Case Study of Hive using Hadoop

Sai Prasad Potharaju; Shanmuk Srinivas; Ravi Kumar Tirandasu

doi:10.17577/IJERTV3IS110587

Volume 03, Issue 11 (November 2014)

Case Study of Hive using Hadoop

DOI : 10.17577/IJERTV3IS110587

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 1,868
Total Downloads : 389
Authors : Sai Prasad Potharaju, Shanmuk Srinivas, Ravi Kumar Tirandasu
Paper ID : IJERTV3IS110587
Volume & Issue : Volume 03, Issue 11 (November 2014)
Published (First Online): 14-11-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Case Study of Hive using Hadoop

1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu

1,2,3 Assistant Professor ,

1,2,3 SRES COE,Department of Computer Engineering , Kopargaon,Maharashtra, India

Abstract -Hadoop is a framework of tools. It is not a software that you can download on your computer. These tools are used to running applications on big data which has huge in capacity,need to process quickly and can be in variety forms. To manage the big data HIVE used as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop .Hive provides a SQL-LIKE languages called HIVEQL. In this paper we explains how to use hive using Hadoop with a simple real time example and also explained how to create a table,load the data into table from external file ,retrieve the data from table and their different statistics like CPU time for each stage of query execution

,cumulative CPU time and time taken to fetch records.

Key Words:Hadoop,Hive,MapReduce,HDFS,HIVEQL

INTRODUCTION

A.MapReduce B. File System(HDFS).

One important characteristic of Hadoop is that it works on Distributed model and it is Linux based set of tools .

All the Machines or nodes under Hadoop will have two components .

A. Task Tracker(T)[1] [2] B. Data Node(D)[1][2].(Fig 2) The job of task tracker is to process the task given to particular node and the job of data node is manage the data given to particular node .The nodes which process the piece of data and has task tracker and data nodes are called slaves. Above these slaves there will be a master which has additional components called 1.Job Tracker(J) [1][2]2.Name Node(N)[1][2].(Fig 2)

Job Tracker and Task Tracker are comes under MapReduce. Data Node and Name Node are comes under HDFS
Fig. 2.

Fig.1.

At the highest level it has a simple architecture which breaks the data and computation into pieces .Hadoop has two important components .

Hive is a Data warehouse system layer built on Hadoop. It allows to define a structure for unstructured big data. It simplifies ,analysis and queries with an SQL like scripting language called HIVEQL.

Hive is not Relational Database,it uses a database to store meta data,but the data that hive processes is stored in HDFS. Hive is not designed for on-line transaction processing(OLTP).Hive runs on Hadoop which is a Batch- processing system where jobs can have high latency with substantial overhead. Therefore latency for hive queries is generally high even for small jobs. Hive is not suited for real- time queries and row level updates and it is best used for batch jobs over large sets of immutable data such as web logs.
THE TYPICAL USE CASE OF HIVE

Hive it takes large amount of unstructured data and place it into a structured view as shown in Fig.3.,that can be used by business analysts by the business tools. Hive supports use cases such as Ad-hoc queries,summarization,data analysis.

Fig.3.

HIVEQL can also be exchange with custom scalar functions means user defined functions(UDF ' S),aggregations(UDFA's) and table functions(UDTF's).
HIVE ARCHITECTURE

As shown in Fig.4 . Hive Architecture consists of

A. Meta store

It stores schema information and provides a structure to stored data .

B.HIVEQL

Which is for query processing,compiling,optimizing and execution ,those queries turned into MapReduce job which invokes mapper and reducer for processing.

HIVEQL is similar to other SQLs ,it uses familiar relational database concepts(tables,rows,columns,schema),it is based on the SQL-92 specifications. Hive supports multi-table inserts via the code means big data can be accessed via tables. It converts SQL queries into MapReduce jobs. User does not need to know MapReduce. Also supports plugging custom MapReduce scripts into queries.

C. User Interface

Hive can be approached by number of user interfaces. It has web UI,it has own command line interface and can be accessed using HDInsight.

Hive is good choice when you want to query the data,when you need an answer to specific questions if you are familiar with SQL.

Fig.4.
USING OF HIVE WITH HADOOP.

A .Prerequisites to work with Hive

The prerequisites for setting up Hive and running queries are 1.User should have stable build of Hadoop

2.Machine Should have Java 1,6 installed 3.Basic Java Programming skills

Basic SQL Knowledge 5.Basic Knowledge of Linux

To start Hive first start all the services of Hadoop using the command $ start-all.sh. Then check all services are running or not using JPS command ,if all are services are running then Use the command $ hive to start HIVE

B.HIVE TABLES

Here we are explaining about basics of Hive Tables. A Hive table consists of

Data: It is typically a file or group of files in HDFS. Schema:It is in the form of meta data stored in a relational database. Schema and data are separate .A schema can be defined for existing data. Data can be added or removed independently. You have to define a schema if you have existing data in HDFS that you want to use in Hive
1. Managing Tables
  
  See current tables
  
  hive>Show Tables;
  
  Check the schema
  
  hive>Describe mytable;
  
  change the table name
  
  hive> ALTER TABLE mytablename RENAME to mt;
  
  Add a Column
  
  hive>ALTER TABLE mytable ADD COLOUMNS (mycol STRING);
  
  Drop a Partition
  
  hive >ALTER TABLE mytable DROP PARTITION(age=17)
2. Loading Data
  
  Use LOAD DATA to import data into Hive Table hive>LOAD DATA LOCAL INPATH ' path of input data' INTO TABLE mytable;
  
  The files are not modified by Hive. Use the word OVERWRITE[3][4] to write over a file of the same name. Hive warehouse default location is /hive/warehouse. Hive can read all of the files in a particular directory. The schema is checked when the data is required and if a row does not match the schema,it will be read as null.
3. Insert
  
  Use INSERT[3] statement to populate data into a table from another HIVE table. Since query results are usually large it is best to use an INSERT clause to tell HIVE where to store your query.
  
  Example:
  
  hive>CREATE TABLE age_count(name STRING,age INT); hive>INSERT OVERWRITE TABLE age_count
  
  SELECT age,COUNT(age) FROM mytable;
4. Performing Queries
  
  SELECT statement is used for this. It supports the following: Where clause
  
  UNION ALL and DISTINCT GROUP BY and HAVING
  
  LIMIT clause
  
  Can Use REGULAR EXPRESSION Column Specification
CASE STUDY

In this we are giving a real example how to use HIVE on Top of the HADOOP and different statistical analysis. In this example we have a predefined dataset (cricket.csv) having more than 20 columns and more than 10000 records in it. In the below example we are explaining how to display

few records based on attributes of cricket.

The cricket.csv file has different attributes like name,country, Player_iD,Year.stint,teamid,run,igl,D,G,G_batting.AB,R,H,2 B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_OLD

Follow the below steps in order to retrieve the data from Big tables.

Start Hadoop

$ start-all.sh
Create a Folder and copy all predefined datasets

$ hadoop fs -mkdir sampledataset

$ hadoop fs -put /home/sai/hivedataset/

/user/sai/sampledataset

Note:hivedataset has Big Data tables in .cs format ,those tables are copying from my local file system to Hadoop.
Start Hive

$ hive
Create Table in Hive

hive>create table temp_cricket (col_value STRING);
Load the Data from csv file to table temp_batting hive> LOAD DATA INPATH

'/user/sai/sampledataset/hivedataset/cricket.csv' OVERWRITE INTO TABLE temp_cricket;
Create Another table to Extract data

hive>create table batting (player_id STRING, year INT, runs INT);
Insert data Into newly created data By extracting hive>insert overwrite table batting

SELECT

regexp_extract(col_value, '^(?:([^,]*)\,?){1}',3) player_id,

regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 4) year,

regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 7) run from temp_cricket;

In above query from temp_cricket table player_id,year,run are loaded into batting table .

Execute Few Queries and observe the result Example 1

Display year and maximum runs scored in each year

hive> SELECT year, max(runs) FROM batting GROUP BY year;

Total Jobs=1

hadoop job information for stage-1:

Number of mappers:1;Number of reducers:1 Stage-1 map=0% reduce =0%

Stage-1 map=100% reduce =0% cumulative CPU 3.45 sec Stage-1 map=100% reduce =100% cumulative CPU 6.35 sec MapReduce Total cumulative CPU time:6.35 sec

Time taken :46.09 sec

Example 2

Display player_id,year and maximum runs scored in each year

hive> SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b

ON (a.year = b.year AND a.runs = b.runs) ;

Total Jobs=2

Launching job 1 out of 2

hadoop job information for stage-2:

Number of mappers:1;Number of reducers:1 Stage-2 map=0% reduce =0%

Stage-2 map=100% reduce =0% cumulative CPU 3.0 sec Stage-2 map=100% reduce =100% cumulative CPU 5.95 sec MapReduce Total cumulative CPU time:5.95 sec

Launching job 2 out of 2

hadoop job information for stage-3:

Number of mappers:1;Number of reducers:0 Stage-3 map=0% reduce =0%

Stage-3 map=100% reduce =0% cumulative CPU 3.12 sec

Stage-3 map=100% reduce =100% cumulative CPU 3.12 sec MapReduce Total cumulative CPU time:3.12 sec

Time taken :65.743sec

As there are two queries (Select Statements) two jobs have been executed. The CPU time varies according to the system configuration

CONCLUSION

Hive works with Hadoop to allow you to query and manage large-scale data using a familiar SQL-like interface. Hive provides command line interface to the shell and Microsoft HDInsight provides console access. Hive tables consist of data and schema and they are separated for maximum flexibility. Hive Query Language(HIVEQL) supports familiar SQL operations including joins,sub queries,Order By,Sort by etc. The CPU analysis depends on Hadoop configuration nodes and System configuration

REFERENCES

Hadoop. http://hadoop.apache.org, 2009.
HDFS(hadoop distributed file system) architecture. http://hadoop.apache.org/common/docs/current/hdfsdesi n.html, 2009
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy.Hive – A Warehousing Solution Over a Map-Reduce

Framework .http://www.vldb.org/pvldb/2/vldb09-938.pdf
JasonRutherglen,Dean Wampler,Edward CaprioloE. ttp://www.reedbushey.com/99Programming%20Hive.pdf

Case Study of Hive using Hadoop

Leave a Reply