Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, April 27, 2017

Hive for Oracle Developers and DBAs - Part II


In the first Hive post  we have discussed the basic usage and functionality of Hive , today we move forward and discuss some advance functionality. I'll cover Collection, Tables and Partitions in this post.


Collection Data Types                                                              

STRUCT
Analogous to a C struct or an “object.” Fields can be accessed using the “dot” notation. For example, if a column name is of type STRUCT {first STRING; last STRING}, then the first name field can be referenced using name.first.

MAP

collection of key-value tuples, where the fields are accessed using array notation (e.g., ['key']). For example, if a column name is of type MAP with key→value pairs 'first'→'Abuzar' and 'last'→'Ali', then the last name can be referenced using name['last'].

ARRAY

Ordered sequences of the same type that are indexable using zero-based integers. For example, if a column name is of type ARRAY of strings with the value ['Abuzar', 'Ali'], then the second element can be referenced using name[1].

More about Tables                                                                 

Hive offers significant extensions to support a wide range of flexibility where the data files for tables are stored, the formats used, etc.

TBLPROPERTIES is to add additional documentation in a key-value format, TBLPROPERTIES can be used to express essential metadata about the database connection.

Managed Tables: Also called internal tables, because Hive controls the life cycle of their data. When we drop a managed table , Hive deletes  the data in the table. Managed tables are less convenient for sharing with other tools.  For example, suppose we have data that is created and used primarily by Pig or other tools, but we want to run some queries against it.

CREATE TABLE IF NOT EXISTS employees(
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home address')
COMMENT 'Table holds information about employees'
row format delimited
fields terminated by ','
collection items terminated by '$'
map keys terminated by '#'
LOCATION '/userdata/employees'
TBLPROPERTIES ('creator'='Inam', 'created_at'='2017-02-17 12:55:00','project'='Hive POC')
;


Get Table information: 
hive (scott)> desc extended employees;


Replacing EXTENDED with FORMATTED provides more readable but also more verbose output.

hive (scott)> desc formatted employees;



Hive automatically adds two table properties: last_modified_by holds the username of the last user to modify the table, and last_modified_time holds the epoch time in seconds of that modification.

hive (scott)> SHOW TBLPROPERTIES employees;
OK
prpt_name       prpt_value
comment Table holds information about employees
created_at      2017-02-17 12:55:00
creator Inam
project Hive POC
transient_lastDdlTime   1493125674
Time taken: 0.251 seconds, Fetched: 5 row(s)

Prepare data for the employees (complex) table (with collection columns)
Create the test csv file for the employees table to be loaded later
employees.csv (located at /tmp/mydata)
Abuzar,1000,Ikram$Usman$,pf#3.5$loan#9.4,College Rd.$Lahore$Punjab$41000
Ali,2000,Ikram$Usman$,pf#5.5$loan#7.4,College Rd.$Lahore$Punjab$41000

Load data into employees (complex) table
hive (scott)> load data local inpath '/tmp/mydata/employees.csv' into table scott.employees;
Loading data to table scott.employees
OK
Time taken: 0.227 seconds

view employees (complex) table data
hive (scott)> select name,salary,subordinates,deductions from employees;
OK
name    salary  subordinates    deductions
Abuzar  1000.0  ["Ikram","Usman",""]    {"pf":3.5,"loan":9.4}
Ali     2000.0  ["Ikram","Usman",""]    {"pf":5.5,"loan":7.4}
Time taken: 0.177 seconds, Fetched: 2 row(s)

hive (scott)> select name,salary,subordinates,deductions,address from employees;
OK
name    salary  subordinates    deductions      address
Abuzar  1000.0  ["Ikram","Usman",""]    {"pf":3.5,"loan":9.4}   {"street":"College Rd.","city":"Lahore","state":"Punjab","zip":41000}
Ali     2000.0  ["Ikram","Usman",""]    {"pf":5.5,"loan":7.4}   {"street":"College Rd.","city":"Lahore","state":"Punjab","zip":41000}
Time taken: 0.13 seconds, Fetched: 2 row(s)

Load further data using put (uploading file into HDFS employees table directory)
hive (scott)> !hdfs dfs -put /tmp/mydata/employees2.csv /userdata/employees;

Query again ans you will see more rows (existing in employees2.csv)

hive (scott)> select name,salary,subordinates,deductions from employees;
OK
name    salary  subordinates    deductions
Abuzar  1000.0  ["Ikram","Usman",""]    {"pf":3.5,"loan":9.4}
Ali     2000.0  ["Ikram","Usman",""]    {"pf":5.5,"loan":7.4}
Ashraf  4000.0  ["Ikram","Usman",""]    {"pf":3.5,"loan":9.4}
Mubeen  5000.0  ["Ikram","Usman",""]    {"pf":5.5,"loan":7.4}
Time taken: 0.14 seconds, Fetched: 4 row(s)


Copy the schema (but not the data) of an existing table
hive (scott)> CREATE TABLE IF NOT EXISTS employees2 LIKE employees;

List the tables in other database
hive (scott)> SHOW TABLES IN scott;
OK
tab_name
dept
dual
emp
emp2
employees
employees2

By Regular expression, Not all regular expression features are supported
hive (scott)> SHOW TABLES IN scott LIKE 'e.*';
OK
tab_name
emp
emp2
employees
employees2
Time taken: 0.033 seconds, Fetched: 4 row(s)

External Tables: If we want to study data with many tools for different data sources (eg; datasource1,datasource2) , we have to ensure the schema we’ll use next matches the schemas of both these data sources.


Let’s assume the data files are in HDFS directory /data/stocks. 

[hdpsysuser@hdpmaster ~]$ hdfs dfs -mkdir -p /data/stocks
[hdpsysuser@hdpmaster ~]$ hdfs dfs -chmod 777 /data/stocks

The following table declaration creates an external table that can read all the data files for this comma-delimited data in /data/stocks:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks(
exchange_name STRING,
price_open FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

hive (scott)> select * from stocks; -- -- NO ROWS RETURNED
OK
stocks.exchange_name    stocks.price_open
Time taken: 0.119 seconds

Now I created 2 CSVs and placed them in /tmp/mydata folder on hive server to furhter put them in HDFS /data/stocks location

asia.csv
Karachi Stock Exchange,1000
Riyadh Stock Exchange,2000

us.csv
NY Stock Exchange,3000

Put asia.csv to HDFS
[hdpsysuser@hdpmaster ~]$ hdfs dfs -put /tmp/mydata/asia.csv /data/stocks

hive (scott)> select * from stocks;
OK
stocks.exchange_name    stocks.price_open
Karachi Stock Exchange  1000.0
Riyadh Stock Exchange   2000.0
Time taken: 0.157 seconds, Fetched: 2 row(s)

Put the second csv also and query again
[hdpsysuser@hdpmaster ~]$ hdfs dfs -put /tmp/mydata/us.csv /data/stocks
hive (scott)> select * from stocks;
OK
stocks.exchange_name    stocks.price_open
Karachi Stock Exchange  1000.0
Riyadh Stock Exchange   2000.0
NY Stock Exchange       3000.0
Time taken: 0.168 seconds, Fetched: 3 row(s)

Hive does not assume that external table owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted. Some HiveQL constructs are not permitted for external tables.

Partitioned Managed Tables: Often partitioning is used for distributing load horizontally, moving data physically closer to its most frequent users, and other purposes. Partitioned tables have important performance benefits, and they can help organize data in a logical fashion, such as hierarchically. Partitioning tables changes how Hive structures the data storage. 

If we create below table in the scott database, there will be an employees_part directory for the table, Hive will create subdirectories reflecting the partitioning structure.

CREATE TABLE IF NOT EXISTS employees_part(
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home address')
COMMENT 'Table holds information about employees with partitions'
PARTITIONED BY (country STRING, state STRING)
row format delimited
fields terminated by ','
collection items terminated by '$'
map keys terminated by '#'
TBLPROPERTIES ('creator'='Inam', 'created_at'='2017-05-17 12:55:00','project'='Hive POC')
;

hive (scott)> describe employees_part;
OK
col_name        data_type       comment
name                    string
salary                  float
subordinates            array<string>
deductions              map<string,float>
address                 struct<street:string,city:string,state:string,zip:int>
country                 string
state                   string

# Partition Information
# col_name              data_type               comment

country                 string
state                   string
Time taken: 0.037 seconds, Fetched: 13 row(s)
hive (scott)>

Observe above table description, columns declared in partitioned by clause have been added in the column list automatically.

List the content of the location of employees_part table
hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/employees_part;

You see no folders as data has not been loaded yet.

hive (scott)> select name,country,state from employees_part;
OK
name    country state
Time taken: 0.205 seconds


Prepare data to be loaded into partition table
Create 3 CSVs like below.
employees_part1.csv
Abuzar,1000,Ikram$Usman$,pf#3.5$loan#9.4,College Rd.$Lahore$Punjab$41000,PAK,Punjab
Ali,2000,Ikram$Usman$,pf#5.5$loan#7.4,College Rd.$Lahore$Punjab$41000,PAK,Punjab
employees_part2.csv

Zeeshan,1000,Ikram$Usman$,pf#3.5$loan#9.4,College Rd.$Lahore$Punjab$41000,PAK,Sindh

Hafiz,2000,Ikram$Usman$,pf#5.5$loan#7.4,College Rd.$Lahore$Punjab$41000,PAK,Sindh
employees_part3.csv
Saleem,1000,Ikram$Usman$,pf#3.5$loan#9.4,College Rd.$Lahore$Punjab$41000,Saudia,Riyadh
Danish,2000,Ikram$Usman$,pf#5.5$loan#7.4,College Rd.$Lahore$Punjab$41000,Saudia,Qaseem

Load Data into partitioned table (Dynamic Partitions)

Loading csv data to a partitioned table involves below mentioned two steps:
1- Load csv file to a non-partitioned table.
2- Load non-partitioned table data to partitioned table.


Create non-partitioned table
CREATE TABLE IF NOT EXISTS employees_nonpart(
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home address',
country STRING,
state STRING
)
COMMENT 'Table holds information about employees to be used for partitions later'
row format delimited
fields terminated by ','
collection items terminated by '$'
map keys terminated by '#'
TBLPROPERTIES ('creator'='Inam', 'created_at'='2017-05-17 12:55:00','project'='Hive POC')
;

Now load data into non-partitioned table
hive (scott)> load data local inpath '/tmp/mydata/employees_part1.csv' into table scott.employees_nonpart;
Loading data to table scott.employees_nonpart
OK
Time taken: 0.589 seconds
hive (scott)> load data local inpath '/tmp/mydata/employees_part2.csv' into table scott.employees_nonpart;
Loading data to table scott.employees_nonpart
OK
Time taken: 0.28 seconds
hive (scott)> load data local inpath '/tmp/mydata/employees_part3.csv' into table scott.employees_nonpart;
Loading data to table scott.employees_nonpart
OK
Time taken: 0.211 seconds

hive (scott)> select * from employees_nonpart;

Now load non-partitioned table data to partitioned table employees_part (created earlier)
To load partitioned table we use below commands:

hive (scott)> SET hive.exec.dynamic.partition = true;
hive (scott)> SET hive.exec.dynamic.partition.mode = nonstrict;
hive (scott)> INSERT OVERWRITE TABLE employees_part PARTITION(country, state) SELECT  * from employees_nonpart;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdpclient_20170426145142_d2035d89-2b7c-4a26-a9f1-c34b60bbb2e7
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1493031715875_0014, Tracking URL = http://nn01:8088/proxy/application_1493031715875_0014/
Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job  -kill job_1493031715875_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-26 14:51:49,155 Stage-1 map = 0%,  reduce = 0%
2017-04-26 14:51:54,361 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.75 sec
MapReduce Total cumulative CPU time: 2 seconds 750 msec
Ended Job = job_1493031715875_0014
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/employees_part/.hive-staging_hive_2017-04-26_14-51-42_297_7475299094026344956-1/-ext-10000
Loading data to table scott.employees_part partition (country=null, state=null)

Loaded : 4/4 partitions.
         Time taken to load dynamic partitions: 0.475 seconds
         Time taken for adding to write entity : 0.001 seconds
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.75 sec   HDFS Read: 7354 HDFS Write: 751 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 750 msec
OK
employees_nonpart.name  employees_nonpart.salary        employees_nonpart.subordinates  employees_nonpart.deductions    employees_nonpart.address employees_nonpart.country       employees_nonpart.state
Time taken: 14.228 seconds

Now verify your dynamic partitions are there
hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/employees_part;
Found 2 items
drwxrwxrwx   - hdpclient supergroup          0 2017-04-26 14:52 /user/hive/warehouse/scott.db/employees_part/country=Pakistan
drwxrwxrwx   - hdpclient supergroup          0 2017-04-26 14:52 /user/hive/warehouse/scott.db/employees_part/country=Saudia







Show partitions
hive (scott)> SHOW PARTITIONS employees_part;
OK
partition
country=Pakistan/state=Punjab
country=Pakistan/state=Sindh
country=Saudia/state=Qaseem
country=Saudia/state=Riyadh
Time taken: 0.084 seconds, Fetched: 4 row(s)

hive (scott)> SHOW PARTITIONS employees_part PARTITION(country='Pakistan');
OK
partition
country=Pakistan/state=Punjab
country=Pakistan/state=Sindh
Time taken: 0.086 seconds, Fetched: 2 row(s)

hive (scott)> SELECT * FROM employees_part
            > WHERE country = 'Saudia' AND state = 'Riyadh';
OK
employees_part.name     employees_part.salary   employees_part.subordinates     employees_part.deductions       employees_part.address  employees_part.country    employees_part.state
Saleem  1000.0  ["Ikram","Usman",""]    {"pf":3.5,"loan":9.4}   {"street":"College Rd.","city":"xxxx","state":"yyyyy","zip":41000}      Saudia    Riyadh
Time taken: 0.496 seconds, Fetched: 1 row(s)

Once created, the partition keys (country and state) behave like regular columns. Perhaps the most important reason to partition data is for faster queries. In the previous
query, which limits the results to employees in Riyadh, it is only necessary to scan the
contents of one directory even if we have thousands of country and state directories. For very large data sets, partitioning can dramatically improve query performance, but only if the partitioning scheme reflects common range filtering (e.g., by locations, timestamp ranges).

Hive Strict mode
Remember, a query across all partitions could trigger an enormous MapReduce job if the
table data and number of partitions are large. A highly suggested safety measure is
putting Hive into “strict” mode, which prohibits queries of partitioned tables without
a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the following session:
hive (scott)> set hive.mapred.mode=strict;
hive (scott)> SELECT ep.name, ep.salary FROM employees_part ep LIMIT 100;
FAILED: SemanticException Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please make sure that hive.strict.checks.large.query is set to false and that hive.mapred.mode is not set to 'strict' to enable them. No partition predicate for Alias "ep" Table "employees_part"
hive (scott)>  set hive.mapred.mode=nonstrict;
hive (scott)> SELECT ep.name, ep.salary FROM employees_part ep LIMIT 100;
OK
ep.name ep.salary
Abuzar  1000.0
Ali     2000.0
Zeeshan 1000.0
Hafiz   2000.0
Danish  2000.0
Saleem  1000.0
Time taken: 0.171 seconds, Fetched: 6 row(s)


Static Partition
We have below csv and we want to create the partition for it. For static partition you don't need to have the partition columns' data in your csv.
employee_part4.csv
Aslam,8000,Ikram$Usman$,pf#3.5$loan#9.4,College Rd.$Lahore$Punjab$41000
Akram,9000,Ikram$Usman$,pf#5.5$loan#7.4,College Rd.$Lahore$Punjab$41000

hive (scott)> LOAD DATA LOCAL INPATH '/tmp/mydata/employees_part4.csv'
            > INTO TABLE employees_part
            > PARTITION (country = 'UAE', state = 'Dubai');
Loading data to table scott.employees_part partition (country=UAE, state=Dubai)
OK
Time taken: 0.409 seconds

Verify partitions
hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/employees_part;
Found 3 items
drwxrwxrwx   - hdpclient supergroup          0 2017-04-26 14:52 /user/hive/warehouse/scott.db/employees_part/country=Pakistan
drwxrwxrwx   - hdpclient supergroup          0 2017-04-26 14:52 /user/hive/warehouse/scott.db/employees_part/country=Saudia
drwxrwxrwx   - hdpclient supergroup          0 2017-04-26 15:54 /user/hive/warehouse/scott.db/employees_part/country=UAE


External Partitioned Tables: 
You can use partitioning with external tables. In fact, you may find that this is your most common scenario for managing large production data sets. The combination gives you a way to “share” data with other tools, while still optimizing query performance. Consider an example that fits this scenario well ie; logfile analysis.

I've a transaction log (5GB), first I checked the content to know what data is available.

hive (scott)> !head -2 /tmp/mydata/translog.csv;
"RECORD_ID","EMP_ID","REQUEST_TYPE","JSON_INPUT","RESULT_CODE","RESULT_DESCRIPTION","ERROR_TYPE","START_TIME","PROCCESSING_TIME","REQUEST_CHANNEL","CUSTOM_1","CUSTOM_2","CUSTOM_3","CUSTOM_4","CUSTOM_5","CUSTOM_6"
273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""
hive (scott)>

Observe that first line contains columns and rest is data. In order to load into HDFS (Table) we need to remove the first line in this csv file and then create table for this file.

Delete first line from the file
[root@en01 ~]# sed  -i 1d /tmp/mydata/translog.csv

Verify first line is deleted
hive (scott)> !head -1 /tmp/mydata/translog.csv;
273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""

CREATE EXTERNAL TABLE translog(
RECORD_ID STRING, 
EMP_ID STRING, 
REQUEST_TYPE STRING, 
JSON_INPUT STRING,  
RESULT_CODE STRING,
RESULT_DESCRIPTION STRING,
ERROR_TYPE STRING,
START_TIME STRING,
PROCCESSING_TIME STRING,
REQUEST_CHANNEL STRING,
CUSTOM_1 STRING,
CUSTOM_2 STRING,
CUSTOM_3 STRING,
CUSTOM_4 STRING,
CUSTOM_5 STRING,
CUSTOM_6 STRING
)
PARTITIONED BY (YEAR int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;

See here we are not determining any location for this external table, we will specify the location while loading data to a partition of this table. You need to add partition to this table first and then you will load data.

Adding Partition: An ALTER TABLE statement is used to add each partition separately. It must specify a value for each partition key, the year in our case.

hive (scott)> ALTER TABLE translog ADD PARTITION(YEAR = 2016) LOCATION '/userdata/translog/2016';
OK
Time taken: 0.073 seconds
hive (scott)>

The directory convention we used above is completely up to us. Here, we follow a hierarchical directory structure, because it’s a logical way to organize our data, but there is no requirement to do so.


Now load data to this partition

hive (scott)> LOAD DATA LOCAL INPATH '/tmp/mydata/translog.csv' into table scott.translog PARTITION (YEAR=2016);
Loading data to table scott.translog partition (year=2016)
OK
Time taken: 48.713 seconds
hive (scott)>

hive (scott)> select count(RECORD_ID) from translog;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdpclient_20170427104243_2de52449-d8c8-4798-bf22-f4f46f96f4d2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1493031715875_0016, Tracking URL = http://nn01:8088/proxy/application_1493031715875_0016/
Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job  -kill job_1493031715875_0016
Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 1
2017-04-27 10:42:49,988 Stage-1 map = 0%,  reduce = 0%
2017-04-27 10:42:58,326 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 48.43 sec
2017-04-27 10:42:59,358 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 65.38 sec
2017-04-27 10:43:00,386 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 109.71 sec
2017-04-27 10:43:02,451 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 116.71 sec
2017-04-27 10:43:03,478 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 151.69 sec
2017-04-27 10:43:04,508 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 172.22 sec
2017-04-27 10:43:05,540 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 175.18 sec
MapReduce Total cumulative CPU time: 2 minutes 55 seconds 180 msec
Ended Job = job_1493031715875_0016
MapReduce Jobs Launched:
Stage-Stage-1: Map: 20  Reduce: 1   Cumulative CPU: 175.18 sec   HDFS Read: 5194481859 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 55 seconds 180 msec
OK
c0
14531811
Time taken: 24.327 seconds, Fetched: 1 row(s)
hive (scott)>

Add another partition
hive (scott)> ALTER TABLE translog ADD PARTITION(YEAR = 2017) LOCATION '/userdata/translog/2017';
OK
Time taken: 0.065 seconds

hive (scott)> show partitions translog;
OK
partition
year=2016
year=2017
Time taken: 0.077 seconds, Fetched: 2 row(s)

Now you can load data to this new partition.
hive (scott)> LOAD DATA LOCAL INPATH '/tmp/mydata/translog.csv' into table scott.translog PARTITION (YEAR=2017);
Loading data to table scott.translog partition (year=2017)
OK
Time taken: 47.846 seconds

See some performance benefits of partition
hive (scott)> select count(RECORD_ID) from translog;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdpclient_20170427113327_6b0bdbc7-3c90-4f49-8d28-7b6ad464da0c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1493031715875_0024, Tracking URL = http://nn01:8088/proxy/application_1493031715875_0024/
Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job  -kill job_1493031715875_0024
Hadoop job information for Stage-1: number of mappers: 39; number of reducers: 1
2017-04-27 11:33:33,067 Stage-1 map = 0%,  reduce = 0%
2017-04-27 11:33:41,325 Stage-1 map = 15%,  reduce = 0%, Cumulative CPU 47.02 sec
2017-04-27 11:33:44,428 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 55.76 sec
2017-04-27 11:33:45,459 Stage-1 map = 34%,  reduce = 0%, Cumulative CPU 121.97 sec
2017-04-27 11:33:46,489 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 138.21 sec
2017-04-27 11:33:47,516 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 185.94 sec
2017-04-27 11:33:48,543 Stage-1 map = 72%,  reduce = 0%, Cumulative CPU 240.19 sec
2017-04-27 11:33:51,625 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 250.42 sec
2017-04-27 11:33:52,649 Stage-1 map = 90%,  reduce = 0%, Cumulative CPU 293.86 sec
2017-04-27 11:33:54,698 Stage-1 map = 90%,  reduce = 30%, Cumulative CPU 294.7 sec
2017-04-27 11:33:55,729 Stage-1 map = 92%,  reduce = 30%, Cumulative CPU 305.05 sec
2017-04-27 11:33:56,757 Stage-1 map = 97%,  reduce = 30%, Cumulative CPU 324.86 sec
2017-04-27 11:33:57,782 Stage-1 map = 97%,  reduce = 32%, Cumulative CPU 324.94 sec
2017-04-27 11:33:58,807 Stage-1 map = 100%,  reduce = 32%, Cumulative CPU 337.91 sec
2017-04-27 11:34:00,855 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 340.31 sec
MapReduce Total cumulative CPU time: 5 minutes 40 seconds 310 msec
Ended Job = job_1493031715875_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 39  Reduce: 1   Cumulative CPU: 340.31 sec   HDFS Read: 10388960288 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 5 minutes 40 seconds 310 msec
OK
c0
29063622
Time taken: 34.146 seconds, Fetched: 1 row(s)


hive (scott)> select count(RECORD_ID) from translog where year=2016;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdpclient_20170427113617_2f9d44ec-202a-49c2-92b5-d416c82801b9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1493031715875_0025, Tracking URL = http://nn01:8088/proxy/application_1493031715875_0025/
Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job  -kill job_1493031715875_0025
Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 1
2017-04-27 11:36:22,653 Stage-1 map = 0%,  reduce = 0%
2017-04-27 11:36:29,880 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 7.67 sec
2017-04-27 11:36:30,908 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 38.78 sec
2017-04-27 11:36:31,934 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 46.55 sec
2017-04-27 11:36:32,958 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 73.47 sec
2017-04-27 11:36:33,984 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 110.84 sec
2017-04-27 11:36:35,008 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 117.97 sec
2017-04-27 11:36:36,032 Stage-1 map = 85%,  reduce = 0%, Cumulative CPU 145.64 sec
2017-04-27 11:36:37,058 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 174.74 sec
2017-04-27 11:36:39,112 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 177.62 sec
MapReduce Total cumulative CPU time: 2 minutes 57 seconds 620 msec
Ended Job = job_1493031715875_0025
MapReduce Jobs Launched:
Stage-Stage-1: Map: 20  Reduce: 1   Cumulative CPU: 177.62 sec   HDFS Read: 5194481084 HDFS Write: 108 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 57 seconds 620 msec
OK
c0
14531811
Time taken: 22.753 seconds, Fetched: 1 row(s)
hive (scott)>

Drop a partition
hive (scott)> ALTER TABLE translog drop PARTITION(YEAR = 2017);
Dropped the partition year=2017
OK
Time taken: 0.095 seconds



No comments: