Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Saturday, February 25, 2017

Hive for Oracle Developers and DBAs - Part I

The Hadoop ecosystem emerged as a cost-effective way of working with large data sets. It imposes a particular programming model, called MapReduce, for breaking up computation tasks into units that can be distributed around a cluster of commodity, server class hardware, thereby providing cost-effective, horizontal scalability.

Underneath this computation model is a distributed file system called the Hadoop Distributed Filesystem (HDFS). However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that infrastructure is based on traditional relational databases and the Structured Query Language (SQL)? What about the large base of SQL users, both expert database designers and administrators, as well as casual users who use SQL to extract information from their data warehouses?

This is where Hive comes in. Hive provides a SQL dialect, called Hive Query Language

(abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster.

Please remember if you need OLTP features for large-scale data, you should consider using a NoSQL database. Examples include HBase, a NoSQL database integrated with Hadoop like Cassandra and DynamoDB if you are using Amazon’s Elastic MapReduce (EMR) or Elastic Compute Cloud (EC2).

As being Oracle chap, my fellow DBAs requested to create analogy for the Hive so that they may be able to work in Hive like familiar Oracle schema ie; scott. This post covers this analogy which helps Oracle Developers and DBA to run the same (almost not exactly :) ) SQL (Oracle) they are familiar with.

Prerequisites
Complete my previous post Hive Installation and Configuration 

Preparing Oracle Data for Hive                                                

Step 1: Connect to SQL Plus using Scott/tiger and run the below query to take data to be loaded to Hive later. 

SQL> select deptno||','|| dname|| ','||loc from scott.dept;
DEPTNO||','||DNAME||','||LOC
--------------------------------------------------
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

SQL> set lines 100
SQL> set pages 25
SQL> select empno||','|| ename|| ','||job||','||mgr||','||hiredate||','||sal||','||comm||','||deptno  from scott.emp;

EMPNO||','||ENAME||','||JOB||','||MGR||','||HIREDATE||','||SAL||','||COMM||','||DEPTNO
------------------------------------------------------------------
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10
14 rows selected.

Take the above output and create dept.csv and emp.csv respectively which we will load into hive.

Step 2: Take the Oracle's table (dept and emp) DDL script also which will help us to create the table in hive later
create table dept(
  deptno number(2,0),
  dname  varchar2(14),
  loc    varchar2(13),
  constraint pk_dept primary key (deptno)
);

create table emp(
  empno    number(4,0),
  ename    varchar2(10),
  job      varchar2(9),
  mgr      number(4,0),
  hiredate date,
  sal      number(7,2),
  comm     number(7,2),
  deptno   number(2,0),
  constraint pk_emp primary key (empno),
  constraint fk_deptno foreign key (deptno) references dept (deptno)

);


Preparing Hive to use Oracle Data                                           

Create DB in Hive
Create the database "scott" (analogous Oracle schema) in hive, by default DB will be created in /user/hive/warehouse which is under a top-level directory specified by the property hive.metastore.warehouse.dir, with .db extension eg; /user/hive/warehouse/scott.db.  You can override this default location if you want. Play with some hive commands as below


hive> create database scott;

OK

Time taken: 0.546 seconds

hive> use scott;

Create a test table
hive> create table test1(id int);
hive> insert into test1 values(10);

Show databases
hive> show databases like 'sc.*';
database_name
scott

Describe database
hive> describe database scott;
db_name comment location        owner_name      owner_type      parameters
scott           hdfs://hdpmaster:9000/user/hive/warehouse/scott.db      hdpsysuser      USER

hive> describe database extended scott;
db_name comment location        owner_name      owner_type      parameters
scott           hdfs://hdpmaster:9000/stg       hdpsysuser      USER    {date=2017-02-17, creator=Inam Bukhari}

Set Hive variable
Unfortunately, there is no command to show you which database is your current working database but you can know by setting hive variable on hive prompt, these variables last util session ends.

hive> set hive.cli.print.current.db=true;
hive (scott)>

Drop database
hive> drop database scott;

Remember,  dropping database will drop the HDFS folders also so be careful. you could get the below error if your database has tables.

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database scott is not empty. One or more tables exist.)

You can use the cascade option in this case

hive (scott)> drop database stgdb cascade;

Alter Database
hive> ALTER DATABASE stgdb SET DBPROPERTIES ('edited-by' = 'Bukhari');
hive (scott)> describe database extended stgdb;
db_name comment location        owner_name      owner_type      parameters

stgdb           hdfs://hdpmaster:9000/stg       hdpsysuser      USER    {edited-by=Bukhari}


Create DB at your own location
You could change the default location, create database scott location '/your location';  

first create folder in HDFS for your DB
[hdpsysuser@hdpmaster ~]$ hdfs dfs -mkdir -p /scottdb
[hdpsysuser@hdpmaster ~]$ hdfs dfs -chmod 777 /scottdb

hive> create database scott location '/scottdb';
OR
hive> create database scott location '/scottdb' WITH DBPROPERTIES ('creator' = 'Inam Bukhari', 'date' = '2017-02-17');

Populate example tables (dept,emp)  in Hive database


Use/login to scott
hive> use scott;
OK
Time taken: 0.009 seconds
hive>

Create dual table 
hive (scott)> create table dual(c string);
OK
Time taken: 0.12 seconds
hive (scott)> insert into dual values('x');

Create tabels in scott
CREATE  TABLE scott.dept( deptno STRING,  dname  STRING,  loc    STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Load data into scott.dept table
You should have copied the csv files created earlier above on some location of your HDFS client and load into HDFS.

hive>LOAD DATA LOCAL INPATH '/tmp/mydata/dept.csv' OVERWRITE INTO TABLE scott.dept;

hive> select * from scott.dept;
OK
10      ACCOUNTING      NEW YORK
20      RESEARCH        DALLAS
30      SALES   CHICAGO
40      OPERATIONS      BOSTON

Observe there is no column header, If we want to see the columns names of the table in Hive, you need to set  the following hive conf property.

Setting column headers for results
hive> set hive.cli.print.header=true;

Second way, Hive automatically looks for a file named .hiverc in your HOME directory (User's HOME) and runs the commands it contains, if any

[hdpsysuser@hdpmaster ~]$ echo "SET hive.cli.print.header=true;" >> ~/.hiverc

exit from hive and then re-run and you will see the headers

Third way, redirect output on console or to file
[hdpsysuser@hdpmaster ~]$ hive -e 'set hive.cli.print.header=true;select * from scott.dept;'
[hdpsysuser@hdpmaster ~]$ hive -e 'set hive.cli.print.header=true;select * from scott.dept;' >> /tmp/empoutput.txt

set hive.cli.print.current.db=true;
use scott;

You can set this property in hive-site.xml also.

Create second table 'emp' 
create table scott.emp(  empno    STRING,  ename    STRING,  job      STRING,
  mgr      STRING,  hiredate STRING,  sal      STRING,
  comm     STRING,  deptno   STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Load data into scott.emp and select it.

LOAD DATA LOCAL INPATH '/tmp/mydata/emp.csv' OVERWRITE INTO TABLE scott.emp;

hive> select '--------------Emp Data----------------';
hive> select * from scott.emp;
OK
7369    SMITH   CLERK   7902    17-DEC-80       800             20
7499    ALLEN   SALESMAN        7698    20-FEB-81       1600    300     30
7521    WARD    SALESMAN        7698    22-FEB-81       1250    500     30
7566    JONES   MANAGER 7839    02-APR-81       2975            20
7654    MARTIN  SALESMAN        7698    28-SEP-81       1250    1400    30
7698    BLAKE   MANAGER 7839    01-MAY-81       2850            30
7782    CLARK   MANAGER 7839    09-JUN-81       2450            10
7788    SCOTT   ANALYST 7566    19-APR-87       3000            20
7839    KING    PRESIDENT               17-NOV-81       5000            10
7844    TURNER  SALESMAN        7698    08-SEP-81       1500    0       30
7876    ADAMS   CLERK   7788    23-MAY-87       1100            20
7900    JAMES   CLERK   7698    03-DEC-81       950             30
7902    FORD    ANALYST 7566    03-DEC-81       3000            20
7934    MILLER  CLERK   7782    23-JAN-82       1300            10

Time taken: 0.081 seconds, Fetched: 14 row(s)

Suppress INFO and WARNING messages
If you want to suppress INFO and WARNING messages on hive you could start hive using -S option, -v is for query display, -e is for executing the query

[hdpsysuser@hdpmaster ~]$ hive -S


Using Hive Queries                                                                  

Output the results of a HiveQL query to CSV

hive> insert overwrite local directory '/tmp/output1' select * from scott.emp;

emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno

If you don't use local keyword then file is created on HDFS. it will create the folder in destination where output is written

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/empoutput' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
select * from scott.emp;

run the os command from hive shell and see the destination, you will get the 000000_0, view this file
hive> !ls -ltr /tmp/empoutput;
total 4
-rw-r--r--. 1 hdpsysuser hdpsysuser 605 Feb 16 13:02 000000_0

view the file using cat and you will see the output here

hive> !cat /tmp/empoutput/000000_0;

Test join
hive> select d.deptno,d.dname,e.ename,e.job,e.sal from dept d,emp e where d.deptno=e.deptno;

Top N rows
hive>  select deptno,ename,sal from emp order by sal desc limit 5;

Calculated field
hive> select ename,sal,comm,sal+comm total from emp where comm > 0;

Aggregate function
hive> select deptno,count(*),max(sal),min(sal) from emp group by deptno;

In Hive, count(*) counts all rows and count(columnA) will only count rows where columnA is non-NULL.


hive> select count(*),count(ename),count(comm) from emp;

CTAS
hive> create table emp2 as select * from emp;
Insert into table from another table

hive> insert into emp2 select * from emp;

DML
hive> select count(*) from emp2;
c0
57
hive> insert into emp2(deptno,empno,ename,comm) values(10,1,'ABC',NULL);
hive> select count(*) from emp2;
c0

58
hive> update emp2 set sal=1000 where empno=1;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive> delete from emp2;

FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

Delete workaround
The best approach is to partition your data such that the rows you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.

alternative you could use below approach, take your desired data to be existed in where condition and rewrite to the table itself using insert.

hive> insert overwrite table emp2 select * from emp2 where empno <> 1;

hive> select * from emp2 where empno=1; 


Put Comment in query
hive> select ename--,job
    > from emp;



Temporary Folders and Logs


Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:

On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir, On the client machine, this is hardcoded to /tmp/<username> 


Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

Show all hive configuration variable
hive> set;
Show specific parameter
hive> set system:user.name;
system:user.name=hdpsysuser
hive> set hive.exec.scratchdir;
hive.exec.scratchdir=/tmp/hive


The logs are stored in the directory /tmp/<user.name>:
/tmp/<user.name>/hive.log

1 comment:

veera cynixit said...

Very nice blog with very unique content.

keep sharing more posts.

hadoop admin online course