DBMentors - Inam Bukhari's Blog: Hive for Oracle Developers and DBAs

The Hadoop ecosystem emerged as a cost-effective way of working with large data sets. It imposes a particular programming model, called MapReduce, for breaking up computation tasks into units that can be distributed around a cluster of commodity, server class hardware, thereby providing cost-effective, horizontal scalability.

Underneath this computation model is a distributed file system called the Hadoop Distributed Filesystem (HDFS). However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that infrastructure is based on traditional relational databases and the Structured Query Language (SQL)? What about the large base of SQL users, both expert database designers and administrators, as well as casual users who use SQL to extract information from their data warehouses?

This is where Hive comes in. Hive provides a SQL dialect, called Hive Query Language

(abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster.

Please remember if you need OLTP features for large-scale data, you should consider using a NoSQL database. Examples include HBase, a NoSQL database integrated with Hadoop like Cassandra and DynamoDB if you are using Amazon’s Elastic MapReduce (EMR) or Elastic Compute Cloud (EC2).

As being Oracle chap, my fellow DBAs requested to create analogy for the Hive so that they may be able to work in Hive like familiar Oracle schema ie; scott. This post covers this analogy which helps Oracle Developers and DBA to run the same (almost not exactly :) ) SQL (Oracle) they are familiar with.

Prerequisites

Complete my previous post Hive Installation and Configuration

Preparing Oracle Data for Hive

Step 1: Connect to SQL Plus using Scott/tiger and run the below query to take data to be loaded to Hive later.

SQL> select deptno||','|| dname|| ','||loc from scott.dept;
DEPTNO||','||DNAME||','||LOC
--------------------------------------------------
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

SQL> set lines 100
SQL> set pages 25
SQL> select empno||','|| ename|| ','||job||','||mgr||','||hiredate||','||sal||','||comm||','||deptno from scott.emp;

EMPNO||','||ENAME||','||JOB||','||MGR||','||HIREDATE||','||SAL||','||COMM||','||DEPTNO
------------------------------------------------------------------
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10
14 rows selected.

Take the above output and create dept.csv and emp.csv respectively which we will load into hive.

Step 2: Take the Oracle's table (dept and emp) DDL script also which will help us to create the table in hive later
create table dept(
deptno number(2,0),
dname varchar2(14),
loc varchar2(13),
constraint pk_dept primary key (deptno)
);

create table emp(
empno number(4,0),
ename varchar2(10),
job varchar2(9),
mgr number(4,0),
hiredate date,
sal number(7,2),
comm number(7,2),
deptno number(2,0),
constraint pk_emp primary key (empno),
constraint fk_deptno foreign key (deptno) references dept (deptno)

);

Preparing Hive to use Oracle Data

Create DB in Hive

Create the database "scott" (analogous Oracle schema) in hive, by default DB will be created in /user/hive/warehouse which is under a top-level directory specified by the property hive.metastore.warehouse.dir, with .db extension eg; /user/hive/warehouse/scott.db. You can override this default location if you want. Play with some hive commands as below

hive> create database scott;

Time taken: 0.546 seconds

hive> use scott;

Create a test table

hive> create table test1(id int);

hive> insert into test1 values(10);

Show databases

hive> show databases like 'sc.*';

database_name

scott

Describe database

hive> describe database scott;

db_name comment location owner_name owner_type parameters

scott hdfs://hdpmaster:9000/user/hive/warehouse/scott.db hdpsysuser USER

hive> describe database extended scott;

db_name comment location owner_name owner_type parameters

scott hdfs://hdpmaster:9000/stg hdpsysuser USER {date=2017-02-17, creator=Inam Bukhari}

Set Hive variable

Unfortunately, there is no command to show you which database is your current working database but you can know by setting hive variable on hive prompt, these variables last util session ends.

hive> set hive.cli.print.current.db=true;

hive (scott)>

Drop database

hive> drop database scott;

Remember, dropping database will drop the HDFS folders also so be careful. you could get the below error if your database has tables.

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database scott is not empty. One or more tables exist.)

You can use the cascade option in this case

hive (scott)> drop database stgdb cascade;

Alter Database

hive> ALTER DATABASE stgdb SET DBPROPERTIES ('edited-by' = 'Bukhari');

hive (scott)> describe database extended stgdb;
db_name comment location owner_name owner_type parameters

stgdb hdfs://hdpmaster:9000/stg hdpsysuser USER {edited-by=Bukhari}

Create DB at your own location

You could change the default location, create database scott location '/your location';

first create folder in HDFS for your DB

[hdpsysuser@hdpmaster ~]$ hdfs dfs -mkdir -p /scottdb

[hdpsysuser@hdpmaster ~]$ hdfs dfs -chmod 777 /scottdb

hive> create database scott location '/scottdb';
OR

hive> create database scott location '/scottdb' WITH DBPROPERTIES ('creator' = 'Inam Bukhari', 'date' = '2017-02-17');

Populate example tables (dept,emp) in Hive database

Use/login to scott

hive> use scott;

Time taken: 0.009 seconds

hive>

Create dual table

hive (scott)> create table dual(c string);

Time taken: 0.12 seconds

hive (scott)> insert into dual values('x');

Create tabels in scott

CREATE TABLE scott.dept( deptno STRING, dname STRING, loc STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Load data into scott.dept table

You should have copied the csv files created earlier above on some location of your HDFS client and load into HDFS.

hive>LOAD DATA LOCAL INPATH '/tmp/mydata/dept.csv' OVERWRITE INTO TABLE scott.dept;

hive> select * from scott.dept;

10 ACCOUNTING NEW YORK

20 RESEARCH DALLAS

30 SALES CHICAGO

40 OPERATIONS BOSTON

Observe there is no column header, If we want to see the columns names of the table in Hive, you need to set the following hive conf property.

Setting column headers for results

hive> set hive.cli.print.header=true;

Second way, Hive automatically looks for a file named .hiverc in your HOME directory (User's HOME) and runs the commands it contains, if any

[hdpsysuser@hdpmaster ~]$ echo "SET hive.cli.print.header=true;" >> ~/.hiverc

exit from hive and then re-run and you will see the headers

Third way, redirect output on console or to file
[hdpsysuser@hdpmaster ~]$ hive -e 'set hive.cli.print.header=true;select * from scott.dept;'
[hdpsysuser@hdpmaster ~]$ hive -e 'set hive.cli.print.header=true;select * from scott.dept;' >> /tmp/empoutput.txt

set hive.cli.print.current.db=true;
use scott;

You can set this property in hive-site.xml also.

Create second table 'emp'
create table scott.emp( empno STRING, ename STRING, job STRING,
mgr STRING, hiredate STRING, sal STRING,
comm STRING, deptno STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Load data into scott.emp and select it.

LOAD DATA LOCAL INPATH '/tmp/mydata/emp.csv' OVERWRITE INTO TABLE scott.emp;

hive> select '--------------Emp Data----------------';
hive> select * from scott.emp;
OK
7369 SMITH CLERK 7902 17-DEC-80 800 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81 2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000 10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 23-MAY-87 1100 20
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 23-JAN-82 1300 10

Time taken: 0.081 seconds, Fetched: 14 row(s)

Suppress INFO and WARNING messages
If you want to suppress INFO and WARNING messages on hive you could start hive using -S option, -v is for query display, -e is for executing the query

[hdpsysuser@hdpmaster ~]$ hive -S

Using Hive Queries

Output the results of a HiveQL query to CSV

hive> insert overwrite local directory '/tmp/output1' select * from scott.emp;

emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno

If you don't use local keyword then file is created on HDFS. it will create the folder in destination where output is written

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/empoutput'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from scott.emp;

run the os command from hive shell and see the destination, you will get the 000000_0, view this file
hive> !ls -ltr /tmp/empoutput;
total 4
-rw-r--r--. 1 hdpsysuser hdpsysuser 605 Feb 16 13:02 000000_0

view the file using cat and you will see the output here

hive> !cat /tmp/empoutput/000000_0;

Test join
hive> select d.deptno,d.dname,e.ename,e.job,e.sal from dept d,emp e where d.deptno=e.deptno;

Top N rows
hive> select deptno,ename,sal from emp order by sal desc limit 5;

Calculated field
hive> select ename,sal,comm,sal+comm total from emp where comm > 0;

Aggregate function
hive> select deptno,count(*),max(sal),min(sal) from emp group by deptno;

In Hive, count(*) counts all rows and count(columnA) will only count rows where columnA is non-NULL.

hive> select count(*),count(ename),count(comm) from emp;

CTAS
hive> create table emp2 as select * from emp;
Insert into table from another table

hive> insert into emp2 select * from emp;

DML
hive> select count(*) from emp2;
c0
57
hive> insert into emp2(deptno,empno,ename,comm) values(10,1,'ABC',NULL);
hive> select count(*) from emp2;
c0

58
hive> update emp2 set sal=1000 where empno=1;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive> delete from emp2;

FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

Delete workaround

The best approach is to partition your data such that the rows you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.

alternative you could use below approach, take your desired data to be existed in where condition and rewrite to the table itself using insert.

hive> insert overwrite table emp2 select * from emp2 where empno <> 1;

hive> select * from emp2 where empno=1;

Put Comment in query

hive> select ename--,job

> from emp;

Temporary Folders and Logs

Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:

On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir, On the client machine, this is hardcoded to /tmp/<username>

Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

Show all hive configuration variable
hive> set;
Show specific parameter
hive> set system:user.name;
system:user.name=hdpsysuser
hive> set hive.exec.scratchdir;
hive.exec.scratchdir=/tmp/hive

The logs are stored in the directory /tmp/<user.name>:
/tmp/<user.name>/hive.log

Next Post
Hive for Oracle Developers and DBAs - Part II