Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, November 06, 2017

Working with HBase on HDP

Introduction
Apache HBase is a No-SQL database that runs on a Hadoop cluster. It is ideal for storing unstructured or semi-structured data. It was designed to scale due to the fact that data that is accessed together is stored together which allows to build big data applications for scaling and eliminating limitations of relational databases. 

Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. Through HBase you can have random real-time read/write access to data in the Hadoop File System.

HBase and HDFS

HDFSHBase
HDFS is a distributed file system suitable for storing large files.HBase is a database built on top of the HDFS.
HDFS does not support fast individual record lookups.HBase provides fast lookups for larger tables.
It provides high latency batch processing; It provides low latency access to single rows from billions of records (Random access).
It provides only sequential access of data.HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.


HBase Architectural Components

Physically, HBase is composed of three types of servers in a master slave type of architecture.

Region servers: They serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. They are collocated with the HDFS DataNodes which enable data locality.

RegionsRegions are nothing but tables that are split up and spread across the region servers. These tables are divided horizontally by row key range and assigned to region servers.

HBase Master process: Region assignment, DDL (create, delete tables) operations are handled by this process

Zookeeper: a part of Hadoop ecosystem to maintain a live cluster state.





Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:

  • Table is a collection of rows.
  • Row is a collection of column families.
  • Column family is a collection of columns.
  • Column is a collection of key value pairs.






Region Server Components

A Region Server runs on an HDFS data node and has the following components:

WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.

BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.

MemStore: It is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.

Hfiles: To store the rows as sorted KeyValues on disk.




Installation 

You should have Hadoop installed already in order to isntall HBase. My environment has Hortonworks Data Platform, So I just added the HBase service using Ambari Admin console. After adding it you can access HBase using its shell.


Working with HBase Shell

HBase comes with an interactive shell from where you can communicate with HBase components and perform operations on them. You can communicate with HBase using Java API also but for this post we will be using only shell.

[root@dn04 ~]# su hbase
[hbase@dn04 root]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.6.1.0-129, r718c773662346de98a8ce6fd3b5f64e279cb87d4, Wed May 31 03:27:31 UTC 2017

hbase(main):001:0>

General  commands

--Show cluster status
hbase(main):010:0> status
1 active master, 0 backup masters, 2 servers, 1 dead, 6.0000 average load

hbase(main):011:0> status 'simple'

hbase(main):012:0> status 'summary'

hbase(main):014:0> status 'detailed'

--version Info

hbase(main):014:0> version

--User info
hbase(main):015:0> whoami


Create Table
Syntax: create '<table_name>','<column_family_name>'
hbase(main):005:0> create 'emp','ProfessionalData','IncomeData'
0 row(s) in 2.2810 seconds

=> Hbase::Table - emp

Drop Table
hbase(main):003:0> disable 'emp'
0 row(s) in 4.2880 seconds

hbase(main):004:0> drop 'emp'
0 row(s) in 1.2960 seconds

Check the table
hbase(main):006:0> list
TABLE
emp
1 row(s) in 0.0210 seconds

=> ["emp"]

Existence of Table
hbase(main):012:0> exists 'emp'
Table emp does exist
0 row(s) in 0.0200 seconds


Describe Table
hbase(main):013:0> describe 'emp'
Table emp is ENABLED
emp
COLUMN FAMILIES DESCRIPTION
{NAME => 'IncomeData', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRE
SSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'ProfessionalData', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER',
COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0350 seconds



alter
Alter is the command used to make changes to an existing table. Using this command, you can change the maximum number of cells of a column family, set and delete table scope operators, and delete a column family from a table.

Changing the Maximum Number of Cells of a Column Family
hbase(main):014:0> alter 'emp', NAME => 'ProfessionalData', VERSIONS => 5
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.2610 seconds

Table Scope Operators

You can set and remove table scope operators such as MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.

hbase(main):015:0> alter 'emp', READONLY
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 2.3000 seconds

hbase(main):017:0> alter 'emp', WRITE
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 2.2080 seconds

Removing Table Scope Operators

We can also remove the table scope operators. Given below is the syntax to remove ‘MAX_FILESIZE’ from emp table.
alter 'emp', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'

Deleting a Column Family
alter ‘ emp’, ‘delete’ => ‘ column family ’ 

Data Manipulation

Think I've below data in a csv format (/data/mydata/emp.csv) which I want to use for the table (emp) I created earlier.

SQL> select empno||','|| ename|| ','||job||','||mgr||','||hiredate||','||sal||','||comm||','||deptno  from scott.emp;

EMPNO||','||ENAME||','||JOB||','||MGR||','||HIREDATE||','||SAL||','||COMM||','||DEPTNO
------------------------------------------------------------------
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10
14 rows selected.



Now I want to insert the second row of above table (csv), Using put command, you can insert rows into a table. Its syntax is as follows:

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’

hbase(main):007:0> put 'emp','7499','ProfessionalData:ename','ALLEN'
0 row(s) in 0.0960 seconds

hbase(main):008:0> put 'emp','7499','ProfessionalData:job','SALESMAN'
0 row(s) in 0.0110 seconds

hbase(main):009:0> put 'emp','7499','IncomeData:sal','1600'
0 row(s) in 0.0120 seconds

hbase(main):010:0> put 'emp','7499','IncomeData:comm','300'
0 row(s) in 0.0120 seconds

Scan Data

hbase(main):011:0> scan 'emp'
ROW            COLUMN+CELL
 7499          column=IncomeData:comm, timestamp=1509953805568, value=300
 7499          column=IncomeData:sal, timestamp=1509953798889, value=1600
 7499          column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
 7499          column=ProfessionalData:job, timestamp=1509953791768, value=SALESMAN
1 row(s) in 0.0410 seconds


Updating Data

You can update an existing cell value using the put command.
put ‘table name’,’row ’,'Column family:column name',’new value’

hbase(main):018:0> scan 'emp'
ROW                                         COLUMN+CELL
 7499         column=IncomeData:comm, timestamp=1509953805568, value=300
 7499         column=IncomeData:sal, timestamp=1509953798889, value=1600
 7499         column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
 7499         column=ProfessionalData:job, timestamp=1509953791768, value=SALESMAN
1 row(s) in 0.0300 seconds

hbase(main):019:0> put 'emp','7499','ProfessionalData:job','MANAGER'
0 row(s) in 0.0200 seconds

hbase(main):020:0> scan 'emp'
ROW           COLUMN+CELL
 7499         column=IncomeData:comm, timestamp=1509953805568, value=300
 7499         column=IncomeData:sal, timestamp=1509953798889, value=1600
 7499         column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
 7499         column=ProfessionalData:job, timestamp=1509955831838, value=MANAGER
1 row(s) in 0.0180 seconds


Reading Data

Syntax: get ’<table name>’,’row1’
hbase(main):021:0> get 'emp', '7499'
COLUMN                                      CELL
 IncomeData:comm                            timestamp=1509953805568, value=300
 IncomeData:sal                             timestamp=1509953798889, value=1600
 ProfessionalData:ename                     timestamp=1509953783956, value=ALLEN
 ProfessionalData:job                       timestamp=1509955831838, value=MANAGER
4 row(s) in 0.0560 seconds


Reading a Specific Column
Syntax:  get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}

hbase(main):022:0> get 'emp', '7499', {COLUMN => 'ProfessionalData:ename'}
COLUMN                                      CELL
 ProfessionalData:ename       timestamp=1509953783956, value=ALLEN
1 row(s) in 0.0300 seconds

Deleting Data

Deleting a Specific Cell in a Table
Syntax:  delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’

hbase(main):023:0> delete 'emp', '7499', 'IncomeData:sal',1509953798889
0 row(s) in 0.0440 seconds

hbase(main):027:0> scan 'emp'
ROW                                         COLUMN+CELL
 7499           column=IncomeData:comm, timestamp=1509953805568, value=300
 7499           column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
 7499           column=ProfessionalData:job, timestamp=1509955831838, value=MANAGER
1 row(s) in 0.0110 seconds

Deleting All Cells in a Table
hbase(main):028:0> deleteall 'emp','74499'
0 row(s) in 0.0170 seconds

ImportTsv Utility in HBase:

ImportTsv is a utility that will load data in TSV or CSV format into a specified HBase table. The column names of the TSV data must be specified using the -Dimporttsv.columns option. This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier. The special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data.

Next argument is the table name where you want the data to be imported

Third argument specifies the input directory of CSV data.


First copy the csv file into HDFS
[root@dn04 ~]# su hdfs
[hdfs@dn04 root]$ hadoop fs -copyFromLocal /data/mydata/emp.csv /tmp

Now use ImportTsv tool

[hbase@dn04 root]$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=,  -Dimporttsv.columns="HBASE_ROW_KEY,ProfessionalData:ename,ProfessionalData:job,ProfessionalData:mgr,ProfessionalData:hiredate,IncomeData:sal,IncomeData:comm,IncomeData:deptno" emp hdfs://nn01:8020/tmp/emp.csv

2017-11-06 11:37:23,465 INFO  [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4e268090 connecting to ZooKeeper ensemble=dn02:2181,dn04:2181,dn03:2181
2017-11-06 11:37:23,472 INFO  [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-129--1, built on 05/31/2017 03:01 GMT
2017-11-06 11:37:23,472 INFO  [main] zookeeper.ZooKeeper: Client environment:host.name=dn04
2017-11-06 11:37:23,472 INFO  [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_121
2017-11-06 11:37:23,472 INFO  [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation

..
...
ImportTsv
                Bad Lines=0
        File Input Format Counters
                Bytes Read=617
        File Output Format Counters
                Bytes Written=0



Now scan and verify import
hbase(main):001:0> scan 'emp'
ROW                                         COLUMN+CELL
 7369                                       column=IncomeData:comm, timestamp=1509957443054, value=20
 7369                                       column=IncomeData:sal, timestamp=1509957443054, value=
 7369                                       column=ProfessionalData:empno, timestamp=1509957443054, value=SMITH
 7369                                       column=ProfessionalData:ename, timestamp=1509957443054, value=CLERK
 7369                                       column=ProfessionalData:hiredate, timestamp=1509957443054, value=800
 7369                                       column=ProfessionalData:job, timestamp=1509957443054, value=7902
 7369                                       column=ProfessionalData:mgr, timestamp=1509957443054, value=17-DEC-80
 7499                                       column=IncomeData:comm, timestamp=1509957443054, value=30
 7499                                       column=IncomeData:sal, timestamp=1509957443054, value=300
 7499                                       column=ProfessionalData:empno, timestamp=1509957443054, value=ALLEN
 7499                                       column=ProfessionalData:ename, timestamp=1509957443054, value=SALESMAN
 7499                                       column=ProfessionalData:hiredate, timestamp=1509957443054, value=1600
 7499                                       column=ProfessionalData:job, timestamp=1509957443054, value=7698
 7499                                       column=ProfessionalData:mgr, timestamp=1509957443054, value=20-FEB-81



Accessing HBase Programmatically

You can access HBase using java APIs provided. Below is given as an example

1- Set required java environment varialbes

[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=/usr/hdp/2.6.1.0-129/hbase/lib/*
[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=$CLASSPATH:/usr/hdp/2.6.1.0-129/hadoop/*:.

2- write Java code
***********Java File CreateTable.java ******************
import java.io.IOException;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.conf.Configuration;

public class CreateTable {
      
   public static void main(String[] args)  {
try{

    Configuration conf = HBaseConfiguration.create();
    System.out.println("conf==> "+conf);
    conf.set("hbase.zookeeper.property.clientPort", "2181");
            conf.set("hbase.zookeeper.quorum", "te1-hdp-rp-dn04");
      conf.set("zookeeper.znode.parent", "/hbase-unsecure");

    // Instantiating HbaseAdmin class
    HBaseAdmin admin = new HBaseAdmin(conf);

     // Instantiating table descriptor class
   HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("empz"));

  // Adding column families to table descriptor
  tableDescriptor.addFamily(new HColumnDescriptor("personal"));
  tableDescriptor.addFamily(new HColumnDescriptor("professional"));

// Execute the table through admin
admin.createTable(tableDescriptor);

System.out.println(" Table created ");

// Getting all the list of tables using HBaseAdmin object
HTableDescriptor[] tableDescriptorLst = admin.listTables();

// printing all the table names.
        for (int i=0; i<tableDescriptorLst.length;i++ ){
        System.out.println(tableDescriptorLst[i].getNameAsString());
}

//Disable table
Boolean b = admin.isTableDisabled("empz");
if(!b){
   admin.disableTable("empz");
   System.out.println("Table disabled");
}





}catch(Exception ex){      System.out.println("Error--> "+ ex.toString());  }
   }
}

3- Compile java code

[hbase@te1-hdp-rp-dn04 ~]$ javac CreateTable.java

4- Run java class file

[hbase@te1-hdp-rp-dn04 ~]$ java CreateTable
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
conf==> Configuration: core-default.xml, core-site.xml, hbase-default.xml, hbase-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
 Table created

5- Verify from hbase shell

hbase(main):007:0> list
TABLE
SYSTEM.CATALOG
SYSTEM.FUNCTION
SYSTEM.SEQUENCE
SYSTEM.STATS
....
empz

11 row(s) in 0.0100 seconds

=> ["SYSTEM.CATALOG", "SYSTEM.FUNCTION", "SYSTEM.SEQUENCE", "SYSTEM.STATS", "TEST", "emp", "empz", "pagecounts", "scott.emp_hbase1", "scott.emp_hbase2", "test"]
hbase(main):008:0>



No comments: