Working with HBase on HDP

Introduction

Apache HBase is a No-SQL database that runs on a Hadoop cluster. It is ideal for storing unstructured or semi-structured data. It was designed to scale due to the fact that data that is accessed together is stored together which allows to build big data applications for scaling and eliminating limitations of relational databases.

Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. Through HBase you can have random real-time read/write access to data in the Hadoop File System.

HBase and HDFS

HDFS HBase

HDFS is a distributed file system suitable for storing large files. HBase is a database built on top of the HDFS.

HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables.

It provides high latency batch processing; It provides low latency access to single rows from billions of records (Random access).

It provides only sequential access of data. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.

HDFS	HBase
HDFS is a distributed file system suitable for storing large files.	HBase is a database built on top of the HDFS.
HDFS does not support fast individual record lookups.	HBase provides fast lookups for larger tables.
It provides high latency batch processing;	It provides low latency access to single rows from billions of records (Random access).
It provides only sequential access of data.	HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.

HBase Architectural Components

Physically, HBase is composed of three types of servers in a master slave type of architecture.

Region servers: They serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. They are collocated with the HDFS DataNodes which enable data locality.

Regions: Regions are nothing but tables that are split up and spread across the region servers. These tables are divided horizontally by row key range and assigned to region servers.

HBase Master process: Region assignment, DDL (create, delete tables) operations are handled by this process

Zookeeper: a part of Hadoop ecosystem to maintain a live cluster state.

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:

Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.

Region Server Components

A Region Server runs on an HDFS data node and has the following components:

WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.

BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.

MemStore: It is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.

Hfiles: To store the rows as sorted KeyValues on disk.

Installation

You should have Hadoop installed already in order to isntall HBase. My environment has Hortonworks Data Platform, So I just added the HBase service using Ambari Admin console. After adding it you can access HBase using its shell.

Working with HBase Shell

HBase comes with an interactive shell from where you can communicate with HBase components and perform operations on them. You can communicate with HBase using Java API also but for this post we will be using only shell.

[root@dn04 ~]# su hbase

[hbase@dn04 root]$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 1.1.2.2.6.1.0-129, r718c773662346de98a8ce6fd3b5f64e279cb87d4, Wed May 31 03:27:31 UTC 2017

hbase(main):001:0>

General commands

--Show cluster status

hbase(main):010:0> status

1 active master, 0 backup masters, 2 servers, 1 dead, 6.0000 average load

hbase(main):011:0> status 'simple'

hbase(main):012:0> status 'summary'

hbase(main):014:0> status 'detailed'

--version Info

hbase(main):014:0> version

--User info

hbase(main):015:0> whoami

Create Table

Syntax: create '<table_name>','<column_family_name>'

hbase(main):005:0> create 'emp','ProfessionalData','IncomeData'

0 row(s) in 2.2810 seconds

=> Hbase::Table - emp

Drop Table

hbase(main):003:0> disable 'emp'

0 row(s) in 4.2880 seconds

hbase(main):004:0> drop 'emp'

0 row(s) in 1.2960 seconds

Check the table

hbase(main):006:0> list

TABLE

emp

1 row(s) in 0.0210 seconds

=> ["emp"]

Existence of Table

hbase(main):012:0> exists 'emp'

Table emp does exist

0 row(s) in 0.0200 seconds

Describe Table

hbase(main):013:0> describe 'emp'

Table emp is ENABLED

emp

COLUMN FAMILIES DESCRIPTION

{NAME => 'IncomeData', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRE

SSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'ProfessionalData', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER',

COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

2 row(s) in 0.0350 seconds

alter

Alter is the command used to make changes to an existing table. Using this command, you can change the maximum number of cells of a column family, set and delete table scope operators, and delete a column family from a table.

Changing the Maximum Number of Cells of a Column Family

hbase(main):014:0> alter 'emp', NAME => 'ProfessionalData', VERSIONS => 5

Updating all regions with the new schema...

0/1 regions updated.

1/1 regions updated.

Done.

0 row(s) in 3.2610 seconds

Table Scope Operators

You can set and remove table scope operators such as MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.

hbase(main):015:0> alter 'emp', READONLY

Updating all regions with the new schema...

1/1 regions updated.

Done.

0 row(s) in 2.3000 seconds

hbase(main):017:0> alter 'emp', WRITE

Updating all regions with the new schema...

1/1 regions updated.

Done.

0 row(s) in 2.2080 seconds

Removing Table Scope Operators

We can also remove the table scope operators. Given below is the syntax to remove ‘MAX_FILESIZE’ from emp table.

alter 'emp', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'

Deleting a Column Family

alter ‘ emp’, ‘delete’ => ‘ column family ’

Data Manipulation

Think I've below data in a csv format (/data/mydata/emp.csv) which I want to use for the table (emp) I created earlier.

SQL> select empno||','|| ename|| ','||job||','||mgr||','||hiredate||','||sal||','||comm||','||deptno from scott.emp;

EMPNO||','||ENAME||','||JOB||','||MGR||','||HIREDATE||','||SAL||','||COMM||','||DEPTNO
------------------------------------------------------------------
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10
14 rows selected.

Now I want to insert the second row of above table (csv), Using put command, you can insert rows into a table. Its syntax is as follows:

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’

hbase(main):007:0> put 'emp','7499','ProfessionalData:ename','ALLEN'

0 row(s) in 0.0960 seconds

hbase(main):008:0> put 'emp','7499','ProfessionalData:job','SALESMAN'

0 row(s) in 0.0110 seconds

hbase(main):009:0> put 'emp','7499','IncomeData:sal','1600'

0 row(s) in 0.0120 seconds

hbase(main):010:0> put 'emp','7499','IncomeData:comm','300'

0 row(s) in 0.0120 seconds

Scan Data

hbase(main):011:0> scan 'emp'

ROW COLUMN+CELL

7499 column=IncomeData:comm, timestamp=1509953805568, value=300

7499 column=IncomeData:sal, timestamp=1509953798889, value=1600

7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN

7499 column=ProfessionalData:job, timestamp=1509953791768, value=SALESMAN

1 row(s) in 0.0410 seconds

Updating Data

You can update an existing cell value using the put command.

put ‘table name’,’row ’,'Column family:column name',’new value’

hbase(main):018:0> scan 'emp'

ROW COLUMN+CELL

7499 column=IncomeData:comm, timestamp=1509953805568, value=300

7499 column=IncomeData:sal, timestamp=1509953798889, value=1600

7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN

7499 column=ProfessionalData:job, timestamp=1509953791768, value=SALESMAN

1 row(s) in 0.0300 seconds

hbase(main):019:0> put 'emp','7499','ProfessionalData:job','MANAGER'

0 row(s) in 0.0200 seconds

hbase(main):020:0> scan 'emp'

ROW COLUMN+CELL

7499 column=IncomeData:comm, timestamp=1509953805568, value=300

7499 column=IncomeData:sal, timestamp=1509953798889, value=1600

7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN

7499 column=ProfessionalData:job, timestamp=1509955831838, value=MANAGER

1 row(s) in 0.0180 seconds

Reading Data

Syntax: get ’<table name>’,’row1’

hbase(main):021:0> get 'emp', '7499'

COLUMN CELL

IncomeData:comm timestamp=1509953805568, value=300

IncomeData:sal timestamp=1509953798889, value=1600

ProfessionalData:ename timestamp=1509953783956, value=ALLEN

ProfessionalData:job timestamp=1509955831838, value=MANAGER

4 row(s) in 0.0560 seconds

Reading a Specific Column

Syntax: get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}

hbase(main):022:0> get 'emp', '7499', {COLUMN => 'ProfessionalData:ename'}

COLUMN CELL

ProfessionalData:ename timestamp=1509953783956, value=ALLEN

1 row(s) in 0.0300 seconds

Deleting Data

Deleting a Specific Cell in a Table

Syntax: delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’

hbase(main):023:0> delete 'emp', '7499', 'IncomeData:sal',1509953798889

0 row(s) in 0.0440 seconds

hbase(main):027:0> scan 'emp'

ROW COLUMN+CELL

7499 column=IncomeData:comm, timestamp=1509953805568, value=300

7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN

7499 column=ProfessionalData:job, timestamp=1509955831838, value=MANAGER

1 row(s) in 0.0110 seconds

Deleting All Cells in a Table

hbase(main):028:0> deleteall 'emp','74499'

0 row(s) in 0.0170 seconds

ImportTsv Utility in HBase:

ImportTsv is a utility that will load data in TSV or CSV format into a specified HBase table. The column names of the TSV data must be specified using the -Dimporttsv.columns option. This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier. The special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data.

Next argument is the table name where you want the data to be imported

Third argument specifies the input directory of CSV data.

First copy the csv file into HDFS

[root@dn04 ~]# su hdfs

[hdfs@dn04 root]$ hadoop fs -copyFromLocal /data/mydata/emp.csv /tmp

Now use ImportTsv tool

[hbase@dn04 root]$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,ProfessionalData:ename,ProfessionalData:job,ProfessionalData:mgr,ProfessionalData:hiredate,IncomeData:sal,IncomeData:comm,IncomeData:deptno" emp hdfs://nn01:8020/tmp/emp.csv

2017-11-06 11:37:23,465 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4e268090 connecting to ZooKeeper ensemble=dn02:2181,dn04:2181,dn03:2181

2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-129--1, built on 05/31/2017 03:01 GMT

2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:host.name=dn04

2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_121

2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation

...

ImportTsv

Bad Lines=0

File Input Format Counters

Bytes Read=617

File Output Format Counters

Bytes Written=0

Now scan and verify import

hbase(main):001:0> scan 'emp'

ROW COLUMN+CELL

7369 column=IncomeData:comm, timestamp=1509957443054, value=20

7369 column=IncomeData:sal, timestamp=1509957443054, value=

7369 column=ProfessionalData:empno, timestamp=1509957443054, value=SMITH

7369 column=ProfessionalData:ename, timestamp=1509957443054, value=CLERK

7369 column=ProfessionalData:hiredate, timestamp=1509957443054, value=800

7369 column=ProfessionalData:job, timestamp=1509957443054, value=7902

7369 column=ProfessionalData:mgr, timestamp=1509957443054, value=17-DEC-80

7499 column=IncomeData:comm, timestamp=1509957443054, value=30

7499 column=IncomeData:sal, timestamp=1509957443054, value=300

7499 column=ProfessionalData:empno, timestamp=1509957443054, value=ALLEN

7499 column=ProfessionalData:ename, timestamp=1509957443054, value=SALESMAN

7499 column=ProfessionalData:hiredate, timestamp=1509957443054, value=1600

7499 column=ProfessionalData:job, timestamp=1509957443054, value=7698

7499 column=ProfessionalData:mgr, timestamp=1509957443054, value=20-FEB-81

Accessing HBase Programmatically

You can access HBase using java APIs provided. Below is given as an example

1- Set required java environment varialbes

[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=/usr/hdp/2.6.1.0-129/hbase/lib/*
[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=$CLASSPATH:/usr/hdp/2.6.1.0-129/hadoop/*:.

2- write Java code
***********Java File CreateTable.java ******************
import java.io.IOException;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.conf.Configuration;

public class CreateTable {

public static void main(String[] args) {
try{

Configuration conf = HBaseConfiguration.create();
System.out.println("conf==> "+conf);
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("hbase.zookeeper.quorum", "te1-hdp-rp-dn04");
conf.set("zookeeper.znode.parent", "/hbase-unsecure");

// Instantiating HbaseAdmin class
HBaseAdmin admin = new HBaseAdmin(conf);

// Instantiating table descriptor class
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("empz"));

// Adding column families to table descriptor
tableDescriptor.addFamily(new HColumnDescriptor("personal"));
tableDescriptor.addFamily(new HColumnDescriptor("professional"));

// Execute the table through admin
admin.createTable(tableDescriptor);

System.out.println(" Table created ");

// Getting all the list of tables using HBaseAdmin object
HTableDescriptor[] tableDescriptorLst = admin.listTables();

// printing all the table names.
for (int i=0; i<tableDescriptorLst.length;i++ ){
System.out.println(tableDescriptorLst[i].getNameAsString());
}

//Disable table
Boolean b = admin.isTableDisabled("empz");
if(!b){
admin.disableTable("empz");
System.out.println("Table disabled");
}

}catch(Exception ex){ System.out.println("Error--> "+ ex.toString()); }
}
}

3- Compile java code

[hbase@te1-hdp-rp-dn04 ~]$ javac CreateTable.java

4- Run java class file

[hbase@te1-hdp-rp-dn04 ~]$ java CreateTable

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

conf==> Configuration: core-default.xml, core-site.xml, hbase-default.xml, hbase-site.xml

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Table created

5- Verify from hbase shell

hbase(main):007:0> list

TABLE

SYSTEM.CATALOG

SYSTEM.FUNCTION

SYSTEM.SEQUENCE

SYSTEM.STATS

....

empz

11 row(s) in 0.0100 seconds

=> ["SYSTEM.CATALOG", "SYSTEM.FUNCTION", "SYSTEM.SEQUENCE", "SYSTEM.STATS", "TEST", "emp", "empz", "pagecounts", "scott.emp_hbase1", "scott.emp_hbase2", "test"]

hbase(main):008:0>

Access hbase table from Hive

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table directly – you’ll have to use the HBase shell to do so.

Registering the table is only the first step. As part of that registration, you also need to specify a column mapping. This is how you link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.
CREATE external TABLE hbase_emp(rowkey STRING, comm STRING, sal STRING,ename string,job string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,IncomeData:comm,IncomeData:sal,ProfessionalData:ename,ProfessionalData:job') TBLPROPERTIES ('hbase.table.name' = 'emp');

0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> CREATE external TABLE hbase_emp(rowkey STRING, comm STRING, sal STRING,ename string,job string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,IncomeData:comm,IncomeData:sal,ProfessionalData:ename,ProfessionalData:job') TBLPROPERTIES ('hbase.table.name' = 'emp');
No rows affected (0.75 seconds)

0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> describe hbase_emp;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| rowkey | string | |
| comm | string | |
| sal | string | |
| ename | string | |
| job | string | |
+-----------+------------+----------+--+
5 rows selected (0.386 seconds)

0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> select * from hbase_emp;

+-------------------+-----------------+----------------+------------------+----------------+--+
| hbase_emp.rowkey | hbase_emp.comm | hbase_emp.sal | hbase_emp.ename | hbase_emp.job |
+-------------------+-----------------+----------------+------------------+----------------+--+
| 7499 | 300 | 1600 | ALLEN | SALESMAN |
+-------------------+-----------------+----------------+------------------+----------------+--+
1 row selected (0.536 seconds)

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Monday, November 06, 2017

Working with HBase on HDP

1 comment:

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews