HBase and HDFS
HBase Architectural Components
Physically, HBase is composed of three types of servers in a master slave type of architecture.
Region servers: They serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. They are collocated with the HDFS DataNodes which enable data locality.
Check the table
ImportTsv Utility in HBase:
Next argument is the table name where you want the data to be imported
Accessing HBase Programmatically
You can access HBase using java APIs provided. Below is given as an example
1- Set required java environment varialbes
[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=/usr/hdp/*
[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=$CLASSPATH:/usr/hdp/*:.
2- write Java code
***********Java File ******************
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.conf.Configuration;
public class CreateTable {
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
System.out.println("conf==> "+conf);
conf.set("", "2181");
conf.set("hbase.zookeeper.quorum", "te1-hdp-rp-dn04");
conf.set("zookeeper.znode.parent", "/hbase-unsecure");
// Instantiating HbaseAdmin class
HBaseAdmin admin = new HBaseAdmin(conf);
// Instantiating table descriptor class
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("empz"));
// Adding column families to table descriptor
tableDescriptor.addFamily(new HColumnDescriptor("personal"));
tableDescriptor.addFamily(new HColumnDescriptor("professional"));
// Execute the table through admin
System.out.println(" Table created ");
// Getting all the list of tables using HBaseAdmin object
HTableDescriptor[] tableDescriptorLst = admin.listTables();
// printing all the table names.
for (int i=0; i<tableDescriptorLst.length;i++ ){
//Disable table
Boolean b = admin.isTableDisabled("empz");
System.out.println("Table disabled");
}catch(Exception ex){ System.out.println("Error--> "+ ex.toString()); }
3- Compile java code
[hbase@te1-hdp-rp-dn04 ~]$ javac
Access hbase table from Hive
Apache HBase is a No-SQL database that runs on a Hadoop cluster. It is ideal for storing unstructured or semi-structured data. It was designed to scale due to the fact that data that is accessed together is stored together which allows to build big data applications for scaling and eliminating limitations of relational databases.
Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. Through HBase you can have random real-time read/write access to data in the Hadoop File System.
HBase and HDFS
HDFS | HBase |
HDFS is a distributed file system suitable for storing large files. | HBase is a database built on top of the HDFS. |
HDFS does not support fast individual record lookups. | HBase provides fast lookups for larger tables. |
It provides high latency batch processing; | It provides low latency access to single rows from billions of records (Random access). |
It provides only sequential access of data. | HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups. |
Physically, HBase is composed of three types of servers in a master slave type of architecture.
Region servers: They serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. They are collocated with the HDFS DataNodes which enable data locality.
Regions: Regions are nothing but tables that are split up and spread across the region servers. These tables are divided horizontally by row key range and assigned to region servers.
HBase Master process: Region assignment, DDL (create, delete tables) operations are handled by this process
Zookeeper: a part of Hadoop ecosystem to maintain a live cluster state.
General commands
HBase Master process: Region assignment, DDL (create, delete tables) operations are handled by this process
Zookeeper: a part of Hadoop ecosystem to maintain a live cluster state.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
- Table is a collection of rows.
- Row is a collection of column families.
- Column family is a collection of columns.
- Column is a collection of key value pairs.
Region Server Components
A Region Server runs on an HDFS data node and has the following components:
WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.
BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
MemStore: It is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
Hfiles: To store the rows as sorted KeyValues on disk.
You should have Hadoop installed already in order to isntall HBase. My environment has Hortonworks Data Platform, So I just added the HBase service using Ambari Admin console. After adding it you can access HBase using its shell.
Working with HBase Shell
HBase comes with an interactive shell from where you can communicate with HBase components and perform operations on them. You can communicate with HBase using Java API also but for this post we will be using only shell.
[root@dn04 ~]# su hbase
[hbase@dn04 root]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version, r718c773662346de98a8ce6fd3b5f64e279cb87d4, Wed May 31 03:27:31 UTC 2017
--Show cluster status
hbase(main):010:0> status
1 active master, 0 backup masters, 2 servers, 1 dead, 6.0000 average load
hbase(main):011:0> status 'simple'
hbase(main):012:0> status 'summary'
hbase(main):014:0> status 'detailed'
--version Info
hbase(main):014:0> version
--User info
hbase(main):015:0> whoami
Create Table
Syntax: create '<table_name>','<column_family_name>'
hbase(main):005:0> create 'emp','ProfessionalData','IncomeData'
0 row(s) in 2.2810 seconds
=> Hbase::Table - emp
Drop Table
hbase(main):003:0> disable 'emp'
0 row(s) in 4.2880 seconds
hbase(main):004:0> drop 'emp'
0 row(s) in 1.2960 seconds
hbase(main):006:0> list
1 row(s) in 0.0210 seconds
=> ["emp"]
Existence of Table
hbase(main):012:0> exists 'emp'
Table emp does exist
0 row(s) in 0.0200 seconds
Describe Table
hbase(main):013:0> describe 'emp'
Table emp is ENABLED
2 row(s) in 0.0350 seconds
Alter is the command used to make changes to an existing table. Using this command, you can change the maximum number of cells of a column family, set and delete table scope operators, and delete a column family from a table.
Changing the Maximum Number of Cells of a Column Family
hbase(main):014:0> alter 'emp', NAME => 'ProfessionalData', VERSIONS => 5
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
0 row(s) in 3.2610 seconds
Table Scope Operators
You can set and remove table scope operators such as MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.
hbase(main):015:0> alter 'emp', READONLY
Updating all regions with the new schema...
1/1 regions updated.
0 row(s) in 2.3000 seconds
hbase(main):017:0> alter 'emp', WRITE
Updating all regions with the new schema...
1/1 regions updated.
0 row(s) in 2.2080 seconds
Removing Table Scope Operators
We can also remove the table scope operators. Given below is the syntax to remove ‘MAX_FILESIZE’ from emp table.
alter 'emp', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'
Deleting a Column Family
alter ‘ emp’, ‘delete’ => ‘ column family ’
Data Manipulation
Think I've below data in a csv format (/data/mydata/emp.csv) which I want to use for the table (emp) I created earlier.
SQL> select empno||','|| ename|| ','||job||','||mgr||','||hiredate||','||sal||','||comm||','||deptno from scott.emp;
14 rows selected.
14 rows selected.
Now I want to insert the second row of above table (csv), Using put command, you can insert rows into a table. Its syntax is as follows:
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
hbase(main):007:0> put 'emp','7499','ProfessionalData:ename','ALLEN'
0 row(s) in 0.0960 seconds
hbase(main):008:0> put 'emp','7499','ProfessionalData:job','SALESMAN'
0 row(s) in 0.0110 seconds
hbase(main):009:0> put 'emp','7499','IncomeData:sal','1600'
0 row(s) in 0.0120 seconds
hbase(main):010:0> put 'emp','7499','IncomeData:comm','300'
0 row(s) in 0.0120 seconds
Scan Data
hbase(main):011:0> scan 'emp'
7499 column=IncomeData:comm, timestamp=1509953805568, value=300
7499 column=IncomeData:sal, timestamp=1509953798889, value=1600
7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
7499 column=ProfessionalData:job, timestamp=1509953791768, value=SALESMAN
1 row(s) in 0.0410 seconds
Updating Data
You can update an existing cell value using the put command.
put ‘table name’,’row ’,'Column family:column name',’new value’
hbase(main):018:0> scan 'emp'
7499 column=IncomeData:comm, timestamp=1509953805568, value=300
7499 column=IncomeData:sal, timestamp=1509953798889, value=1600
7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
7499 column=ProfessionalData:job, timestamp=1509953791768, value=SALESMAN
1 row(s) in 0.0300 seconds
hbase(main):019:0> put 'emp','7499','ProfessionalData:job','MANAGER'
0 row(s) in 0.0200 seconds
hbase(main):020:0> scan 'emp'
7499 column=IncomeData:comm, timestamp=1509953805568, value=300
7499 column=IncomeData:sal, timestamp=1509953798889, value=1600
7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
7499 column=ProfessionalData:job, timestamp=1509955831838, value=MANAGER
1 row(s) in 0.0180 seconds
Reading Data
Syntax: get ’<table name>’,’row1’
hbase(main):021:0> get 'emp', '7499'
IncomeData:comm timestamp=1509953805568, value=300
IncomeData:sal timestamp=1509953798889, value=1600
ProfessionalData:ename timestamp=1509953783956, value=ALLEN
ProfessionalData:job timestamp=1509955831838, value=MANAGER
4 row(s) in 0.0560 seconds
Reading a Specific Column
Syntax: get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}
hbase(main):022:0> get 'emp', '7499', {COLUMN => 'ProfessionalData:ename'}
ProfessionalData:ename timestamp=1509953783956, value=ALLEN
1 row(s) in 0.0300 seconds
Deleting Data
Deleting a Specific Cell in a Table
Syntax: delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
hbase(main):023:0> delete 'emp', '7499', 'IncomeData:sal',1509953798889
0 row(s) in 0.0440 seconds
hbase(main):027:0> scan 'emp'
7499 column=IncomeData:comm, timestamp=1509953805568, value=300
7499 column=ProfessionalData:ename, timestamp=1509953783956, value=ALLEN
7499 column=ProfessionalData:job, timestamp=1509955831838, value=MANAGER
1 row(s) in 0.0110 seconds
Deleting All Cells in a Table
hbase(main):028:0> deleteall 'emp','74499'
0 row(s) in 0.0170 seconds
ImportTsv is a utility that will load data in TSV or CSV format into a specified HBase table. The column names of the TSV data must be specified using the -Dimporttsv.columns option. This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier. The special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data.
Next argument is the table name where you want the data to be imported
Third argument specifies the input directory of CSV data.
First copy the csv file into HDFS
[root@dn04 ~]# su hdfs
[hdfs@dn04 root]$ hadoop fs -copyFromLocal /data/mydata/emp.csv /tmp
Now use ImportTsv tool
[hbase@dn04 root]$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,ProfessionalData:ename,ProfessionalData:job,ProfessionalData:mgr,ProfessionalData:hiredate,IncomeData:sal,IncomeData:comm,IncomeData:deptno" emp hdfs://nn01:8020/tmp/emp.csv
2017-11-06 11:37:23,465 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4e268090 connecting to ZooKeeper ensemble=dn02:2181,dn04:2181,dn03:2181
2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-129--1, built on 05/31/2017 03:01 GMT
2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client
2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_121
2017-11-06 11:37:23,472 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
Bad Lines=0
File Input Format Counters
Bytes Read=617
File Output Format Counters
Bytes Written=0
Now scan and verify import
hbase(main):001:0> scan 'emp'
7369 column=IncomeData:comm, timestamp=1509957443054, value=20
7369 column=IncomeData:sal, timestamp=1509957443054, value=
7369 column=ProfessionalData:empno, timestamp=1509957443054, value=SMITH
7369 column=ProfessionalData:ename, timestamp=1509957443054, value=CLERK
7369 column=ProfessionalData:hiredate, timestamp=1509957443054, value=800
7369 column=ProfessionalData:job, timestamp=1509957443054, value=7902
7369 column=ProfessionalData:mgr, timestamp=1509957443054, value=17-DEC-80
7499 column=IncomeData:comm, timestamp=1509957443054, value=30
7499 column=IncomeData:sal, timestamp=1509957443054, value=300
7499 column=ProfessionalData:empno, timestamp=1509957443054, value=ALLEN
7499 column=ProfessionalData:ename, timestamp=1509957443054, value=SALESMAN
7499 column=ProfessionalData:hiredate, timestamp=1509957443054, value=1600
7499 column=ProfessionalData:job, timestamp=1509957443054, value=7698
7499 column=ProfessionalData:mgr, timestamp=1509957443054, value=20-FEB-81
Accessing HBase Programmatically
You can access HBase using java APIs provided. Below is given as an example
1- Set required java environment varialbes
[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=/usr/hdp/*
[hbase@te1-hdp-rp-dn04 ~]$ export CLASSPATH=$CLASSPATH:/usr/hdp/*:.
2- write Java code
***********Java File ******************
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.conf.Configuration;
public class CreateTable {
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
System.out.println("conf==> "+conf);
conf.set("", "2181");
conf.set("hbase.zookeeper.quorum", "te1-hdp-rp-dn04");
conf.set("zookeeper.znode.parent", "/hbase-unsecure");
// Instantiating HbaseAdmin class
HBaseAdmin admin = new HBaseAdmin(conf);
// Instantiating table descriptor class
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("empz"));
// Adding column families to table descriptor
tableDescriptor.addFamily(new HColumnDescriptor("personal"));
tableDescriptor.addFamily(new HColumnDescriptor("professional"));
// Execute the table through admin
System.out.println(" Table created ");
// Getting all the list of tables using HBaseAdmin object
HTableDescriptor[] tableDescriptorLst = admin.listTables();
// printing all the table names.
for (int i=0; i<tableDescriptorLst.length;i++ ){
//Disable table
Boolean b = admin.isTableDisabled("empz");
System.out.println("Table disabled");
}catch(Exception ex){ System.out.println("Error--> "+ ex.toString()); }
[hbase@te1-hdp-rp-dn04 ~]$ javac
4- Run java class file
[hbase@te1-hdp-rp-dn04 ~]$ java CreateTable
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See for further details.
conf==> Configuration: core-default.xml, core-site.xml, hbase-default.xml, hbase-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
Table created
5- Verify from hbase shell
hbase(main):007:0> list
11 row(s) in 0.0100 seconds
=> ["SYSTEM.CATALOG", "SYSTEM.FUNCTION", "SYSTEM.SEQUENCE", "SYSTEM.STATS", "TEST", "emp", "empz", "pagecounts", "scott.emp_hbase1", "scott.emp_hbase2", "test"]
Access hbase table from Hive
Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table directly – you’ll have to use the HBase shell to do so.
Registering the table is only the first step. As part of that registration, you also need to specify a column mapping. This is how you link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.
CREATE external TABLE hbase_emp(rowkey STRING, comm STRING, sal STRING,ename string,job string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,IncomeData:comm,IncomeData:sal,ProfessionalData:ename,ProfessionalData:job') TBLPROPERTIES ('' = 'emp');
0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> CREATE external TABLE hbase_emp(rowkey STRING, comm STRING, sal STRING,ename string,job string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,IncomeData:comm,IncomeData:sal,ProfessionalData:ename,ProfessionalData:job') TBLPROPERTIES ('' = 'emp');
No rows affected (0.75 seconds)
0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> describe hbase_emp;
| col_name | data_type | comment |
| rowkey | string | |
| comm | string | |
| sal | string | |
| ename | string | |
| job | string | |
5 rows selected (0.386 seconds)
0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> select * from hbase_emp;
| hbase_emp.rowkey | hbase_emp.comm | hbase_emp.sal | hbase_emp.ename | hbase_emp.job |
| 7499 | 300 | 1600 | ALLEN | SALESMAN |
1 row selected (0.536 seconds)
Registering the table is only the first step. As part of that registration, you also need to specify a column mapping. This is how you link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.
CREATE external TABLE hbase_emp(rowkey STRING, comm STRING, sal STRING,ename string,job string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,IncomeData:comm,IncomeData:sal,ProfessionalData:ename,ProfessionalData:job') TBLPROPERTIES ('' = 'emp');
0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> CREATE external TABLE hbase_emp(rowkey STRING, comm STRING, sal STRING,ename string,job string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,IncomeData:comm,IncomeData:sal,ProfessionalData:ename,ProfessionalData:job') TBLPROPERTIES ('' = 'emp');
No rows affected (0.75 seconds)
0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> describe hbase_emp;
| col_name | data_type | comment |
| rowkey | string | |
| comm | string | |
| sal | string | |
| ename | string | |
| job | string | |
5 rows selected (0.386 seconds)
0: jdbc:hive2://te1-hdp-rp-dn04:10000/elmlogs> select * from hbase_emp;
| hbase_emp.rowkey | hbase_emp.comm | hbase_emp.sal | hbase_emp.ename | hbase_emp.job |
| 7499 | 300 | 1600 | ALLEN | SALESMAN |
1 row selected (0.536 seconds)
1 comment:
Very nice article,Keep Sharing more posts with us.
Thank you...
Big Data Hadoop Training
Post a Comment