Introduction
Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.
Unlike other storage for big data analytics, Kudu isn't just a file format. It's a live storage system which supports low-latency millisecond-scale access to individual rows. Kudu isn't designed to be an OLTP system, but Fast processing of OLAP workloads.
We are using Kudu to provide the data to our real time dashboards.
Quick Concepts
Quick Concepts
- Columnar Data Store
- A columnar data store stores data in strongly-typed columns
- you can read a single column, or a portion of that column, while ignoring other columns
- a given column contains only one type of data, pattern-based compression can be orders of magnitude more efficient
- Table has a schema and a totally ordered primary key. A table is split into segments called tablets
- Tablet
- is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases
- replicated on multiple tablet servers
- Tablet Server
- stores and serves tablets to clients
- For a given tablet, one tablet server acts as a leader, and the others act as follower replicas of that tablet
- Only leaders service write requests, while leaders or followers each service read requests
- Leaders are elected using Raft Consensus Algorithm.
- One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers
- Master
- The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster
- there can only be one acting master (the leader)
- also coordinates metadata operations for clients
- All the master’s data is stored in a tablet, which can be replicated to all the other candidate masters.
- Tablet servers heartbeat to the master at a set interval (the default is once per second).
- Raft Consensus Algorithm
- a means to guarantee fault-tolerance and consistency, both for regular tablets and for master data
- Catalog Table
- central location for metadata of Kudu
- tables, schemas, locations, and states
- tablets, which tablet servers have replicas of each tablet, the tablet’s current state, and start and end keys
- may not be read or written directly
- accessible only via metadata operations exposed in the client API -
- Logical Replication
- Kudu replicates operations, not on-disk data. This is referred to as logical replication, as opposed to physical replication
- inserts and updates do transmit data over the network, deletes do not need to move any data
- compaction, do not need to transmit the data over the network in Kudu, different from storage systems that use HDFS
My Environment:
- Master Nodes
1 Master Node
- Tablet Servers
All 5 Data Nodes as Tablet Servers
- Storage
Different hard drives for the the data and WAL directories of Kudu
--fs_wal_dir and --fs_metadata_dir be placed on a high-performance drives with high bandwidth and low latency
- mirror the drives containing these directories in order to make recovering from a drive failure easier
- --fs_data_dirs be comma-separated list of directories, data will be striped across the directories, let Kudu manage its own striping over multiple devices rather than delegating the striping to a RAID-0 array
- --log_dir
- --block_cache_capacity_mb, Maximum amount of memory allocated to the Kudu Tablet Server’s block cache.
- --memory_limit_hard_bytes, Maximum amount of memory a Tablet Server can consume before it starts rejecting all incoming writes.
- kudu.table.history_max_age_sec, Number of seconds to retain history for tablets in this table.
Setting up Kudu
Kudu source code can be downloaded and build by your own from Apache but i've taken the already built RPM. Kudu is based on master client architecture, you need at least one Master and one or more tablet servers. You can query kudu by using Impala or Presto, we are using Presto.
1- Download Kudu RPM from below, i'm using 1.8
Media Path: /opt/sources/kudu/kudu-1.8.0-1.x86_64.rpm
Master: ha01 x.x.x.210
Web UI: http://x.x.x..210:8051/tablet-servers
RPC Port: x.x.x..210:7051
[admin@ha01]$ sudo rpm -ivh /opt/sources/kudu/kudu-1.8.0-1.x86_64.rpm
installing the package above will create
- kudu user on OS
- log location /var/log/kudu
- configuration location /etc/kudu/conf and /etc/default/kudu-master
3- Configure Master
vim /etc/kudu/conf/master.gflagfile
--log_dir=/opt/datastack/log/kudu
--fs_wal_dir=/opt/datastack/kudu/master
--fs_data_dirs=/opt/datastack/kudu/master
4- Create Related Folders defined in configuration, give ownership/permissions to kudu user and start the services
mkdir -p /opt/datastack/log/kudu
mkdir -p /opt/datastack/kudu/master
mkdir -p /opt/datastack/kudu/master
chown kudu:kudu -R /opt/datastack/log/kudu
chown kudu:kudu -R /opt/datastack/kudu/master
chown kudu:kudu -R /opt/datastack/kudu/master
start the Master services
[admin@ha01]$ sudo systemctl status kudu-master.service
[admin@ha01]$ sudo systemctl enable kudu-master.service
[admin@ha01]$ sudo systemctl start kudu-master.service
[admin@ha01]$ sudo systemctl status kudu-master.service
[admin@ha01]$ sudo systemctl enable kudu-master.service
[admin@ha01]$ sudo systemctl start kudu-master.service
[admin@ha01]$ sudo systemctl status kudu-master.service
Verify that Master services are running by using browser
http://x.x.x.210:8051/
5- Check Cluster Health
[admin@ha01 ~]$ su - kudu
-bash-4.2$ kudu cluster ksck x.x.x.210:7051
6- Installing Configuring Kudu Tablet Server
Tablet Servers
dn01 x.x.x.196
dn02 x.x.x.197
dn03 x.x.x.198
dn04 x.x.x.199
dn05 x.x.x.200
WEB UI Tablet : http://x.x.x.198:8050
[admin@dn01]$ sudo rpm -ivh /opt/sources/kudu/kudu-1.8.0-1.x86_64.rpm
7- Configure Tablet Servers
vim /etc/kudu/conf/tserver.gflagfile
sudo mkdir -p /data/datastack/kudu/tserver
sudo mkdir -p /data/datastack/log/kudu
sudo chown kudu:kudu -R /data/datastack/kudu/
sudo chown kudu:kudu -R /data/datastack/log/kudu/
DN01
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/data/datastack/log/kudu
--fs_wal_dir=/data/datastack/kudu/tserver
--fs_data_dirs=/data/datastack/kudu/tserver
DN02
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/data/datastack/log/kudu
--fs_wal_dir=/data/datastack/kudu/tserver
--fs_data_dirs=/data/datastack/kudu/tserver
DN03
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/var/log/kudu
--fs_wal_dir=/data/kudu/wals/tserver
--fs_data_dirs=/data/kudu/data/tserver
DN04
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/var/log/kudu
--fs_wal_dir=/data/kudu/wals/tserver
--fs_data_dirs=/data/kudu/data/tserver
DN05
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/var/log/kudu
--fs_wal_dir=/data/kudu/wals/tserver
--fs_data_dirs=/data/kudu/data/tserver
8- Create Related Folders defined in configuration, give ownership/permissions to kudu user and start the services as done with kudu master
eg;
[root@hdpmaster ~]# systemctl start kudu-tserver.service
Configure Presto to query Kudu
Configuration on Presto (0.326) Coordinator and Workers
Create a catalog properties file $PRESTO_HOME/etc/catalog/kudu.properties
$PRESTO_HOME is pointing to Presto home directory
su - presto326
ps -ef | grep presto326
vi /opt/presto-server-326/etc/catalog/kudu.properties
kudu.client.master-addresses=x.x.x.210
After configuration restart the presto services on all presto workers and coordinator
Configure Presto client
Add driver
DriverName: <<Any name>>
DrvierURL: jdbc:presto://x.x.x.204:6060;User=presto326;Catalog=Kudu;Schema=default;
Create Alias for Kudu access
You are ready to do send queries to Kudu
-- Testing: Create a users table in the default schema
CREATE TABLE kudu.default.users (
user_id int WITH (primary_key = true),
first_name varchar,
last_name varchar
) WITH (
partition_by_hash_columns = ARRAY['user_id'],
partition_by_hash_buckets = 2,
number_of_replicas = 3
)
INSERT INTO kudu.default.users VALUES (1, 'Zeeshan', 'Aziz'), (2, 'Alaa', 'Salameh'),(3,'Ali','Haider');
SELECT * FROM kudu.default.users;
--can send summaries directly to Kudu by Presto
INSERT INTO kudu.default.users
select * from (select 3,'Alaa','Salameh') t1
ALTER TABLE kudu.default.users RENAME TO kudu.default.newusers
ALTER TABLE kudu.default.users ADD COLUMN newcol int
ALTER TABLE kudu.default.users DROP COLUMN newcol
SHOW SCHEMAS;
SHOW TABLES;
SHOW CREATE TABLE kudu.default.users;
SHOW COLUMNS FROM kudu.default.users;
DESCRIBE kudu.default.users;
DROP TABLE kudu.default.users
Note:
The Kudu primary key enforces a uniqueness constraint. Inserting a second row with the same primary key results in updating the existing row (‘UPSERT’).
Kudu Command Line
1- Check that master and tablet server processes are running, and that table metadata is consistent
-bash-4.2$ kudu cluster ksck en01
2- Move tablet replicas between tablet servers to balance replica counts for each table and for the cluster as a whole
kudu cluster rebalance en01
3- Dump the tree of a Kudu filesystem
kudu fs dump tree -fs_wal_dir=/data/kudu/master
kudu fs dump tree -fs_wal_dir=/data/kudu/master -fs_data_dirs=/data/kudu/master
4-Dump the UUID of a Kudu filesystem
kudu fs dump uuid -fs_wal_dir=/data/kudu/master -fs_data_dirs=/data/kudu/master
5- Get the status of a Kudu Master
kudu master status en01
6- Get the current timestamp of a Kudu Master
kudu master timestamp en01
7- List masters in a Kudu cluster
kudu master list en01
Errors
1- Failed to initialize sys tables async: on-disk and provided master lists are different
http://mail-archives.apache.org/mod_mbox/kudu-user/201711.mbox/%3CCADY20s6Bj3ThDh0_ZBvBqxXoqXd54g9rAn1uHJ3-dfyLpateOg@mail.gmail.com%3E
https://stackoverflow.com/questions/51204270/unable-to-start-kudu-master
2- Unable to load consensus metadata (the metadata file is missed and the part of tablets is unavailable)
http://mail-archives.apache.org/mod_mbox/kudu-user/201806.mbox/%3CCAPV-+Aw44_u_R47vFjpwjxtg7ApPfxOOrKCEh6DrGc1e+b=PLA@mail.gmail.com%3E
https://issues.apache.org/jira/browse/KUDU-2195?focusedCommentId=16328129&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16328129
1 comment:
Very nice Blog with very unique content.
keep updating more Blogs.
hadoop admin online training
hadoop admin certification
hadoop administration online training
Post a Comment