Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Sunday, May 03, 2020

Working with Apache Kudu


Introduction                                                                        


Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Unlike other storage for big data analytics, Kudu isn't just a file format. It's a live storage system which supports low-latency millisecond-scale access to individual rows. Kudu isn't designed to be an OLTP system, but Fast processing of OLAP workloads.



We are using Kudu to provide the data to our real time dashboards.

Quick Concepts


- Columnar Data Store
   - A columnar data store stores data in strongly-typed columns
     - you can read a single column, or a portion of that column, while ignoring other columns
   - a given column contains only one type of data, pattern-based compression can be orders of magnitude more efficient
- Table has a schema and a totally ordered primary key. A table is split into segments called tablets
- Tablet 
   - is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases
     - replicated on multiple tablet servers
- Tablet Server
- stores and serves tablets to clients
- For a given tablet, one tablet server acts as a leader, and the others act as follower replicas of that tablet
- Only leaders service write requests, while leaders or followers each service read requests
- Leaders are elected using Raft Consensus Algorithm. 
- One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers
- Master
- The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster
- there can only be one acting master (the leader)
- also coordinates metadata operations for clients
- All the master’s data is stored in a tablet, which can be replicated to all the other candidate masters.
- Tablet servers heartbeat to the master at a set interval (the default is once per second).
- Raft Consensus Algorithm
- a means to guarantee fault-tolerance and consistency, both for regular tablets and for master data
- Catalog Table
- central location for metadata of Kudu 
- tables,  schemas, locations, and states 
- tablets,  which tablet servers have replicas of each tablet, the tablet’s current state, and start and end keys
- may not be read or written directly
- accessible only via metadata operations exposed in the client API
- Logical Replication
    - Kudu replicates operations, not on-disk data. This is referred to as logical replication, as opposed to physical replication
- inserts and updates do transmit data over the network, deletes do not need to move any data
- compaction, do not need to transmit the data over the network in Kudu, different from storage systems that use HDFS


My Environment:

- Master Nodes
1 Master Node

- Tablet Servers
All 5 Data Nodes as Tablet Servers

- Storage
     Different hard drives for the the data and WAL directories of Kudu

   --fs_wal_dir and --fs_metadata_dir be placed on a high-performance drives with high       bandwidth and low latency
- mirror the drives containing these directories in order to make recovering from a drive failure easier
- --fs_data_dirs be  comma-separated list of directories, data will be striped across the directories, let Kudu manage its own striping over multiple devices rather than delegating the striping to a RAID-0 array
- --log_dir 
- --block_cache_capacity_mb, Maximum amount of memory allocated to the Kudu Tablet Server’s block cache.
- --memory_limit_hard_bytes, Maximum amount of memory a Tablet Server can consume before it starts rejecting all incoming writes.
- kudu.table.history_max_age_sec, Number of seconds to retain history for tablets in this table.




Setting up Kudu                                                                    

Kudu source code can be downloaded and build by your own from Apache but i've taken the already built RPM. Kudu is based on master client architecture, you need at least one Master and one or more tablet servers. You can query kudu by using Impala or Presto, we are using Presto.

1- Download Kudu RPM from below, i'm using 1.8



2- Installing configuring Kudu Master

Media Path: /opt/sources/kudu/kudu-1.8.0-1.x86_64.rpm
Master: ha01 x.x.x.210
Web UI: http://x.x.x..210:8051/tablet-servers
RPC Port: x.x.x..210:7051

[admin@ha01]$ sudo rpm -ivh /opt/sources/kudu/kudu-1.8.0-1.x86_64.rpm


installing the package above will create 
- kudu user on OS
- log location /var/log/kudu
- configuration location /etc/kudu/conf and /etc/default/kudu-master

3- Configure Master
vim /etc/kudu/conf/master.gflagfile

--log_dir=/opt/datastack/log/kudu
--fs_wal_dir=/opt/datastack/kudu/master
--fs_data_dirs=/opt/datastack/kudu/master


4- Create Related Folders defined in configuration, give ownership/permissions to kudu user and start the services

mkdir -p /opt/datastack/log/kudu
mkdir -p /opt/datastack/kudu/master
mkdir -p /opt/datastack/kudu/master

chown kudu:kudu -R /opt/datastack/log/kudu
chown kudu:kudu -R /opt/datastack/kudu/master
chown kudu:kudu -R /opt/datastack/kudu/master

start the Master services

[admin@ha01]$ sudo systemctl status kudu-master.service
[admin@ha01]$ sudo systemctl enable kudu-master.service
[admin@ha01]$ sudo systemctl start kudu-master.service
[admin@ha01]$ sudo systemctl status kudu-master.service
Verify that Master services are running by using browser

http://x.x.x.210:8051/

5- Check Cluster Health 

[admin@ha01 ~]$ su - kudu
-bash-4.2$ kudu cluster ksck x.x.x.210:7051

6- Installing Configuring Kudu Tablet Server 

Tablet Servers
dn01     x.x.x.196
dn02     x.x.x.197
dn03     x.x.x.198
dn04     x.x.x.199
dn05     x.x.x.200

WEB UI Tablet : http://x.x.x.198:8050

[admin@dn01]$ sudo rpm -ivh /opt/sources/kudu/kudu-1.8.0-1.x86_64.rpm


7- Configure Tablet Servers 

vim /etc/kudu/conf/tserver.gflagfile

sudo mkdir -p /data/datastack/kudu/tserver
sudo mkdir -p /data/datastack/log/kudu
sudo chown kudu:kudu -R /data/datastack/kudu/
sudo chown kudu:kudu -R /data/datastack/log/kudu/

DN01 
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/data/datastack/log/kudu
--fs_wal_dir=/data/datastack/kudu/tserver
--fs_data_dirs=/data/datastack/kudu/tserver

DN02
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/data/datastack/log/kudu
--fs_wal_dir=/data/datastack/kudu/tserver
--fs_data_dirs=/data/datastack/kudu/tserver

DN03
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/var/log/kudu
--fs_wal_dir=/data/kudu/wals/tserver
--fs_data_dirs=/data/kudu/data/tserver

DN04 
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/var/log/kudu
--fs_wal_dir=/data/kudu/wals/tserver
--fs_data_dirs=/data/kudu/data/tserver

DN05
--tserver_master_addrs=x.x.x.210:7051
--log_dir=/var/log/kudu
--fs_wal_dir=/data/kudu/wals/tserver
--fs_data_dirs=/data/kudu/data/tserver

8- Create Related Folders defined in configuration, give ownership/permissions to kudu user and start the services as done with kudu master
eg;
[root@hdpmaster ~]# systemctl start kudu-tserver.service


Configure Presto to query Kudu                                         

Configuration on Presto (0.326) Coordinator and Workers 
Create a catalog properties file $PRESTO_HOME/etc/catalog/kudu.properties
$PRESTO_HOME is pointing to Presto home directory

su - presto326
ps -ef | grep presto326

vi /opt/presto-server-326/etc/catalog/kudu.properties

connector.name=kudu

kudu.client.master-addresses=x.x.x.210


After configuration restart the presto services on all presto workers and coordinator

Configure Presto client                                                    


 Add driver 
DriverName: <<Any name>>
DrvierURL: jdbc:presto://x.x.x.204:6060;User=presto326;Catalog=Kudu;Schema=default;

Create Alias for Kudu access

You are ready to do send queries to Kudu

--  Testing: Create a users table in the default schema

CREATE TABLE kudu.default.users (
  user_id int WITH (primary_key = true),
  first_name varchar,
  last_name varchar
) WITH (
  partition_by_hash_columns = ARRAY['user_id'],
  partition_by_hash_buckets = 2,
  number_of_replicas = 3
)


INSERT INTO kudu.default.users VALUES (1, 'Zeeshan', 'Aziz'), (2, 'Alaa', 'Salameh'),(3,'Ali','Haider');

SELECT * FROM kudu.default.users;

--can send summaries directly to Kudu by Presto
INSERT INTO kudu.default.users
select * from (select 3,'Alaa','Salameh') t1


ALTER TABLE kudu.default.users RENAME TO kudu.default.newusers

ALTER TABLE kudu.default.users ADD COLUMN newcol int

ALTER TABLE kudu.default.users DROP COLUMN newcol

SHOW SCHEMAS;
SHOW TABLES;
SHOW CREATE TABLE kudu.default.users;
SHOW COLUMNS FROM kudu.default.users;
DESCRIBE kudu.default.users;
DROP  TABLE kudu.default.users

Note:
The Kudu primary key enforces a uniqueness constraint. Inserting a second row with the same primary key results in updating the existing row (‘UPSERT’).

Kudu Command Line                                                      

1- Check that master and tablet server processes are running, and that table metadata is consistent
-bash-4.2$ kudu cluster ksck en01

2-  Move tablet replicas between tablet servers to balance replica counts for each table and for the cluster as a whole
kudu cluster rebalance en01

3- Dump the tree of a Kudu filesystem 
kudu fs dump tree -fs_wal_dir=/data/kudu/master
kudu fs dump tree -fs_wal_dir=/data/kudu/master -fs_data_dirs=/data/kudu/master

4-Dump the UUID of a Kudu filesystem

kudu fs dump uuid -fs_wal_dir=/data/kudu/master -fs_data_dirs=/data/kudu/master

5- Get the status of a Kudu Master 
kudu master status en01

6-  Get the current timestamp of a Kudu Master
kudu master timestamp en01

7- List masters in a Kudu cluster 
 kudu master list en01


Errors                                                                             

1- Failed to initialize sys tables async: on-disk and provided master lists are different
http://mail-archives.apache.org/mod_mbox/kudu-user/201711.mbox/%3CCADY20s6Bj3ThDh0_ZBvBqxXoqXd54g9rAn1uHJ3-dfyLpateOg@mail.gmail.com%3E

https://stackoverflow.com/questions/51204270/unable-to-start-kudu-master

2- Unable to load consensus metadata (the metadata file is missed and the part of tablets is unavailable)
http://mail-archives.apache.org/mod_mbox/kudu-user/201806.mbox/%3CCAPV-+Aw44_u_R47vFjpwjxtg7ApPfxOOrKCEh6DrGc1e+b=PLA@mail.gmail.com%3E

https://issues.apache.org/jira/browse/KUDU-2195?focusedCommentId=16328129&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16328129