Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Tuesday, July 05, 2022

Create Data Lake Without Hadoop

 In this post, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Presto) to achieve this. For some administrative work we may use Hive as well.


Environment:

1- CentOS

2- Hive 3 along with Hadoop 3 (Only binaries to use the jars)

3- Hive Metastore running

4- Minio Object Store (S3-Compatible)

5- Presto Cluster

Installing Minio                                            

Minio is an open source distributed object storage server written in Go, designed for Private Cloud infrastructure providing S3 storage functionality. Minio is the best server which is suited for storing unstructured data such as photos, videos, log files, backups, and container. Size of an object can be range from a KBs to a maximum of 5TB.

Since it support Amazon’s S3 protocol, it works seamlessly with Presto and allow you to query data directly out of your on-premises data lake. We’ll walk you through connecting Presto to a Minio object storage server.

I'll try to show the commands and output in this post without describing much about the technology itself since focus is to show the fully functional data lake without HDFS.

[hdpsysuser@hdpmaster minio]$ pwd

/data/apps/minio

-- MINIO SERVER

[hdpsysuser@hdpmaster minio]$ wget https://dl.min.io/server/minio/release/linux-amd64/minio

[hdpsysuser@hdpmaster minio]$ chmod +x minio

[hdpsysuser@hdpmaster minio]$ ./minio server /data/miniolocal --console-address ":9001"

OR

[hdpsysuser@hdpmaster ~]$ /data/apps/minio/minio server /data/miniolocal --console-address ":9001"





-- MINIO CLIENT

[hdpsysuser@hdpmaster minio]$ wget https://dl.min.io/client/mc/release/linux-amd64/mc

[hdpsysuser@hdpmaster minio]$ chmod +x mc

[hdpsysuser@hdpmaster minio]$ ./mc config host add myminio http://hdpmaster:9000 minioadmin minioadmin

mc: Configuration written to `/home/hdpsysuser/.mc/config.json`. Please update your access credentials.

mc: Successfully created `/home/hdpsysuser/.mc/share`.

mc: Initialized share uploads `/home/hdpsysuser/.mc/share/uploads.json` file.

mc: Initialized share downloads `/home/hdpsysuser/.mc/share/downloads.json` file.

-- Create an S3 bucket in MinIO, utilize the MinIO clien

[hdpsysuser@hdpmaster minio]$ ./mc mb myminio/datalake

Bucket created successfully `myminio/datalake`.

-- confirm that your S3 bucket has been successfully created

[hdpsysuser@hdpmaster minio]$ ./mc ls myminio

[2022-07-05 08:42:30 +03]     0B datalake/



[hdpsysuser@hdpmaster ~]$ /data/apps/minio/mc alias set minio http://127.0.0.1:9000 minioadmin minioadmin
Added `minio` successfully.
[hdpsysuser@hdpmaster ~]$ /data/apps/minio/mc ls minio
[2022-07-05 08:42:30 +03]     0B datalake/
[2022-07-06 11:14:45 +03]     0B logstash-output/
[2022-08-03 08:05:48 +03]     0B testbucket/


[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc config host add minio http://192.168.56.101:9000 minioadmin minioadmin --api S3v4
[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc ls minio
[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc ls minio/datalake/raworders/2022/08/24/17
[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc cat minio/datalake/raworders/2022/08/24/17/ls.s3.ffb518a6-f262-4f16-b5e5-d2f231dac11c.2022-08-24T20.18.part391.txt

[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc find minio/datalake/raworders/2022/08/24/16/ --name "*.txt"                               

[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc admin config get minio compression


[hdpsysuser@hdpmaster ~]$ mkdir -p /data2/mydata/miniodownloads

-- copy from local fs to minio, folder (key) will be created if not existing
[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc cp --recursive /data2/mydata/miniodownloads/06/hourlogs.zip minio/datalake/raworders/2022/08/25/06

-- copy from minio to local fs
[hdpsysuser@hdpmaster ~]$ /data2/apps/minio/mc cp --recursive minio/datalake/raworders/2022/08/25/06 /data2/mydata/miniodownloads

--Zip all files in local fs and upload to minio bucket, -m deletes files after zip
[hdpsysuser@hdpmaster 05]$ zip -m /data2/mydata/miniodownloads/06/hourlogs.zip /data2/mydata/miniodownloads/06/*.txt

-- list files in zip
[hdpsysuser@hdpmaster 05]$ less hourlogs.zip
[hdpsysuser@hdpmaster 05]$ zipinfo hourlogs.zip
[hdpsysuser@hdpmaster 05]$ vim hourlogs.zip

-- Create Tar FILE
[hdpsysuser@hdpmaster 06]$ tar -czvf file06hour.tar.gz *.txt


-- check space on minio
[hdpsysuser@hdpmaster 05]$ /data2/apps/minio/mc du minio/datalake/raworders/2022/08/25/06
-- stat of OBJECT
[hdpsysuser@hdpmaster 05]$ /data2/apps/minio/mc stat minio/datalake/raworders/2022/08/24/15/hourlogs.zip

--remove object
[hdpsysuser@hdpmaster 05]$ /data2/apps/minio/mc rm --recursive --force minio/datalake/raworders/2022/08/24/16/




-- Optional: using hdfs dfs as a client to s3
If you have already hadoop with you and you want to use HDFS DFS as your s3 client you can do this by below steps
--set the below properties in hdfs-site.xml, core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>http://hdpmaster:9000</value>
</property>

<property>
<name>fs.s3a.access.key</name>
<value>minioadmin</value>
</property>

<property>
<name>fs.s3a.secret.key</name>
<value>minioadmin</value>
</property>

<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>

<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>

-- test by hdfs command

[hdpsysuser@hdpmaster ~]$ hdfs dfs -ls s3a://datalake/

[hdpsysuser@hdpmaster ~]$ hdfs dfs -copyToLocal s3a://datalake/cust.csv .

Note:
You may get the exception while using hdfs command to minio (S3) because of some missing jars, in that case download the related jars and place in the hadoop location. you may take help from below.

[hdpsysuser@hdpmaster minio]$ hdfs dfs -ls s3a://

-ls: Fatal internal error

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

These are Hadoop filesystem client classes, found in the hadoop-aws JAR. An exception reporting this class as missing means that this JAR is not on the classpath.

S3A depends upon two JARs, alongside hadoop-common and its dependencies.

hadoop-aws JAR.
aws-java-sdk-bundle JAR.

The versions of hadoop-common and hadoop-aws must be identical, you should place the jars in the common folder of hadoop. 




Configuring Hive Metastore                      

In order for Presto to connect to Minio, it needs a cataloging service which the Hive Metastore provides. It tells Presto how the tables are defined and where the data is located.

 -- add the following configuration to the hive-site.xml (/data/apps/apache-hive-3.1.2-bin/conf)


<property>

   <name>fs.s3a.connection.ssl.enabled</name>

   <value>false</value>

<description>Enables or disables SSL connections to S3.</description>


</property>


<property>

<name>fs.s3a.endpoint</name>

<description>AWS S3 endpoint to connect to. An up-to-date list is

provided in the AWS Documentation: regions and endpoints. Without this

property, the standard region (s3.amazonaws.com) is assumed.

</description>

<value>http://hdpmaster:9000</value>

</property>


<property>

<name>fs.s3a.awsAccessKeyId</name>

<description>AWS access key ID.</description>

<value>minioadmin</value>

</property>


<property>

<name>fs.s3a.awsSecretAccessKey</name>

<description>AWS secret key.</description>

<value>minioadmin</value>

</property>


<property>

<name>fs.s3a.path.style.access</name>

<value>true</value>

<description>Enable S3 path style access ie disabling the default virtual hosting behaviour.

Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.

</description>

</property>


<property>

<name>fs.s3a.impl</name>

<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

<description>The implementation class of the S3A Filesystem</description>

</property>


 <property>

       <name>hive.metastore.schema.verification</name> 

       <value>false</value> 

 </property> 


<property>

    <name>javax.jdo.option.ConnectionURL</name>

    <value>jdbc:postgresql://hdpmaster:5432/hive</value>

    <description>PostgreSQL JDBC driver connection URL</description>

  </property>


  <property>

    <name>javax.jdo.option.ConnectionDriverName</name>

    <value>org.postgresql.Driver</value>

    <description>PostgreSQL metastore driver class name</description>

  </property>


  <property>

    <name>javax.jdo.option.ConnectionUserName</name>

    <value>hive</value>

    <description>the username for the DB instance</description>

  </property>


  <property>

    <name>javax.jdo.option.ConnectionPassword</name>

    <value>hive</value>

    <description>the password for the DB instance</description>

  </property>

</configuration>


Note: If you wish to use your local file system for your hive, you could use the below properties in hive-stie.xml

<property>

    <name>hive.metastore.warehouse.dir</name>

    <value>file:///data/hivelocalfs</value>

    <description>location of default database for the warehouse</description>

</property>


 <property> 

      <name>fs.default.name</name> 

      <value>file:///data/hivelocalfs</value> 

   </property>


Configuring Presto Catalog                       

We need to configure the catalog for minio in Presto cluster.

-- Configuring Presto catalog (minio.properties) and restart the presto 

[presto@hdpmaster ~]$ cd /data/apps/presto-server-0.260.1/etc/catalog/

connector.name=hive-hadoop2

hive.metastore.uri=thrift://hdpmaster:9083

hive.s3.path-style-access=true

hive.s3.endpoint=http://hdpmaster:9000

hive.s3.aws-access-key=minioadmin

hive.s3.aws-secret-key=minioadmin

hive.s3select-pushdown.enabled=true

hive.s3.ssl.enabled=false

-- verify from presto if the configuration shows the new catalog

[presto@hdpmaster catalog]$ prestocli --server hdpmaster:6060 --catalog minio --schema default

presto:default> SHOW CATALOGS;


-- check different presto commands

presto:default> CREATE SCHEMA minio.myschema;

CREATE SCHEMA

presto:default> drop schema myschema;

DROP SCHEMA

Test with Data Lake                                     

-- Create schema in hive with location (no HDFS running)

0: jdbc:hive2://hdpmaster:10000/default> CREATE SCHEMA myschema location 's3a://datalake/cust.db/myschema'; 

INFO  : Starting task [Stage-0:DDL] in serial mode

INFO  : Completed executing command(queryId=hdpsysuser_20220705142740_4e18c1b1-72f0-49bd-be41-d773bb07f28d); Time taken: 0.086 seconds

INFO  : OK

INFO  : Concurrency mode is disabled, not creating a lock manager

No rows affected (0.131 seconds)



-- connect to beeline using new schema

0: jdbc:hive2://hdpmaster:10000/default> !connect jdbc:hive2://hdpmaster:10000/myschema

Connecting to jdbc:hive2://hdpmaster:10000/myschema

Enter username for jdbc:hive2://hdpmaster:10000/myschema: 

Enter password for jdbc:hive2://hdpmaster:10000/myschema: 

Connected to: Apache Hive (version 3.1.2)

Driver: Hive JDBC (version 3.1.2)

Transaction isolation: TRANSACTION_REPEATABLE_READ


-- create external table in Hive (non-partitioned and partitioned) 

CREATE EXTERNAL TABLE customer_ext(

  cust_id int,

  cust_name string

  )

  ROW FORMAT DELIMITED

  FIELDS TERMINATED BY ','

  STORED AS TEXTFILE

  LOCATION  's3a://datalake/cust.db/customer_ext';

Upload the csv file in external table location and check with query.





CREATE EXTERNAL TABLE customer_ext(

  cust_id int,

  cust_name string

  )

  PARTITIONED BY (cust_loc string) STORED AS ORC

  LOCATION  's3a://datalake/cust.db/customer_ext';

  0: jdbc:hive2://hdpmaster:10000/default> CREATE EXTERNAL TABLE customer(

. . . . . . . . . . . . . . . . . . . .>   cust_id int,

. . . . . . . . . . . . . . . . . . . .>   cust_name string

. . . . . . . . . . . . . . . . . . . .>   )

. . . . . . . . . . . . . . . . . . . .>   PARTITIONED BY (cust_loc string) STORED AS ORC

. . . . . . . . . . . . . . . . . . . .>   LOCATION  's3a://datalake/cust.db/customer';

INFO  : Starting task [Stage-0:DDL] in serial mode

INFO  : Completed executing command(queryId=hdpsysuser_20220705140103_173889db-5494-4588-94cb-eae8a9d71f0d); Time taken: 0.174 seconds

INFO  : OK

INFO  : Concurrency mode is disabled, not creating a lock manager

No rows affected (0.23 seconds)

-- query table from presto
presto:default> select * from customer_ext;
 cust_id | cust_name | cust_loc 
---------+-----------+----------
(0 rows)

Query 20220705_110230_00032_ibg9w, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

-- Create Internal table in Hive

CREATE  TABLE customer_int(

  cust_id int,

  cust_name string

  )

  PARTITIONED BY (cust_loc string) STORED AS ORC

  LOCATION  's3a://datalake/cust.db/customer_int';

  

-- insert into internal table

presto:default> insert into customer_int values(1,'inam','lahore');

INSERT: 1 row

Query 20220705_110901_00039_ibg9w, FINISHED, 1 node

Splits: 19 total, 19 done (100.00%)

0:04 [0 rows, 0B] [0 rows/s, 0B/s]


presto:default> insert into customer_int values(1,'ali','riyadh');

presto:default> insert into customer_int values(3,'abuzar','riyadh');


presto:default> select * from customer_int;

 cust_id | cust_name | cust_loc 

---------+-----------+----------

       1 | ali       | riyadh   

       1 | inam      | lahore   

(2 rows)


Query 20220705_111320_00042_ibg9w, FINISHED, 1 node

Splits: 18 total, 18 done (100.00%)

0:01 [2 rows, 698B] [1 rows/s, 589B/s]


-- delete record presto

presto:default> delete from customer_int where cust_loc='lahore'; --complete partition

DELETE


Query 20220705_111430_00043_ibg9w, FINISHED, 1 node

Splits: 1 total, 1 done (100.00%)

0:02 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> delete from customer_int where cust_id=3;

Query 20220705_111718_00046_ibg9w failed: This connector only supports delete where one or more partitions are deleted entirely

--Query from Hive (beeline)

0: jdbc:hive2://hdpmaster:10000/default> select * from customer_int;

INFO  : OK

INFO  : Concurrency mode is disabled, not creating a lock manager

+-----------------------+-------------------------+------------------------+

| customer_int.cust_id  | customer_int.cust_name  | customer_int.cust_loc  |

+-----------------------+-------------------------+------------------------+

| 1                     | inam                    | lahore                 |

+-----------------------+-------------------------+------------------------+

1 row selected (4.489 seconds)

--Create schema in Hive and internal table in it
-- create schema in hive

CREATE SCHEMA myschema location 's3a://datalake/cust.db/myschema';

1: jdbc:hive2://hdpmaster:10000/myschema> use myschema;

1: jdbc:hive2://hdpmaster:10000/myschema> show tables;

INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (0.092 seconds)


-- Create Internal table in Hive without location
CREATE  TABLE customer_int2(
  cust_id int,
  cust_name string
  )
  PARTITIONED BY (cust_loc string) STORED AS ORC
  ;

1: jdbc:hive2://hdpmaster:10000/myschema> CREATE  TABLE customer_int2(
. . . . . . . . . . . . . . . . . . . . >   cust_id int,
. . . . . . . . . . . . . . . . . . . . >   cust_name string
. . . . . . . . . . . . . . . . . . . . >   )
. . . . . . . . . . . . . . . . . . . . >   PARTITIONED BY (cust_loc string) STORED AS ORC
. . . . . . . . . . . . . . . . . . . . >   ;
command(queryId=hdpsysuser_20220705144453_2a3b6d8f-d5a8-4066-92b4-ebc35e5d1ba6); Time taken: 0.11 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
No rows affected (0.207 seconds)
1: jdbc:hive2://hdpmaster:10000/myschema> 


-- use presto to insert in table

presto:default> use myschema;
USE

presto:myschema> insert into customer_int2 values(3,'abuzar','riyadh');
INSERT: 1 row

Query 20220705_114816_00059_ibg9w, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]





Complete Example with Trino/Presto without using Hive CLI

[trino391@hdpmaster ~]$  trinocli --server http://hdpmaster:6060 --catalog minio

SHOW CATALOGS;

SHOW SCHEMAS in minio;
SHOW TABLES IN minio.part;

CREATE SCHEMA minio.part
WITH (location = 's3a://part/');  -- create the bucket in minio first

-- Create a table with no partitions
CREATE TABLE minio.part.no_part (id int, name varchar, dt varchar)
WITH (
  format = 'ORC'
);

INSERT INTO minio.part.no_part 
VALUES 
  (1, 'part-1', '2020-11-18'), 
  (2, 'part-2', '2020-11-18'),
  (3, 'part-3', '2020-11-19'), 
  (4, 'part-4', '2020-11-19'),
  (5, 'part-5', '2020-11-20'), 
  (6, 'part-6', '2020-11-20');

CREATE TABLE minio.part.orders (id int, name varchar, dt varchar)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['dt']
);

INSERT INTO minio.part.orders 
VALUES 
  (1, 'part-1', '2020-11-18'), 
  (2, 'part-2', '2020-11-18'),
  (3, 'part-3', '2020-11-19'), 
  (4, 'part-4', '2020-11-19'),
  (5, 'part-5', '2020-11-20'), 
  (6, 'part-6', '2020-11-20');

SELECT *
FROM minio.part.no_part
WHERE dt = '2020-11-20';
 
SELECT *
FROM minio.part.orders
WHERE dt = '2020-11-20';

DELETE FROM minio.part.orders 
WHERE dt = '2020-11-18';


SELECT *
FROM minio.part.orders;

-- Make sure you are using minio (which is a rename of hive) catalog
CALL system.sync_partition_metadata('part', 'orders', 'ADD');
CALL system.sync_partition_metadata('part', 'orders', 'DROP');
CALL system.sync_partition_metadata('part', 'orders', 'FULL');

 -- Create a table with multi partitions
CREATE TABLE minio.part.multi_part (id int, name varchar, year varchar, month varchar, day varchar)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['year', 'month', 'day']
);

INSERT INTO minio.part.multi_part 
VALUES 
  (1, 'part-1', '2020', '11', '18'), 
  (2, 'part-2', '2020', '11', '18'),
  (3, 'part-3', '2020', '11', '19'), 
  (4, 'part-4', '2020', '11', '19'),
  (5, 'part-5', '2020', '11', '20'), 
  (6, 'part-6', '2020', '11', '20'),
  (7, 'part-7', '2019', '11', '18'), 
  (8, 'part-8', '2019', '01', '18'),
  (9, 'part-9', '2019', '11', '19'), 
  (10, 'part-10', '2019', '01', '19'),
  (11, 'part-11', '2019', '11', '20'), 
  (12, 'part-12', '2019', '01', '20');

No comments: