In this post, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Presto) to achieve this. For some administrative work we may use Hive as well.
Environment:
1- CentOS
2- Hive 3 along with Hadoop 3 (Only binaries to use the jars)
3- Hive Metastore running
4- Minio Object Store (S3-Compatible)
5- Presto Cluster
Installing Minio
Minio is an open source distributed object storage server written in Go, designed for Private Cloud infrastructure providing S3 storage functionality. Minio is the best server which is suited for storing unstructured data such as photos, videos, log files, backups, and container. Size of an object can be range from a KBs to a maximum of 5TB.
Since it support Amazon’s S3 protocol, it works seamlessly with Presto and allow you to query data directly out of your on-premises data lake. We’ll walk you through connecting Presto to a Minio object storage server.
I'll try to show the commands and output in this post without describing much about the technology itself since focus is to show the fully functional data lake without HDFS.
[hdpsysuser@hdpmaster minio]$ pwd
/data/apps/minio
-- MINIO SERVER
[hdpsysuser@hdpmaster minio]$ wget https://dl.min.io/server/minio/release/linux-amd64/minio
[hdpsysuser@hdpmaster minio]$ chmod +x minio
[hdpsysuser@hdpmaster minio]$ ./minio server /data/miniolocal --console-address ":9001"
OR
[hdpsysuser@hdpmaster ~]$ /data/apps/minio/minio server /data/miniolocal --console-address ":9001"
-- MINIO CLIENT
[hdpsysuser@hdpmaster minio]$ wget https://dl.min.io/client/mc/release/linux-amd64/mc
[hdpsysuser@hdpmaster minio]$ chmod +x mc
[hdpsysuser@hdpmaster minio]$ ./mc config host add myminio http://hdpmaster:9000 minioadmin minioadmin
mc: Configuration written to `/home/hdpsysuser/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/home/hdpsysuser/.mc/share`.
mc: Initialized share uploads `/home/hdpsysuser/.mc/share/uploads.json` file.
mc: Initialized share downloads `/home/hdpsysuser/.mc/share/downloads.json` file.
-- Create an S3 bucket in MinIO, utilize the MinIO clien
[hdpsysuser@hdpmaster minio]$ ./mc mb myminio/datalake
Bucket created successfully `myminio/datalake`.
-- confirm that your S3 bucket has been successfully created
[hdpsysuser@hdpmaster minio]$ ./mc ls myminio
[2022-07-05 08:42:30 +03] 0B datalake/
[hdpsysuser@hdpmaster minio]$ hdfs dfs -ls s3a://
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
These are Hadoop filesystem client classes, found in the hadoop-aws JAR. An exception reporting this class as missing means that this JAR is not on the classpath.
S3A depends upon two JARs, alongside hadoop-common and its dependencies.
hadoop-aws JAR.
aws-java-sdk-bundle JAR.
The versions of hadoop-common and hadoop-aws must be identical, you should place the jars in the common folder of hadoop.
Configuring Hive Metastore
-- add the following configuration to the hive-site.xml (/data/apps/apache-hive-3.1.2-bin/conf)
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
<description>Enables or disables SSL connections to S3.</description>
</property>
<property>
<name>fs.s3a.endpoint</name>
<description>AWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com) is assumed.
</description>
<value>http://hdpmaster:9000</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<description>AWS access key ID.</description>
<value>minioadmin</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<description>AWS secret key.</description>
<value>minioadmin</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
<description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
</description>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://hdpmaster:5432/hive</value>
<description>PostgreSQL JDBC driver connection URL</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>PostgreSQL metastore driver class name</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>the username for the DB instance</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>the password for the DB instance</description>
</property>
</configuration>
Note: If you wish to use your local file system for your hive, you could use the below properties in hive-stie.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>file:///data/hivelocalfs</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>fs.default.name</name>
<value>file:///data/hivelocalfs</value>
</property>
Configuring Presto Catalog
We need to configure the catalog for minio in Presto cluster.
-- Configuring Presto catalog (minio.properties) and restart the presto
[presto@hdpmaster ~]$ cd /data/apps/presto-server-0.260.1/etc/catalog/
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hdpmaster:9083
hive.s3.path-style-access=true
hive.s3.endpoint=http://hdpmaster:9000
hive.s3.aws-access-key=minioadmin
hive.s3.aws-secret-key=minioadmin
hive.s3select-pushdown.enabled=true
hive.s3.ssl.enabled=false
-- verify from presto if the configuration shows the new catalog
[presto@hdpmaster catalog]$ prestocli --server hdpmaster:6060 --catalog minio --schema default
presto:default> SHOW CATALOGS;
-- check different presto commands
presto:default> CREATE SCHEMA minio.myschema;
CREATE SCHEMA
presto:default> drop schema myschema;
DROP SCHEMA
Test with Data Lake
0: jdbc:hive2://hdpmaster:10000/default> CREATE SCHEMA myschema location 's3a://datalake/cust.db/myschema';
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hdpsysuser_20220705142740_4e18c1b1-72f0-49bd-be41-d773bb07f28d); Time taken: 0.086 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
No rows affected (0.131 seconds)
-- connect to beeline using new schema
0: jdbc:hive2://hdpmaster:10000/default> !connect jdbc:hive2://hdpmaster:10000/myschema
Connecting to jdbc:hive2://hdpmaster:10000/myschema
Enter username for jdbc:hive2://hdpmaster:10000/myschema:
Enter password for jdbc:hive2://hdpmaster:10000/myschema:
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
-- create external table in Hive (non-partitioned and partitioned)
CREATE EXTERNAL TABLE customer_ext(
cust_id int,
cust_name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://datalake/cust.db/customer_ext';
Upload the csv file in external table location and check with query.
CREATE EXTERNAL TABLE customer_ext(
cust_id int,
cust_name string
)
PARTITIONED BY (cust_loc string) STORED AS ORC
LOCATION 's3a://datalake/cust.db/customer_ext';
0: jdbc:hive2://hdpmaster:10000/default> CREATE EXTERNAL TABLE customer(
. . . . . . . . . . . . . . . . . . . .> cust_id int,
. . . . . . . . . . . . . . . . . . . .> cust_name string
. . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . .> PARTITIONED BY (cust_loc string) STORED AS ORC
. . . . . . . . . . . . . . . . . . . .> LOCATION 's3a://datalake/cust.db/customer';
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hdpsysuser_20220705140103_173889db-5494-4588-94cb-eae8a9d71f0d); Time taken: 0.174 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
No rows affected (0.23 seconds)
-- Create Internal table in Hive
CREATE TABLE customer_int(
cust_id int,
cust_name string
)
PARTITIONED BY (cust_loc string) STORED AS ORC
LOCATION 's3a://datalake/cust.db/customer_int';
-- insert into internal table
presto:default> insert into customer_int values(1,'inam','lahore');
INSERT: 1 row
Query 20220705_110901_00039_ibg9w, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:04 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> insert into customer_int values(1,'ali','riyadh');
presto:default> insert into customer_int values(3,'abuzar','riyadh');
presto:default> select * from customer_int;
cust_id | cust_name | cust_loc
---------+-----------+----------
1 | ali | riyadh
1 | inam | lahore
(2 rows)
Query 20220705_111320_00042_ibg9w, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:01 [2 rows, 698B] [1 rows/s, 589B/s]
-- delete record presto
presto:default> delete from customer_int where cust_loc='lahore'; --complete partition
DELETE
Query 20220705_111430_00043_ibg9w, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> delete from customer_int where cust_id=3;
Query 20220705_111718_00046_ibg9w failed: This connector only supports delete where one or more partitions are deleted entirely
--Query from Hive (beeline)
0: jdbc:hive2://hdpmaster:10000/default> select * from customer_int;
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+-----------------------+-------------------------+------------------------+
| customer_int.cust_id | customer_int.cust_name | customer_int.cust_loc |
+-----------------------+-------------------------+------------------------+
| 1 | inam | lahore |
+-----------------------+-------------------------+------------------------+
1 row selected (4.489 seconds)
No comments:
Post a Comment