As the emerging concept of a data lakehouse is continuing to gain traction, I thought to write the hello world for it which I named as Hello Lakehouse. In this post first I'll elaborate some necessary concepts and then will come to the implementation part using open source technologies.
Introduction
A data lakehouse is an open data management architecture that combines the flexibility and cost-efficiency of data lakes with the data management and structure features of data warehouses, all on one data platform. So it is better to discuss little bit the warehouse first , then data lake and subsequently lakehouse.
Data Warehouse
Data warehouse is purpose-built for BI & Reporting, however it don't provide much support for video, audio, text, data science, ML. There is limited support for streaming. It is closed and in proprietary formats. Therefore most data is stored in data lakes. Further DWH has high implementation and maintenance costs and become outdated and require regular maintenance.
Data Lake
Data Lakehouse - Fixing Data Lake Problems
Lakehouse is a combined approach which is enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low cost cloud/object storage in open formats. A lakehouse gives you data versioning, governance, security, reduce data duplication and ACID properties that are needed even for unstructured data.
Data lakehouses usually start as data lakes containing all data types; then data is converted to a special format (eg; Delta Lake, Apache Hudi, and Apache Iceberg) which is open-source storage layer that brings reliability to data lakes by enabling ACID transactional processes from traditional data warehouses on data lakes.
Data lakehouses provide direct access to some of the most widely used business intelligence tools (Tableau, PowerBI) to enable advanced analytics. Additionally, data lakehouses use open-data formats (such as Parquet) with APIs and machine learning libraries, including Python/R, making it straightforward for data scientists and machine learning engineers to utilize the data.
Data lakehouse architecture enforces schema and data integrity making it easier to implement robust data security and governance mechanisms.
The main disadvantage of a data lakehouse is it’s still a relatively new technology. As such, it’s unclear whether it will live up to its all promises. It may be years before data lakehouses can compete with mature big-data storage solutions.
Quick Comparison - DWH, Data Lake, Lakehouse
Environment:
1- CentOS
2- S3 Object Store (Minio), the storage layer for our lakehouse
3- Hive metastore service (HMS), service that stores metadata and used by Trino
4- Apache Iceberg, open table format for huge analytic datasets
5- Trino 391, distributed SQL query engine, requires Java 17+
Step-1: Check your S3 Object store is working
I've chosen Minio for the storage layer of my lakehouse. If you want to install Minio you can check this post Create Data Lake Without Hadoop. Please pay attention to the section "Configuring Hive Metastore" for the S3 related configuration to the metastore.
[hdpsysuser@hdpmaster ~]$ /data/apps/minio/minio server /data/miniolocal --console-address ":9001"
Step-2: Check HMS is working
HMS with S3 related configuration must be running.
[hdpsysuser@hdpmaster ~]$ hive --service metastore
Step-3: Configure Trino for Iceberg
The Iceberg table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic swap. The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents.
Iceberg data files can be stored in either Parquet or ORC format, as determined by the format property in the table definition. The table format defaults to ORC. Since Iceberg stores the paths to data files in the metadata files, it only consults the underlying file system for files that must be read.
Iceberg scales to petabytes independent of the underlying storage layer and the access engine layer.
Step-4: Test with Iceberg tables
Metadata columns/Tables
In addition to the defined columns, the Iceberg connector automatically exposes path metadata as a hidden column ("$path") in each table
trino:icebergdb> SELECT *, "$path"
-> FROM icebergdb.dept;
trino:icebergdb> SELECT * FROM "dept$data"; -- alias for the Iceberg table itself
trino:icebergdb> SELECT * FROM "dept$properties"; -- retrieve the properties of the current snapshot
key | value
----------------------+-------
write.format.default | ORC
(1 row)
trino:icebergdb> SELECT * FROM "dept$files"; -- detailed overview of the data files
content | file_path | file_format | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_c
---------+-----------------------------------------------------------------------------+-------------+--------------+--------------------+--------------+-----------------+-------------
0 | s3a://datalake/icebergdb/dept/data/021c3231-602a-403c-8396-620c94c09572.orc | ORC | 4 | 602 | NULL | {1=4, 2=4, 3=4} | {1=0, 2=0, 3
(1 row)
The Iceberg connector supports Materialized view management. In the underlying system each materialized view consists of a view definition and an Iceberg storage table. You can use the Iceberg table properties to control the created storage table and therefore the layout and performance.
CREATE MATERIALIZED VIEW pmvw_emp
COMMENT 'Testing materialized view'
WITH ( format = 'ORC', partitioning = ARRAY['deptno'] )
AS
SELECT d.deptno,d.dname,e.ename,e.job,e.sal + coalesce(e.comm,0) "Total Salary"
FROM dept d,emp e
WHERE d.deptno=e.deptno;
-- Show defined materialized view properties
trino:icebergdb> SELECT * FROM system.metadata.materialized_view_properties;
-- Show metadata about the materialized views
trino:icebergdb> SELECT * FROM system.metadata.materialized_views;
trino:icebergdb> SHOW CREATE MATERIALIZED VIEW pmvw_emp;
No comments:
Post a Comment