Monday, August 08, 2022
Satisfy your search use cases using Opensearch & Opensearch-Dashboards
Thursday, July 28, 2022
Hello Lakehoue! Building Your First On-Prem Data Lakehouse
As the emerging concept of a data lakehouse is continuing to gain traction, I thought to write the hello world for it which I named as Hello Lakehouse. In this post first I'll elaborate some necessary concepts and then will come to the implementation part using open source technologies.
Monday, July 25, 2022
Centralized Logging with Fluentd/Fluent-bit and Minio
Fluentd is an open source data collector for building the unified logging layer. Once installed on a server, it runs in the background to collect, parse, transform, analyze and store various types of data. It is written in Ruby for flexibility, with performance-sensitive parts in C. td-agent is a stable distribution package of Fluentd having 30-40MB memory footprint.
Fluent Bit is a Lightweight Data Forwarder (with 450KB memory footprint) for Fluentd. Fluent Bit is specifically designed for forwarding the data from the edge (Containers / Servers / Embedded Systems) to Fluentd aggregators.
Thursday, July 07, 2022
Using Filebeat/Logstash to send logs to Minio Data Lake
To aggregate logs directly to an object store like Minio, you can use the Logstash S3 output plugin. Logstash aggregates and periodically writes objects on S3, which are then available for later analysis. For more information please review the related post at the end of this post.
Tuesday, July 05, 2022
Create Data Lake Without Hadoop
In this post, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Presto) to achieve this. For some administrative work we may use Hive as well.
Monday, May 04, 2020
Connect to Presto from Spark
1- Copy the presto driver to the spark master location eg; /opt/progs/spark-2.4.5-bin-hadoop2.7/jars
Kudu Integration with Spark
Kudu integrates with Spark through the Data Source API, I downloaded the jar files from below locaton
https://jar-download.com/artifacts/org.apache.kudu/kudu-spark2_2.11/1.10.0/source-code
you can place the jar files in $SPARK_HOME/jars (eg; /opt/progs/spark-2.4.5-bin-hadoop2.7/jars)if you dont want to use --jars option with spark shell
Sunday, May 03, 2020
Working with Ignite [In-Memory Data Grid]
Introduction
Working with Apache Kudu
Introduction