Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, August 10, 2017

Working with Talend for Big Data (TOSBD)


Talend (eclipse based) provides unified development and management tools to integrate and process all of your data with an easy to use, visual designer. It helps companies become data driven by making data more accessible, improving its quality and quickly moving it where it’s needed for real-time decision making.
Talend for Big Data is built on top of Talend's data integration solution that enables users to access, transform, move and synchronize big data by leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform ever so easy to use.

Tuesday, August 08, 2017

Analyzing/Parsing syslogs using Hive and Presto


My company asked me to provide the solution for syslog aggregation for all the environments so that they may be able to analyze and get insights. Logs should be captured first, then retained and finally processed by the analyst team in a way they already use to query/process with database. The requirements are not much clearer as well as volume of data can't be determined at the stage.

Wednesday, August 02, 2017

Working with Apache Cassandra (RHEL 7)

Cassandra (created at Facebook for inbox search) like HBase is a NoSQL database, generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL and designed to manage extremely large data sets with manipulation capabilities. It is a distributed database, clients can connect to any node in the cluster and access any data.

Tuesday, August 01, 2017

Hortonworks - Using HDP Spark SQL

Using SQLContext, Apache Spark SQL can read data directly from the file system. This is useful when the data you are trying to analyze does not reside in Apache Hive (for example, JSON files stored in HDFS).

Monday, July 31, 2017

Installing/Configuring Hortonworks Data Platform [HDP]

Ambari is  completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operating Hadoop. As part of the Hortonworks Data Platform, allows enterprises to plan, install and securely configure HDP making it easier to provide ongoing cluster maintenance and management, no matter the size of the cluster.

Monday, July 10, 2017

Working with Apache Spark SQL

What is Spark?
Apache Spark is a lightning-fast cluster (in-memory cluster )computing technology, designed for fast computation. Spark does not depend upon Hadoop because it has its own cluster management, Hadoop is just one of the ways to implement Spark, it uses Hadoop for storage purpose. It extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

Saturday, June 24, 2017

Installing/Configuring and working with Apache Kafka


Apache Kafka is an open source, distributed publish-subscribe messaging system,
mainly designed to persistent messaging, high throughput, support multiple clients and providing real time message visibility to consumers.

Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are. It supports parallel data loading in the Hadoop systems.

Friday, June 23, 2017

Forward syslog to Flume with rsyslog



In computing, syslog is a standard for message logging. It allows separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them. Each message is labeled with a facility code, indicating the software type generating the message, and assigned a severity label.

Computer system designers may use syslog for system management and security auditing as well as general informational, analysis, and debugging messages. A wide variety of devices, such as printers, routers, and message receivers across many platforms use the syslog standard. This permits the consolidation of logging data from different types of systems in a central repository. Implementations of syslog exist for many operating systems.

Streaming Twitter Data by Flume using Cloudera Twitter Source

In my previous post Streaming Twitter Data using Apache Flume which fetches tweets using Flume and twitter streaming for data analysis.Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large". In order to overcome this issue, I used Cloudera TwitterSource rather than apache TwitterSource.

Streaming Twitter Data using Apache Flume


Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of streaming event data. It is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (event/log data) from various web servers and services like Facebook and Twitter to HDFS.

Building Teradata Presto Cluster

Before working on this post you should review below posts.

Installing/Configuring PrestoDB
Working with PrestoDB Connectors

In this post , I'll be covering below

1- Installing and configuring Presto Admin
2- Installing Presto Cluster on a single node
3- Using Presto ODBC Driver
4- Installing and configuring Presto Cluster with one coordinator and three workers

Working with PrestoDB Connectors

Complete my previous post Installing/Configuring PrestoDB

Presto enables you to connect to other databases using some connector, in order to perform queries and joins over several sources providing metadata and data for queries. In this post we will work with some connectors. A coordinator (a master daemon) uses connectors to get metadata (such as table schema) that is needed to build a query plan. Workers use connectors to get actual data that will be processed by them.

Installing/Configuring PrestoDB


Presto (invented at Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. Unlike Hive, Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. A single Presto query can combine data (through pluggable connectors) from multiple sources, allowing for analytics across your entire organization. It is targeted at analysts who expect response times ranging from sub-second to minutes.

Managing HDFS Quotas

The Hadoop Distributed File System (HDFS) allows the administrator to set quotas for the number of names used and the amount of space used for individual directories. Name quotas and space quotas operate independently, but the administration and implementation of the two types of quotas are closely parallel.

Hadoop DFSAdmin Commands

The dfsadmin tools are a specific set of tools designed to help you root out information about your Hadoop Distributed File system (HDFS). As an added bonus, you can use them to perform some administration operations on HDFS as well.

Recover the deleted file/folder in HDFS

By default Hadoop deletes the files/directory permanently but sometimes they are deleted accidentally and you want to get them back. You have to enable Trash feature for this purpose. There are two properties (fs.trash.interval & fs.trash.checkpoint.interval) to be set in core-site.xml to move the deleted files and directories in .Trash folder which is located in HDFS /user/$USER/.Trash.

Hive Streaming

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop 
Streaming API opens an I/O pipe to an external process. Data is then passed to 
the process, which operates on the data it reads from the standard input and writes the 
results out through the standard output, and back to the Streaming API job.

Thursday, June 08, 2017

Installing/Configuring and Working on Apache Sqoop


Apache Sqoop is a hadoop ecosystem's tool (hadoop client) designed to Efficiently transfers bulk data between Apache Hadoop and structured datastores like Oracle. It helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. It can also be used to extract data from Hadoop and export it into external structured datastores.

Friday, June 02, 2017

Apache PIG - a Short Tutorial


Apache Pig is an abstraction over MapReduce developed as a research project at Yahoo in 2006 and was open sourced via Apache incubator in 2007. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. To write data analysis programs, Pig provides a high-level language known as Pig Latin. Scripts written in Pig Latin are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs. 

Tuesday, May 30, 2017

Creating External Table for HDFS using Oracle Connector for Hadoop (OSCH)


Oracle Big Data Connectors facilitate data access to data stored in an Apache Hadoop cluster. It can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware. There are three connectors available from which we are going to work on Oracle SQL Connector for Hadoop Distributed File System for the purpose of this post.

Sunday, May 14, 2017

Connect Oracle SQL Developer to Hive

As Oracle SQL Developer is one of the most common SQL client tool used by Developers, Data Analyst and Data Architects to interact with Oracle and other relational systems. So extending the functionality of SQL developer to connect to hive is very useful for Oracle users. You can use the SQL Worksheet to query, create and alter Hive tables dynamically accessing data sources defined in the Hive metastore.

Tuesday, May 02, 2017

Using Hadoop Compression

Hadoop Compression

Hive can read data from a variety of sources, such as text files, sequence files, or even custom formats using Hadoop’s InputFormat APIs as well as can write data to various formats using OutputFormat API. You can take the leverage from Hadoop to store data as compressed to save significant disk storage. Compression also can increase throughput and performance. Compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.

Sunday, April 30, 2017

Hive for Oracle Developers and DBAs - Part III

Today we will discuss some more topic in Hive like Hive Queries, Distributed clauses, Sampling Data, Views,  Indexes and schema design. You can review the related posts below.

Hive for Oracle Developers and DBAs - Part  I
Hive for Oracle Developers and DBAs - Part II

Thursday, April 27, 2017

Hive for Oracle Developers and DBAs - Part II

In the first Hive post  we have discussed the basic usage and functionality of Hive , today we move forward and discuss some advance functionality. I'll cover Collection, Tables and Partitions in this post.

Sunday, April 23, 2017

Hive Installation and Configuration

What is Hive?

Apache Hive (originated in Facebook) is a Data warehouse system which is built to work on Hadoop to manage large datasets residing in HDFS. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data using SQL At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs). It is not designed for online transaction processing and best used for traditional data warehousing tasks.

Saturday, February 25, 2017

Hive for Oracle Developers and DBAs - Part I

The Hadoop ecosystem emerged as a cost-effective way of working with large data sets. It imposes a particular programming model, called MapReduce, for breaking up computation tasks into units that can be distributed around a cluster of commodity, server class hardware, thereby providing cost-effective, horizontal scalability.

Thursday, February 23, 2017

Hadoop Administration: Accessing HDFS (File system & Shell Commands)

You can access HDFS in many different ways. HDFS provides a native Java application programming interface (API) and a native C-language wrapper for the Java API. In addition, you can use a web browser to browse HDFS files. I'll be using CLI only in this post.

Tuesday, February 21, 2017

Setting up Hadoop Edge/Gateway Node (Hadoop Client)

We have 3 node Hadoop cluster (2.7.3) (One Master and two Slaves) already running in our environment, now we want to set up a fourth instance as a client machine (analogous to Oracle client) and submit commands from the client machine to the hadoop cluster. 

Monday, February 20, 2017

Setting up multi node Apache Hadoop Cluster 2.7.3 on RHEL 7.3

Multi node Hadoop cluster as composed of Master-Slave Architecture to accomplish BigData processing which contains multiple nodes. For setting up multi node Hadoop Cluster, I am going to use three machines (One as MasterNode and rest two are as SlaveNodes). 

Saturday, February 18, 2017

Setting up single node Apache Hadoop Cluster 2.7.3 on RHEL 7.3

The purpose of this post is to setup the single node Apache Hadoop Cluster 2.7.3 for BigData enthusiasts who want to learn the parallel computation to tackle large amount of datasets. Knowledge of Linux is the prerequisite for this post.

Below is the information for the environment that I've , I'm using RHV.
RHV,  Red Hat Enterprise Linux Server 7.3
Master Node:   hostname è hdpmaster           IPè192.166.44.170     rootpwdè hadoop123

Hadoop Ecosystem - Quick Introduction

This is data age, data data everywhere. Although we cannot measure total volume of data stored electronically but it is estimated that 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes. Clearly we can say this is Zettabyte Era. A zettabyte is equal to one thousand exabytes, one million petabytes, or one billion terabytes.

Sunday, February 12, 2017

Big Data - The Bigger Picture

I’ve put the title with "The Bigger Picture" instead of "The Big Picture" because even big picture comes with much more details. The aim of this post is to provide a broad understanding of the topic without indulging into deeper details.