Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Saturday, February 18, 2017

Hadoop Ecosystem - Quick Introduction

This is data age, data data everywhere. Although we cannot measure total volume of data stored electronically but it is estimated that 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes. Clearly we can say this is Zettabyte Era. A zettabyte is equal to one thousand exabytes, one million petabytes, or one billion terabytes.

As data is growing too fast for any organization and people are relying on the data a lot for better decision making, it is becoming impossible to store and process this huge (big) data into any single server no matter how well we scale it up. So how to deal with this challenge?? The savior to deal with this challenge is "Hadoop".

What is Hadoop?

Hadoop is an open-source Java-based programming framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Since its initial release, Hadoop has been continuously developed and updated. The second iteration of Hadoop (Hadoop 2) improved resource management and scheduling.

Why is Hadoop important?

Ability: It has ability to store and process huge volumes of data having different varieties, quickly. Sources may be from social media and the Internet of Things (IoT), machine generated data.
    Computing power: Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
    Fault tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
    Flexibility: Unlike traditional relational data bases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
    Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability: You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Organizations can deploy Hadoop components and supporting software packages in their local data center. However, most big data projects depend on short-term use of substantial computing resources. This type of usage is best-suited to highly scalable public cloud services, such as Amazon Web Services (AWS), Google Cloud Platform and Microsoft Azure. Public cloud providers often support Hadoop components through basic services, such as AWS Elastic Compute Cloud and Simple Storage Service instances.

Hadoop Modules and Projects

As a software framework, Hadoop is composed of numerous functional modules like Hadoop Common, Hadoop Distributed File System, Hadoop YARN and Hadoop MapReduce. Together these systems give users the tools to support additional Hadoop projects, along with the ability to process large data sets in real time while automatically scheduling jobs and managing cluster resources.

Below is the small description of Hadoop components.


The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters.

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open source distributed processing framework.

Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.  It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries.

R Connectors
R and Hadoop can be integrated to scale data Analytics to Big Data Analytics using R connectors

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural center, and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.

ZooKeeper is a distributed, open-source coordination service for distributed applications for maintaining configuration information, naming, providing distributed synchronization, and providing group services.. It allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. 

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data (eg; log data)  into the Hadoop Distributed File System (HDFS). It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

Apache Sqoop ("SQL to Hadoop") is a Java-based, console-mode application designed for transferring bulk data between Apache Hadoop and non-Hadoop datastores, such as relational databases, NoSQL databases and data warehouses.

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). Use Apache HBase when you need random, realtime read/write access to your Big Data.

Apache Ambari is a software project of the Apache Software Foundation. Ambari enables system administrators to provision, manage and monitor a Hadoop cluster, and also to integrate Hadoop with the existing enterprise infrastructure.