Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Saturday, June 24, 2017

Installing/Configuring and working with Apache Kafka


Apache Kafka is an open source, distributed publish-subscribe messaging system,
mainly designed to persistent messaging, high throughput, support multiple clients and providing real time message visibility to consumers.

Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are. It supports parallel data loading in the Hadoop systems.

Friday, June 23, 2017

Forward syslog to Flume with rsyslog



In computing, syslog is a standard for message logging. It allows separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them. Each message is labeled with a facility code, indicating the software type generating the message, and assigned a severity label.

Computer system designers may use syslog for system management and security auditing as well as general informational, analysis, and debugging messages. A wide variety of devices, such as printers, routers, and message receivers across many platforms use the syslog standard. This permits the consolidation of logging data from different types of systems in a central repository. Implementations of syslog exist for many operating systems.

Streaming Twitter Data by Flume using Cloudera Twitter Source

In my previous post Streaming Twitter Data using Apache Flume which fetches tweets using Flume and twitter streaming for data analysis.Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large". In order to overcome this issue, I used Cloudera TwitterSource rather than apache TwitterSource.

Streaming Twitter Data using Apache Flume


Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of streaming event data. It is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (event/log data) from various web servers and services like Facebook and Twitter to HDFS.

Building Teradata Presto Cluster

Before working on this post you should review below posts.

Installing/Configuring PrestoDB
Working with PrestoDB Connectors

In this post , I'll be covering below

1- Installing and configuring Presto Admin
2- Installing Presto Cluster on a single node
3- Using Presto ODBC Driver
4- Installing and configuring Presto Cluster with one coordinator and three workers

Working with PrestoDB Connectors

Complete my previous post Installing/Configuring PrestoDB

Presto enables you to connect to other databases using some connector, in order to perform queries and joins over several sources providing metadata and data for queries. In this post we will work with some connectors. A coordinator (a master daemon) uses connectors to get metadata (such as table schema) that is needed to build a query plan. Workers use connectors to get actual data that will be processed by them.

Installing/Configuring PrestoDB


Presto (invented at Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. Unlike Hive, Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. A single Presto query can combine data (through pluggable connectors) from multiple sources, allowing for analytics across your entire organization. It is targeted at analysts who expect response times ranging from sub-second to minutes.

Managing HDFS Quotas

The Hadoop Distributed File System (HDFS) allows the administrator to set quotas for the number of names used and the amount of space used for individual directories. Name quotas and space quotas operate independently, but the administration and implementation of the two types of quotas are closely parallel.

Hadoop DFSAdmin Commands

The dfsadmin tools are a specific set of tools designed to help you root out information about your Hadoop Distributed File system (HDFS). As an added bonus, you can use them to perform some administration operations on HDFS as well.

Recover the deleted file/folder in HDFS

By default Hadoop deletes the files/directory permanently but sometimes they are deleted accidentally and you want to get them back. You have to enable Trash feature for this purpose. There are two properties (fs.trash.interval & fs.trash.checkpoint.interval) to be set in core-site.xml to move the deleted files and directories in .Trash folder which is located in HDFS /user/$USER/.Trash.

Hive Streaming

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop 
Streaming API opens an I/O pipe to an external process. Data is then passed to 
the process, which operates on the data it reads from the standard input and writes the 
results out through the standard output, and back to the Streaming API job.

Thursday, June 08, 2017

Installing/Configuring and Working on Apache Sqoop


Apache Sqoop is a hadoop ecosystem's tool (hadoop client) designed to Efficiently transfers bulk data between Apache Hadoop and structured datastores like Oracle. It helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. It can also be used to extract data from Hadoop and export it into external structured datastores.

Friday, June 02, 2017

Apache PIG - a Short Tutorial


Apache Pig is an abstraction over MapReduce developed as a research project at Yahoo in 2006 and was open sourced via Apache incubator in 2007. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. To write data analysis programs, Pig provides a high-level language known as Pig Latin. Scripts written in Pig Latin are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.