DBMentors - Inam Bukhari's Blog: 2018

Tuesday, April 24, 2018

Configuring Logstash with Elasticsearch

Introduction

Logstash is an open source data collection engine with real-time pipelining capabilities. Logstash can dynamically unify data from disparate sources and normalize the data into destinations of your choice. Cleanse and democratize all your data for diverse advanced downstream analytics and visualization use cases. You can clean and transform your data during ingestion to gain near real-time insights immediately at index or output time. Logstash comes out-of-box with many aggregations and mutations along with pattern matching, geo mapping, and dynamic lookup capabilities.

Integrating Hadoop and Elasticsearch

Introduction

Hadoop shines as a batch processing system, but serving real-time results can be challenging. For truly interactive data discovery, ES-Hadoop (The Elasticsearch-Hadoop connector) lets you index Hadoop data into the Elastic Stack to take full advantage of the speedy Elasticsearch engine and beautiful Kibana visualizations.

Working with Elasticsearch

Introduction

Elasticsearch is a distributed, scalable, real-time search and analytics engine built on top of Apache Lucene™,. Lucene ( a library) is arguably the most advanced, high-performance, and fully featured search engine library in existence today—both open source and proprietary. It enables you to search, analyze, and explore your data whether you need full-text search, real-time analytics of structured data, or a combination of the two.

Configure Rsyslog with Any Log File

Modern linux distros ship with Rsyslog which has some nice additional functionality (imfile module) that provides the ability to convert any standard text file into a Syslog message.

HDFS Centralized Cache Management

Due to increasing memory capacity, many interesting working sets are able to fit in aggregate cluster memory. By using HDFS centralized cache management, applications can take advantage of the performance benefits of in-memory computation. Cluster cache state is aggregated and controlled by the NameNode, allowing applications schedulers to place their tasks for cache locality.

Configuring ACLs on HDFS

ACLs extend the HDFS permission model to support more granular file access based on arbitrary combinations of users and groups. We will discuss how to use Access Control Lists (ACLs) on the Hadoop Distributed File System (HDFS).

Spooling Files to HBase using Flume

Scenario:

One of my team wants to upload the contents of file existing in a specific directory (spooling dir) to HBase for some analysis. For the purpose we will be using Flume's spooldir-source which will allow users and applications to place files in spooling dir and process each line as one event to put it in HBase. It is assumed that Hadoop cluster and HBase is running, our environment is on HDP 2.6.

Integrating Hadoop Cluster with Microsoft Azure Blob Storage

Introduction

Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. All access to Azure Storage is done through a storage account.

Install SQL Server 2017 on Linux [RHEL]

SQL Server 2017 now runs on Linux. It’s the same SQL Server database engine, with many similar features and services regardless of your operating system.

I'm providing below the straight away installation for it.

Using Avro with Hive and Presto

Pre-requsites

Working with Apache Avro to manage Big Data Files

The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points:

Working with Apache Avro to manage Big Data Files

What is Avro?

Apache Avro is a language-neutral data serialization system and is a preferred tool to serialize data in Hadoop. Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persistent storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again.

Custom Merge Utility for Flume Generated Files

Problem:

My client is streaming tweets to HDFS location, thousands of flume files are being created on this location. Hive External Table has been created on this location, external table performance is degraded when there are too may small files.

How to merge multiple part files (ORC) in Hadoop created by PolyBase?

Problem:

One of my client is using PolyBase to query and offloading SQL Server data to Hadoop. While offloading data in Hive ORC format, multiple part files are created by the PolyBase in HDFS. For better query performance all these part files are needed to be merged.

Using Microsoft PolyBase to query Big Data

Introduction

PolyBase is a technology that accesses data outside of the database via the t-sql language. In SQL Server 2016, it allows you to run queries on external data in Hadoop or to import/export data from Azure Blob Storage. Queries are optimized to push computation to Hadoop. PolyBase does not require you to install additional software to your Hadoop environment. Querying external data uses the same syntax as querying a database table. This all happens transparently. PolyBase handles all the details behind-the-scenes, and no knowledge about Hadoop is required by the end user to query external tables.

Processing Twitter (JSON) Data in Oracle (12c External Table)

Problem:

We have live Twitter stream data ingested by Flume to our Hadoop cluster.
Flume is generating too many files in HDFS, 2 files in 1 second about 172k files in a day.
We have to process the Flume generated twitter JSON files.
Created an Oracle external table over twitter JSON files but performance is too bad because of too many files.
Need a remedy for the above issues.

Using Preprocessor with External Table [Over HDFS]

Oracle 11g Release 2 introduced the PREPROCESSOR clause to identify a directory object and script used to process the files before they are read by the external table. This feature was backported to 11gR1 (11.1.0.7). The PREPROCESSOR clause is especially useful for reading compressed files, since they are unzipped and piped straight into the external table process without ever having to be unzipped on the file system.

Partitioning Oracle (12c) External Table [Over HDFS]

Partitioned external tables were introduced in Oracle Database 12c Release 2 (12.2), allowing external tables to benefit from partition pruning and partition-wise joins. With the exception of hash partitioning, many partitioning and subpartitioning strategies are supported with some restrictions. In this post I've created a test to get better performance of external table over HDFS.

Optimizing NFS Performance [HDP NFS]

Introduction

You may experience poor performance when using NFS. Careful analysis of your environment, both from the client and from the server point of view, is the first step necessary for optimal NFS performance. Aside from the general network configuration - appropriate network capacity, faster NICs, full duplex settings in order to reduce collisions, agreement in network speed among the switches and hubs, etc. - one of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize.

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Tuesday, April 24, 2018

Monday, April 23, 2018

Thursday, April 19, 2018

Tuesday, April 17, 2018

Tuesday, March 20, 2018

Tuesday, March 13, 2018

Tuesday, February 06, 2018

Monday, January 29, 2018

Thursday, January 25, 2018

Tuesday, January 23, 2018

Sunday, January 21, 2018

Monday, January 08, 2018

Thursday, January 04, 2018

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews