Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Sunday, February 12, 2017

Big Data - The Bigger Picture


I’ve put the title with "The Bigger Picture" instead of "The Big Picture" because even big picture comes with much more details. The aim of this post is to provide a broad understanding of the topic without indulging into deeper details.  

Today the term big data draws a lot of attention, but behind the hype there's a simple story. For decades, companies have been making business decisions based on transnational data stored in relational databases. Beyond that critical data, however, there is a potential treasure trove of non-traditional, less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Decreases in the cost of both storage and compute power have made it feasible to collect this data - which would have been thrown away only a few years ago. As a result, more and more companies are looking to include non-traditional yet potentially very valuable data with their traditional enterprise data in their business intelligence analysis. 

Scenarios for Big Data

Big data is, admittedly, an over-hyped buzzword used by software and hardware companies to boost their sales. Behind the hype, however, there is a real and extremely important technology trend with impressive business potential.  So Before we move on with looking at definitions of Big Data though, let's get an understanding for what makes it such an industry phenomenon. What has given rise to the prominence of Big Data.
It makes sense really at first to understand some of the scenarios where Big Data is produced, where is it coming from?




Certainly, web and internet scenarios including the analysis of web logs are canonical example and most people are able to identify pretty quickly. Likewise, social media data, things like Foursquare check-ins and Facebook status messages, Twitter messages are another examples.Even cell towers produce all kinds of data both about the calls that they connect and complete as well as just about the devices that pass near the towers and how long and what the signal strength is and what platforms and brands names those devices are associated with. It may help the company to start the better services in some areas. Please see the below image to get the more clearer picture of the data being generated today.


Here you see there is a lot of Data processing is required and in order to process this huge data you would require to have big hardware infrastructure (Clustered Servers) which may be in hundreds and thousands and in this case node failure is expected rather than exceptional. The number of nodes in a cluster does not remain constant also. Think how would you process this ever increasing huge data???

When data is enormous and ever increasing , you need to think about the new ways to manage the processing of your data rather than relying upon the traditional ways.


What Enables Big Data?


So far we just saw the scenarios which force us to think about the something new “Big Data” , now we observe some enablers that have come together recently that really lit up why Big Data has become so big. 

Big Data technology, for the most part, is compatible with commodity hardware, cheap servers, cheap disks. This is a big difference from data analysis technologies and marketplaces even recently that focused much more on expensive appliances, network storage, and other technologies that smaller organizations really didn't have access to. Closely correlated with that, there is the vast reduction in the cost of storage, disk drives have become much more dense and much cheaper recently and that means it's not just cheaper to analyze the data, it's also cheaper to keep it around.
 A third thing to consider is that much of the Big Data technology is based on open source software that has created another dropping of a barrier to entry for various organizations.

Big Data Defined


it's important to realize that there's no hard and fast definition of how much data you need in order for it to be considered big data, whatever (data) was big yesterday is not big today and whatever is big today may not be big tomorrow, so it is relative term. But in a nutshell we can define it like below.

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 



For some, data size matters, for others it's not about the data size but how you use it. In going along with that, some people believe that the real hallmark of big data isn't necessarily how much data you have, but it could also involve how quickly that data is arriving and being sampled and recorded and it could also refer to the variability of that data. How it's structured and how consistently or inconsistently it is structured. So the three Vs kind of seek to describe all of this.

3 Vs

Volume is the first thought that comes with big data: the big part. Some experts consider Petabytes the starting point of big data. As we generate more and more data, we are sure this starting point will keep growing. However, volume in itself is not a perfect criterion of big data, as other two Vs have a more direct impact.

Velocity refers to the speed at which the data is being generated or the frequency with which it is delivered. Think of the stream of data coming from the sensors in the highways, or the video cameras at airports that scan and process faces in a crowd. There is also the click stream data of popular e-commerce web sites.

Variety is about all the different data and file types that are available. Just think about the music files in the iTunes store (about 28 million songs and over 30 billion downloads), or the movies in Netflix (over 75,000), the articles in the New York Times web site (more than 13 million starting in 1851), tweets (over 500 million every day), foursquare check-ins with geolocation data (over five million every day), and then you have all the different log files produced by any system that has a computer embedded. When you combine these three Vs, you will start to get a more complete picture of what big data is all about.


Alternate Data Processing Techniques

Big data is not only about the data, it is also about alternative data processing techniques that can better handle the three Vs as they increase their values. The traditional relational database is well known for the following characteristics:
• Transnational support for the ACID properties:
• Atomicity: Where all changes are done as if they are a single operation.
• Consistency: At the end of any transaction, the system is in a valid state.
• Isolation: The actions to create the results appear to have been done sequentially, one at a time.
• Durability: All the changes made to the system are permanent.
• The response times are usually in the subsecond range, while handling thousands of interactive users.
• The data size is in the order of Terabytes.
• Typically uses the SQL-92 standard as the main programming language.

In general, relational databases cannot handle the three Vs well, so what helps you then here??

MapReduce: Simplified Data Processing


MapReduce is an algorithmic approach to dealing with data, to dealing with big data. It was actually developed and perfected inside of Google and then Google published a paper to share the aspects of that approach with the rest of the industry. Hadoop is in effect the open source implementation of Google's MapReduce. 

In this approach a large amount of data is taken and divided into several smaller batches of data and then processed each of those batches in parallel. It has two phases, Map & Reduce.

The map step concerns itself with splitting the data up and doing some pre processing on each of the chunks. The reduce step will then take the output of the map step and aggregate the data. Specifically, the map step will be responsible for outputting data in a key and value format. The reduce step will then expect the data in that format, it will expect that the data is sorted by the key so that all data items related to a given key are contiguous and it will produce output where there is only one piece of data for each key. And in that sense, it's aggregating, it's aggregating all the rows of data for a given key into a single row.


Again, Hadoop is the technology that most prominently implements MapReduce, but there are other technologies including certain NOSQL databases that employ the concept in their own data processing and their own programming models.

Big Data Technology - Hadoop


Hadoop is an Apache project that combines a MapReduce (defined above) engine with a distributed file system called HDFS, the Hadoop Distributed File System.
Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.



HDFS allows the local disks on each of the nodes in the Hadoop cluster to be pulled together as a single pool storage and the files that may exist on a given node are typically replicated in other nodes, by default there are two additional copies of each file name. If a particular node fails, we won't lose data because that data is also stored on at least two other nodes by default. Files can't be updated in HDFS. In certain cases they can be appended too, but they can't be rewritten.


Opting Big Data - Strategy

Big Data is not for all, it is for those who believe that better decisions are made using data relevant to the issue, not merely human judgment based on individual experience thus the use of big data require some strategic analysis. You should assess your systems and define the capabilities needed to make your Big Data initiative a success. 

You have to meet the obstacles which can be technical, Insufficient big data tools, limited data-driven decision making at senior levels. 


You have to devise the strategy how can you move forward to Big Data road? To be successful you will need Data Scientists in your team, professionals who are adept with the analytical and visualization tools required to process and recognize patterns in data and who are equally comfortable with business concepts and operations. 

Think about pitfalls, probably the biggest pitfall is the fact that Big Data technology is so much in this Hype cycle right now. That doesn't make it bad technology but it does mean there are lots of potential distractions out there to make sound decisions. Make sure that you cut through those distractions so that you can use sound logic for your road map going forward. 

Consider whether your data volumes truly are big and consider that even if the nature of the data in your business really qualifies as Big Data, that you may not be collecting enough of that data to really reach the Big Data benchmark. Some types of businesses are very transnational, some types of businesses don't necessarily benefit from analysis of social media sentiment and some businesses are well-established with stable clientele and low competitive worry.