I’ve
put the title with "The Bigger Picture" instead of "The Big
Picture" because even big picture comes with much more details. The
aim of this post is to provide a broad understanding of the topic without
indulging into deeper details.
Today
the term big data draws a lot of attention, but behind the hype there's a
simple story. For decades, companies have been making business decisions based
on transnational data stored in relational databases. Beyond that critical
data, however, there is a potential treasure trove of non-traditional, less
structured data: weblogs, social media, email, sensors, and photographs that
can be mined for useful information. Decreases in the cost of both storage and
compute power have made it feasible to collect this data - which would have
been thrown away only a few years ago. As a result, more and more companies are
looking to include non-traditional yet potentially very valuable data with
their traditional enterprise data in their business intelligence analysis.
Scenarios
for Big Data
Scenarios for Big Data
Big data is, admittedly, an over-hyped
buzzword used by software and hardware companies to boost their sales. Behind
the hype, however, there is a real and extremely important technology trend
with impressive business potential. So Before
we move on with looking at definitions of Big Data though, let's get an
understanding for what makes it such an industry phenomenon. What has given
rise to the prominence of Big Data.
It makes sense really at first to
understand some of the scenarios where Big Data is produced, where is it coming from?
Certainly,
web and internet scenarios including the analysis of web logs are canonical
example and most people are able to identify pretty quickly. Likewise, social
media data, things like Foursquare check-ins and Facebook status messages,
Twitter messages are another examples.Even
cell towers produce all kinds of data both about the calls that they connect
and complete as well as just about the devices that pass near the towers and
how long and what the signal strength is and what platforms and brands names
those devices are associated with. It may help the company to start the better
services in some areas. Please see the below image to get the more clearer picture of the data being generated today.
Here
you see there is a lot of Data processing is required and in order to process
this huge data you would require to have big hardware infrastructure (Clustered
Servers) which may be in hundreds and thousands and in this case node failure
is expected rather than exceptional. The number of nodes in a cluster does not
remain constant also. Think how would you process this ever increasing huge data???
When
data is enormous and ever increasing , you need to think about the new ways to
manage the processing of your data rather than relying upon the traditional
ways.
What Enables Big Data?
So far we just saw the
scenarios which force us to think about the something new “Big Data” , now we observe some enablers
that have come together recently that really lit up why Big Data has become so
big.
Big Data technology, for
the most part, is compatible with commodity hardware, cheap servers, cheap
disks. This is a big difference from data analysis technologies and
marketplaces even recently that focused much more on expensive appliances,
network storage, and other technologies that smaller organizations really
didn't have access to. Closely correlated with that, there is the vast
reduction in the cost of storage, disk drives have become much more dense and
much cheaper recently and that means it's not just cheaper to analyze the
data, it's also cheaper to keep it around.
A
third thing to consider is that much of the Big Data technology is based on
open source software that has created another dropping of a barrier to
entry for various organizations.
Big Data Defined
it's
important to realize that there's no hard and fast definition of
how much data you need in order for it to be considered big data, whatever (data) was big yesterday is not big today and whatever is big today may not be big tomorrow, so it is relative term. But in a nutshell we can define it like below.
Big data is
the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional
data processing applications.
For some, data size matters, for others it's
not about the data size but how you use it. In going along with that, some people
believe that the real hallmark of big data isn't necessarily how much data you
have, but it could also involve how quickly that data is arriving and being
sampled and recorded and it could also refer to the variability of that data.
How it's structured and how consistently or inconsistently it is structured. So
the three Vs kind of seek to describe all of this.
3 Vs
Volume is the first thought that comes
with big data: the big
part.
Some experts consider Petabytes the starting point of big data. As we generate
more and more data, we are sure this starting point will keep growing. However,
volume in itself is not a perfect criterion of big data, as other two Vs have a
more direct impact.
Velocity refers to the speed at which the
data is being generated or the frequency with which it is delivered. Think of
the stream of data coming from the sensors in the highways, or the video
cameras at airports that scan and process faces
in a crowd. There is also the click stream data of popular e-commerce web
sites.
Variety is about all the different data
and file types that are available. Just think about the music files in the
iTunes store (about 28 million songs and over 30 billion downloads), or the
movies in Netflix (over 75,000), the articles in the New York Times web site
(more than 13 million starting in 1851), tweets (over 500 million every day),
foursquare check-ins with geolocation data (over five million every day), and
then you have all the different log files produced by any system that has a
computer embedded. When you combine these three Vs, you will start to get a
more complete picture of what big data is all about.
Alternate Data Processing Techniques
Big data is not only about the data, it
is also about alternative data processing techniques that can better handle the
three Vs as they increase their values. The traditional relational database is
well known for the following characteristics:
• Transnational support for the ACID
properties:
• Atomicity: Where all changes are done
as if they are a single operation.
• Consistency: At the end of any
transaction, the system is in a valid state.
• Isolation: The actions to create the
results appear to have been done sequentially, one at a time.
• Durability: All the changes made to the
system are permanent.
• The response times are usually in the subsecond
range, while handling thousands of interactive users.
• The data size is in the order of
Terabytes.
• Typically uses the SQL-92 standard as
the main programming language.
In general, relational databases cannot
handle the three Vs well, so what helps you then here??
MapReduce: Simplified Data Processing
MapReduce is an algorithmic approach to dealing
with data, to dealing with big data. It was actually developed and perfected
inside of Google and then Google published a paper to share the aspects
of that approach with the rest of the industry. Hadoop is in effect the open
source implementation of Google's MapReduce.
In this approach a large amount of data is taken and divided into several smaller batches of data and then processed each of those
batches in parallel. It has two phases, Map & Reduce.
The
map step concerns itself with splitting the data up and doing some pre
processing on each of the chunks. The reduce step will then take the output of
the map step and aggregate the data. Specifically, the map step will be
responsible for outputting data in a key and value format. The reduce step will
then expect the data in that format, it will expect that the data is sorted by
the key so that all data items related to a given key are contiguous and it
will produce output where there is only one piece of data for each key. And in
that sense, it's aggregating, it's aggregating all the rows of data for a given
key into a single row.
Again, Hadoop is the technology that most
prominently implements MapReduce, but there are other technologies including
certain NOSQL databases that employ the concept in their own data processing
and their own programming models.
Big Data Technology - Hadoop
Hadoop is an Apache project that combines
a MapReduce (defined above) engine with a distributed file system called HDFS, the Hadoop
Distributed File System.
Hadoop is
not a type of database, but rather a software ecosystem that allows for
massively parallel computing. It is an enabler of certain types NoSQL distributed
databases (such as HBase),
which can allow for data to be spread across thousands of servers with little
reduction in performance.
HDFS allows the local disks on each of
the nodes in the Hadoop cluster to be pulled together as a single pool storage and the files that may exist on a given node are
typically replicated in other nodes, by default there are two additional
copies of each file name. If a
particular node fails, we won't lose data because that data is also stored on
at least two other nodes by default. Files can't be updated in HDFS. In certain
cases they can be appended too, but they can't be rewritten.
Opting Big Data - Strategy
Big Data is not for all, it is for those who believe that better decisions are made using data relevant to the issue, not merely human judgment based on individual experience thus the
use of big data require some strategic analysis. You
should assess your systems and define the capabilities needed to make your Big
Data initiative a success.
You
have to meet the obstacles which can be technical, Insufficient
big data tools, limited data-driven decision making at senior levels.
You have to devise the strategy how can you move forward to Big Data road? To
be successful you will need Data Scientists in your team, professionals who are
adept with the analytical and visualization tools required to process and
recognize patterns in data and who are equally comfortable with business
concepts and operations.
Think about pitfalls, probably the
biggest pitfall is the fact that Big Data technology is so much in this Hype
cycle right now. That doesn't make it bad technology but it does mean there are
lots of potential distractions out there to make sound decisions. Make sure
that you cut through those distractions so that you can use sound logic for
your road map going forward.
Consider whether your data
volumes truly are big and consider that even if the nature of the data in your
business really qualifies as Big Data, that you may not be collecting enough of
that data to really reach the Big Data benchmark. Some types of businesses are very transnational, some types
of businesses don't necessarily benefit from analysis of social media sentiment
and some businesses are well-established with stable clientele and low
competitive worry.
2 comments:
Well explained.
Thanks for sharing very useful and excellent content.
Keep writing more blogs
Thank you
learn big data online
Post a Comment