Sunday, January 13, 2019

Why Big Data and What is Big Data?

Before we know What is Big data, let’s start with Why Big data came into the picture?

Big data gets generated in multi petabyte quantities every day. Data changes fast and comes in different format e.g. audio, video, picture, text, structure, unstructured etc. those are difficult to manage and process using RDBMS or other traditional technologies. Since tech company like Google, yahoo in early 2000s found challenges to solve these various types of data with huge volume with existing technologies, so they started looking into alternative solution and that's how Big data is here today. You will find more about the big data history at the end of this post.


Let’s start with, what is Big data?

Is big data a Tool? Language? Solution? Or what? ...

Well, it’s a platform that comprises many tools, fortunately most of them are open source. However, since there are many tools available in the market to solve big data challenges, so next confusion arises; what tools to use when, I will write about this in my next post.

Let’s focus on concept of big data, People think big data is always about huge data, but it’s not the case. We can say, to be candidate for big data solution it should meet at least one of the three elements from 3 Vs:
 1) Volume 
 2) Velocity and 
 3) Variety

Fig 1: Elements to meet big data challenge

High volume: Social media like Facebook has billions of users, huge content created on YouTube every hour, organization like NASA generated 1.73 gigabytes of data at the end of year 2014 in every few seconds, Maersk vessels send huge volume of data every minutes over network.

High Velocity: Speed of the data matter, you need to capture real time data from IoT devices. Your mobile devices produce tons of data every day. Some business can’t wait longer, so you may have to capture near real time of data and need to process immediately. Some business like retail industry require real time data.

High Variety: Different type of data mixed in the same platform e.g. Wikipedia or Twitter or Facebook they have mix of text, audio, videos, images etc. Regular business also receive different format of data which need to transform into useful output.

So when your organization deal with the above 3 Vs then it's time to consider moving into big data platform. As Forbes research shown [1], the companies who said don't have any plan to use big data in 2015,  out of those; 11% percent already started using big data from 2017. And in 2015, 17% mentioned they are using big data but those number is increased to 53%. in 2017. The research also added that, among all industries; Finance and Telecom are ahead to adapt the big data.

History of Big data (literally how Hadoop invented):

Since data started growing exponentially and you get various type of data with great velocity which existing transactional database could not handle. Hence, many says; at first Google faced challenge how to handle the scenarios where they tried to gain an advantage in their searches, Google wanted to create a catalog of the entire Internet. To be successful, they had to meet the challenges presented by the 3 V's (as mentioned above) in an innovative way.

Google tackled the big data problem working together in a group of interconnected, inexpensive computers. This was revolutionary, over a span of a couple of years, Google Labs released papers describing the parts of their big data solution. From these, Doug Cutting and Mike Cafarella began developing a project at Yahoo!, which was later open sourced into the Apache Foundation project called Hadoop, named after the toy elephant of Mr. Cutting’s son.

When people talk about big data, the first name come is ‘Hadoop’. Hadoop is High-availability distributed object-oriented platform is used in maintaining, scaling, error handling, self-healing and securing large scale of data. These data can be structured or unstructured. As mentioned earlier if data is huge with variety and need to process instantly then traditional systems are unable to handle it. Thus, Hadoop comes in the picture.

But please remember, big data is not only Hadoop, there are so many other tools work with Hadoop eco system which you must need to use to solve the big data challenges which I am going to write in my next post.


[1] https://www.forbes.com/sites/louiscolumbus/2017/12/24/53-of-companies-are-adopting-big-data-analytics/#19852cfd39a1 (accessed on 11th Jan 2019)