BIG DATA in small words
I have to admit that for me, the name Big Data sounds somehow childish. It is like you, a very intelligent and highly educated IT consultant were asking your six years old son: ‘Hey Bill, daddy is working with lots of unstructured, huge data sets and we need a name for it.’ Bill: ‘Aaaaaa…… Big Data?’ J
A simple definition for Big Data is: Very large sets of unstructured data, with sizes beyond the ability of commonly used program/software tools to manage, capture and process the data in a tolerable time frame in order to enable enhanced decision making, discovery and process optimization. The size of this data sets is constantly increasing, from a few terabytes at the beginning of this millennia, to many petabytes today and many exabytes tomorrow. [petabyte (PB) = 1015bytes,
exabyte (EB) = 1018bytes].
Gartner Inc., in 2001 (then META Group) has defined the 3Vs of Big Data (volume, velocity and variety) adding the forth V (veracity) later:
- Volume – the amount of data
- Velocity – in and out speed of data
- Variety – the range of data types and sources
- Veracity – the quality of the data
The next step, after acknowledging the inability of conventional software to process the Big Data, was to develop software/tools able to solve this problem. Seisint Inc. has developed a C++ based distributed file-sharing framework for data storage and query, followed in later years by MapReduce and Hadoop with more advanced and better approach to the Big Data processing.
In order to setup a Big Data processing environment one will need:
- A serious number of host machines (nodes) organized in a special cluster. The nodes can be partitioned into racks.
- A highly performant storage array of reasonable size
- A software framework with three main components:
- The framework providing the computational resources (CPU, memory, etc.) needed for the applications execution. Hadoop is using YARN Infrastructure (Yet Another Resource Negotiator) for this task.
- The framework providing permanent, reliable and distributed storage. Hadoop is using the HDFS Federation (Hadoop Distributed File System), Amazon is using S3 (Simple Storage Solution).
- The MapReduce framework which is the software layer implementing the MapReduce paradigm. In layman’s terms, the MapReduce was designed to take big data and use parallel distributed computing to turn big data into regular-sized data, by mapping the data and reducing the data.
For more details about the MapReduce paradigm read the article:
The evolution of Big Data processing ecosystems has triggered the apparition of other non-conventional technics like the NoSQL technologies. A NoSQL “non SQL” or “non-relational” database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. (Wikipedia).
Another interesting development was the detachment of Apache Spark from being a component of Hadoop to a fast and general engine for large-scale data processing. Apache Spark can run in standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos.
With the very fast evolution of the Internet of Things (IoT), the variety V of the 3Vs of the Big Data is amazing. The sources can be any smart device, smart cars, smart cities, satellites, traffic cameras, surveillance cameras, ATMs, etc., the data is collected in every know format and in several new formats every day and that points to the real challenge with Big Data which is the first V (volume) and the growth rate is incredible!
I am, generally speaking an optimist and I believe that the future will bring us fantastic ways for processing the Super-Big Data of the future which can only be described that is “as big as China”! J
For details about this topic, contact me at: firstname.lastname@example.org