Big data is one of the hottest buzzwords and trends in technology for quite some time now. However, beyond the buzz of the media, Big data is a fundamental technology in building future businesses and sustainable enterprises. It is what allows businesses to better understand their clients and make better decisions based on that understanding, it also allows decision makers to predict and understand failures, errors, threats as well as opportunities and hence create more sustainable systems in return.
It’s true that data has existed for as long as we could remember, but the exponential increase in volume, variety, and velocity of that data is what earned it the name of Big Data.
We generate quintillion bytes of data every day, more than 500 thousand photos are being shared on Snapchat, 4 million videos being viewed by users and 3+ million search queries being conducted on Google in a single minute. The sheer volume of valuable insights in that enormous amount of data creates the need for Big Data frameworks, to manage and analyze the data with the resources at hand.
Hadoop is an Apache open source framework for managing and processing datasets. Hadoop uses computer clusters and modules that are designed to be fault-resistant. Based on the belief that hardware will fail, at one point or another, it replicates the data and copies it to another node in the cluster so that if a failure took place, the data could be retrieved through the other node and saving the hustle of data inconsistency. That is not the only benefit that Hadoop offers, though, Apache Hadoop is scalable which is an important requirement in the Big Data world, as users should have the ability to expand their infrastructure and distribute a large amount of data over multiple servers that could be potentially working in parallel. Not only that, but it is also a fast and flexible framework, the way Hadoop maps the data on the clusters empowers data processing and make it faster. Hadoop comes with four modules: Hadoop Common, Hadoop Distributed File system, Hadoop YARN (Yet Another Resource Negotiator) and Hadoop MapReduce.
While Hadoop is a widely known and loved framework, there are few cons to using Hadoop such as lack of preventive measures as the default of security measures in Hadoop is being disabled, so scientists and users should always keep that in mind while working with sensitive data. Also, it’s hard to use for small data, so it’s almost only fit for usage at large companies or entities that generate or possess a large amount of data.
Spark is another amazing product made by Apache for batch processing. It’s a fairly easy to use framework, that enables users to write applications in Java, R, Python, Node Js and Scala. Although, an ideal situation occurs if the developer knows or tends to learn Node Js for a better implementation of Apache Spark. Thanks to its DAG scheduler, physical execution engine, and query optimizer, it is hundred times faster than Hadoop is. Spark is perfect when it comes to machine learning too, but it then requires a cluster manager and a distributed storage system.
Spark can be used as a standalone framework or can be used in conjunction with other frameworks as it supports the integration with Hadoop. Apache Mesos, Cassandra, HBase and kurbenetes which makes it ideal for most types of integrations, whether on an individual small scale or a large scale for businesses.
Spark is based on a type of data structures known as RDD or Resilient Distributed Dataset. It is a read-only set of items that are distributed over the whole cluster of machines in the system. RDD makes the system fault-resistant as well and prevents the loss of data in cases of failures. Spark comes with a stack of libraries that empowers its functionality even more, that includes Spark SQL, Spark Streaming, MLib, Graphx. The spark code is reusable as well which makes it even friendlier and easier to use.
However, there are cons to using Spark too, as it’s not too small-data friendly when it’s used with Hadoop either, and the RDD structure poses a high memory cost in return for the performance.
Storm is another framework offered by Apache for data processing, specifically, real-time processing. It is simple and can be used with any programming language, which allows you to use it with your favorite language and it is said to be fun to use as well. The core idea of Storm is defining certain small and discrete operations which creates a topology that functions as a pipeline for transforming the data. Storm is used for real-time analytics, continuous computations, online ML and much more. According to Apache,” a benchmark clocked spark at over million tuples processed per second per node”. Spark is also scalable and fault-tolerant.
Apache Samza is a powerful framework for asynchronous stream processing in real-time, which utilizes Apache Kafka for messaging and Hadoop YARN for fault tolerance, security and resource management. Samza offers a suite of great features such as a simple API that is comparable to MapReduce, processor isolation, durability, scalability and the fact that is Pluggable and lets you run Samza with other execution environments. The framework is made to handle many gigabytes per partition i.e. large amounts of states. It snapshots and restores processor’s state and is capable of restoring to a consistent snapshot upon restarting. Samza is best used for filtering, joins, distributed RPC and reprocessing.
There is also Apache Flink which we can’t go by without mentioning, it’s a powerful framework for both real-time and batch modes. It offers many high-level functionalities while being similar to MapReduce and is absolutely stunning when it comes to performance. The key point in comparing the frameworks and the main criterion is the environment and application that they will be used for, as the nature of the data and the environment shifts the need from one framework to another. Also, the use of a wrong framework for a situation would result in wasted resources.
A research by the Avendus Capital in 2015 indicated that the big data market in India was hovering around US$1.15 billion. It is estimated that India alone will face the shortage of 250,000 data scientist and engineer by the end of this year, which marks the insanely massive potential waiting in the big data field.