Making Sense Out of the Big Data Tangle
Our Puzzled Customers
In this era of Big Data, many of the IT people I talk with have a number of questions about the technology and trends associated with this new paradigm.
For example, many of them are feeling somewhat overwhelmed with the amount of data they now have to deal with – data that seems to be growing exponentially.
Many of the comments I hear go something like this: “I never thought we would be swamped with so much information. Are there Big Data solutions available now that can help me deal with this deluge of data?”
Some of our more knowledgeable customers ask, “Should I look for an ESB (Enterprise service bus) solution or is Apache Spark with Kafka the right approach?” And then they add, “By the way, I really don’t understand Hadoop MapReduce and Spark.”
Other typical comments and questions are: “I am not concerned with speed – is Spark still a viable option? We realize that we need a framework to help us process our huge data sets, but why is it so difficult to adopt Hadoop to this task? What if we make the move to Spark – is our data stored in HDFS or Hive still of any use or do we need to adapt it to this new platform?”
Some better informed data scientists and computer gurus use the level of data quantity and complexity they are dealing with to help determine if they should keep their existing analytics implementation in MapReduce or preferably shift to Spark.
Let’s see if we can bring a little more clarity to these and other Big Data questions.
Data and Analytics
The term, “analytics” has been around for a long time now. However, the term Big Data is a relative newcomer. The only difference between the terms analysis and analytics is that analytics is about analysing data and converting it into actionable insights.
The term Big Data seems to have first been used by computer scientist John R. Mashey. A report from the META group (now Gartner) was the first to identify the “3 Vs” (volume, variety, and velocity) perspective of Big Data.
The MapReduce (MR) paradigm had already been discussed in functional programming literature when Google’s seminal paper on MR provided scalable implementations of the paradigm on a cluster of nodes and fuelled its rapid development in the Big Data space.
Hadoop, which comprises the MR implementation, along with the Hadoop Distributed File System (HDFS), has now become the de facto standard for data processing, with a lot of industrial game changers such as Disney, Sears, Wal-Mart, and AT&T creating their own Hadoop cluster installations.
When immediate response is required, Spark is changing the nature of the game. Like Hadoop MapReduce, Spark is a processing tool in the Big Data domain. The advantage of Spark over Hadoop MR is that it can work data from many sources. Spark is the fastest open source engine for sorting a petabyte or more of data.
Spark Vs. MapReduce
Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers and researchers. They had previously been working on Hadoop MapReduce, and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast in order to handle interactive queries and iterative algorithms. Its design included capabilities like support for in-memory storage and efficient fault recovery. Hadoop performs well when data can be partitioned into independent chunks.
However, Hadoop is difficult to adopt. There are a variety of reasons for this:
• It lacks Object Database Connectivity (ODBC), which forces many tool developers to build separate Hadoop connectors.
• Hadoop’s is not performant for all types of applications – especially scientific analysis. It is heavily dependent on disk, the slowest component during processing.
• In computations where joins are needed and access data across splits are required, establishing correlation among data chunks is quite a time consuming and difficult task.
Back to MapReduce. When it comes to iterative computations, a key algorithmic approach for research and BI, MR does not perform well. Fetching of data on each iteration on MapReduce is time consuming and can cause significant performance hits. Normally, termination condition check runs outside the MapReduce Job, which determines if a job is finished and computation is complete – a prerequisite to start other computation.
For analytics on streaming data derived from, for example, sensors on a factory floor or IoT, or from applications that require multiple operations, you probably want to go with Spark. Currently, if you have ESB implementation, which uses Queues/Topic to process the data, and you want to replace your existing system or design a new system with the ability to handle high volume, high-velocity data then Kafka with spark is an ideal solution. Depending on the volume of messaging per second and the built-in functionality of Queuing system capabilities, you might want to run a comparison between Kafka and RabbitMQ. Kafka is used for higher volumes, but provides less built-in functionality, such as filtering. RabbitMQ offers more functional operations, but at a compromised speed. MapReduce's processing style is justified if data operations and reporting requirements are mostly static and you can wait for batch-mode processing.
Scientific Calculation and Big Data
Analytics on huge data sets (petabytes and more), with Hadoop as the preferred technology tool, has already proven itself to be a right choice; Spark has increased the performance many times over. You can understand the viability of Spark and Hadoop by looking at the characterization of the computation paradigms.
1. In algorithmic terminology the complexity of O(N) is an ideal case for Hadoop MapReduce. It involves basic statistical operations such as computing the mean, median, and variance.
2. Linear systems problems like linear regression and Principal Component Analysis (PCA) can be done via Hadoop although it’s not easy (e.g., the use of kernel principal component analysis [kernel PCA], and kernel regression).
3. The algorithmic complexity of O(N2) or O(N3) is not handled efficiently by Hadoop MapReduce. These are problems that involve distances, kernels, or other kinds of similarity between points or sets of points (tuples).
4. The graph data processing and computations that include centrality, commute distances, and ranking are very hard to partition across a cluster. Euclidean graph problems and graph searches are difficult to realize over Hadoop MapReduce.
5. Linear or quadratic programming approaches are also harder to realize over Hadoop, because they involve complex iterations and operations on large matrices, especially at high dimensions. Conjugate gradient descent (CGD), due to its iterative nature, is also hard to solve using Hadoop. The Alternating Direction Method of Multipliers has been realized efficiently over Message Passing Interface (MPI), whereas the Hadoop implementation requires several iterations and may be less efficient.
6. The mathematical operation of integration of functions is important in Big Data analytics. They arise in Bayesian inference as well as in random effects models. Quadrature approaches that are sufficient for low-dimensional integrals might be realizable on Hadoop. But they are not suitable for dealing with the high-dimensional integration, which arises in the Bayesian inference approach for Big Data analytical problems.
Markov Chain Monte Carlo is also hard to realize over Hadoop MapReduce. MCMC is iterative in nature because the chain must converge to a stationary distribution – this might happen only after several iterations.
7. Alignment problems are those that involve matching between data objects or sets of objects. They occur in various domains—image de-duplication, multiple sequence alignments used in computational biology, etc. Dynamic programming is used or Hidden Markov Models (HMMs); iterations/recursions are required, which is not a choice that favors Hadoop MR, as we mentioned earlier.
Getting Big Data up to Speed
To summarize, simpler problems or smaller versions of the Big Data challenges are doable in Hadoop with MapReduce. Spark can be seen as the next-generation data processing alternative to Hadoop MapReduce in the Big Data space where speed matters. We can say with confidence that Spark is a replacement for Hadoop MapReduce in general and for scientific analysis in particular due to the fact that it has far more API-calls then just ‘Map’ and ‘Reduce’. Spark’s amazing components, like SparkSQL, Spark Streaming, and GraphX, make data analysis much easier.