When it comes to real-time big data architectures, today… there are choices. Today, there is more than just Lambda on the menu of choices, and in this blog series, I’ll discuss a couple of these choices and compare them using relevant use cases. So, how do you select the right architecture for our real-time project? Let’s get started.
Before we dive into the architecture, let’s discuss some of the requirements of real-time data processing systems in big data scenarios.
The most obvious of these requirements is that data is in motion. In other words, the data is continuous and unbounded. It’s really about when you are analyzing this data that matters. If you are looking for answers against the current snapshot of data or have specific low-latency requirements, then you’re probably looking at a real-time scenario.
In addition, there are very often business deadlines to be met. After all, if there were no consequences to missing deadlines for real-time analysis, then the process could be batched. These consequences can range from complete failure to simply degradation of service.
Since we are talking about big data, we also expect to push the limits on volume, velocity and possibly even variety of data.
Real-time data processing often requires qualities such as scalability, fault-tolerant, predictability, resiliency against stream imperfections, and must be extensible.
New Architectures for the New Data Era
To address this need, new architectures were born… or in other words, necessity is the mother of invention.
The Lambda Architecture, attributed to Nathan Marz, is one of the more common architectures you will see in real-time data processing today. It is designed to handle low-latency reads and updates in a linearly scalable and fault-tolerant way.
The data stream entering the system is dual fed into both a batch and speed layer.
The batch layer stores the raw data as it arrives, and computes the batch views for consumption. Naturally, batch processes will occur on some interval and will be long-lived. The scope of data is anywhere from hours to years.
The speed layer is used to compute the real-time views to compliment the batch views.
Any query may get a complete picture by retrieving data from both the batch views and the real-time views. The queries will get the best of both worlds. The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up to the moment access to the latest possible data. As time goes on, real-time data expires and are replaced with data in the batch views.
One additional benefit to this architecture is that you can replay the same incoming data and produce new views in case code or formula changes.
The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Luckily with Spark Streaming (abstraction layer) or Talend (Spark Batch and Streaming code generator), this has become far less of an issue… although the operational burden still exists.
Next, we’ll discuss the Kappa Architecture.
The Kappa Architecture was first described by Jay Kreps. It focuses on only processing data as a stream. It is not a replacement for the Lambda Architecture, except for where your use case fits. For this architecture, incoming data is streamed through a real-time layer and the results of which are placed in the serving layer for queries.
The idea is to handle both real-time data processing and continuous reprocessing in a single stream processing engine. That’s right, reprocessing occurs from the stream. This requires that the incoming data stream can be replayed (very quickly), either in its entirety or from a specific position. If there are any code changes, then a second stream process would replay all previous data through the latest real-time engine and replace the data stored in the serving layer.
This architecture attempts to simplify by only keeping one code base rather than manage one for each batch and speed layers in the Lambda Architecture. In addition, queries only need to look in a single serving location instead of going against batch and speed views.
The complication of this architecture mostly revolves around having to process this data in a stream, such as handling duplicate events, cross-referencing events or maintaining order- operations that are generally easier to do in batch processing.
One Size May Not Fit All
Many real-time use cases will fit a Lambda architecture well. The same cannot be said of the Kappa Architecture. If the batch and streaming analysis are identical, then using Kappa is likely the best solution. In some cases, however, having access to a complete set of data in a batch window may yield certain optimizations that would make Lambda better performing and perhaps even simpler to implement.
There are also some very complex situations where the batch and streaming algorithms produce very different results (using machine learning models, expert systems, or inherently very expensive operations that must be performed differently in real-time) which would require using Lambda.
So, that covers the two most popular real-time data processing architectures. The next articles in this series will dive deeper into each of these and we’ll discuss concrete use cases and the technologies that would often be found in these architectures.
“How to beat the CAP theorem” by Nathan Marz
“Questioning the Lambda Architecture” by Jay Kreps
“Big Data” by Nathan Marz, James Warren