Introduction to Apache Beam
This blog is the first in a series of posts explaining the overarching goal and purpose of the Apache Beam project. In the future blogs, we will explain how to use Apache Beam to implement data processing jobs.
When you have an existing big data platform, the continuous evolution of that platform is important. If you’re currently using Apache Hadoop MapReduce jobs to process your data, you may want to migrate to Apache Spark in order to leverage new features and improve performance. You may also want to implement streaming data processing in addition to your existing batch processing capabilities.
Alternatively, you may want to look for some easy integration patterns to use or upgrade to different technologies. For instance, if today you are using Apache Spark, you may want to consider using Apache Flink for some use cases or as part of a proof-of-concept (PoC) to weigh its benefits.
When you have various runtimes on-site and it can be very tough and costly to attempt to switch between different technologies.
Apache Beam, a new distributed processing tool that's currently being incubated at the ASF, provides an abstraction layer allowing developers to focus on Beam code, using the Beam programming model. Thanks to Apache Beam, an implementation is agnostic to the runtime technologies being used, meaning you can switch technologies quickly and easily.
Apache Beam also offers a programming model that is agnostic in terms of coverage—meaning the programming model is unified, which allows developers to implement both batch and streaming data processing. It’s actually where the Apache Beam name comes from: B (for Batch) and EAM (for strEAM).
To implement your data processes using the Beam programming model, you will use an SDK or DSL provided by Beam. Now, you really have only one SDK: The Java SDK. However, a Python SDK is expected to be released and Beam will provide a Scala SDK and additional DSL (Declarative DSL with XML for instance) soon.
With Apache Beam, first you choose a Beam SDK then you implement your data processes as Beam Pipelines. You don’t need to worry about the actual runtime where your processes will be deployed and run. Apache Beam provides runners. The runners are responsible for translating your pipelines to the target runtime. Apache Beam provides turnkey runners for Apache Spark, Apache Flink, and Google Cloud Dataflow platform. It will also provide new runners soon for Apache Hadoop MapReduce, Apache Karaf, and more.
To summarize, Apache Beam offers several unique capabilities for advancing your big data environment, including:
- Portability - Your data processing jobs are exactly the same, decoupled from the actual runtime that you will use. You use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally.
- Unified Processing- Apache Beam provides the same unified model for batch and stream processing.
- Advanced features – The Apache Beam programming model supports advanced features, implementing the latest patterns about data processing. For instance, the programming model supports event windowing, triggering, watermarking, delay in data arrival, etc.
- Extensible model and SDK - Most of the Apache Beam parts can be extended to meet your varying environmental needs. Say you want to support a new runtime – fine, you can create your own runner. Perhaps you wish to create a DSL. No problem, just create it on top of one of the provided SDKs. Or, say you need to support a new data source. Just create a new Beam IO (implemented custom data source and sink).
In the next blog post, we will look at the Apache Beam programming model in more detail including data pipelines, PCollection and PTransform and IO.
About Jean-Baptiste (@jbonofre)
ASF Member, PMC for Apache Karaf, PMC for Apache ServiceMix, PMC for Apache ACE, PMC for Apache Syncope, Committer for Apache ActiveMQ, Committer for Apache Archiva, Committer for Apache Camel, Contributor for Apache Falcon