Apache Beam in 2017: Use Cases, Progress and Continued Innovation
On January 10, 2017, Apache Beam (Beam) got promoted as a Top-Level Apache Software Foundation project. It was an important milestone that validated the value of the project, legitimacy of its community, and heralded its growing adoption. In the past year, Apache Beam has experienced tremendous momentum, with significant growth in both its community and feature set. Let us walk you through some of the notable achievements.
First, let’s take a deeper look at how Apache Beam was used in 2017. Beam, being a unified framework for batch and stream processing, enables a very wide spectrum of diverse use cases. Here are some use cases that exemplify the versatility of Beam:
In 2017, Apache Beam had 174 contributors worldwide, from many different organizations. The Apache community was proud to count 18 PMC members and 31 committers among that mix. The community had seven releases in 2017, each bringing a rich set of new features and fixes.
The most obvious and encouraging sign of the growth of the Beam community, and validation of its core value proposition of portability, is the addition of significant new runners (i.e. execution engines). We entered 2017 with Apache Flink, Apache Spark 1.x, Google Cloud Dataflow, Apache Apex, and Apache Gearpump. In 2017, the following new and updated runners were developed:
In addition to runners, Beam added new IO connectors, some notable ones being the Cassandra, MQTT, AMQP, HBase/HCatalog, JDBC, Solr, Tika, Redis, and ElasticSearch connectors. Beam’s IO connectors make it possible to read from or write to data sources/sinks even when they are not natively supported by the underlying execution engine. Beam also provides fully pluggable filesystem support, allowing us to support and extend our coverage to HDFS, Amazon S3, Microsoft Azure Storage, and Google Storage. We continue to add new IO connectors and filesystems to extend Beam use cases.
A particularly telling sign of the maturity of an open source community is when it is able to join forces with multiple other open source communities, and mutually improve the state of each collaborative society. Over the past few months, the Beam, Calcite, and Flink communities have come together to define a robust spec for Streaming SQL, with contributing engineers from over four organizations. If you are excited by the prospect of improving the state of streaming SQL, please join us!
In addition to SQL, new specs for XML and JSON based declarative DSLs are also in the Proof of Concept stage (PoC) and in need of additional contributors.
Innovation is important to the success on any open source project, and Apache Beam has a rich history of bringing ground-breaking ideas to the open source community. For example, Apache Beam was the first to introduce some seminal concepts in the world of big-data processing:
- Unified batch and streaming SDKs that enable users to author big-data jobs without having to learn multiple, disparate SDKs/APIs;
- Cross-Engine Portability: Giving enterprises the confidence that workloads authored today will not have to be re-written when open source engines become outdated and are supplanted by newer ones; and
- Semantics that are essential for reasoning about unbounded unordered data, and achieving consistent and correct output from a streaming job.
In 2017, the pace of innovation continued with the addition of the following capabilities:
- A cross-Language Portability framework accompanied by a Go SDK;
- Dynamically Shardable IO (SplittableDoFn);
- Support for schemas in PCollection, allowing the community to extend Beam’s runner capabilities; and
- Extensions addressing new use cases such as machine learning, and new data formats.
Areas of improvement
Any retrospective view of a project is incomplete without an honest assessment of areas for improvement. In terms of parts of the Beam project still requiring innovation, two aspects stand out:
- Helping runners showcase their individual strengths. After all, portability does not imply homogeneity. Different runners have different areas in which they excel, and we need to do a better job of highlighting their individual strengths.
- Based on the previous point, helping customers make a more informed decision when they select a runner or migrate from one to another.
In 2018, we aim to take proactive steps to improve the above aspects.
Ethos of the project and its community
The world of batch and stream big-data processing today is reminiscent of the Tower of Babel parable: a slowdown of progress because different communities spoke different languages. Similarly, today there are multiple disparate big-data SDKs/APIs, each with their own distinct terminology to describe similar concepts. The side effect is user confusion and slower adoption.
For this reason, in order to help inspire and accelerate adoption, the Apache Beam project aims to provide an industry standard portable SDK that will:
- Benefit users by providing innovation with stability: The separation of SDKs and engines enables healthy competition between runners, without requiring users to constantly learn new SDKs/APIs and rewrite their workloads to benefit from each innovation; and
- Benefit big-data engines by growing the pie for everyone: Making it easier for users to author, maintain, upgrade and migrate their big-data workloads will lead to significant growth in the number of big-data deployments in production.
The Apache Beam community is both proud of the tremendous achievements we’ve been able to attain thus far, and we hope you will join us in continuing the project’s advancements in the years to come.