Today, enterprises need to collect and analyze more and more data to drive greater business insight and improve customer experiences.
To process this data, technology stacks have evolved to include cloud data warehouses and data lakes, big data processing, serverless computing, containers, machine learning, and more.
Increasingly, layered across this network of systems, is a new architectural model – “data pipelines” that collect, transform, and deliver data to the people and machines that need it for transactions and decision making. Like circulatory systems for blood to deliver nutrients to the body, data pipelines deliver data to fuel insights that can improve a company’s revenue and profitability.
However, even as data pipelines gain in popularity, many enterprises underestimate how difficult they can be to manage properly. Enterprises simply do not have the time or resources to introspect data as it moves across the cloud, so data lineage and relationships are often not captured, and data pipelines become an island unto themselves. Likewise, many first-generation data pipelines have to be rearchitected as the underlying systems and schemas change. Without proper attention, they can even break.
To ensure their data pipelines work appropriately, we suggest adopters start with making sure the pipeline’s “nervous system” be built to support reliability, change and future-proof innovation.
Understanding the Data Pipeline Lifecycle
While data pipelines can deliver a wide range of benefits, getting them right requires a broad perspective. So, we suggest organizations get familiar with the full data pipeline lifecycle, from design and deployment to operate and govern.
Taking a lifecycle approach will very much improve the chances that your data pipeline is intelligent and responsive to your company’s needs – and will provide quick access to trusted data. Having deep experience in data pipeline design and delivery, I’d like to share the overall lifecycle – and tips for a successful implementation.
It is vital to design data pipelines so they can easily adapt to different connectivity protocols (database, application, API, sensor protocol), different processing speeds (batch, micro-batch, streaming), different data structures (structured, unstructured), and different qualities of service (throughput, resiliency, cost, etc.).
For example, during the design phase, these challenges can be addressed through a flexible, intuitive, and intelligent design interface, including autosuggestion, data sampling for live preview, and design optimization.
Some potential challenges within the Design phase include how to access the data, what its structure is, and if it can be trusted. This means it’s important to have live feedback on what you’re building, or you have a tedious design-test-debug-design scenario. The framework must have the right level of instrumentation so developers can capture and act on events, addressing changes in data structure and content in real time.
Modern data pipelines also need to support:
- Data semantic and data structure changes while ensuring compatibility through a schema-less approach
- Data quality validation rules to detect anomalies in the content flowing through the pipeline
- Full data lineage to address governance requirements such as GDPR
- “Out of order” real-time data processing in the case of data latency
When building a data pipeline, it’s important to develop the pipeline to be as portable and agile as possible. This will ensure your technology choices from the beginning will prove long-lasting – and not require a complete re-architecture in the future.
During the deployment phase, challenges can include where to deploy each part of the pipeline (local to data or at the edge), on what runtime (cloud, big data, containers), and how to effectively scale to meet demand.
For example, we often see clients who start on-premises, then go to a Cloud/Hybrid platform, then incorporate a multi-cloud and/or serverless computing platform with machine learning. Working through an abstraction layer like Apache Beam, for example, ensures this level of flexibility and portability, where the data pipeline is abstracted from its runtime.
Another consideration is scale. Data pipelines and their underlying infrastructure need to be able to scale to handle increasing volumes of data. In today’s cloud era, the good news is that you can get the scalability you need at a cost you can afford. A technique that works is to use a distributed processing strategy where you process some data locally (e.g. IoT data), and/or utilize new serverless Spark platforms where you just pay for what you need when you need it.
Operate and Optimize.
This phase presents a range of challenges to capturing and correlating data, as well as delivering analytics and insights as outcomes. Among the challenges are how to handle changing data structures and pipelines that fail, and how to optimize and improve data pipelines over time. We find that AI/ML is of sufficient maturity to be very helpful here.
At runtime, data pipelines need to have capabilities to intelligently respond and improve, rather than fail. For example, autoscaling as the volume of data increases through serverless infrastructure provisioning and auto load balancing, dynamically adjusting to changing schemas, and autocorrection. All of this is AI-driven thanks to technical, business, or operational historical or real-time metadata.
AI is also used to optimize data pipeline operations and highlight bottlenecks, decreasing the meantime to detect errors, investigate, and troubleshoot. And with auto-detection or adaption to schema changes at runtime, AI keeps your pipelines running.
One last tip here. Practitioners also can optimize their pipelines with machine learning through a framework like Apache Spark. Spark’s machine learning algorithms and utilities (packaged through MLLib) allow data practitioners to introduce intelligence into their Spark data pipelines.
As companies integrate many more types of structured and unstructured data, it is a requirement to understand the lineage of data, cleanse, and govern it. Having a well-crafted data governance strategy in place from the start is a fundamental practice for any project, helping to ensure consistent, common processes and responsibilities.
We suggest users start by identifying business drivers for data that needs to be carefully controlled and the benefits expected from this effort. This strategy will be the basis of your data governance framework.
Common governance challenges for the adaptive data pipeline include complying with (or face severe penalties) recent regulations such as General Data Privacy Regulation (GDPR) for European data or the California Consumer Privacy Act (CCPA).
If built correctly, your data pipeline can remain accurate, resilient, and hassle-free – and even grow smarter over time to keep pace with your changing environments – whether it be batch to streaming and real-time; hybrid to cloud or multi-cloud; or from Spark 1.0 to 2.4 to next-big thing.
This article was originally published on Integration Developer News.