Building Agile Data Lakes with Robust Ingestion and Transformation Frameworks – Part 1

This post was authored by Venkat Sundaram from Talend and Ramu Kalvakuntla from Accenture.

With the advent of Big Data technologies like Hadoop, there has been a major disruption in the information management industry. The excitement around it is not only about the three Vs – volume, velocity and variety – of data but also the ability to provide a single platform to serve all data needs across an organization. This single platform is called the Data Lake. The goal of a data lake initiative is to ingest data from all known systems within an enterprise and store it in this central platform to meet enterprise-wide analytical needs.

However, a few years back Gartner warned that a large percentage of data lake initiatives have failed or will fail – becoming more of a data swamp than a data lake. How do we prevent this? We have teamed up with one of our partners, Accenture, to discuss the data challenges enterprises face, what caused data lakes to become swamps, discuss the characteristics of a robust data ingestion framework and how it can help make the data lake more agile. We have partnered with Accenture Insights on multiple customer engagements to build these robust ingestion and transformation frameworks to build their enterprise data lake solution.

Download Hadoop and Data Lakes now.
Read Now

Current Data Challenges:

Enterprises face many challenges with data today, from siloed data stores and massive data growth to expensive platforms and lack of business insights. Let’s take a look at these individually:

1. Siloed Data Stores

Nearly every organization is struggling with siloed data stores spread across multiple systems and databases. Many organizations have hundreds, if not thousands, of database servers. They’ve likely created separate data stores for different groups such as Finance, HR, Supply Chain, Marketing and so forth for convenience’s sake, but they’re struggling big time because of inconsistent results. 

I have personally seen this across multiple companies: they can’t tell exactly how many active customers they have or what the gross margin per item is because they get varying answers from groups that have their own version of the data, calculations and key metrics.

2. Massive Data Growth

No surprise that data is growing exponentially across all enterprises. Back in 2002 when we first built a Terabyte warehouse, our team was so excited! But today even a Petabyte is still small. Data has grown a thousandfold—in many cases in less than two decades‚—causing organizations to no longer be able to manage it all with their traditional databases.

Traditional systems scale vertically rather than horizontally, so when my current database reaches its capacity, we just can’t add another server to expand; we have to forklift

Diagram 1: Current Data Challenges

into newer and higher capacity servers. But even that will have limitations. IT has become stuck in this deep web and is unable to manage systems and data efficiently.

3. Expensive Platforms

 Traditional relational MPP databases are appliance-based and come with very high costs. There are cases where companies are paying more than $100K per terabyte and are unable to keep up with this expense as data volumes rapidly grow from terabytes to exabytes.

4. Lack of Business Insights

Because of all of the above challenges, business is just focused on descriptive analytics, like a rear mirror view of what happened yesterday, last month, last year, year over year, etc., instead of focusing on predictive and prescriptive analytics to find key insights on what to do next.

What is the Solution?

One possible solution is consolidating all disparate data sources into a single platform called a data lake. Many organizations have started this path and failed miserably. Their data lakes have morphed into unmanageable data swamps.

What does a data swamp look like? Here’s an analogy: when you go to a public library to borrow a book or video, the first thing you do is search the catalog to find out whether the material you want is available, and if so, where to find it. Usually, you are in and out of the library in a couple of minutes. But instead, let’s say when you go to the library there is no catalog, and books are piled all over the place—fiction in one area and non-fiction in another and so forth. How would you find the book you are looking for? Would you ever go to that library again? Many data lakes are like this, with different groups in the organization loading data into it, without a catalog or proper metadata and governance.

A data lake should be more like a data library, where every dataset is being indexed and cataloged, and there should be a gatekeeper who decides what data should go into the lake to prevent duplicates and other issues. For this to happen properly, we need an ingestion framework, which acts like a funnel as shown below.

Diagram 2: Data Ingestion Framework / Funnel

A data ingestion framework should have the following characteristics:

  • A Single framework to perform all data ingestions consistently into the data lake.
  • Metadata-driven architecture that captures the metadata of what datasets to be ingested, when to be ingested and how often it needs to ingest; how to capture the metadata of datasets; and what are the credentials needed connect to the data source systems.
  • Template design architecture to build generic templates that can read the metadata supplied in the framework and automate the ingestion process for different formats of data, both in batch and real-time
  • Tracking metrics, events and notifications for all data ingestion activities
  • Single consistent method to capture all data ingestion along with technical metadata, data lineage, and governance
  • Proper data governance with “search and catalog” to find data within the data lake
  • Data Profiling to collect the anomalies in the datasets so data stewards can look at them and come up with data quality and transformation rules

Diagram 3: Data Ingestion Framework Architecture

Modern Data Architecture Reference Architecture

Data lakes are a foundational structure for Modern Data Architecture solutions, where they become a single platform to land all disparate data sources and: stage raw data, profile data for data stewards, apply transformations, move data and run machine learning and advanced analytics, ultimately so organizations can find deep insights and perform what-if analysis.

Unlike traditional data warehouses, where business won’t see the data until it’s curated, using the modern data architecture businesses can ingest new data sources through the framework and analyze it within hours and days, instead of months and years.

In the next part of this series, we’ll discuss, “What is Metadata Driven Architecture?” and see how it enables organizations to build robust ingestion and transformation frameworks to build successful Agile data lake solutions. Let me know what your thoughts are in the comments and head to Accenture for more info

Join The Conversation


Leave a Reply