Building Successful Governed Data Lakes with Agile Data Lake Methodology – Volume 2
This is the second part in a series of blogs that discuss how to successfully build governed data lakes. To all those who have read the first part, thank you! If you haven’t read it, we invite you to read it before continuing, as this second part will build upon it and dive a bit deeper. In this blog, we are going to help you understand how to build a successful data lake with proper governance using metadata-driven architectures and frameworks.
What is a Metadata-Driven Architecture?
First, let’s talk about what I mean by “metadata-driven” architecture. A metadata-driven architecture allows developers to use metadata (data about data) to abstract function from logic. Metadata-driven frameworks will allow us to create generic templates and pass the metadata as parameters at run time - this allows us to write logic once and re-use it many times. This type of architecture will allow a single consistent method of ingesting data into the data lake, improve speed to market and provide the ability to govern what goes into the lake.
How does the Clarity Insights Data Ingestion Framework work?
Clarity Insights has successfully built many governed data lake solutions for clients, and as part of this effort, we have created a data ingestion framework using metadata-driven architecture on the Talend Big Data Platform. We picked Talend because it is lightweight, open source and a code generator which gives us the flexibility to design generic components for both data ingestion and transformation processes.
The diagram below illustrates the functionality of the ingestion framework built by Clarity Insights.
What are the core components of the Data Ingestion Framework?
Framework Database – this metadata database stores:
- Global parameters — such as your Hadoop environment details, Hive database and IP addresses
- Configuration metadata — such as what ingestions to run, in what order to run, which templates to use and how many parallel processes to run
- Operational Metadata — such as what jobs ran at what time, for how long, how many records were processed and job status
This database that stores the metadata can be setup on any RDBMS database.
Reusable Templates/Components – Some templates that are built in Talend include:
- Object Discovery — to identify the number of objects need to be ingested from a database or files from a given directory
- Metadata Definitions — to pull metadata from RDBMS database or delimited files or Excel mappings for fixed length files
- Database Ingestions — Sqoop components to ingest data from RDBMS sources such as Oracle, SQL Server, MySQL, AS400, DB2 etc.
- File Ingestion — template for fixed length, delimited files, XML, JSON files etc.
- Change Data Capture — component to identify changes since the data was ingested in the last run along with metadata changes on source tables or source files
Common Services – the framework leverages services including:
- Restartability — the framework is completely re-startable based on the run history collected at the most granular level in the framework database
- Parallel processing — determine the optimal number of parallel jobs to run based on the configurations within the metadata store
- Dependency Management — the sequence in which the jobs should be run based on the dependency defined in the Metadata store
- Indexing/Cataloging — create index and catalog using metadata management tools such as Talend Data Catalog/Atlas/Cloudera Navigator etc.
The master process is the job that will be set up to run from the enterprise scheduler by providing the Process ID. Master process will pull all the jobs, dependencies, parameters, etc. and run the job based on the order in which they are configured within the metadata store. Everything is controlled through this one process at run time. None of the child Talend jobs will know what they are processing until the Master process provides input to them at run time.
When a request comes to ingest a new source system, the request will go to the governance council. The council will review the request and check the data catalog to see if this data already exists in the data lake. If it is a new dataset, they will enter the details in the framework database. The governance process is completely integrated with the data ingestion framework thus created a fully governed data lake.
How is the Data Ingestion Framework implemented in Talend?
Let’s say we receive a whole bunch of pipe delimited files from an external vendor into a DMZ server daily. The consumer of data requests the governance council to set up ingestion for this new source. The council checks their data catalog and finds that this data does not exist. So, they set up a new process and its details in the framework database.
New Process called ‘clarity demo for DF’ has been created in the process table.
Parameters for this process have been entered with the location of files, type of files, delimiter and schema information.
The last step in the setup of the process metadata is to create modules and the sequence in which they need to run. In this example, the first template - “Build Object List” - will get the list of files located in the inbound directory; the second template - “Get Object Definition” - will get the metadata of each of these files; the third template - “Ingest Objects into S3” - is for ingesting data into S3; and the fourth template - “Hive Usable” - will create Hive compressed tables.
Now that we have set up the process metadata in the database, we will use this in a Talend Template for ingesting these files. Here’s how the Talend job would look like:
The above template has three subjobs – (i) a pre-job, (ii) main subjob, and (iii) a post-job. The pre-job gets the required metadata from the framework database for the process and the list of modules that need to run. The main subjob is responsible for ingestion of data into S3 buckets. The post-job loads the run history for tracking and restartability. As you can see, no new code has been written to ingest new datasets. When the process completes successfully, data will be available within Hive for data stewards to analyze the data, profile the results to understand the anomalies and apply business glossary rules and definitions.
The framework will capture all the metadata of every file and every table. Anytime the metadata changes on the source system, the framework dynamically detects it and creates a new metadata definition within Hive and tags that with the original tables. It will create a view on top of the Hive tables for business to query and in many cases, users will not even notice that things changed underneath the table.
If we have ingestion requirements around new file formats such as JSON, XML or industry standard formats such as HL7, all we have to do is build a new generic template in Talend and the rest will all flow through the same ingestion process. Once we do this, a data lake will be more like a data library where every dataset is being indexed and cataloged.
A robust and scalable Data Ingestion Framework needs to have the following characteristics:
- Single framework to perform all data ingestions consistently into the governed data lake
- Metadata-driven architecture that captures the metadata of what datasets need to be ingested, when and how often it needs to ingest them; how to capture the metadata of datasets; and what are the credentials needed to connect to the data source systems
- Template design architecture to build generic templates that can read the metadata supplied in the framework and automate the ingestion process for different formats of data, both in batch and real-time
- Tracking metrics, events and notifications for all data ingestion activities
- Single consistent method to capture all data ingestion along with technical metadata, data lineage and governance
- Proper data governance with “search and catalogue” to find data within the data lake
- Data Profiling to collect the anomalies in the datasets so data stewards can look at them and come up with data quality and transformation rules
In the next part of this series, we’ll discuss how to tie the data governance process into the data ingestion framework, and we’ll see how it enables organizations to build governed data lake solutions. While we are not getting into nitty-gritty details in these blog posts, the key takeaway is this: it’s vital to have frameworks like these to be successful in your data lake initiatives. Check back soon for the next installment!