In my previous blog “Beyond ‘The Data Vault’” I examined various data storage options and a practical architecture/design for an Enterprise Data Vault Warehouse. As you may have realized by now I am quite smitten with this innovative data modeling methodology and recommend to anyone who is developing a ‘Data Lake’ or Data Warehouse on Big Data platforms consider this as a critical design paradigm. Dan Linstedt has worked hard on expanding this concept and recently introduced Data Vault 2.0 (shameless plug) which extends the methodology, architecture, and data model (HUB/LNK/SAT) implementation and best practices.
Today’s massive acceleration and accumulation of data truly demands high-technology, a reliable architecture, and clear best practices to accommodate, access, and gain from it, ‘Real Value’. Moving this voluminous data from point ‘A’ to ‘B’ is a serious challenge. Talend is designed for and embraces this challenge head on. So where do Talend and “The Data Vault” converge? Well, let’s find out…
Talend Data Fabric
Data Integration is the foundation upon which data system architects and implementation specialists all develop ETL/ELT processes for getting getting data from somewhere to elsewhere. We then add Data Quality, Data Governance, Master Data Management, and Web Services as needed; implementing many acronyms along the way like: ESB, SOAP, REST, CDC, HDFS, SPARK, HIVE, and many more. Talend Data Fabric is comprised of a full-featured development platform you are obvously familiar with, so let’s leave all the marketing fluff and sales pitches to those who live and breath that sort of thing. In my mind, it’s the technology, the methodology, and best practices that truly matter.
In the case of a Data Vault (DV), Talend is emerging as a key technology utilized by many companies to sort out their DV data processing requirements. There are 2 specific provisions of interest:
- Schema Generation – using external metadata, having the ability to synchronize a Data Vault model with the ETL/ELT toolset
- Job Templates – using the synchronized Data Vault model, generate and maintain Talend jobs for ingestion, manipulation, & subsequent down-stream processing
While today’s Talend does not yet support automated facilities to manage these needs, work has begun to examine how, where, and when to implement them. Talend is happy to be working closely with Dan Linstedt on these features and we hope to bring the most robust ‘Data Vault’ enabled data processing tool to the market soon. If I have any sway, that is. Meanwhile, many customers currently using Talend and their ‘Data Vault’ are happily demonstrating considerable success the old-fashioned way. They code it!
Talend has joined the WWDVC 2106 – World Wide Data Vault Consortium held in Stowe, Vermont this May 25th through May 29th as a Platinum Sponsor. As an event sponsor presenting information on the product roadmap, features (for those new to Talend) and specifics on how to build data integration jobs for Data Vaults, we expect to finally bring ETL/ELT technology to this prestegous audience. A three hour hands-on session will demonstrate how to build Talend jobs for both ‘Relational’ and ‘Big Data’ Data Vault environments. Ed Ost, Talend’s ‘Director, Worldwide Technology Alliances, Channel Partners’ has fashioned the simple Enterprise Data Vault depicted in my previous blog into a sandbox for attendees to learn from and play in.
Here are some sample implementations using the Amazon AWS cloud-based infrastructure:
The Relational flow demonstrates a dump/load technique from the source data into an AWS S3 bucket and processed into a Data Vault defined and stored in RDS. From there one can ELT the data directly into a de-normalized RedShift Data Mart for analytical queries.
The Big Data flow demonstrates a direct read/load from the source data using ‘Sqoop’ into a S3/RedShift Data Vault having Point-In-Time and Bridge table to allow equa-join queries that also uses direct ELT to populate a de-normalized RedShift Data Mart for analytical queries.
The Big Data flow using Spark demonstrates a variant where the source data is read and loaded into an S3/Hive Data Vault using Spark and then a more traditional ETL process to populate a de-normalized RedShift Data Mart for analytical queries.
This years event wil be special, not because I am speaking, no; but because W. H. (Bill) Inmon is! The Father of Data Warehousing himself. Other Talend partners like Analytix/DS and Snowflake are also sponsoring this event so it should prove to be the absolute best place to be for meeting, networking, and discussing Data Vaults with the thought leaders from around the world.
Is Your “Data Lake” a Swamp?
Now that we’ve looked at how Talend and the ‘Data Vault’ work together, I wanted to address a key topic that discomforts me. As most ‘buzz’ words go, the ‘Data Lake’ is fast becoming the most annoying, misunderstood, and misused term yet. The essence of the idea is to dump ALL your data into this puddle, this pond, this lake, this ocean; or perhaps this swamp, this cesspool! Without technology, methodology, and best practices, that’s exactly what you’ll get. I do like the idea of dumping everything into a Data Lake. In fact it adheres to one of Dan Linstedt’s precepts: “100% of the data 100% of the time”. If you put all your data into the lake, then you’ll already have it when a business need arises. No downtime required! The real focus should be on the query instead. I’m sure you’ve guessed that Talend and the ‘Data Vault’ is the answer. Creating a Data Lake based upon the Data Vault model and methodology in a Big Data environment with robust data integration tools (Talend) plus pliable best practices, I believe, is the right way to go. Swim in the lake, don’t drown in it!
So, not my usual lengthy blog; Yes, I can keep it brief. The reality is: ‘busy is, as busy does’. I am hopeful to see many of you Data Vault enthusiasts at the WWDVC in May. If not, let’s keep the dialog going here; post your comments, ask your questions, raise the debate! I’d be more than happy to respond in kind.