What is “The Data Vault” and why do we need it?

For anything you might want to do, understanding the problem and using the right tools is essential.  Resulting methodologies and best practices that inevitably arise become the catalyst for innovation and superior accomplishments.  Database systems, particularly data warehouse systems are no exception, yet does the best data modeling methodologies of the past offer the best solution today?

Big Data, agreeably a very hot topic, will clearly play a significant part in the future of business intelligence solutions.  Frankly the reliance upon Inmon’s Relational 3NF and Kimball’s STAR schema strategies simply no longer apply.  Using and knowing how to use the best data modeling methodology is a key design priority and has become critical to successful implementations.  Persisting with outdated data modeling methodologies is like putting wagon wheels on a Ferrari.

Today, virtually all businesses make money using the Internet.  Harvesting the data they create in an efficient way and making sense of it has become a considerable IT challenge.  One can easily debate the pros and cons involved in the data modeling methodologies of the past, but that will not be the focus of this blog.  Instead let’s talk about something relatively new that offers a way to easily craft adaptable, sensible, data models that energize your data warehouse:  The Data Vault!

Data Vault: Adaptable to changeEnterprise Data Warehouse (EDW) systems aim to provide true Business Intelligence (BI) for the data-driven enterprise.  Companies must address critical metrics ingrained in this vital, vibrant data.  Providing an essential data integration process that eventually supports a variety of reporting requirements is a key goal for these Enterprise Data Warehouse systems.  Building them involves significant design, development, administration, and operational effort.  When upstream business systems, structures, or rules change, fail to provide consistent data, or require new systems integration solutions, the minimum reengineering requirements present us with problem #1: The one constant is change; so how well can an EDW/BI solution adapt?

"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin

Consumption and analysis of business data by diverse user communities has become a critical reality to maintain a competitive edge yet technological realities today often require highly trained end-users.  Capturing, processing, transforming, cleansing, and reporting on this data may be understandable, but in most cases the sheer volume of data can be overwhelming; Yup, problem #2: Really Big Data; often characterized as: Volume, Velocity, Variety, Variability, Veracity, Visualization, & Value!

Crafting effective and efficient EDW/BI systems, simplified for usability and reporting on this data, quickly becomes a daunting and often difficult technical ordeal even for veteran engineering teams.  Several integrated technologies are required from database systems, data processing (ETL) tools like Talend, various programming languages, administration, reporting, and interactive graphics software to high performance networks and powerful computers having very large storage capacities.  The design, creation, delivery, and support of robust, effortless EDW/BI systems for simplified, intelligent use are, you guessed it; problem #3: Complexity!

Often we see comprehensive and elegant solutions delivered to the business user that fails to understand the true needs of the business.  We’re told that’s just the way it is due to technical requirements (limitations; wink, wink) and/or design parameters (lack of features; nudge, nudge).  Hence; problem #4: The Business Domain; fit the data to meet the needs of the business, not the other way around!

Furthermore, as upstream systems change (and they will), as EDW/BI technology plows ahead (and they must), as the dynamic complexities involved prevail (relentlessly), every so often new data sources need to be added to the mix.  These are usually unpredicted and unplanned for.  The integration impact can be enormous often requiring complete regeneration of the aggregated data; hence, problem #5: Flexibility; or the lack there of!

So how do we solve these problems?  Well …

Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as:

A subject oriented, nonvolatile, time-variant collection of data in support of management’s decisions
(http://en.wikipedia.org/wiki/Bill_Inmon)


Star schemaRalph Kimball (http://en.wikipedia.org/wiki/Ralph_Kimball), a pioneering data warehousing architect, developed the “dimensional modeling” methodology now regarded as the de-facto standard in the area of decision support.  The Dimensional Model (called a “star schema”) is different from Inman’s “normalized modeling” (sometimes called a “snowflake schema”) methodology.  In Kimball’s Star Schema, transactional data is partitioned into aggregated “facts” with referential “dimensions” surrounding and providing descriptors that define the facts.  The Normalized Model (3NF or “third normal form”) stores data in related “tables” following relational database design rules established by E. F. Codd and Raymond F. Boyce in the early 1970’s that eliminate data redundancy.  Fostering vigorous debate amongst EDW/BI Architects as to which methodology is best, both have weakness when dealing with inevitable changes in the systems feeding the data warehouse and in cleansing data to conform to strict methodology requirements.

Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast analysis of data from multiple perspectives.  The cube structure is created from either a Star or Snowflake Schema stored as metadata from which one can view or “pivot” the data in various ways.  Generally cubes have one time based dimension that supports a historical representation of data.  Creating OLAP cubes can be very expensive and often create a significant amount of data that is of little or no use.  The 80/20 rule appears in many cases to hold true (where only 20% of the OLAP cube data proves useful) which begs the question: Built upon a traditional architecture does an OLAP cube truly deliver sufficient ROI?  Often, the answer is a resounding, NO!  Durable EDW/BI systems must deliver real value.

 

A Fresh Approach

The Data Vault is a hybrid data modeling methodology providing historical data representation from multiple sources designed to be resilient to environmental changes.  Originally conceived in 1990 and released in 2000 as a public domain modeling methodology, Dan Linstedt, its creator, describes a resulting Data Vault database as:

A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.  It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas.  The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.
(http://en.wikipedia.org/wiki/Data_Vault_Modeling)

Focused on the business process, the Data Vault as a data integration architecture, has robust standards and definitional methods which unite information in order to make sense if it.  The Data Vault model is comprised of three basic table types:

the data vaultHUB (blue): containing a list of unique business keys having its own surrogate key.  Metadata describing the origin of the business key, or record ‘source’ is also stored to track where and when the data originated.

LNK (red): establishing relationships between business keys (typically hubs, but links can link to other links); essentially describing a many-to-many relationship.  Links are often used to deal with changes in data granularity reducing the impact of adding a new business key to a linked Hub.

SAT (yellow): holding descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension).  Where Hubs and Links form the structure of the data model, Satellites contain temporal and descriptive attributes including metadata linking them to their parent Hub or Link tables.  Metadata attributes within a Satellite table containing a date the record became valid and a date it expired provide powerful historical capabilities enabling queries that can go ‘back-in-time’.

There are several key advantages to the Data Vault approach:

- Simplifies the data ingestion process

- Removes the cleansing requirement of a Star Schema

- Instantly provides auditability for HIPPA and other regulations

- Puts the focus on the real problem instead of programming around it

- Easily allows for the addition of new data sources without disruption to existing schema

Simply put, the Data Vault is both a data modeling technique and methodology which accommodates historical data, auditing, and tracking of data.

The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework
Bill Inmon

 

Adaptable

Through the separation of business keys (as they are generally static) and the associations between them from their descriptive attributes, a Data Vault confronts the problem of change in the environment.  Using these keys as the structural backbone of a data warehouse all related data can be organized around them.  These Hubs (business keys), Links (associations), and SAT (descriptive attributes) support a highly adaptable data structure while maintaining a high degree of data integrity.  Dan Linstedt often correlates the Data Vault to a simplistic view of the brain where neurons are associated with Hubs and Satellites and where dendrites are Links (vectors of information).  Some Links are like synapses (vectors in the opposite direction).  They can be created or dropped on the fly as business relationships change automatically morphing the data model as needed without impact to the existing data structures.  Problem #1 Solved!

 

Big Data

Data Vault v2.0 arrived on the scene in 2013 and incorporates seamless integration of Big Data technologies along with methodology, architecture, and best practice implementations. Through this adoption, very large amounts of data can easily be incorporated into a Data Vault designed to store using products like Hadoop, Infobright, MongoDB and many other NoSQL options.  Eliminating the cleansing requirements of a Star Schema design, the Data Vault excels when dealing with huge data sets by decreasing ingestion times, and enabling parallel insertions which leverages the power of Big Data systems.  Problem #2 Solved!

 

Simplification

Crafting an effective and efficient Data Vault model can be done quickly once you understand the basics of the 3 table types: Hub, Satellite, and Link!  Identifying the business keys 1st and defining the Hubs is always the best place to start.  From there Hub-Satellites represent source table columns that can change, and finally Links tie it all up together.  Remember it is also possible to have Link-Satellite tables too.  Once you’ve got these concepts, it’s easy.  After you’ve completed your Data Vault model the next common thing to do is build the ETL data integration process to populate it.  While a Data Vault data model is not limited to EDW/BI solutions, anytime you need to get data out of some data source and into some target, a data integration process is generally required.  Talend’s mission is to connect the data-driven enterprise.  

With its suite of integration software, Talend simplifies the development process, reduces the learning curve, and decreases total cost of ownership with a unified, open, and predictable ETL platform.  A proven ETL technology, Talend can certainly be used to populate and maintain a robust EDW/BI system built upon a Data Vault data model.  Problem #3 Solved!

 

Your Business

The Data Vault essentially defines the Ontology of an Enterprise in that it describes the business domain and relationships within it.  Processing business rules must occur before populating a Star Schema.  With a Data Vault you can push them downstream, post EDW ingestion.  An additional Data Vault philosophy is that all data is relevant, even if it is wrong.  Dan Linstedt suggests that data being wrong is a business problem, not a technical one.  I agree!  An EDW is really not the right place to fix (cleanse) bad data.  The simple premise of the Data Vault is to ingest 100% of the source data 100% of the time; good, bad, or ugly.  Relevant in today’s world, auditability and traceability of all the data in the data warehouse thus become a standard requirement.  This data model is architected specifically to meet the needs of today’s EDW/BI systems.  Problem #4 Solved!
 

To understand the Data Vault is to understand the business

(http://danlinstedt.com)

 

Flexible

The Data Vault methodology is based on SEI/CMMI Level 5 best practices and includes many of its components combining them with best practices from Six Sigma, TQM, and SDLC (Agile).  Data Vault projects have short controlled release cycles and can consist of a production release every 2 or 3 weeks automatically adopting the repeatable, consistent, and measurable projects expected at CMMI Level 5.  When new data sources need to be added, similar business keys are likely, new Hubs-Satellites-Links can be added and then further linked to existing Data Vault structures without any change to the existing data model.  Problem #5 Solved!

 

Conclusion

In conclusion, the Data Vault modeling and methodology addresses the elements of the problems we identified above:

- It adapts to a changing business environment

- It supports very large data sets

- It simplifies the EDW/BI design complexities

- It increases usability by business users because it is modeled after the business domain

- It allows for new data sources to be added without impacting the existing design

This technological advancement is already proving to be highly effective and efficient.  Easy to design, build, populate, and change, the Data Vault is a clear winner.  Very Cool!  Do you want one?

Visit http://learndatavault.com or http://www.keyldv.com/lms for much more on Data Vault modeling and methodology.

 

Related Resources

With Talend, Speed Up Your Big Data Integration Projects

Products Mentioned

Talend Big Data

Share

Comments

DanielUpton
@Frans, Data Vault makes the process of designing, loading, and evolving the DWH simpler. In exchange for this, yes, it does require additional tables, and this has often made it intimidating to communicate to those typically finding comfort in either 3NF or dimensional data models for the communication of design satisfying requirements. Daniel Upton
Dale Anderson
@Claude & @Brice I do not specifically suggest that Data Vault should or does replace the STAR Schema methodology. In fact I believe it can enhance it. STAR schemas are notorious for being hard to build and keep current with changes in systems upstream and in the required data cleansing. My experience with Data Vaults is to place them BEFORE populating a STAR schema. This way you have the appropriate auditing, business key relationships, and adaptability that can strengthen a FACT table. Also as there are so many great BI visualization tools that target STAR schemas, so don't throw them away. d;)
Dale Anderson
@Claude & @Brice I do not specifically suggest that Data Vault should or does replace the STAR Schema methodology. In fact I believe it can enhance it. STAR schemas are notorious for being hard to build and keep current with changes in systems upstream and in the required data cleansing. My experience with Data Vaults is to place them BEFORE populating a STAR schema. This way you have the appropriate auditing, business key relationships, and adaptability that can strengthen a FACT table. Also as there are so many great BI visualization tools that target STAR schemas, so don't throw them away. d;)
Claude
I like the Data Vault methodology very much as a means of storing data into an auditable system of records database with history. Maintaining the data in its original format makes it especially suitable for auditing. I do intend to recommend this to clients in the future. That said, I'm afraid that you're overselling this methodology if you offer it as an alternative to a star schema. The ETL is there for a reason and that reason doesn't go away because you have a hub, link, and spoke schema. If there is a change upstream then of course the Data Vault is easy to change; since it contains no business interpretation. At some point, this wonderfully uncleansed data is going to have to used in reports - and you are going to have to either have cleaned the data before or you'll wind up leaving it to the business users to do so. If there's a business change, if you don't have a star schema to change, you'll have to change all your reporting queries to reflect the new business realities - as task far more daunting to me than changing the star schema. And star schema or no star schema, Data Vault does nothing to prevent the need for this change downstream. I see great value in using the data vault as a unified data source for the star schema, but I would never recommend anyone use the vault as a reporting platform.
Brice.M
Just to be sure I got it well, DV is not supposed to replace an old fashioned Star Model. It is more about a new methodology to load data into the Staging area or the ODS depending on your system. I can't see how it could fit 100% of the business needs if no cleansing is done. In case of replacing ODS and reporting done on it I say a big YES, or at least a big WHY NOT as modeling DV looks easy enough. But again as I got it, it cannot replace a STAR model with cleaned data on which Reporting tools will be based on. Am I right ?
Dale Anderson
Edgar, Thank you for reading my blog and your comments. Data Vaults are a ‘methodology’ and not a ‘technology’ and actually do simplify the alternative ‘STAR’ schemas in many ways. My main concern with Fact and Dimension tables is the impact to them when upstream systems change. All too often I’ve had to redesign, truncate, and repopulate STAR schemas to accommodate these changes. This was not adaptive enough for my success goals. Also the cleansing requirements required to ensure aggregations are accurate imposed tremendous ETL development which becomes more complex when you factor in the requirement to enable both Full and Incremental load processing. Data Vaults eliminate these issues effectively, thus my conclusion is that simplification is achieved. I did not say it was simple. Sure, there are many great tools out there that provide highly useful capabilities for STAR schemas, like SAP, however the cost for these tools, plus the cost to build and maintain successful data warehouses using these can be prohibitive. Many companies simply can’t afford the software and the people needed to craft, maintain, and use them. Data Vaults can be designed and built quickly. Data Vaults are also highly adaptive. Clearly this efficient use of resources lowers cost and delivers more in a timely way. I believe that once data is placed into a Data Vault, it becomes a suitable source for a STAR schema which can be built and populated repeatedly thus enabling all those great BI tools that use them. Maybe it’s possible that your experience with Data Vaults was not as successful due to a design and/or implementation that may not have followed the methodology correctly. Without a proper understanding of how to design your Data Vault model, it is possible to make matters worse, not better. But that is true with any methodology, including Kimball’s. Also please read another related post, Beyond “The Data Vault”: http://www.talend.com/blog/2015/08/14/beyond-%E2%80%9Cthe-data-vault%E2%80%9D Cheers!
edgar blumenthal
After three DataVault cycles: DV was an interesting method for DIY DWH STAGING building , but i also confirm utterly NOT to see the Simplification! For working good , you need old-school rdbms-es, SH Satellites on Hub tables, SL Satellite on Link tables, etc. including their DataFlow SH and SL Transformations. A Reporting Layer [what Reporting Tools for DV??, so you need a StarDimensional Datamodel!!! and expensive BO/BI4 on top ].....And lot of customizing, DIY implementation of authorizations, variables, index ranges, index management!!!, workflow and awkward auditing. All which you get of of the box with e.g. SAP BW! Not to speak about the risks for the end-data result. Which frequently is compromised by all the complexity in the real DV world!! SAP HANA implements BI3 [=360degree EPM] and ETL differently, Generating [!] SQL and sql-scripts , using In-Memory real-time aggregations etc.etc. making the copy-DWh redundant and allowing on the fly Report and Datamodel Mash Up modifications. In HANA SP9 we can commit these VirtualModels back to persistance. So basically we are set for the next 20 years! That is time and cost-saving! kind regards, edgar in dwh since 1996
Dale Anderson
Thank you for your comment! The real value is in the ability to create and populate a DV quickly and its high degree of adaptability. Having only the 3 table types simplifies its design and construction. Storing data within the model becomes quite straight forward and does not require any data cleansing. Plus when upstream systems change (as they often do), the potential "re-do" (often needed for a STAR schema) is not required; Adding or changing data stores is non-disruptive. From my experience in EDW construction and maintenance, I find this to be extremely helpful; thus simplified. Please read my 2nd blog on this topic "Beyond the Data Vault" for more detail on this.
Frans
I like datavault, but I can't see how it's bringing Simplification. For sure it's more standardised, but I can't say it's making a DWH more simple.

Leave a comment

Comments

Comment: 
I like datavault, but I can't see how it's bringing Simplification. For sure it's more standardised, but I can't say it's making a DWH more simple.
Comment: 
Thank you for your comment! The real value is in the ability to create and populate a DV quickly and its high degree of adaptability. Having only the 3 table types simplifies its design and construction. Storing data within the model becomes quite straight forward and does not require any data cleansing. Plus when upstream systems change (as they often do), the potential "re-do" (often needed for a STAR schema) is not required; Adding or changing data stores is non-disruptive. From my experience in EDW construction and maintenance, I find this to be extremely helpful; thus simplified. Please read my 2nd blog on this topic "Beyond the Data Vault" for more detail on this.
Comment: 
After three DataVault cycles: DV was an interesting method for DIY DWH STAGING building , but i also confirm utterly NOT to see the Simplification! For working good , you need old-school rdbms-es, SH Satellites on Hub tables, SL Satellite on Link tables, etc. including their DataFlow SH and SL Transformations. A Reporting Layer [what Reporting Tools for DV??, so you need a StarDimensional Datamodel!!! and expensive BO/BI4 on top ].....And lot of customizing, DIY implementation of authorizations, variables, index ranges, index management!!!, workflow and awkward auditing. All which you get of of the box with e.g. SAP BW! Not to speak about the risks for the end-data result. Which frequently is compromised by all the complexity in the real DV world!! SAP HANA implements BI3 [=360degree EPM] and ETL differently, Generating [!] SQL and sql-scripts , using In-Memory real-time aggregations etc.etc. making the copy-DWh redundant and allowing on the fly Report and Datamodel Mash Up modifications. In HANA SP9 we can commit these VirtualModels back to persistance. So basically we are set for the next 20 years! That is time and cost-saving! kind regards, edgar in dwh since 1996
Comment: 
Edgar, Thank you for reading my blog and your comments. Data Vaults are a ‘methodology’ and not a ‘technology’ and actually do simplify the alternative ‘STAR’ schemas in many ways. My main concern with Fact and Dimension tables is the impact to them when upstream systems change. All too often I’ve had to redesign, truncate, and repopulate STAR schemas to accommodate these changes. This was not adaptive enough for my success goals. Also the cleansing requirements required to ensure aggregations are accurate imposed tremendous ETL development which becomes more complex when you factor in the requirement to enable both Full and Incremental load processing. Data Vaults eliminate these issues effectively, thus my conclusion is that simplification is achieved. I did not say it was simple. Sure, there are many great tools out there that provide highly useful capabilities for STAR schemas, like SAP, however the cost for these tools, plus the cost to build and maintain successful data warehouses using these can be prohibitive. Many companies simply can’t afford the software and the people needed to craft, maintain, and use them. Data Vaults can be designed and built quickly. Data Vaults are also highly adaptive. Clearly this efficient use of resources lowers cost and delivers more in a timely way. I believe that once data is placed into a Data Vault, it becomes a suitable source for a STAR schema which can be built and populated repeatedly thus enabling all those great BI tools that use them. Maybe it’s possible that your experience with Data Vaults was not as successful due to a design and/or implementation that may not have followed the methodology correctly. Without a proper understanding of how to design your Data Vault model, it is possible to make matters worse, not better. But that is true with any methodology, including Kimball’s. Also please read another related post, Beyond “The Data Vault”: http://www.talend.com/blog/2015/08/14/beyond-%E2%80%9Cthe-data-vault%E2%80%9D Cheers!
Comment: 
Just to be sure I got it well, DV is not supposed to replace an old fashioned Star Model. It is more about a new methodology to load data into the Staging area or the ODS depending on your system. I can't see how it could fit 100% of the business needs if no cleansing is done. In case of replacing ODS and reporting done on it I say a big YES, or at least a big WHY NOT as modeling DV looks easy enough. But again as I got it, it cannot replace a STAR model with cleaned data on which Reporting tools will be based on. Am I right ?
Comment: 
I like the Data Vault methodology very much as a means of storing data into an auditable system of records database with history. Maintaining the data in its original format makes it especially suitable for auditing. I do intend to recommend this to clients in the future. That said, I'm afraid that you're overselling this methodology if you offer it as an alternative to a star schema. The ETL is there for a reason and that reason doesn't go away because you have a hub, link, and spoke schema. If there is a change upstream then of course the Data Vault is easy to change; since it contains no business interpretation. At some point, this wonderfully uncleansed data is going to have to used in reports - and you are going to have to either have cleaned the data before or you'll wind up leaving it to the business users to do so. If there's a business change, if you don't have a star schema to change, you'll have to change all your reporting queries to reflect the new business realities - as task far more daunting to me than changing the star schema. And star schema or no star schema, Data Vault does nothing to prevent the need for this change downstream. I see great value in using the data vault as a unified data source for the star schema, but I would never recommend anyone use the vault as a reporting platform.
Comment: 
@Claude & @Brice I do not specifically suggest that Data Vault should or does replace the STAR Schema methodology. In fact I believe it can enhance it. STAR schemas are notorious for being hard to build and keep current with changes in systems upstream and in the required data cleansing. My experience with Data Vaults is to place them BEFORE populating a STAR schema. This way you have the appropriate auditing, business key relationships, and adaptability that can strengthen a FACT table. Also as there are so many great BI visualization tools that target STAR schemas, so don't throw them away. d;)
Comment: 
@Claude & @Brice I do not specifically suggest that Data Vault should or does replace the STAR Schema methodology. In fact I believe it can enhance it. STAR schemas are notorious for being hard to build and keep current with changes in systems upstream and in the required data cleansing. My experience with Data Vaults is to place them BEFORE populating a STAR schema. This way you have the appropriate auditing, business key relationships, and adaptability that can strengthen a FACT table. Also as there are so many great BI visualization tools that target STAR schemas, so don't throw them away. d;)
Comment: 
@Frans, Data Vault makes the process of designing, loading, and evolving the DWH simpler. In exchange for this, yes, it does require additional tables, and this has often made it intimidating to communicate to those typically finding comfort in either 3NF or dimensional data models for the communication of design satisfying requirements. Daniel Upton

Add new comment

More information?
Image CAPTCHA
More information?