What is “The Data Vault” and why do we need it?
For anything you might want to do, understanding the problem and using the right tools is essential. Resulting methodologies and best practices that inevitably arise become the catalyst for innovation and superior accomplishments. Database systems, particularly data warehouse systems are no exception, yet does the best data modeling methodologies of the past offer the best solution today?
Big Data, agreeably a very hot topic, will clearly play a significant part in the future of business intelligence solutions. Frankly the reliance upon Inmon’s Relational 3NF and Kimball’s STAR schema strategies simply no longer apply. Using and knowing how to use the best data modeling methodology is a key design priority and has become critical to successful implementations. Persisting with outdated data modeling methodologies is like putting wagon wheels on a Ferrari.
Today, virtually all businesses make money using the Internet. Harvesting the data they create in an efficient way and making sense of it has become a considerable IT challenge. One can easily debate the pros and cons involved in the data modeling methodologies of the past, but that will not be the focus of this blog. Instead let’s talk about something relatively new that offers a way to easily craft adaptable, sensible, data models that energize your data warehouse: The Data Vault!
Enterprise Data Warehouse (EDW) systems aim to provide true Business Intelligence (BI) for the data-driven enterprise. Companies must address critical metrics ingrained in this vital, vibrant data. Providing an essential data integration process that eventually supports a variety of reporting requirements is a key goal for these Enterprise Data Warehouse systems. Building them involves significant design, development, administration, and operational effort. When upstream business systems, structures, or rules change, fail to provide consistent data, or require new systems integration solutions, the minimum reengineering requirements present us with problem #1: The one constant is change; so how well can an EDW/BI solution adapt?
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin
Consumption and analysis of business data by diverse user communities has become a critical reality to maintain a competitive edge yet technological realities today often require highly trained end-users. Capturing, processing, transforming, cleansing, and reporting on this data may be understandable, but in most cases the sheer volume of data can be overwhelming; Yup, problem #2: Really Big Data; often characterized as: Volume, Velocity, Variety, Variability, Veracity, Visualization, & Value!
Crafting effective and efficient EDW/BI systems, simplified for usability and reporting on this data, quickly becomes a daunting and often difficult technical ordeal even for veteran engineering teams. Several integrated technologies are required from database systems, data processing (ETL) tools like Talend, various programming languages, administration, reporting, and interactive graphics software to high performance networks and powerful computers having very large storage capacities. The design, creation, delivery, and support of robust, effortless EDW/BI systems for simplified, intelligent use are, you guessed it; problem #3: Complexity!
Often we see comprehensive and elegant solutions delivered to the business user that fails to understand the true needs of the business. We’re told that’s just the way it is due to technical requirements (limitations; wink, wink) and/or design parameters (lack of features; nudge, nudge). Hence; problem #4: The Business Domain; fit the data to meet the needs of the business, not the other way around!
Furthermore, as upstream systems change (and they will), as EDW/BI technology plows ahead (and they must), as the dynamic complexities involved prevail (relentlessly), every so often new data sources need to be added to the mix. These are usually unpredicted and unplanned for. The integration impact can be enormous often requiring complete regeneration of the aggregated data; hence, problem #5: Flexibility; or the lack there of!
So how do we solve these problems? Well …
Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as:
“A subject oriented, nonvolatile, time-variant collection of data in support of management’s decisions”
Ralph Kimball (http://en.wikipedia.org/wiki/Ralph_Kimball), a pioneering data warehousing architect, developed the “dimensional modeling” methodology now regarded as the de-facto standard in the area of decision support. The Dimensional Model (called a “star schema”) is different from Inman’s “normalized modeling” (sometimes called a “snowflake schema”) methodology. In Kimball’s Star Schema, transactional data is partitioned into aggregated “facts” with referential “dimensions” surrounding and providing descriptors that define the facts. The Normalized Model (3NF or “third normal form”) stores data in related “tables” following relational database design rules established by E. F. Codd and Raymond F. Boyce in the early 1970’s that eliminate data redundancy. Fostering vigorous debate amongst EDW/BI Architects as to which methodology is best, both have weakness when dealing with inevitable changes in the systems feeding the data warehouse and in cleansing data to conform to strict methodology requirements.
Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast analysis of data from multiple perspectives. The cube structure is created from either a Star or Snowflake Schema stored as metadata from which one can view or “pivot” the data in various ways. Generally cubes have one time based dimension that supports a historical representation of data. Creating OLAP cubes can be very expensive and often create a significant amount of data that is of little or no use. The 80/20 rule appears in many cases to hold true (where only 20% of the OLAP cube data proves useful) which begs the question: Built upon a traditional architecture does an OLAP cube truly deliver sufficient ROI? Often, the answer is a resounding, NO! Durable EDW/BI systems must deliver real value.
A Fresh Approach
The Data Vault is a hybrid data modeling methodology providing historical data representation from multiple sources designed to be resilient to environmental changes. Originally conceived in 1990 and released in 2000 as a public domain modeling methodology, Dan Linstedt, its creator, describes a resulting Data Vault database as:
“A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.”
Focused on the business process, the Data Vault as a data integration architecture, has robust standards and definitional methods which unite information in order to make sense if it. The Data Vault model is comprised of three basic table types:
HUB (blue): containing a list of unique business keys having its own surrogate key. Metadata describing the origin of the business key, or record ‘source’ is also stored to track where and when the data originated.
LNK (red): establishing relationships between business keys (typically hubs, but links can link to other links); essentially describing a many-to-many relationship. Links are often used to deal with changes in data granularity reducing the impact of adding a new business key to a linked Hub.
SAT (yellow): holding descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension). Where Hubs and Links form the structure of the data model, Satellites contain temporal and descriptive attributes including metadata linking them to their parent Hub or Link tables. Metadata attributes within a Satellite table containing a date the record became valid and a date it expired provide powerful historical capabilities enabling queries that can go ‘back-in-time’.
There are several key advantages to the Data Vault approach:
- Simplifies the data ingestion process
- Removes the cleansing requirement of a Star Schema
- Instantly provides auditability for HIPPA and other regulations
- Puts the focus on the real problem instead of programming around it
- Easily allows for the addition of new data sources without disruption to existing schema
Simply put, the Data Vault is both a data modeling technique and methodology which accommodates historical data, auditing, and tracking of data.
“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework”
Through the separation of business keys (as they are generally static) and the associations between them from their descriptive attributes, a Data Vault confronts the problem of change in the environment. Using these keys as the structural backbone of a data warehouse all related data can be organized around them. These Hubs (business keys), Links (associations), and SAT (descriptive attributes) support a highly adaptable data structure while maintaining a high degree of data integrity. Dan Linstedt often correlates the Data Vault to a simplistic view of the brain where neurons are associated with Hubs and Satellites and where dendrites are Links (vectors of information). Some Links are like synapses (vectors in the opposite direction). They can be created or dropped on the fly as business relationships change automatically morphing the data model as needed without impact to the existing data structures. Problem #1 Solved!
Data Vault v2.0 arrived on the scene in 2013 and incorporates seamless integration of Big Data technologies along with methodology, architecture, and best practice implementations. Through this adoption, very large amounts of data can easily be incorporated into a Data Vault designed to store using products like Hadoop, Infobright, MongoDB and many other NoSQL options. Eliminating the cleansing requirements of a Star Schema design, the Data Vault excels when dealing with huge data sets by decreasing ingestion times, and enabling parallel insertions which leverages the power of Big Data systems. Problem #2 Solved!
Crafting an effective and efficient Data Vault model can be done quickly once you understand the basics of the 3 table types: Hub, Satellite, and Link! Identifying the business keys 1st and defining the Hubs is always the best place to start. From there Hub-Satellites represent source table columns that can change, and finally Links tie it all up together. Remember it is also possible to have Link-Satellite tables too. Once you’ve got these concepts, it’s easy. After you’ve completed your Data Vault model the next common thing to do is build the ETL data integration process to populate it. While a Data Vault data model is not limited to EDW/BI solutions, anytime you need to get data out of some data source and into some target, a data integration process is generally required. Talend’s mission is to connect the data-driven enterprise.
With its suite of integration software, Talend simplifies the development process, reduces the learning curve, and decreases total cost of ownership with a unified, open, and predictable ETL platform. A proven ETL technology, Talend can certainly be used to populate and maintain a robust EDW/BI system built upon a Data Vault data model. Problem #3 Solved!
The Data Vault essentially defines the Ontology of an Enterprise in that it describes the business domain and relationships within it. Processing business rules must occur before populating a Star Schema. With a Data Vault you can push them downstream, post EDW ingestion. An additional Data Vault philosophy is that all data is relevant, even if it is wrong. Dan Linstedt suggests that data being wrong is a business problem, not a technical one. I agree! An EDW is really not the right place to fix (cleanse) bad data. The simple premise of the Data Vault is to ingest 100% of the source data 100% of the time; good, bad, or ugly. Relevant in today’s world, auditability and traceability of all the data in the data warehouse thus become a standard requirement. This data model is architected specifically to meet the needs of today’s EDW/BI systems. Problem #4 Solved!
“To understand the Data Vault is to understand the business”
The Data Vault methodology is based on SEI/CMMI Level 5 best practices and includes many of its components combining them with best practices from Six Sigma, TQM, and SDLC (Agile). Data Vault projects have short controlled release cycles and can consist of a production release every 2 or 3 weeks automatically adopting the repeatable, consistent, and measurable projects expected at CMMI Level 5. When new data sources need to be added, similar business keys are likely, new Hubs-Satellites-Links can be added and then further linked to existing Data Vault structures without any change to the existing data model. Problem #5 Solved!
In conclusion, the Data Vault modeling and methodology addresses the elements of the problems we identified above:
- It adapts to a changing business environment
- It supports very large data sets
- It simplifies the EDW/BI design complexities
- It increases usability by business users because it is modeled after the business domain
- It allows for new data sources to be added without impacting the existing design
This technological advancement is already proving to be highly effective and efficient. Easy to design, build, populate, and change, the Data Vault is a clear winner. Very Cool! Do you want one?