In my last blog “What is ‘The Data Vault’ and why do we need it?” I introduced a fresh, compelling methodology for data warehouse modeling authored and invented by Dan Linstedt (http://danlinstedt.com) called ‘the Data Vault’. Solving the many characteristic and inherent problems found in crafting an Enterprise Data Warehouse (EDW), I discussed how the Data Vault’s high adaptability simplifies business ontologies, incorporating Big Data to result in durable yet flexible solutions that most engineering departments would dream of.
Before we get into the Data Vault EDW aspects however, I think we need to cover some basics.
Data Storage Systems
Certainly by now everyone has heard about Big Data; including no doubt the hype, disinformation, and misunderstandings about what Big Data is that are, grudgingly, just as pervasive. So let’s back up a moment and confine the discussion to a sensible level. Setting Relational 3NF and STAR schemas aside, ignoring e-commerce, business intelligence, and data integration, let’s look instead at the main storage facilities that encompass data technologies. These are:
- Database Engines
- ROW: your traditional Relational Database Management System (RDBMS)
- COLUMN: relatively new, widely misunderstood, feels like a normal RDBMS
- NoSQL: new kid on the block; really means ‘NOT ONLY SQL’
- File Systems: everything else under the sun (ASCII/EBCDIC, CSV, JSON, XML, HTML, etc.)
The ROW based database storage methodology is one most of us are already familiar with. Depending upon your vendor of choice (like Oracle, Microsoft, IBM, Postgres, etc…) Data Definition Language (DDL) and Data Manipulation Language (DML) syntax, collectively called SQL, creates tables for the storage and retrieval of structured records, row by row. Commonly based upon some form of ‘key’ identifier, the Relational 3rd Normal Form (3NF) Data Model thrives upon the ROW based database engine and is widely used for many Transactional (OLTP) and/or Analytic (OLAP) systems and/or applications. Highly efficient in complex schema designs and data queries, ROW based database engines offer a tried and true way to build solid data-‘based’ solutions. We should not throw this away, I won’t!
The COLUMN based database storage methodology has been around, quietly, for a while as an alternative to ROW based databases where aggregations are essential. Various vendors (like InfoBright, Vertica, SAP Hana, Sybase IQ, etc…) generally use similar DDL and DML syntax from ROW based databases, yet under the hood things are radically different; a highly efficient engine for processing structured records, column by column; perfect for aggregations (SUM/MIN/MAX/COUNT/AVG/PCT)! This is the main factor that sets it apart from ROW based engines. Some of these column based technologies also provide high data storage compression which allows for a much smaller disk footprint. In some cases as much as 80/1 over their row based counterpart. We should adopt this where appropriate; I do!
The NoSQL based storage methodology (notice I don’t call this a database) is the newer kid on the block which many vendors are vying for your immediate attention (like Cassandra, Cloudera, Hortonworks, MapR, MongoDB, etc.). Many people think that NoSQL technologies are here to replace ROW or COLUMN based databases; that is simply not the case. Instead, as a highly optimized, highly scalable, high performance distributed ‘file system’ (see HDFS below), the NoSQL storage capabilities offer striking features simply not practical with ROW or COLUMN databases. Dramatically enhancing file I/O technologies, NoSQL extends out to new opportunities that were either unavailable, impracticable, or both. Let’s dive a bit deeper on this; OK?
There are three main variations of NoSQL technologies. These include (click on image for web link):
- Key Value: which support fast transaction inserts (like an internet shopping cart); Generally stores data in memory and great for web applications that needs considerable in/out data operations;
- Document Store: which stores highly unstructured data as named value pairs; Great for web traffic analysis, detailed information, and applications that look at user behavior, actions, and logs in real time;
- Column Store: which is focused upon massive amounts of unstructured data across distributed systems (think Facebook & Google);Great for shallow but wide based data relationships yet fails miserably at ad-hoc queries;
Most NoSQL vendors support structured, semi-structured, or non-structured data which can be very useful. The real value, I believe, comes in the fact that NoSQL technologies ingest HUGE amounts of data, very FAST. Forget Megabytes or Gigabytes, or even Terabytes, we are talking Petabytes and beyond! Gobs and gobs of data! With the clustering and multi-threaded inner-workings, scaling to future-proof the expected explosion of data, a NoSQL environment is an apparent ‘no-brainer’. Sure, let’s get excited, but let’s also temper it with the understanding that NoSQL is complementary, not competitive to more traditional databases systems. Also note that NoSQL is NOT A DATABASE but a highly distributed, parallelized ‘file system’ and really great at dealing with lots of non-structured data; did I say BIG DATA?
NoSQL technologies have both strengths and weaknesses. Let’s look at these too:
- NoSQL Strengths
- A winner when you need the ability to store and look up Big Data
- Commodity Hardware based
- Fast Data Ingestion (loads)
- Fast Lookup Speeds (across clusters)
- Streaming Data
- Scalable Data Capacity & Distributed Storage
- Application focused
- NoSQL Weakness
- Conceivably an expensive infrastructure (CPU/RAM/DISK)
- Complexities are hard to understand
- Lack of native SQL interface
- Limited programmatic interfaces
- Poor performance on Update/Delete operations
- Good engineering talent still hard to find
- Inadequate for Analytic Queries (aggregations, metrics, BI)
The FILE SYSTEM data storage methodology is really straightforward and easy. Fundamentally file systems rely upon a variety of storage media (like Local Disks, RAID, NAS, FTP, etc.) and managed by an Operating System (Windows/Linux/MacOS) supporting a variety of file access technologies (like FAT, NTFS, XFS, EXT3, etc.). Files can comprise almost anything, be formatted in many ways, and utilized in a wide variety of application and/or systems. Usually files are organized into folders and/or sub-folders making the file system an essential element to almost all computing today. But then you already know this; Right?
So where does Hadoop fit it in, and what is HDFS? The ‘Hadoop Distributed File System’ (HDFS) is a highly fault-tolerant file system that runs on low-cost, commodity servers. Spread across multiple ‘nodes’ in a hardware cluster (sometimes hundreds or even thousands of nodes), 64Mb ‘chunks’ or data segments are processed using a ‘MapReduce’ programming model that takes advantage of the highly efficient parallel, distributed algorithm.
HDFS is focused on high throughput (fast) data access and support for very large files. To enable data streaming HDFS has relaxed a few restrictions imposed by POSIX (https://en.wikipedia.org/wiki/POSIX) standards to allow support for batch processing applications targeting HDFS.
The Apache Hadoop Project is an open-source framework written in Java that is made up of the following modules:
- Hadoop Common: which contains libraries and utilities
- HDFS: the distributed file system
- YARN: a resource manager responsible for cluster utilization & job scheduling (Apache YARN)
- MapReduce: a programming model for large scale data processing
Collectively, this Hadoop ‘package’, has become the basis for several commercially available and enhanced products, which include (click on image for web link):
So let’s call them all: Data Stores
Let’s bring these three very different data storage technologies into a conjoined perspective; I think it behooves us all to consider that essentially all three offer certain value and benefits across multiple use cases. They are collectively and generically therefore: Data Stores!
Regardless of what type of system you are building, I’ve always subscribed to the notion that you use the right tool for the job. This logic applies to data storage too. Each of these data storage technologies offer specific features and benefits therefore should be used in specific ways appropriate to the requirements. Let’s review:
- ROW based databases should prevail when you want a complex, but not too-huge data set that requires efficient storage, retrieval, update, & delete for OLTP and even some OLAP usage;
- COLUMN based database are clearly aimed at analytics; optimized for aggregations coupled with huge data compression and should be adopted for most business intelligence usage;
- NoSQL based data solutions step in when you need to ingest BIG DATA, FAST, Fast, fast… and when you only really need to make correlations across the data quickly;
- File Systems are the underlying foundation upon which all these others are built. Let’s not forget that!
The Enterprise Data ‘Vault’ Warehouse
Now that we have discussed where and how we might store data, let’s look at the process for crafting an Enterprise Data Warehouse (an obvious Big Data use case) based on a Data Vault model.
An EDW is generally comprised of data originating from a ‘Source’ data store; likely an e-commerce system, or Enterprise Application, or perhaps even generated from machine ‘controls’. The simple desire is to provide useful reporting on metrics aggregated from this ‘Source’ data. Yet ‘IT’ engineering departments often struggle with the large volume and veracity of the data and often fail at the construction of an effective, efficient, and pliable EDW Architecture. The complexities are not the subject of this Blog; however anyone who has been involved in crafting an EDW knows what I am talking about. To be fair, it is harder than it may seem.
Traditionally an enterprise funnels ‘Source’ data into some form of staging area, often called an ‘Operational Data Store’ or ODS. From there, the data is processed further into either a Relational 3NF or STAR data model in the EDW where aggregated processing produdces the business metrics desired. We learned from my previous Blog that this is problematic and time consuming, causing tremendous pressure on data clensing, transformations, and re-factoring when systems up-stream change.
This is where the Data Vault shines!
After constructing a robust ODS (which I believe is sound architecture for staging data prior to populating an EDW) designing the Data Vault is the next task. Let’s look at a simple ODS schema:
Let’s start with the HUB table design. See if you can identify the business keys and the ‘Static’ attributes from the ‘Source’ data structures to include into the HUB tables. Remember also that HUB tables define their own unique surrogate Primary Key and should contain record load date and source attributes.
The LNK ‘link’ tables capture relationships between HUB tables and may include specific ‘transactional’ attributes (there are none in this example). These LNK tables also have a unique surrogate Primary Key and should record the linkage load date.
Finally the SAT ‘satellite’ table capture all the variable attributes. These are constructed from all the remaining, valuable ‘Source’ data attributes that may change over time. The SAT tables do not define their own unique surrogate keys; instead they incorporate either the HUB or the LNK table surrogates plus the record load date combined as the Primary Key.
Additionally the SAT tables include a record load end date column which is designed to contain a NULL value for one and only one instance of a satellite row representing the ‘current’ record. When SAT attribute values change up-stream, a new record is inserted into the SAT table, updating the previously ‘current’ record by setting the record load end date to the date of the newly loaded record.
One very cool result of using this Data Vault model is that it is easily possible to create queries that go “Back-In-Time” as it will be possible to check the ‘rec_load_date’ and the ‘rec_load_end_date’ values to determine what the record attribute values were in the past. For those who have tried, they know, this is very hard to accomplish using a STAR schema.
Eventually data aggregations (MIN/MAX/SUM/AVG/COUNT/PCT), often called ‘Business Metrics’ or ‘Data Points’ must be generated from the Data Vault tables. Reporting systems could query the tables directly, which is a viable solution. I think however, that this methodology puts a strain on any EDW, and instead a column-based database could be utilized instead. As previously discussed, these column-based database engines are very effective for storing and retrieving data aggregations. The design of these column-based tables could be highly de-normalized (consider this example), versioned to account for history. This solution effectively replaces the FACT/DIMENSION relationship requirements and the potentially complex (populate and retrieve) data queries of the STAR schema data model.
Yes, for those who can read between the lines, this does impose an additional data processing step which must be built, tested, and incorporated into an execution schedule. The benefits of doing this extra work are huge. Once the business metrics are stored, pre-aggregated, reporting systems will always provide fast retrieval, consistent values, and managed storage footprints. These become the ‘Data Marts’ for the business user and worth the effort!
Getting data from ‘Source’ data systems, to Data Vault tables, to column-based Data Marts, require tools. Data Integration tools. Yes, Talend Tools! As a real-world example, before I joined Talend, I successfully designed and built such an EDW platform. I used Talend and Infobright. The originating ‘source’ data merged 15 identical databases with 45 tables each, into an ODS. Synchronized data from the ODS to the Data Vault model with 30 tables, and further populated 9 InfoBright tables. The only real transformation requirements were to map the ‘Source’ tables to Data Vault tables, then to de-normalized tables. After processing over 30 billion records, the resulting column-based ‘Data Marts’ could execute aggregated query results, across 6 Billion records, in under 5 seconds. Not bad!
Wow -- A lot of information has been presented here. I can attest that there is a lot more that could have been covered. I will close to say, humbly, that an Enterprise Data ‘Vault’ Warehouse, using any of the data stores discussed above, is worth your consideration. The methodology is sound and highly adaptable. The real work then becomes how you define your business metrics and the ETL/ELT data integration process. We believe, here at Talend, not too surprisingly, that our tools are well suited for building fast, pliable, maintainable, operational code that takes you “Beyond the Data Vault”.