Business Applications, Data Integration, Master Data Management, Data Warehousing, Big Data, Data Lakes, and Machine Learning; these all have (or should have) a common and essential ingredient: A Data Model; Let us NOT forget about that; Or, as in many situations I run into, ignore it completely!
The Data Model is the backbone of almost all of our high value, mission critical, business solutions from e-Commerce and Point-of-Sale, through Financial, Product, and Customer Management, to Business Intelligence and IoT. Without a proper Data Model, where is the business data? Probably: Lost!
After the success of my Blog Series on Talend Job Design Patterns & Best Practices (please read Part 1, Part 2, Part 3, and Part 4), which covers 32 Best Practices and discusses the best way to build your jobs in Talend, I hinted that data modeling would be forthcoming. Well, here it is!
Data Models and Data Modeling Methodologies have been around since the beginning of time. Well, since the beginning of computing anyway. Data needs structure in order to make sense of it and provide a way for computers to deal with its bits and bytes. Sure, today we deal with unstructured and semi-structured data too, but for me it simply means that we evolved to more sophisticated paradigms than our computing predecessors had to deal with. The Data Model therefore remains, and provides the basis upon which we build highly advanced business applications. Like the Talend best practices, I believe we must take our data models and modeling methods seriously.
Looking back at the history of Data Modeling may enlighten us, so I did some research to refresh myself.
A Brief History Lesson on the Data Model
In the ‘Computing Dark Ages’, we used flat record layouts, or arrays; all data saved to tape or large disk drives for subsequent retrieval. However, in 1958, J. W. Young and H. K. Kent described modeling information systems as “a precise and abstract way of specifying the informational and time characteristics of a data processing problem”. Soon after in 1959, CODASYL or the ‘Conference/Committee on Data Systems Languages’, a consortium, was formed by the Charles Babbage Institute at the University of Minnesota which led to standard programming languages like COBOL and the ‘Integrated Data Store’ (IDS); an early database technology designed in the 1960’s at GE/Honeywell by Charles Bachman. IDS proved difficult to use, so it evolved to become the ‘Integrated Database Management System’ (IDMS) developed at B. F. Goodrich (a US aerospace company at the time, and yes the tire company we know today), marketed by Cullinane Database Systems (now owned by Computer Associates). These two data modeling methodologies called the ‘Hierarchal Data Model’ and the ‘Network Data Model’ respectively, were both very common across mainframe computing for the next 50 years. You may still find them in use today. Wow!
In the late 1960’s, while working at IBM, E. F. Codd in collaboration with C. J. Date (author of ‘An Introduction to Database Systems’), mapped Codd’s innovative data modeling theories resulting in the ‘Relational Model of Data for Large Shared Data Banks’ publication in 1970. Codd’s campaign to ensure vendors implemented the methodology properly published his famous ’Twelve Rules of the Relational Model’ in 1985. Actually, thirteen rules numbered zero to twelve; Codd was clearly a computer geek of his day. The Relational Model also introduced the concept of ‘Normalization’ with the definition of the ‘Five Normal Forms’. Many of us talk about ‘3NF’ or the ‘3rd Normal Form’, but do you know how to define it? Read up on these two links and find out if you really know what you think you know. There will be a quiz at the end! Not …
The next significant data modeling methodology arrived in 1996, proposed by Ralph Kimball (retired), in his groundbreaking book co-authored by Margy Ross, ‘The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling’. Kimball’s widely adopted ‘Star Schema’ data model applied concepts introduced in the data warehouse paradigm first proposed in the 1970‘s by W. H. (Bill) Inmon (named in 2007 by Computerworld as one of the ten most influencial people of the first 40 years in computing). Inmon’s ‘Building the Data Warehouse’, published in 1991 has become the defacto standard for all data warehouse computing. While there has been some history of disagreement between Inmon and Kimball over the proper approach to data warehouse implementation, Margy Ross, of the Kimball Group in her article ‘Differences of Opinion’ presents a fair and balanced explanation for your worthy consideration.
Recently a new data modeling methodology has emerged as a strong contender. The Data Vault! Its author and inventor, Dan Linsdedt, first conceived the Data Vault in 1990 and released a publication to the public domain in 2001. The Data Vault model resolves many competing Inmon & Kimball arguments, incorporating historical lineage of data, and offering a highly adaptable, auditable, and expandable paradigm. A critical improvement (IMHO); I invite you to read my blog on ‘What is “The Data Vault” and why do we need it?’. Linstedt’s Data Vault proved invaluable on several significant DOD, NSA, and Corporate projects. In 2013, Linsdedt released Data Vault 2.0 addressing Big Data, NoSQL, unstructured, semi-structured data integration coupled with SDLC best practices on how to use it. Perfect timing, I’d say. So here we are …
A Data Model Summary
A quick summary of the different data modeling methodologies historically include:
- Flat Model – single, two-dimensional array of data elements
- Hierarchical Model – records containing fields and sets defining a parent/child hierarchy
- Network Model – similar to hierarchical model allowing one-to-many relationships using a junction ‘link’ table mapping
- Relational Model – collection of predicates over finite set of predicate variables defined with constraints on the possible values and combination of values
- Star Schema Model – normalized fact and dimension tables removing low cardinality attributes for data aggregations
- Data Vault Model – records long term historical data from multiple data sources using hub, satellite, and link tables
Database Development Life Cycle – DDLC
Today’s dialogue seems to focus entirely on complexity and sheer volume of data. Important, sure, but again I’d like to remind you that the Data Model should be an important part of the discussion. As requirements evolve, the schema (a Data Model) must follow along – or even lead the way; regardless, it needs to be managed. Therefore, I submit to you, the Database Development Life Cycle!
For every environment (like DEV/TEST/PROD) where data is involved, developers need to accommodate and adapt code to its inevitable structural mutation. Similar to the Software Development Life Cycle (SDLC), a database should embrace appropriate Data Model Design & Best Practices. Of the many Data Models that I have designed, clear precepts have emerged which include:
- Adaptability – creating schemas that withstand enhancement or correction
- Expandability – creating schemas that grow beyond expectations
- Fundamentality – creating schemas that deliver on features and functionality
- Portability – creating schemas that can be hosted on disparate systems
- Exploitation – creating schemas that maximize a host technology
- Efficient Storage – creating optimized schema disk footprint
- High Performance – creating optimized schemas that excel
These design precepts incorporate the essence of any chosen modeling methodology, some in contradiction with others. In my experience regardless of these dichotomies, a data model has just three stages of life – cradle to grave:
- A Fresh INSTALL – based upon the current version of the schema
- Apply an UPGRADE – drop/create/alter dB objects upgrading one version to the next
- Data MIIGRATION – where a disruptive ‘upgrade’ occurs (like splitting tables or platform)
Designing the Data Model can be a labor of love entailing both the tedious attention to detail tempered with the creative abstraction of ambiguity. Personally drawn to challenging schemas, I look for cracks and crevices to correct, which often present themselves in various ways. For example:
- χ Composite Primary Keys avoid them, rarely effective or appropriate; there are some exceptions depending upon the data model
- χ Bad Primary Keys usually datetime and/or strings (except a GUID or Hash) are inappropriate
- χ Bad Indexing either too few or too many
- χ Column Datatypes when you only need an Integer don’t use a Long (or Big Integer), especially on a primary key
- χ Storage Allocation inconsiderate of data size and growth potential
- χ Circular References where a table A has a relationship with table B, table B has a relationship with table C, and table C has a relationship with table A – this is simply bad design (IMHO)
Let us consider then a database design best practice: The design and release process of a data model. I believe that when crafting a data model one should follow a prescribed process similar to this:
Self-explanatory to most perhaps, yet let me emphasize the importance of adopting this process. While schema changes are inevitable, getting a solid data model early in any software development project is essential. Undoubtedly minimizing the impact to application code is desirable for delivering successful software projects. Schema changes can be an expensive proposition so understanding the database life cycle and its role becomes very important. Versioning your database model is critical. Use graphical diagrams to illustrate the designs. Create a ‘Data Dictionary’ or ‘Glossary’ and track lineage for historical changes. It is a higher discipline; but it works!
Data Modeling Methodologies
Understanding the history of the Data Model and the best process under which to design them is only the starting point. As a Database Architect for both Transactional (OLTP) and Analytical (OLAP) models, I have discovered that the first three steps illustrated above represent about 80% of the work. So let us consider that next.
Sometimes Data Models are easy, usually due to simplicity and/or small stature. Data Models can also be very hard, usually due to complexity, diversity, and/or sheer size and shape of the data and the many places throughout the Enterprise where it is used. I believe we should understand as early as possible the full extent of what and where data is, how it is affected by, or affects the applications and systems using it, and why it is there in the first place. Getting your head around who needs what and how to deliver it is the challenge. Mapping it out to ensure a solid Data Model is the goal. Choosing the right data modeling methodology is paramount.
Business Value of a Data Model
Talend ETL/ELT jobs are written to read and write data. We do this ostensibly to deliver value to the business. Why then do we need a Data Model? What purpose does it serve? Can’t we simply process it and be done?
From a technical perspective, we rely on the data model to provide a structure upon which we manipulate data flow. The life cycle of a Data Model directly impacts job design, performance, and scalability. And this is just the tip of the iceberg, technically. The business perspective is perhaps more abstract. Foremost the Data Model validates the business requirements. It provides a critical definition for systems integration and the structural control of data used by the business, thus ensuring various functional and/or operational tenets. I submit that the business becomes wholly inefficient without a Data Model. Agreed?
Without the Data Model and tools like Talend, data can completely fail to provide business value, or worse impede its success through inaccuracy, misuse, or misunderstanding. In my experience having a well-defined Data Model and DDLC best practice accelerates and augments the business value of data.
In Part 2 of this series, I will illustrate and examine the basics and value of the Logical and Physical Data Model. I will also propose an expansion on the way we differentiate our data: holistically first, then separating out the conceptual details, before we even attempt a Logical or Physical design. These, to help us better understand the data, model the data, and validate the model of our Database Design.
Until then, ponder on the information presented here, and feel free to leave any comments, questions, and/or debate the principles presented. Cheers!