Data Modeling: Ensuring Data You Can Trust
Here’s food for thought: How can you verify that your data is being fully and efficiently utilized to enhance your business, if no standard exists to ensure the basic accuracy, coverage, extensibility, and interpretability of data? Without a process to ensure clean and quality data you can trust, you have no power or backing for reports and insights behind the best data-driven decisions.
Data modeling is a critical first step in defining the structure of available data, insuring against mistakes, and ultimately certifying actionable downstream results. Preparing a robust data model means knowing the process and its benefits, the various types of models, best practices, and the relevant software tools which are available.
What is data modeling?
Data modeling is the process of describing the structure, associations, and constraints relevant to available data, eventually encoding these rules into a reusable standard. From highlighting the most-critical high-level concepts underlying data from a stakeholders’ standpoint, to precisely documenting the contents, data types, and rules governing individual columns in a database, it is an integral part of the planning stage for any analytics deployment or business intelligence project.
Data modeling notations and techniques are often graphical and intended for clarity and portability; the resulting specifications can effectively be communicated across the levels of an organization, from engineers to managers to analysts, as well as clients or other businesses. The process itself inherently encourages discussion, collaboration, and careful consideration of the transformation of requirements to implemented data reality.
The creation and adherence to data modeling standards is recommended for a variety of applications — from integration and API development to the design of data repositories both on-premises and in the cloud (whether a single database, a data warehouse, a data mart, a data lake, etc.)
During this process, the emphasis is on the need for availability and organization of data, not how the data will be applied. A data model, therefore, is the formal documentation of conventions for the management of a data resource and essentially a guide to implementation, usage, or extension of such a resource.
Why data modeling is important.
An effective data modelling procedure leads to better allocation of human and computational resources, anticipates issues before they arise, bolsters cross-functional communication, and enforces compliance (regulatory and internal) — all while guaranteeing underlying data quality, security, and accessibility.
At the business level, a completed data model makes an organization more responsive to change, lowers risks, and increases effectiveness — ultimately reducing costs.
At the engineering level, redundancy is minimized and systems naturally become easier to integrate, easier to interface with, and more compatible with each other.
Three types of data models.
There are three overarching types of data models which should be generated while planning an analytics project. Together they range from the most abstract to the most discrete of specifications, but each involves contribution from a distinct subset of stakeholders.
1. The conceptual data model.
First, the conceptual data model (sometimes called a conceptual schema) expresses initial requirements at a high level, and is agnostic of technologies or software. Here the general business scope (or domain) of the model is used as a launching point for discussion among process owners. Important classes in the data are selected and semantic relationships between them are described, all based on real-world instances which these classes represent.
Though these deliberations will typically involve managers and data architects, anyone should be able to understand the resulting model. The focus is on the business understanding of data — technical details are reserved for the following two modes.
2. The logical data model.
The conceptual schema can be translated into a logical data model, where more detailed specifications are first dealt with. This type of model concerns the structure of required data: typically how it might be represented in an actual repository.
A single conceptual model can result in multiple logical models, depending on the complexity of the underlying data represented and on the sophistication of business requirements. Architects continue to drive the conversation as they begin to involve data engineers.
3. The physical data model.
Finally, specifications can be encoded in physical data models. These describe individual database tables, component columns and types, and details of keys and relationships — while accounting for performance, access, and resource allocation. The technologies involved have all been selected and accounted for (e.g. choices of local vs. cloud, types of DBMSs, and APIs).
At this point, data modeling has reached the highest level of granularity before actual implementation or deployment. The onus now lies with data engineers and database administrators.
To Note: The difference between logical and physical schemas has often blurred in practice. Some tools and engineering paradigms even combine these distinct models (and the conceptual is also sometimes omitted or undistinguished from the logical). Remember:
- The conceptual model is universally interpretable and deals exclusively with the business requirements
- The logical model bridges business entities and attributes to data structures
- The physical model is a realized implementation, albeit on paper
Data modeling best practices.
Although there are countless benefits to data modeling, there are also inherent dangers in creating and using bad models. To prevent unnecessary overhead and inefficiency, among a host of other problems, sticking to best practices is critical.
Approach data modeling as a holistic process.
It’s important to start with the widest perspective, both temporally and cross-sectionally. Though the three types of models have a natural ordering and distinct purposes, the final specifications should be considered a living document.
The process might be required in full as a nascent business dives into analytics, or in part to extend an established data repository with new use cases. It may revisit itself for reengineering, integration of legacy systems, or migration to the cloud.
Additionally, many enterprises have data ecosystems composed of multiple components and silos, in addition to the exploding global networks of data externally available. It’s important to design flexibly and modularly, with a view to both the future and to this wider ecosystem.
Practically, this means you might start with a bubble chart of your business’ data silos as well as external or anticipated sources. This overview could reveal unconsidered entities to include in a conceptual model, or even highlight inefficiencies and lead to organization-wide improvements. You might consider this a 0th step to the process, but also one that should be revisited as a business or application evolves.
Don’t skip the concept schema.
A beginning modeler might prefer to skip the conceptual phase as unnecessary, while a manager might imagine that architects will seamlessly translate a business idea to a logical schema without input or discussion, but the conceptual model is the foundation for the entire process. To wit, this phase is independent of technological considerations because the business should drive the adoption and utilization of technologies.
Concept modeling refines the original idea or purpose behind a project to its component business requirements and constraints. Failing to make these definitions at the beginning incurs great risk of needing to revisit assumptions down the line. Even worse, there is a danger of passing along ambiguous or incorrect interpretations of entities and relationships, translating to erroneous data.
As previously mentioned, a completed conceptual model is also a standalone communicable resource, potentially the only documentation which can be understood across an enterprise. Conceptualization should always be a well-defined step in any new data modeling process.
Verify your logic carefully.
Again, logical modeling is the bridge between pure business requirements and technical constraints. It is the most important stage for validating compatibility between vision and technical possibility. Before work on a physical model can even be considered, entity relationship diagrams (ERDs) must be carefully assembled and reviewed, and entities themselves discussed and classified into logical buckets (uniquely identifiable entities with unique primary keys, child entities and their relational connection to parents, etc.)
Though this step is still removed from the pure technicalities which engineers will encode in implementation, it explores and validates the bare feasibility of organizing the data previously determined as a business requirement. Logical barriers and ideas incompatible with structured data storage must be discovered and resolved here.
Create a detailed and reusable reference.
The physical model synthesizes the result of all discussions while also creating the most detailed specifications for a data storage project. Here, the focus has moved from abstract perspective and prioritization of concepts to a fully-fledged set of technical guidelines. The specific objects and the technologies containing and serving them should be codified.
Your physical schema must be an engineering reference that is complete, maximally specific, yet navigable. With this in mind, a successful physical modelling step will produce:
- A full diagram of data objects, containers, and relational associations.
- A data dictionary which links every technical component and object to a summary of the relevant information passed down from the more abstracted models.
The cloud and the future of data modeling.
Data modeling is an evolving paradigm (the three-schema approach dates back to the 1970s!), and the rise of cloud storage and computing might seem to disrupt the process.
It is clear, however, that data modeling alongside the cloud will only continue to provide value for businesses and be a critical part of planning for the foreseeable future.
- These techniques continue to provide clear lines of cross-functional communication and understanding in an increasingly fractured and fast-moving technological environment where those links only become more important.
- As companies move their infrastructures to the cloud, data modeling enables stakeholders to make informed decisions about what, how, and when data should be transitioned.
- The rise of ELT and its variants, vast improvements in infrastructure for data storage and migration (most recently fueled by this expansion of cloud computing), and growing interest in streaming data, are all symptoms of a morphing data ecosystem. Understanding how to adopt and integrate foundational new technologies such as these, even anticipating the next ‘big shift,’ starts with a commitment to modeling the underlying data and the drivers for their existence.
Data modeling is at its core a paradigm of careful data understanding before analysis or action, and so will only grow more valuable in light of these trends.
Getting started with data modeling.
Across three distinct types of schema, the data modeling procedure encompasses all different aspects of planning for any data project. It is a paradigm which encourages good business practices alongside technical preparedness, and its continuing value and relevance is clear.
- If you are a manager seeking to improve communication of requirements and streamline processes, a further look at conceptual modeling, the details of object-role models (ORMs) or UML class diagrams, and class responsibility collaborator (CRC) cards is in order.
- For fledgling engineers and analysts, researching entity-relationship diagrams, relevant graphical notations (e.g. Crow’s Foot), and perusing example data dictionaries would be a valuable next step.
Ready to get started with Talend?
More related articles
- What is MySQL? Everything You Need to Know
- What is Middleware? Technology’s Go-to Middleman
- What is Shadow IT? Definition, Risks, and Examples
- What is Serverless Architecture?
- What is SAP?
- What is ERP and Why Do You Need It?
- What is “The Data Vault” and why do we need it?
- What is a Data Lab?
- Understanding Cloud Storage
- What is a Legacy System?
- What is Data as a Service?
- What is a Data Mart?
- What is Data Processing?
- Understanding data mining: Techniques, applications, and tools
- What is Apache Hive?
- Data Munging: A Process Overview in Python
- What is a Data Source?
- Data Transformation Defined
- SQL vs NoSQL: Differences, Databases, and Decisions
- How modern data architecture drives real business results
- Data Gravity: What it Means for Your Data
- CRM Database: What it is and How to Make the Most of Yours
- Data Conversion 101: Improving Database Accuracy