What is Data Health?
Organizations around the world rely on data more than ever. However, there is a difference between being surrounded by data daily, and using data to make daily business decisions. The only way to meet fundamental business goals is to take action based on high-quality, trusted data — healthy data. But we are living in the era of big data, and the more data an organization manages, the harder it can be to keep that data healthy.
Most people know intuitively that healthy data should be clean, complete, and in compliance with legal and regulatory requirements. Unfortunately, those factors alone won’t guarantee that data is ready to use for business decisions. Most organizations can’t measure just how healthy their data is — and it's foolish to rely on data whose health you can’t measure. Part of the problem is that while people think they understand what data health means, they struggle to define or evaluate data health.
Let’s start with a clear definition of data health.
Data health definition
Data health is the condition of a company's data and how well it supports effective, timely decisions and business objectives. To know that your organization’s data is healthy, you must be able to prove that it’s valid, complete, and of sufficient quality to produce analytics that decision-makers can feel comfortable relying on for business decisions.
Talend’s vision of data health combines technologies and behaviors to measure and manage data for better discoverability, understandability, and value. Healthy data means that everyone in the organization can access the information they need, when they need it, and use it without wondering about its validity.
Like any health care system, data health involves monitoring and intervention across the entire life cycle. We think of data health in a framework of prevention, treatment, and community support:
- Preventative care: identifying data challenges preemptively
- Effective treatments: systematically curing data reliability issues and risks
- Supportive culture: establishing a discipline of collaborative data care
With data health metrics to prove the business value of data, an organization can improve nearly any aspect of its operations:
- Enhance sales and marketing analytics
- Address data governance and compliance
- Improve business processes
- Transform the customer experience
- Drive 360-degree engagement
- Enable machine learning and AI
Without healthy data, all of those processes go awry. You can’t address the right customers, shorten sales cycles, or improve processes if the available data you’re basing your work on is inaccurate, uncontrolled, or out of date. Unhealthy data costs companies time and quality in their decision-making, which adds costs and can negatively affect revenue. As you scale up to using big data, the health of the data becomes increasingly important. It is critical for companies working with big data to institute health metrics.
So how can you tell if your data is healthy?
Measuring data health
Data quality is a major consideration for data health. The Data Management Association of the UK defines six dimensions for measuring data quality:
- Accuracy — The degree to which data correctly describes the real-world object or event being described
- Example: Are the calculations of employees’ wages based on their actual work hours?
- Completeness — The proportion of data stored in a dataset against the potential for 100%
- Example: Do address records contain data in all address fields necessary to get a postal mailing to its destination? Full postal code? Country name?
- Consistency — The absence of difference, when comparing two or more representations of a thing against a definition
- Example: Does one table contain data characterized as belonging to a particular division, even though that division has been eliminated after a reorganization?
- Timeliness — The degree to which data represents reality from the required point in time
- Example: If budget decisions are made based on sales statistics, how quickly is sales data made available to decision makers?
- Uniqueness — No item, or entity instance, is recorded more than once based upon how that thing is identified
- Example: When a system updates a record, can you be sure it isn’t creating a duplicate of the original record with more current information?
- Validity or conformity — The degree to which data conforms to the syntax (format, type, or range) of its definition
- Example: A street address of 1000 Data Way is valid (though not necessarily accurate), while an address of /03H8 Data Way is not.
Data teams must make their own assessments of the necessary level of data quality to qualify for data health — and they should be able to certify that level of quality to data users, so they in turn can be confident using the data. Remember, though, that data that is sound but not available or trusted is still not supporting business decisions. It isn’t healthy data.
Since data health is a measure of data’s value to the business, transparency and accessibility are as important as quality. If decision makers don’t have ready access to the data they need, the organization may as well not have that data. On the other hand, data privacy for personally identifiable information (PII) may apply. In those cases, it will be best to isolate some data from unprivileged users. A strong data governance technology platform that enlists relevant business experts as data stewards can help improve data accuracy and security alike.
At your organization, data health metrics may include additional factors such as reasonability and integrity. Whatever factors you include, the point is to be able to rely on your data to be useful across the enterprise. The higher you can rate your data across each of these dimensions, the healthier you can consider your data.
Data health assessment
Once you know what to measure, how do you go about assessing the well-being of your data?
A holistic data health system relies on universal metrics of data quality. With standard metrics, evaluation of data’s trustworthiness and actionability becomes possible. As described above, it is not enough for those preparing corporate data to know that the data meets quality standards. End users can only truly trust their decisions when they have metrics proving data quality.
Talend’s 2021 Data Health Survey revealed that less than half of executives are certain that their company even uses data quality standards. About a third of execs said there were no documented standards in place, and 19% more said they weren’t sure. When asked if they saw a need for universal, cross-industry data quality standards, 95% of executives agreed.
Given the volume of data your organization is probably managing through SaaS platforms, databases, and public-facing web servers, it’ll be impossible to have someone examine every record across all datasets. The best approach is to employ a data platform that includes both data integration and governance capabilities.
You can use the software both to get a reading on data health and to cure unhealthy data. Ideally, you should be able to get instant insight into what data you can trust and have tools to fix the data you can’t. The platform should address data health issues by offering self-service access, pervasive data quality tools, and comprehensive governance capabilities that span all data flows and data sources from end to end.
How healthy is your data?
Do you have confidence in your organization’s ability to deliver decision-ready data? Do you wonder about your data health statistics? Talend can help. Start with a free checkup: export a subset of your data and run it through the Talend Trust Assessor. This free service provides a rapid evaluation of the validity, completeness, and uniqueness of your data. If you just want to see how it works, try it with our sample dataset first.