The only way to meet fundamental business goals is to make decisions based on high-quality data — but the more data an organization manages, the harder it can be for the company to keep its data healthy.
We know that healthy data is clean, complete, and in compliance with legal and regulatory requirements. Unfortunately, most organizations can’t measure just how healthy their data is — and it’s foolish to rely on data whose health you can’t measure. Fortunately, there are ways to measure and assess data health, and take appropriate corrective action when necessary.
Data health definition
Data health is a measure of the quality of your data. You know that your organization’s data is healthy if you can prove that it’s valid, complete, and of sufficient quality to produce analytics you can feel comfortable basing business decisions on.
With healthy data, an organization can
- Enhance sales and marketing analytics
- Address data governance and compliance
- Improve business processes
- Transform the customer experience
- Drive 360-degree engagement
- Enable machine learning and AI
But without healthy data, all of those processes go awry. You can’t address the right customers, shorten sales cycles, or improve processes if the data you’re basing your work on is inaccurate, uncontrolled, or out of date. Unhealthy data costs companies time and quality in their decision-making, and can negatively affect revenue as well.
So how can you tell if your data is healthy?
Measuring data health
The health of your data depends on its quality. The Data Management Association of the UK defines six dimensions of data quality:
- Accuracy — The degree to which data correctly describes the real-world object or event being described Example: Are the calculations of employees’ wages based on their actual work hours?
- Completeness — The proportion of data stored against the potential for 100% Example: In an address record, do you have all the address fields necessary to get a postal mailing to its destination? ZIP code? ZIP+4? Country name? Do you have address values for all records?
- Consistency — The absence of difference, when comparing two or more representations of a thing against a definition Example: Does one table contain data characterized as belonging to a particular division, even though that division has been eliminated after a reorganization?
- Timeliness — The degree to which data represent reality from the required point in time Example: For health care research, such as the development of a vaccine, do you have access to the latest research data? How quickly is that data made available to you?
- Uniqueness — No entity instance (thing) is recorded more than once based upon how that thing is identified Example: When a system updates a record, can you be sure it isn’t creating a duplicate of the original record, but with more current information?
- Validity or conformity — The degree to which data conforms to the syntax (format, type, or range) of its definition Example: A street address of 1000 Data Way is valid (though not necessarily accurate), while an address of 03H8 Data Way is not.
Other analysts would include factors such as reasonability, accessibility, and integrity. Whatever factors you include, the point is to be able to rely on your data to be useful across the enterprise. The higher you can rate your data across each of these dimensions, the healthier you can consider your data.
Data teams must make their own assessments of the necessary level of data quality to qualify for data health — and they should be able to certify that level of quality to data users, so they in turn can be confident using the data.
A side note: Considerations such a data privacy for personally identifiable information (PII) may apply, keeping some data unavailable to unprivileged users. A strong data governance technology platform can help with data accuracy and security, and the underlying data should exhibit strong data health for those who have the ability to access it.
Data health assessment
Once you know what to measure, how do you go about assessing the health of your data? Given the volume of data your organization is probably managing through SaaS platforms, databases, and public-facing web servers, it’s impossible to examine every record.
The best approach is to employ a data platform that includes both data integration and governance capabilities. You can use the software both to get a reading on data health and to “cure” unhealthy data. Ideally you should be able to get instant insight into what data you can trust and have tools to fix the data you can’t. The platform should address data health issues by offering self-service access, pervasive data quality tools, and comprehensive governance capabilities that span all data flows and data sources from end to end.
How healthy is your data?
Do you have confidence in your organization’s data health? Do you wonder how accurate, complete, and timely it is? Talend can help. Export a subset of your data and run it through the Talend Trust Assessor — a free tool that gives you feedback on the validity, completeness, and uniqueness of your data — or try it with our sample dataset.