What is data integrity and why is it important?
Imagine this: A pharmaceutical company touts the safety of its new wonder drug. But when the FDA inspects the offshore production facility, work is halted immediately; important quality-control data is missing. Unfortunately, this real-life example of compromised data integrity is all too common. Problems with the accuracy and consistency of data can cause everything from minor hassles to significant business problems.
In this era of big data, when more pieces of information are processed and stored than ever, data health has become a pressing issue — and implementing measures that preserve the integrity of the data that’s collected is increasingly important. Understanding the fundamentals of data integrity and how it works is the first step in keeping data safe. Read on to learn more about what data integrity is, why it’s essential, and what you can do to keep your data healthy.
What is data integrity?
Data integrity is the overall accuracy, completeness, and consistency of data. Data integrity also refers to the safety of data in regard to regulatory compliance — such as GDPR compliance — and security. It is maintained by a collection of processes, rules, and standards implemented during the design phase. When the integrity of data is secure, the information stored in a database will remain complete, accurate, and reliable, no matter how long it’s stored or how often it’s accessed.
The importance of data integrity in protecting yourself from data loss or a data leak cannot be overstated. In order to keep your data safe from outside forces acting with malicious intent, you must first ensure that internal users are handling data correctly. By implementing the appropriate data validation and error checking, you can ensure that sensitive data is never miscategorized or stored incorrectly, thus exposing you to potential risk.
Data integrity in SQL databases refers to ensuring that each row of a table is uniquely identified so that data can be retrieved separately. To achieve this, you need constraints on columns (constraints are sets of rules). Data constraints prevent invalid data entry into the base tables of the database, which helps maintain data integrity.
Types of data integrity
Maintaining data integrity requires an understanding of the two types of data integrity: physical integrity and logical integrity. Each is a set of processes and methods that enforces data integrity.
Physical integrity is the protection of the completeness and accuracy of that data as it’s stored, maintained in storage, and retrieved. When natural disasters strike, the power goes out, or a disk drive crashes, the physical integrity of data is compromised. Human error, storage erosion, and a host of other issues can also make it impossible for data processing managers, system programmers, applications programmers, and internal auditors to obtain accurate data.
Logical integrity keeps data unchanged as it’s used in different ways in a relational database. Logical integrity protects data from human error and hackers as well, but in a much different way than physical integrity does. There are four types of logical integrity:
- Entity integrity. Entity integrity relies on the creation of primary keys — the unique values that identify pieces of data — to ensure that data isn’t listed more than once and that no field in a table is null. It’s a feature of relational systems which store data in tables that can be linked and used in a variety of ways.
- Referential integrity. Referential integrity refers to the series of processes that make sure data is stored and used uniformly. Rules embedded into the database’s structure about how foreign keys are used ensure that only appropriate changes, additions, or deletions of data occur. Rules may include constraints that eliminate the entry of duplicate data, guarantee that data entry is accurate, and/or disallow the entry of data that doesn’t apply.
- Domain integrity. Domain integrity is the collection of processes that ensure the accuracy of each piece of data in a domain. In this context, a domain is a set of acceptable values that a column is allowed to contain. It can include constraints and other measures that limit the format, type, and amount of data entered.
- User-defined integrity. User-defined integrity involves the rules and constraints created by the user to fit their particular needs. Sometimes entity, referential, and domain integrity aren’t enough to safeguard data. Often, specific business rules must be taken into account and incorporated into data integrity measures.
Data integrity characteristics
Data integrity is comprised of common core characteristics:
- Completeness. To what degree is the data fully available in the database?
- Accuracy. Is the data in the right form and is it correct and true?
- Consistency. Consistency of data can be low level (i.e., customer contact info is formatted in the same way) or high level (different groups are using the same dataset).
- Timeliness. How near to real-time is the data being collected? Old data is often not useful.
- Compliance. Does the data meet compliance standards, such as data privacy regulations and other regulations?
What data integrity is not
With so much talk about data integrity, it’s easy for its true meaning to be muddled. Often data security and data quality are incorrectly substituted for data integrity, but each term has a distinct meaning.
Data integrity is not data security
Data security is the collection of measures taken to keep data from getting corrupted. It incorporates the use of systems, processes, and procedures that restrict unauthorized access and keep data inaccessible to those who may wish to use it in harmful or unintended ways. Breaches in data security may be small and easy to contain or large and capable of causing significant damage.
While data integrity is concerned with keeping information intact and accurate for the entirety of its existence, the goal of data security is to protect information from outside attacks. Data security is but one of the many facets of data integrity. Data security is not broad enough to include the many processes necessary for keeping data complete and accurate over time.
Data integrity is not data quality
Does the data in your database meet company-defined standards and the needs of your business? Data quality answers these questions with an assortment of processes that measure your data’s age, relevance, accuracy, completeness, and reliability.
Much like data security, data quality is only a part of data integrity, but a crucial one. Data integrity encompasses every aspect of data quality and goes further by implementing an assortment of rules and processes that govern how data is entered, stored, transferred, and much more.
Data integrity and GDPR compliance
Data integrity is key to complying with data protection regulations like GDPR. Non-compliance with these regulations can make companies liable for large penalties. In some instances, they may be sued on top of these significant fees. Repeated compliance violations can even put companies out of business.
Fortunately, there are ways to ensure the data integrity you need to comply with GDPR and other data protection legislation.
Data integrity risks
An assortment of factors can affect the integrity of the data stored in a database. A few examples include the following:
- Human error: When individuals enter information incorrectly, duplicate or delete data, don’t follow the appropriate protocols, or make mistakes during the implementation of procedures meant to safeguard information, data integrity is jeopardized.
- Transfer errors: When data can’t successfully transfer from one location in a database to another, a transfer error has occurred.
- Bugs and viruses: Spyware, malware, and viruses are pieces of software that can invade a computer and alter, delete, or steal data.
- Compromised hardware: Sudden computer or server crashes, and problems with how a computer or other device functions, are examples of significant failures and may be indications that your hardware is compromised. Compromised hardware may render data incorrect or incomplete, limit or eliminate access to data, or make information hard to use.
Risks to data integrity can easily be minimized or eliminated by doing the following:
- Limiting access to data and changing permissions to restrict changes to information by unauthorized parties
- Validating data to make sure it’s correct both when it’s gathered and when it’s used
- Backing up data
- Using logs to keep track of when data is added, modified, or deleted
- Conducting regular internal audits
- Using error-detection software
Getting started with data integrity
Protecting the integrity of your company’s data using traditional methods can seem like an overwhelming task. Secure, cloud-based data integration platforms offer a modern alternative that also provides a real-time view of all of your data. With industry-leading cloud integration tools, you can connect multiple source data applications and get access to all of your company’s data in one location.
Take a look at the Definitive Guide to Data Governance to find out how to establish a framework for data integrity.
Ready to get started with Talend?
More related articles
- What is Data Profiling?
- What is Data Quality? Definition, Examples, and Tools
- What is Data Quality Management?
- What is Data Redundancy?
- What is data synchronization and why is it important?
- 8 Ways to Reduce Data Integrity Risk
- 10 Best Practices for Successful Data Quality
- Data Quality Analysis
- Data Quality and Machine Learning: What’s the Connection?
- Data Quality Software
- Data Quality Tools - Why the Cloud is the Cure for Dirty Data
- How to Choose a Big Data Quality Model
- How to Choose the Right Data Quality Tools
- The Value of Data Quality in Healthcare
- Using Machine Learning for Data Quality