Data integrity vs. data quality: Is there a difference?
In short, yes. When we talk about data integrity, we’re referring to the overarching completeness, accuracy, consistency, accessibility, and security of an organization’s data. Together, these factors determine the reliability of the organization’s data. Data quality uses those criteria to measure the level of data integrity and, in turn, its reliability and applicability for its intended use. Data quality and integrity are vital to a data-driven organization that employs analytics for business decisions, offers self-service data access for internal stakeholders and provides data offerings to customers.
Data integrity
To achieve a high level of data integrity, an organization implements processes, rules and standards that govern how data is collected, stored, accessed, edited and used. These processes, rules and standards work in tandem to:
- Validate data and input
- Remove duplicate data
- Provide data backups and ensure business continuity
- Safeguard data via access controls
- Maintain an audit trail for accountability and compliance
An organization can use any number of tools and private or public cloud environments throughout the data lifecycle to maintain data integrity through something known as data governance. This is the practice of creating, updating and consistently enforcing the processes, rules and standards that prevent errors, data loss, data corruption, mishandling of sensitive or regulated data, and data breaches.
The benefits of data integrity
An organization with a high level of data integrity can:
- Increase the likelihood and speed of data recoverability in the event of a breach or unplanned downtime
- Protect against unauthorized access and data modification
- Achieve and maintain compliance more effectively
Good data integrity can also improve business decision outcomes by increasing the accuracy of an organization’s analytics. The more complete, accurate and consistent a dataset is, the more informed business intelligence and business processes become. As a result, leaders are better equipped to set and achieve goals that benefit their organization and drive employee and consumer confidence.
Data science tasks such as machine learning also greatly benefit from good data integrity. When an underlying machine learning model is being trained on data records that are trustworthy and accurate, the better that model will be at making business predictions or automating tasks.
The different types of data integrity
There are two main categories of data integrity: Physical data integrity and logical data integrity.
Physical data integrity is the protection of data wholeness (meaning the data isn’t missing important information), accessibility and accuracy while data is stored or in transit. Natural disasters, power outages, human error and cyberattacks pose risks to the physical integrity of data.
Logical data integrity refers to the protection of data consistency and completeness while it’s being accessed by different stakeholders and applications across departments, disciplines, and locations. Logical data integrity is achieved by:
- Preventing duplication (entity integrity)
- Dictating how data is stored and used (referential integrity)
- Preserving data in an acceptable format (domain integrity)
- Ensuring data meets an organization’s unique or industry-specific needs (user-defined integrity)
How data integrity differs from data security
Data security is a subcomponent of data integrity and refers to the measures taken to prevent unauthorized data access or manipulation. Effective data security protocols and tools contribute to strong data integrity. In other words, data security is the means while data integrity is the goal. Data recoverability — in the event of a breach, attack, power outage or service interruption — falls under the realm of data security.
The consequences of poor data integrity
Human errors, transfer errors, malicious acts, insufficient security and hardware malfunctions all contribute to “bad data,” which negatively impacts an organization’s data integrity. An organization contending with one or more of these issues risks experiencing:
Poor data quality
Low-quality data leads to poor decision-making because of inaccurate and uninformed analytics. Reduced data quality can result in productivity losses, revenue decline and reputational damage.
Insufficient data security
Data that isn’t properly secured is at an increased risk of a data breach or being lost to a natural disaster or other unplanned event. And without proper insight and control over data security, an organization can more easily fall out of compliance with local, regional, and global regulations, such as the European Union’s General Data Protection Regulation.
Data quality
Data quality is essentially the measure of data integrity. A dataset’s accuracy, completeness, consistency, validity, uniqueness, and timeliness are the data quality measures organizations employ to determine the data’s usefulness and effectiveness for a given business use case.
How to determine data quality
Data quality analysts will assess a dataset using dimensions listed above and assign an overall score. When data ranks high across every dimension, it is considered high-quality data that is reliable and trustworthy for the intended use case or application. To measure and maintain high-quality data, organizations use data quality rules, also known as data validation rules, to ensure datasets meet criteria as defined by the organization.
The benefits of good data quality
Improved efficiency
Business users and data scientists don’t have to waste time locating or formatting data across disparate systems. Instead, they can readily access and analyze datasets with greater confidence. Additional time is saved that would have otherwise been wasted on acting on incomplete or inaccurate data.
Increased data value
Because data is formatted consistently and contextualized for the user or application, organizations can derive value from data that may have otherwise been discarded or ignored.
Improved collaboration and better decision-making
High-quality data eliminates incongruency across systems and departments and ensures consistent data across processes and procedures. Collaboration and decision-making among stakeholders are improved because they all rely on the same data.
Reduced costs and improved regulatory compliance
High-quality data is easy to locate and access. Because there is no need to re-create or track down datasets, labor costs are reduced, and manual data entry errors become less likely. And because high-quality data is easy to store in the correct environment as well as collect and compile in mandatory reports, an organization can better ensure compliance and avoid regulatory penalties.
Improved employee and customer experiences
High-quality data provides more accurate, in-depth insights an organization can use to provide a more personalized and impactful experience for employees and customers.
The six dimensions of data quality
To determine data quality and assign an overall score, analysts evaluate a dataset using these six dimensions, also known as data characteristics:
- Accuracy: Is the data provably correct and does it reflect real-world knowledge?
- Completeness: Does the data comprise all relevant and available information? Are there missing data elements or blank fields?
- Consistency: Do corresponding data values match across locations and environments?
- Validity: Is data being collected in the correct format for its intended use?
- Uniqueness: Is data duplicated or overlapping with other data?
- Timeliness: Is data up to date and readily available when needed?
The higher a dataset scores in each of these dimensions, the greater its overall score. A high overall score indicates that a dataset is reliable, easily accessible, and relevant.
How to improve data quality
Some common methods and initiatives organizations use to improve data quality include:
Data profiling
Data profiling, also known as data quality assessment, is the process of auditing an organization’s data in its current state. This is done to uncover errors, inaccuracies, gaps, inconsistent data, duplications, and accessibility barriers. Any number of data quality tools can be used to profile datasets and detect data anomalies that need correction.
Data cleansing
Data cleansing is the process of remediating the data quality issues and inconsistencies discovered during data profiling. This includes the deduplication of datasets, so that multiple data entries don’t unintentionally exist in multiple locations.
Data standardization
This is the process of conforming disparate data assets and unstructured big data into a consistent format that ensures data is complete and ready for use, regardless of data source. To standardize data, business rules are applied to ensure datasets conform to an organization’s standards and needs.
Geocoding
Geocoding is the process of adding location metadata to an organization’s datasets. By tagging data with geographical coordinates to track where it originated from, where it has been and where it resides, an organization can ensure national and global geographic data standards are being met. For example, geographic metadata can help an organization ensure that its management of customer data stays compliant with GDPR.
Matching or linking
This is the method of identifying, merging, and resolving duplicate or redundant data.
Data quality monitoring
Maintaining good data quality requires continuous data quality management. Data quality monitoring is the practice of revisiting previously scored datasets and reevaluating them based on the six dimensions of data quality. Many data analysts use a data quality dashboard to visualize and track data quality KPIs.
Batch and real-time validation
This is the deployment of data validation rules across all applications and data types at scale to ensure all datasets adhere to specific standards. This can be done periodically as a batch process, or continuously in real time through processes like change data capture.
Master data management
Master data management (MDM) is the act of creating and maintaining an organization-wide centralized data registry where all data is cataloged and tracked. This gives the organization a single location to quickly view and assess its datasets regardless of where that data resides or its type. For example, customer data, supply chain information and marketing data would all reside in an MDM environment.
Data integrity, data quality and IBM
IBM offers a wide range of integrated data quality and governance capabilities including data profiling, data cleansing, data monitoring, data matching and data enrichment to ensure data consumers have access to trusted, high-quality data. IBM’s data governance solution helps organizations establish an automated, metadata-driven foundation that assigns data quality scores to assets and improves curation via out-of-the-box automation rules to simplify data quality management.
With data observability capabilities, IBM can help organizations detect and resolve issues within data pipelines faster. The partnership with Manta for automated data lineage capabilities enables IBM to help clients find, track and prevent issues closer to the source.
Learn more about designing the right data architecture to elevate your data quality here.
Senior Product Manager, Watson Knowledge Catalog