Data Quality Checks: How to Test 6 Data Quality Dimensions

January 6, 2024

Share Dagster on Twitter / XCheck out Dagster on LinkedInShare Dagster on Reddit

What Is Data Quality Testing?

Data quality testing involves evaluating data to ensure it meets specific standards for accuracy, completeness, consistency, and more. These tests identify discrepancies and errors in data that can lead to flawed analysis or decision-making. Testing is a continuous process carried out at various stages of data handling, from collection to processing and storage, providing a view of data health.

Organizations use data quality testing to maintain data integrity, reliability, and trustworthiness. This process helps ensure that data sources produce accurate results when used in analytics, reporting, or operational tasks. By implementing data quality testing, companies can minimize risks associated with inaccurate data and increase confidence in data-driven decisions.

Key Dimensions of Data Quality and Relevant Data Checks

1. Accuracy

Accuracy in data quality ensures that information correctly represents the real-world entities it models. It's essential for preventing errors in data-driven decisions and operations. Accurate data must be verified against trusted sources and refined regularly to maintain its precision. Errors and discrepancies can distort outcomes and lead to mistrust in data assets.

Relevant checks include:

  • Source verification: Compare data against trusted reference sources to confirm correctness.
  • Threshold analysis: Validate that values fall within expected thresholds or ranges.
  • Anomaly detection: Identify outliers using statistical methods to flag potential inaccuracies.
  • Cross-field validation: Ensure relationships between fields are logical (e.g., birth date \< current date).
  • Historical comparison: Assess data against historical records to detect deviations.

2. Completeness

Completeness of data refers to the extent to which all required data is available. Missing data can lead to incorrect interpretations and faulty decision-making processes. To ensure completeness, businesses must identify essential data points and verify that all fields are filled appropriately.

Relevant checks include:

  • Null value checks: Identify and report missing values in required fields.
  • Mandatory field validation: Confirm all critical fields are populated.
  • Record-level completeness: Check if datasets include all required records.
  • Cross-system completeness: Ensure data transfers between systems do not omit critical elements.
  • Range and enumeration verification: Validate that fields that allow a limited number of known values are populated by appropriate values.

3. Consistency

Consistency in data quality means maintaining uniformity across datasets, regardless of the source or format. Inconsistent data can result from varied data formats, differing scales, or versioning errors. Upholding consistency requires coherent data handling policies and routine checks for anomalies.

Relevant checks include:

  • Format standardization: Ensure consistent data formatting across datasets (e.g., date formats, numeric precision).
  • Field-level consistency: Compare repeated values across datasets for uniformity (e.g., customer ID).
  • Version control validation: Verify that the most recent data versions are used across systems.
  • Cross-system checks: Ensure consistency between integrated datasets or systems.
  • Schema consistency: Confirm that database schemas adhere to a unified standard.

4. Timeliness

Timeliness refers to data being available when needed for decision-making. Delayed or outdated information reduces the effectiveness of insights. Ensuring timeliness requires efficient data collection processes and immediate data updates.

Relevant checks include:

  • Timestamp validation: Confirm that timestamps on data entries align with processing requirements.
  • Latency monitoring: Measure and monitor delays in data capture or transfer processes.
  • Scheduling adherence: Validate that data updates or feeds occur as per the defined schedule.
  • Alert mechanisms: Implement alerts for delayed data updates or stale records.

5. Uniqueness

Uniqueness in data ensures that each entry is distinct, without duplications. Duplicate data can clutter datasets, skew analysis, and increase storage costs. It is particularly vital in identifying and managing resource allocations, customer records, and supply chains.

Relevant checks include:

  • Duplicate detection: Use algorithms to identify duplicate entries in datasets.
  • Primary key validation: Ensure primary key constraints are enforced in databases.
  • Merge check: Confirm that merged datasets do not introduce duplicate records.
  • Entity resolution: Perform entity matching to ensure distinct representation of individuals or objects.
  • Storage cleanup: Periodically remove redundant or duplicate data.

6. Validity

Validity checks that data values match defined formats and fall within the expected ranges. Invalid data can emerge from input errors or faulty data integrations. Thus, validity is crucial in ensuring data accuracy and usability.

Relevant checks include:

  • Format validation: Check if data values comply with predefined formats (e.g., email, phone number).
  • Range validation: Ensure numeric or date values fall within accepted ranges.
  • Domain constraints: Verify values belong to a defined set of valid categories or codes.
  • Dependency validation: Check that data values conform to interdependent rules (e.g., zip code matching city).
  • Regular expression matching: Use regex patterns to validate textual fields (e.g., postal codes, identification numbers).

Tools and Techniques for Data Quality Checks

Automated Tools

Automated tools streamline data quality checks by leveraging predefined rules and algorithms to identify and correct issues at scale. These tools often integrate with ETL pipelines, databases, and analytics platforms, providing validation during data ingestion and processing.

Key features include duplicate detection, schema validation, and outlier identification. These tools offer user-friendly interfaces for creating workflows that enforce consistency, accuracy, and completeness. Platforms may also incorporate machine learning to detect anomalies and suggest corrections based on historical patterns.

Manual Techniques

Manual techniques remain relevant in scenarios where domain expertise is essential for identifying subtle errors or interpreting data nuances that automated systems might overlook. Analysts may use tools like spreadsheets or SQL queries to perform ad hoc checks, verify calculations, or validate complex relationships between data points.

These methods are particularly useful for small-scale data audits, testing new data pipelines, or addressing one-off issues. While manual techniques are time-consuming, they provide a layer of quality assurance that automated tools alone cannot achieve.

Best practices for manual checks include creating detailed guidelines for reviewers, using peer reviews to minimize oversight, and documenting findings to inform future improvements in automated systems.

Scripting and Programming

Custom scripts and programs allow for tailored data quality checks that address unique business requirements or complex datasets. Common programming languages such as Python, R, or SQL are used to implement advanced validation logic, automate repetitive checks, and process large datasets efficiently.

For example, Python libraries like Pandas and NumPy enable analysts to write scripts for checking data consistency, handling missing values, and detecting anomalies. SQL queries are often used to validate database integrity, enforce referential constraints, and perform cross-table comparisons.

Scripting offers flexibility and scalability, enabling organizations to adapt their data quality processes to evolving needs. However, it requires skilled personnel to develop the scripts, and might create an ongoing maintenance effort.

5 Best Practices for Effective Data Quality Checks

1. Automate Repetitive Checks Wherever Possible

Manual checks, while valuable in some scenarios, are prone to human error and become increasingly inefficient as data volumes grow. Automation helps streamline repetitive tasks, such as checking for missing fields, identifying duplicates, and validating data against predefined business rules. By leveraging automation tools and frameworks, organizations can ensure that checks are performed consistently and at scale.

Technologies like rule-based engines, machine learning models, and data validation scripts are particularly useful for monitoring data quality continuously and flagging anomalies in real time. This reduces the burden on data teams and frees them to focus on resolving issues rather than identifying them. Additionally, automated checks can be scheduled to run at specific intervals or triggered dynamically during critical data operations.

2. Integrate Checks into ETL Pipelines

ETL pipelines are the backbone of data processing workflows, moving data from source systems to storage and analytics platforms. By embedding data quality checks directly into ETL pipelines, organizations can catch errors at the earliest stages of the data lifecycle. For example, schema validation can ensure that incoming data adheres to the expected format and structure, while transformation logic can include checks to prevent invalid or incomplete records from proceeding to the next step.

Integration within ETL also allows organizations to enforce consistency across datasets by applying normalization rules and reconciling discrepancies before loading data into target systems. Many ETL tools come equipped with data quality modules that facilitate this process.

3. Perform Checks Regularly (Batch vs. Real-Time)

Regular checks are essential to ensure that datasets remain reliable and relevant over time. Batch checks are ideal for processing large volumes of data in scheduled intervals, such as daily, weekly, or monthly cycles. These are particularly useful for historical data, reporting, and archival systems, where real-time updates may not be critical. Real-time data quality checks are indispensable for dynamic environments, such as e-commerce platforms, IoT ecosystems, or financial trading systems.

Real-time checks validate data as it is generated or ingested, ensuring that immediate actions are based on accurate and up-to-date information. A hybrid approach, which combines batch processing for static datasets with real-time monitoring for live systems, offers the best of both worlds.

4. Maintain an Audit Trail for Identified and Resolved Issues

An audit trail serves as a critical record-keeping mechanism for all data quality activities. It provides a transparent history of issues, their root causes, and the steps taken to resolve them. Such documentation is invaluable for compliance with regulatory standards like GDPR or HIPAA, which often require organizations to demonstrate accountability in their data management practices. Beyond compliance, audit trails enable organizations to perform root cause analysis, identifying recurring patterns and systemic issues that need to be addressed at their source.

A well-maintained audit trail includes detailed logs that capture timestamps, error descriptions, responsible parties, and resolution outcomes. These records can be stored in centralized systems accessible to all stakeholders, fostering collaboration and continuous improvement.

5. Collaborate Across Teams to Define Expectations and Priorities

Effective data quality management requires input and cooperation from multiple stakeholders, including data engineers, analysts, business leaders, and compliance officers. Each group has unique perspectives and priorities, which must be aligned to ensure a unified approach. For instance, while business teams may focus on the accuracy and timeliness of data for decision-making, compliance teams may emphasize adherence to regulations.

To align these goals, organizations should hold cross-functional workshops and meetings to define quality standards, acceptable error thresholds, and processes for resolving disputes. Collaboration tools, such as shared dashboards and data catalogs, can help teams maintain visibility into data quality metrics and issues. Clear communication channels and governance frameworks ensure that everyone understands their roles and responsibilities.

Automating Data Quality with Dagster

Data quality can be significantly improved by tightly coupling data quality checks with the orchestration of the processes that generate the data.

Dagster’s Asset Checks feature also allows you to automate data quality checks and validate data assets as soon as they are materialized, proactively alerting you to any data quality issues and preventing malformed or erroneous data from flowing to downstream systems.

The Dagster+ Catalog, meanwhile, enables master data management and easier discovery of trusted data assets across your organization. Unlike traditional data catalogs, Dagster is also aware of the context and processes surrounding the materialization of these assets, offering your team built-in monitoring and observability capabilities that allow you to visualize the health of your data pipelines and perform root-cause analyses from within the same platform.

To learn more about how Dagster can serve as the single pane of glass for the health of your data pipelines, see this deep dive into Asset Checks.