Data quality issues are the silent killers of today’s data-driven initiatives, leading to incorrect insights, flawed decisions, operational inefficiencies, and financial loss.
Traditional data quality management approaches are reactive and fragmented, often validating quality only when unexpected results appear. This disjointed method is costly and inefficient, causing delays and persistent errors that affect decision-making.
Without embedded data quality criteria, organizations miss the chance to ensure data accuracy and reliability from the start, leading to higher costs and diminished trust in data insights.
Data is used to both run and improve the ways that the organization achieves their business objectives ... [data quality ensures that] data is of sufficient quality to meet the business needs.
The Practitioner's Guide to Data Quality Improvement
Automated data pipelines with integrated quality checks contribute to precise analysis and stable operations. The orchestrator environment is crucial for monitoring and enforcing data quality throughout the data lifecycle, ensuring accurate, complete, and actionable data.
In our recent Dagster Deep Dive, "Building Reliable Data Platforms," we discussed strategies for integrating data quality checks using Dagster. If you missed the live event, we’ve embedded the on-demand video below.
Key Point Summary
Throughout the webinar, we explored a few key points:
Understanding Data Quality
It’s important to establish a consensus, so we first explained what it means to have quality data, touching bases on why it’s crucial for making informed business decisions and ensuring operational efficiency. Additionally, we examined the impact of poor data quality, which can cost organizations significantly in time, resources, and spending.
Six Dimensions of Data Quality
To help with assessing quality data, we highlighted six key dimensions to consider when observing data:
- Completeness: Ensuring that data is populated in both the attributes and the record.
- Timeliness: The readiness of data within a specific and predetermined time frame.
- Consistency: The uniformity and reliability of data across multiple data stores or during multiple stages of data processing.
- Accuracy: Ensuring that the data correctly represents the real-world conditions or facts it is supposed to reflect.
- Validity: If the data conforms to the syntactic and semantic rules set for its domain.
- Uniqueness: Whether or not there are duplicates present within the dataset where duplicates should not exist.
We also reviewed strategies to build rules for data validation across these dimensions so that you can enhance your overall data integrity.
Frameworks for Data Validation
Tools like Great Expectations (GX) and dbt tests for enforcing data quality, so we showed how they can be integrated within data pipelines to enforce data quality standards.
Tip: If you’re interested in improving data pipeline reliability without writing custom logic for data testing, check out this blog post on using Dagster and GX together.
Implementing Data Validations
Next came some demos of practical applications using Dagster’s tools to automate and maintain data quality, highlighting features like Dagster’s asset checks and showing how you can configure them to automatically ensure that data meets predefined quality standards.
Supporting Organizational Data Quality
We also discussed and highlighted the importance of collaborative efforts across teams to uphold data standards, addressing how to support quality standards organization-wide and emphasizing the role of data platforms and the importance of embedding data quality controls throughout the data lifecycle – from ingestion to analysis.
Monitoring Data Asset Health
We demonstrated how Dagster’s platform can monitor and alert on data quality issues in real time ensuring ongoing compliance with data quality standards and showcasing how you can set up these checks for the feedback you need to maintain that integrity.
Key Takeaways From the Session
- Data quality is not just a technical requirement but a strategic asset that drives business success.
- Implementing structured data validation across the data lifecycle can significantly reduce errors and enhance data usability.
- Collaborative efforts between data engineers, governance teams, and platform owners are essential to enforce and maintain data quality standards.
- Leveraging modern tools and frameworks can automate the process, ensuring consistency and reliability in data quality checks.
Q&A Highlights
During the webinar, we held a Q&A session with the attendees but didn’t get to answer all of the questions due to time constraints. Below are those (now) addressed questions:
- What is the best practice for testing asset checks in Dagster, and does an asset check failure prevent the I/O manager from running?
- The best practices for testing asset checks are detailed in the Dagster documentation, which provides guidelines on how to effectively test asset checks within your data pipelines. The documentation does not specify the impact of asset check failures on the I/O manager directly, but it provides comprehensive guidelines for setting up tests for asset checks.
- Can you provide examples or best practices for multi-asset checks in Dagster, particularly for replicating data validity across systems?
- In Dagster, asset checks are typically defined on a single asset, but you can involve multiple assets in the validation process. For example, you might define an asset check that compares two different assets (e.g., tables in a database) to ensure they are equal or meet certain conditions. A practical example might involve an @asset_check decorator applied to one asset, where the check function verifies the equality of two related assets using a SQL query or other comparison methods. While no specific examples were provided in the webinar, you can explore libraries like DataFold's datadiff for similar functionality outside of Dagster.
- Can partitioned assets work with asset checks in Dagster?
- Currently, it's a known limitation that partitioned assets may not work effectively with asset checks. More details can be found in this GitHub issue and the Dagster docs.
- When are asset checks on Source Assets run, especially considering Source Assets can be external to the Dagster deployment?
- Currently, asset checks on Source Assets need to be manually scheduled or executed independently. This setup may not be ideal, especially when the Source Asset is external, but the feedback has been noted and will be shared with the development team for potential improvements.
- Can we define Dagster Assets at a customer level for periodic job runs for each customer, including validations/data anomaly detection partitioned by customer?
- Yes, it's possible to define Dagster Assets at a customer level. You could either use a factory method to produce a pipeline for each customer or create a custom run configuration with customer information, allowing the reuse of the same pipeline across different customers. If you are already using a single job definition that runs periodically with different customer information configurations, you can refactor your job to run materialization of assets for each customer with asset checks. Implementing customer-specific validations might introduce complexity, but it's feasible to integrate them into Dagster's UI, divided by customer configuration.
Conclusion
Ensuring data quality is critical for making reliable business decisions and maintaining operational efficiency. By integrating comprehensive data quality checks into data pipelines using tools like Dagster, organizations can enhance their data management practices and reduce errors. Collaborative efforts and modern frameworks are essential to uphold data standards and drive business success.
Watch the on-demand webinar we’ve embedded to explore these strategies and see practical applications. Additionally, stay tuned for future deep dives so that you can continue to gain valuable insights on how to stay on top of your data quality management.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Data Platform Week 2024
- Name
- Alex Noonan
- Handle
- @noonan
From Chaos to Control: How Dagster Unifies Orchestration and Data Cataloging
- Name
- Alex Noonan
- Handle
- @noonan
Dagster Deep Dive Recap: Orchestrating Flexible Compute for ML with Dagster and Modal
- Name
- TéJaun RiChard
- Handle
- @tejaun