Learn
Data Lineage for Compliance: GDPR, HIPAA, and What Auditors Want

Data Lineage for Compliance: GDPR, HIPAA, and What Auditors Want

GDPR and HIPAA auditors don't want dashboards. Learn how column-level lineage built into your pipeline architecture produces audit-ready evidence on demand.

Auditors don't evaluate dashboards. They follow a specific row of data from its raw source, through every transformation, to the number they're questioning. Compliance breaks down in the gap between what data teams typically show and what regulators require. Lineage graphs, catalog entries, and screenshots don't constitute evidence. Closing that gap means building evidence into your pipeline architecture at development time, not assembling it the week before a review.

TL;DR

  • Regulators ask for provenance of a specific field, not confirmation that a lineage tool exists.
  • GDPR field-level compliance requires column-level lineage rather than table-level flow maps.
  • HIPAA audits focus on access trails and minimum necessary exposure at each transformation hop.
  • Manual documentation decays; teams that have moved to automated lineage report 57 percent faster audit prep.
  • Teams that can't produce field-level provenance on demand are betting on how thorough the next auditor happens to be.

What regulators actually ask data teams for

The question an auditor asks is never “do you have a lineage tool?” It’s “show me where this number came from and who touched it.” This distinction matters. Many teams treat lineage as an internal documentation exercise rather than evidence for regulatory reviews.

GDPR specifics: provenance, retention, and right to be forgotten

GDPR asks three questions about personal data: where was it collected, what systems processed it, and can it be deleted?

Provenance

Proving provenance under GDPR means tracing a specific field, such as a customer email, a device ID, or a behavioral record, back to the collection point. Table-level lineage won’t satisfy this. An auditor who asks “where did this email address come from?” doesn’t want to hear that the customers table came from raw_crm. Column-level lineage answers that customers.email originated in raw_crm.contact_email, was normalized to lowercase in a staging model, and passed unchanged into the reporting layer. That chain is the evidence.

Column-level lineage is required for GDPR field-level compliance. Lineage shows how personal data is collected, processed, and stored, which is the auditable traceability GDPR requires.

Retention and right to be forgotten

Right-to-be-forgotten requests expose incomplete lineage immediately. Personal data persists when a deletion touches only the source system because downstream assets aren’t mapped. The compliance team signs off on a deletion that isn’t complete.

Column-level lineage addresses this by showing every downstream location that holds a copy of the field in question. An incomplete lineage map produces an incomplete deletion. The audit doesn’t catch the gap. The next data breach does.

HIPAA specifics: PHI flow, access trails, and minimum necessary

HIPAA audits are structured differently from GDPR reviews. GDPR auditors trace provenance and verify deletion. HIPAA auditors verify access: who saw protected health information (PHI), through which system, at what time, and whether that access was consistent with a documented need.

PHI flow

The minimum necessary standard in HIPAA requires that PHI is disclosed only to the extent needed for a specific purpose. For data pipelines, that means each transformation hop should expose only the fields required downstream. Lineage must capture not just where PHI went, but what was visible at each stage. A model that passes 40 PHI fields to an analytics layer when only 3 are needed represents a compliance gap, and table-level lineage won’t surface it.

Access trails

Dagster’s asset-centric approach captures the data produced at each step rather than merely logging which task ran. For HIPAA access trail requirements, this difference is significant. When an auditor asks which assets a user’s query touched, the answer should come from the orchestration system, not from reconstructed server logs.

Access logs tell auditors who accessed data and when. Lineage tells them what that data contained. A data visibility framework combines both: the lineage graph shows the transformation path, and the access log shows who traversed it.

Why manual evidence falls apart under audit

Manual lineage documentation starts decaying the moment a pipeline changes. For example, a wiki updated in Q1 doesn’t reflect the transformation logic deployed in Q3, or a Slack thread where an engineer explained a join condition six months ago isn’t audit evidence.

A compliance officer can’t certify a transformation they can’t reproduce. When field calculation logic lives only in memory, teams reconstruct the history manually during an audit. That reconstruction slows reviews and introduces errors.

Automating lineage reduces manual data processing time, speeds up audit preparation, and eliminates the reconstruction work that slows reviews and introduces errors.

A data catalog addresses this problem for teams with documented schemas that actually hold. Pipelines that are slow-moving and dbt models that are well-commented can survive a light-touch review with a catalog updated quarterly. The problem is that most regulated pipelines aren’t stable, and the audit that matters tends to arrive when the stack is mid-migration. A catalog captures metadata at a point in time. It does not capture what actually ran during the run that produced the record being questioned.

Column-level lineage as audit evidence

Table-level lineage is a starting point. Column-level lineage is the audit artifact.

What the granularity difference means

Table-level lineage tells an auditor that the customers table came from raw_crm. Column-level lineage tells them that email came from raw_crm.contact_email, was lowercased in stg_customers, and was masked to email_hash in the reporting layer. That answer tells auditors whether PII handling was correct at each step. SOX Section 404 demands the same granularity: an auditable record of data provenance, transformation logic, and processing controls.

The performance trade-off

Deeper lineage capture carries a real cost, and different approaches trade visualization richness against runtime performance. An architecture where lineage is native to asset definitions sidesteps this trade-off because the dependency graph is declared at build time rather than inferred from runtime logs.

How Dagster closes the gap

In Dagster, lineage is captured automatically when you define an asset that depends on another. The dependency graph exists before the first run. Unlike task-based orchestrators that require external integrations to reconstruct lineage after execution, Dagster captures this metadata as part of the asset definition. Post-execution reconstruction creates what amounts to an archaeology problem, where teams must parse logs to determine what actually ran.

Dagster’s column-level lineage extends this to the field level. Engineers can specify column dependencies manually for Dagster-native assets or use automated dbt integrations. The result is a field-level provenance record that lives in the orchestration system, reflects what actually ran, and doesn’t depend on a catalog entry updated on a separate schedule.

Audit logs, RBAC, and the rest of the compliance stack

Lineage answers what happened to the data. Audit logs answer who accessed it and when. Role-based access control answers who should have been allowed to. A compliance stack that has one without the other two leaves auditors with partial answers.

What each layer covers

Lineage traces the transformation path: which assets produced which fields, which models consumed them, and what logic ran. Audit logs provide the access record: timestamps, user identities, query patterns. RBAC establishes the authorization boundary: which roles have permission to read or write which assets.

These three layers produce the full answer to a HIPAA or GDPR audit request. An auditor asking “did this user have authorization to access this PHI?” looks to RBAC. The follow-up question about whether they actually accessed it is answered by logs. The lineage graph tells them what that PHI contained and where it came from.

Regulated industries in practice

BenchSci, a pharmaceutical AI company, used Dagster to coordinate disparate data sources and analytics tools across their drug discovery platform. The team reduced computational costs and data errors by materializing assets only when necessary. Every materialization is intentional and logged, which keeps the lineage graph accurate.

Group 1001, a large insurance organization, moved hundreds of SQL Server tables and tens of millions of records in 4 months with only 2 developers. They built observability and lineage into the migration pipelines throughout. The 10x productivity gain came from pipeline state being visible and auditable throughout the migration, achieved without sacrificing observability for speed.

In both cases, the compliance evidence wasn’t assembled for the audit. It was already in the system.

Built for regulated industries: explore Dagster+ Pro

Dagster+ Pro implements this compliance stack in a single platform: column-level provenance, asset-centric lineage, audit logs, and RBAC. Column-level lineage in the Dagster+ UI lets engineers trace how a field is created and used across the entire data platform. External Assets extend that coverage to pipelines orchestrated outside Dagster, so ingestion syncs or jobs from other orchestrators don’t create a gap in the lineage graph.

The compliance pressure is expanding beyond banking. AI pipelines face the same provenance question as GDPR-regulated data: where did this field come from, what transformed it, and who had access?

A team that builds lineage into assets at development time isn’t preparing for audits. They’re already done. The evidence is in the system, and the audit becomes a query rather than a scramble. Teams that can’t produce column-level provenance for a field on demand are betting on how thorough the next auditor is.

FAQs about data lineage for compliance

Which granularity do GDPR and HIPAA auditors actually require?

Auditors require column-level lineage to verify the provenance and handling of specific fields like customer emails or protected health information. Table-level maps fail to prove that a specific sensitive field was correctly masked or deleted downstream. Automated, granular lineage results in faster audit preparation by providing this field-level evidence on demand.

How does lineage handle the right to be forgotten requests?

Lineage identifies every downstream asset that holds a copy or transformation of a specific user’s data, ensuring deletion requests are complete. Without column-level mapping, personal data often persists in downstream reporting tables even after the source record is removed. In Dagster, the TableColumnLineage object is used to declare these dependencies, creating an auditable trail for GDPR field-level compliance.

Can lineage coverage extend to assets orchestrated outside Dagster?

Dagster uses External Assets to incorporate metadata from tools like Fivetran into a single lineage graph. This prevents blind spots in the audit trail when data moves through ingestion syncs or legacy schedulers. That cross-cutting visibility matters most when data moves across systems with different orchestrators is precisely the scenario where audit trails tend to break down.

What is the ROI of switching from manual to automated lineage?

Automating lineage reduces manual data processing time by 85 percent and can lead to 45 percent cost savings within a year, according to Opus Technologies. Manual documentation decays immediately as pipelines change, forcing teams into data archaeology during audits. Automated systems capture lineage at development time, turning audit preparation into a query rather than a weeks-long reconstruction project.

How do dbt lineage and orchestrator lineage work together?

dbt provides transformation lineage within the data warehouse, while the orchestrator captures the end-to-end journey from ingestion to the final BI layer. Dagster integrates with dbt to pull these column-level dependencies into the primary UI, creating a unified evidence layer. This combined record satisfies SOX Section 404 requirements for an auditable record of data provenance and processing controls.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 13, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

How Dagster Compass Powers Brooklyn Data’s Self-Service Analytics
How Dagster Compass Powers Brooklyn Data’s Self-Service Analytics
Blog

June 1, 2026

How Dagster Compass Powers Brooklyn Data’s Self-Service Analytics

Text-to-analytics promises self-service access to data, but adoption depends on usability, governance, and trust. In this guest post, Brooklyn Data explains how it evaluated Compass, deployed it on top of Snowflake, and enabled teams to answer operational questions directly in Slack while maintaining centralized governance and business context.

Snowflake Runs Your Data: Dagster Runs Everything Else
Snowflake Runs Your Data: Dagster Runs Everything Else
Blog

May 28, 2026

Snowflake Runs Your Data: Dagster Runs Everything Else

Snowflake increasingly handles transformation and data freshness internally through features like Dynamic Tables and Cortex. Dagster complements Snowflake by providing orchestration, lineage, automation, and cost visibility across your broader data platform from SQL-defined assets to downstream automation and Snowflake query attribution.

We Tried ty for Performance. It Found Real Bugs
We Tried ty for Performance. It Found Real Bugs
Blog

May 21, 2026

We Tried ty for Performance. It Found Real Bugs

We adopted Astral’s new Python type checker, ty, to speed up type checking in the Dagster monorepo. The performance gains were dramatic, but the bigger surprise was that ty caught real runtime bugs Pyright missed. Here’s what we learned migrating a large Python codebase incrementally to ty.

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.

Modernize Your Data Platform for the Age of AI
Guide

January 15, 2026

Modernize Your Data Platform for the Age of AI

While 75% of enterprises experiment with AI, traditional data platforms are becoming the biggest bottleneck. Learn how to build a unified control plane that enables AI-driven development, reduces pipeline failures, and cuts complexity.

Download the eBook on How to Scale Data Teams
Guide

November 5, 2025

Download the eBook on How to Scale Data Teams

From a solo data practitioner to an enterprise-wide platform, learn how to build systems that scale with clarity, reliability, and confidence.

Download the eBook Primer on How to Build Data Platforms
Guide

February 21, 2025

Download the eBook Primer on How to Build Data Platforms

Learn the fundamental concepts to build a data platform in your organization; covering common design patterns for data ingestion and transformation, data modeling strategies, and data quality tips.

AI Driven Data Engineering
Course

March 19, 2026

AI Driven Data Engineering

Learn how to build Dagster applications faster using AI-driven workflows. You'll use Dagster's AI tools and skills to scaffold pipelines, write quality code, and ship data products with confidence while still learning the fundamentals.

Dagster & ETL
Course

July 11, 2025

Dagster & ETL

Learn how to ingest data to power your assets. You’ll build custom pipelines and see how to use Embedded ETL and Dagster Components to build out your data platform.

Testing with Dagster
Course

April 21, 2025

Testing with Dagster

In this course, learn best practices for testing, including unit tests, mocks, integration tests and applying them to Dagster.