Working with dbt Seeds: Quick Tutorial & Critical Best Practices

What Are dbt Seeds?

dbt (data build tool) seeds are static CSV files stored within your dbt project that are loaded into your analytics warehouse as database tables. Seeds bring small, fixed reference datasets, such as mapping tables, default configurations, or lookup values, into your data models without an ETL process.

Why use seeds

By defining the data in CSV files, typically placed in the seeds/ directory of the dbt project, teams maintain version control and reproducibility with minimal complexity. Seeds are useful for incorporating non-changing, foundational data that many transformations rely on. Teams may use seeds for things like currency conversion rates, country codes, feature flags, or business-specific mappings.

Seeds command syntax

The dbt seed command loads all CSV files from the seeds/ directory into your data warehouse as tables. To load only specific seed files, use the --select flag followed by the seed name. For example: dbt seed --select "country_codes". This command loads the country_codes.csv file into your warehouse.

When to Use (and Avoid) Seeds

When to Use Seeds

Use seeds when you need small, stable datasets that don’t change frequently and need to be tightly integrated into your dbt project. Typical use cases include:

Reference tables like country codes, product categories, or customer tiers
Configuration mappings such as feature flags or business rules
Historical lookups including static currency conversion rates or shipping zones
Standardized data like time zones or ISO codes

Seeds are helpful when data needs to be present immediately upon project setup, without relying on external systems or pipelines, and when data needs to be shared across multiple development environments.

When to Avoid Seeds

Seeds are not suitable for dynamic, large-scale, or sensitive datasets. Avoid seeds in the following situations:

Large datasets: If the file exceeds 5–10MB or includes tens of thousands of rows, dbt seed becomes inefficient and can slow down both dbt runs and Git operations. In such cases, use an elt pipeline and treat the data as a source.
Frequently updated data: When the dataset is updated regularly (e.g., from external systems or user inputs), seeds require manual updates, which breaks automation and increases maintenance overhead. Automate ingestion instead.
Collaborative write access: Seeds are read-only once loaded. If other systems or teams need to update the data, seeds introduce version drift. Use a centralized database or API for shared data instead.
Sensitive data: CSV files don’t support row-level security or access control. If privacy is a concern, rely on your data platform’s security features and load data through governed processes.
Business-editable content: If non-technical users need to update the data, Git workflows are not practical. Build user-friendly tools for data entry and sync updates into your warehouse programmatically.

Quick Tutorial: Add dbt Seeds to Your DAG

To add a seed to your DAG, begin by placing your CSV file in the seeds/ directory of your dbt project. For example, you might create a file called seeds/country_codes.csv containing static reference data like country codes and names.

Next, load the seed into your warehouse by running:

dbt seed

This command will create a table in your target schema with the same name as the CSV file (e.g., country_codes). You can confirm that the table was created by checking the dbt logs, which will report how many rows were inserted and how long the operation took.

To use the seed in your dbt models, reference it with the ref function, just like you would a model:

-- models/example.sql
select * from {{ ref('country_codes') }}

This allows downstream models to treat the seed as a normal table in your DAG. Any changes to the CSV will be picked up the next time dbt seed is run.

For configuration, you can adjust behavior in your dbt_project.yml, such as setting column types or enabling quoting. You can also document and test seeds using YAML property files, just like models, to ensure consistency.

Best Practices for Using dbt Seeds

Here are some important practices to consider when working with seeds in dbt.

1. Clearly Documenting Seed Data Sources

Each seed file should include metadata about its origin, intent, and ownership. This documentation lives in a schema.yml file under the seeds: section, where you can specify descriptions for the table and each column. The documentation should answer questions like:

Where did this data come from? (e.g., manually created, exported from Salesforce, derived from internal logic)
What does each field represent?
Who is responsible for maintaining it?
How frequently is the data expected to change?

Additionally, if the seed is meant to replicate an authoritative source (e.g., ISO country codes), cite that source explicitly. This ensures team members understand whether a seed is the source of truth or a convenience copy, and whether they should trust or cross-reference it.

2. Maintaining Seed File Size and Scope

Seeds should remain small and focused to ensure fast project execution and easy code review. Large seed files (e.g., over 10MB or 10,000 rows) can slow down the dbt seed command, increase memory usage, and cause performance degradation during warehouse loads.

To maintain performance:

Split large datasets into logical segments across multiple seed files if needed
Avoid storing transactional data, logs, or other high-volume records in seeds
Use gzip compression outside dbt only if necessary, but prefer reducing data volume instead

Additionally, scope each seed narrowly: if a seed mixes multiple unrelated domains (e.g., combining country codes and currency mappings), break it into separate files. This improves maintainability, makes it easier to apply tests and documentation, and limits rework when part of the seed changes.

3. Implementing Consistent Seed Naming Conventions

Use a predictable naming scheme to make seeds easier to reference and manage in your project. Recommended conventions include:

Use lowercase letters and underscores (e.g., feature_flags.csv, not FeatureFlags.csv)
Avoid ambiguous names like lookup or misc_data
Match naming patterns used for models and sources to reinforce consistency
Prefer plural names for datasets (e.g., user_segments, not user_segment)

This helps ensure naming alignment across models, documentation, tests, and ref statements. Consistent names also simplify collaboration across teams and make version control diffs easier to understand. Avoid embedding version numbers or temporary labels in filenames—manage changes through Git history and comments instead.

4. Testing Seeds Data for Freshness and Validity

Although seeds are intended to be relatively static, they can become outdated or misaligned with business logic over time. Establish a recurring review process—monthly or quarterly—to ensure seed data remains accurate and relevant.

Use dbt’s testing framework to define basic constraints like:

Unique keys (e.g., id fields)
Non-null values for required columns
Valid values in enumerated fields (e.g., status must be active or inactive)

Additionally, write custom tests or macros to compare seed values against other tables or external APIs where applicable. If a seed replicates a known source, set up automated jobs or alerts to detect drift or missing updates.

5. Securely Managing Sensitive Information in Seeds

dbt seeds are stored as plain CSV files, often committed to Git, which makes them unsuitable for sensitive or regulated data like PII, PHI, or access tokens. These files lack encryption, access control, and auditing capabilities.

Avoid including sensitive data such as:

Customer names, emails, or IDs
Financial or health records
Internal passwords, secrets, or tokens

If you must work with such data, use a secure ingestion method outside of dbt (e.g., via Dagster, Airbyte, Fivetran, or a secure ETL job) and treat it as a source, not a seed. For less sensitive but still private data, enforce Git access controls, and consider excluding seed directories from commits using .gitignore, or loading such data only in local dev environments.

Related content: Read our guide to dbt test (coming soon)

Orchestrating dbt Data Pipelines with Dagster

Dagster is an open-source data orchestration platform with first-class support for orchestrating dbt pipelines. As a general-purpose orchestrator, Dagster allows you to scale beyond the simple features offered by dbt seeds and seamlessly manage the complex upstream dependencies your dbt project depends on.

It offers teams a unified control plane for not only dbt assets, but also ingestion, transformation, and ML workflows. With a Python-native approach, it unifies SQL, Python, and more into a single, testable, and observable platform.

Head over to our docs on Dagster’s integration with dbt and dbt Cloud to learn more about using Dagster with dbt to unlock better scheduling, lineage, and observability.

Working with dbt Seeds: Quick Tutorial & Critical Best Practices

What Are dbt Seeds?

When to Use (and Avoid) Seeds

Quick Tutorial: Add dbt Seeds to Your DAG

Best Practices for Using dbt Seeds

1. Clearly Documenting Seed Data Sources

2. Maintaining Seed File Size and Scope

3. Implementing Consistent Seed Naming Conventions

4. Testing Seeds Data for Freshness and Validity

5. Securely Managing Sensitive Information in Seeds

Orchestrating dbt Data Pipelines with Dagster

Dagster Newsletter

Latest writings

Orchestrating Nanochat: Building the Tokenizer

When (and When Not) to Optimize Data Pipelines

Your Data Team Shouldn't Be a Help Desk: Use Compass with Your Data