Why Use dbt with Databricks?
Databricks is a cloud-based platform for large-scale data engineering, analytics, and machine learning. It is built on Apache Spark and offers a collaborative workspace where data engineers, analysts, and scientists can interact with data.
dbt (data build tool) complements Databricks, helping manage data transformations using SQL and software engineering best practices. When combined, they create a scalable analytics stack that supports version-controlled, testable, and documented data pipelines:
- SQL-based development: dbt allows analysts and engineers to write transformations in pure SQL, aligning well with Databricks’ support for SQL and SQL-compatible interfaces.
- Version control and reproducibility: dbt integrates with Git for version control, enabling collaborative development and auditability across teams working in Databricks.
- Automated testing and documentation: dbt brings built-in testing and documentation tools, which improve the quality and transparency of transformations done in Databricks.
- Environment management: dbt enables clear separation between development, staging, and production environments, aligning with Databricks' workspace model and job orchestration features.
- Scalable execution: dbt models can be compiled into efficient SQL code and executed at scale using Databricks’ distributed compute engine.
Incremental models: dbt supports incremental processing out-of-the-box, helping optimize performance when used with large datasets in Databricks.
Understanding the dbt-Databricks Adapter
The dbt-databricks adapter is a dedicated integration that enables dbt to run efficiently on Databricks. It replaces the need for the more generic dbt-spark adapter when working in Databricks environments, offering deeper compatibility and better performance by leveraging Databricks-native features.
Uses pure Python
One of the key advantages of the adapter is its simplified setup. Unlike other spark-based adapters that require an ODBC driver, dbt-databricks uses pure Python APIs, making installation and configuration straightforward. Developers can get started quickly using standard Python tooling (pip install dbt-databricks), without extra system dependencies.
Learn more in our detailed guide to dbt Python
How it works
The adapter defaults to using the delta lake table format, which aligns with Databricks’ native storage model. This enables dbt to use Delta features such as MERGE for incremental materializations, improving efficiency when working with large datasets. Other formats are also available, including Parquet, CSV, JSON, and Avro, and can be selected through the model configuration.
The adapter also supports Unity Catalog, allowing users to define models using Databricks' three-level namespace (catalog, schema, and table). This improves organization, security, and access control in shared environments.
Performance considerations
From a performance standpoint, SQL generated by dbt-databricks is automatically accelerated by Databricks’ Photon engine—a vectorized query engine optimized for fast execution. This results in more efficient query processing compared to using standard spark-based adapters.
Quick Tutorial: Using dbt with Databricks from Setup to Scheduled Models
This tutorial shows how to get started with dbt for Databricks. Instructions are adapted from the dbt documentation.
1. Set Up a Databricks Workspace
Start by creating a Databricks account or logging into an existing one. Fill out the registration form and verify your email address. After verification, select your preferred cloud provider—AWS, Azure, or GCP. For this guide, AWS is used, but the process is similar across platforms. Do not select the Community Edition, as it lacks the necessary compute capabilities.
You’ll be directed to the setup page where you can choose a trial plan (Premium by default, or Enterprise if needed). Provide a name for your workspace and select an AWS region. Then click Start Quickstart. This triggers an AWS CloudFormation stack deployment.
If you encounter a session timeout or error, return to the Databricks console and manually create the workspace via the UI. When prompted, enter your Databricks password under Databricks Account Credentials, accept the terms, and click Create stack. Once the stack is created, return to the Databricks tab and open your new workspace.
2. Load Source Data into Databricks
To begin working with data, switch to the SQL interface, select SQL Editor from the left hand side menu. In order to select SQL Warehouses browse down the left sidebar under the SQL heading. Start the default Starter Warehouse, which may take a few minutes to initialize.
Next, upload the sample data files:
Click on +New button and you will be redirected to the asset upload page. Please proceed to upload jaffle_shop_customers.csv. Set the table properties:
- Catalog:
hive_metastore - Database:
default - Table name:
jaffle_shop_customers
Confirm column types:
id: bigintfirst_name: stringlast_name: string
Click Create once the schema is validated. Repeat the upload and configuration steps for the other two files: jaffle_shop_orders.csv and stripe_payments.csv.
Once all tables are loaded, verify them by running SQL queries in the editor:
select * from default.jaffle_shop_customers;
select * from default.jaffle_shop_orders;
select * from default.stripe_payments;
To ensure access across users working in the workspace, execute:
grant all privileges on schema default to <DB-USER>;
3. Connect dbt to Databricks
There are two main ways to connect dbt Cloud to Databricks: manually or using Marketplace. For simplicity and quicker setup, use the Marketplace route:
- In the Databricks sidebar, click Marketplace
- Search for the dbt tile.
- You will find a product called dbt Cloud. Click on it and it will take you to the integration page. Click the ‘Connect’ button to start the integration
- Choose a catalog (e.g.,
hive_metastore) and click Next. - Select your SQL warehouse. If one isn't available, create a new warehouse and return to this step.
- Choose a schema (such as
default) and click Add.
Partner Connect creates a Databricks service principal (DBT_CLOUD_USER) and access token. It also assigns necessary privileges for schema usage and SQL warehouse access. After configuration, click Connect to dbt. This opens dbt Cloud in a new browser tab where you can set up your account.
Next, in dbt Cloud Studio, initialize your project:
- Click Start developing in the Studio IDE. This may take a few minutes.
- Click Initialize dbt project to scaffold your directory with starter files.
- Make an initial commit (
initial commit) to your managed Git repo. - Create a new file with the query:
select * from default.jaffle_shop_customers;
Save the file and run dbt run in the terminal to verify the setup. You should see a successful run output.
4. Build your First Model
Create a new branch (add-customers-model) using the Version Control tab. Then, in the models directory:
- Click Create file, name it
customers.sql, and paste the model query that joins customers and orders to calculate:
first_order_datemost_recent_order_datenumber_of_orders
- Save the file.
- In the terminal, run
dbt run. The output should show model(s) (including intermediate steps) were built successfully.
5. Control Materializations
By default, dbt materializes models as views. To change this to tables:
- Open
dbt_project.yml. - Update the configuration:
name: 'jaffle_shop'
models:
jaffle_shop:
+materialized: table
example:
+materialized: view
- Save and run
dbt run.
To override this for a specific model, add the following to the top of a .sql file:
{{ config(
materialized = 'table'
) }}
After making changes, run dbt run --full-refresh to reapply the new materialization strategy.
6. Add Tests to Validate Your Data
Create a new file, models/schema.yml, with the following content:
version: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
You can include tests like accepted_values for status and relationships for foreign key checks. To execute tests please run the following command:
dbt test
All tests should pass. Failing tests will return the number of invalid records.
7. Commit, Merge, and Deploy Your changes
If you edited the protected main branch, dbt will prompt you to create a new one. Enter a commit message like "Add customers model, tests, docs", commit your changes, and merge the branch into main.
If you're already on a new branch, go to Version Control, commit your changes, and merge.
Use the Deploy section in dbt Cloud to schedule production builds.
- Go to Environments and create a new one named
Production. - Choose the latest dbt version and assign your deployment target schema.
- Save the environment.
Then, go to Jobs:
- Click Create one, name it (e.g., "Production run"), and select your new environment.
- Under Execution Settings, add
dbt buildas a command. - Enable Generate docs on run.
- Save and click Run now.
Monitor the job progress in Run history. Once complete, view your project documentation to confirm it reflects the deployed state.
You've now completed the full cycle of building, testing, and deploying a dbt project on Databricks using real data and transformations.
Best Practices for Using dbt with Databricks
Here are some useful practices to consider when working with Databricks and dbt.
1. Leverage Unity Catalog for Governance
Unity Catalog provides centralized data governance across Databricks workspaces. When used with dbt, it allows models to be defined with full three-level namespaces (catalog.schema.table), supporting clear data organization and consistent access control.
Use Unity Catalog to enforce fine-grained permissions on schemas and tables, manage lineage, and standardize naming conventions. Ensure that your dbt profiles and project configurations specify the correct catalog and schema to take advantage of this structure.
2. Leverage Streaming Tables and Materialized Views
Databricks supports streaming tables and materialized views, which can simplify near-real-time data pipelines. While dbt doesn't natively support streaming semantics, you can integrate streaming data by creating external tables or views over streaming sources (e.g., auto loader or structured streaming outputs).
Materialized views can be managed outside of dbt but referenced within models. For use cases that require freshness and low-latency access, consider building and refreshing these views using Databricks SQL tasks and orchestrating them alongside dbt jobs.
3. Optimize Delta Tables via Post-Hooks
Databricks Delta tables benefit from regular maintenance such as OPTIMIZE, VACUUM, and ANALYZE commands. In dbt, these can be automated using post-hook configurations.
For example:
models:
my_project:
+post-hook:
- "OPTIMIZE {{ this }}"
Apply these hooks selectively to large or frequently queried tables to improve query performance and maintain healthy storage conditions.
4. Use Incremental Models for Efficient Data Processing
Incremental models in dbt reduce compute costs by processing only new or changed data. This is especially important when working with large datasets in Databricks.
Leverage dbt’s is_incremental() function within models to define logic that handles inserts, updates, or merges. Use delta lake’s ACID capabilities and the MERGE operation to ensure safe, performant updates. Align your partitioning strategy with the most common query filters to improve read performance on these models.
5. Use the databricks_copy_into Macro for Data Ingestion
For loading external data into delta tables, use the databricks_copy_into macro provided in the dbt-databricks package. This macro wraps the COPY INTO SQL command, enabling efficient ingestion from cloud object stores (e.g., S3, ADLS, GCS).
The macro supports parameters for file format, schema inference, and incremental loads. It is particularly useful for loading semi-structured data like JSON or CSV, and for automating ingestion pipelines as part of dbt runs.
Example usage:
{{ databricks_copy_into(
table_name='raw_data',
path='s3://my-bucket/data/',
file_format='CSV',
format_options={'header': 'true'}
) }}
Use this macro in staging models or seed layers to bootstrap ingestion processes directly from external sources.
Orchestrating Data Pipelines with dbt, Databricks, and Dagster
When teams combine dbt, Databricks, and Dagster, they gain an end-to-end data platform that is reliable, observable, and suitable for modern analytics workflows. Each tool plays a clear role in the overall architecture, and together they deliver strong guarantees around quality, governance, and operational control.
Why Use Dagster with dbt and Databricks
Dagster is a data orchestrator that manages the execution and coordination of work across your data systems. It introduces a shared definition of assets, strong observability features, and precise dependency management. When used alongside dbt and Databricks, Dagster becomes the control plane for your analytics and ML pipelines.
Key advantages include:
Unified asset graph: Dagster represents dbt models, Databricks SQL tasks, external data sources, and notebooks as assets. This allows teams to understand dependencies, view lineage, and track the impact of upstream or downstream changes from a single interface.
Strong dbt integration: Dagster can parse a dbt project and generate software-defined assets that correspond to dbt models and tests. This gives structure to dbt runs, improves modularity, and provides detailed runtime visibility in the Dagster UI.
Native Databricks integration: Dagster connects directly to Databricks clusters, jobs, and SQL warehouses. It can trigger Databricks tasks, coordinate notebook execution, and orchestrate SQL workloads that run on Delta Lake.
Flexible automation and scheduling: Dagster supports both time-based schedules and event-driven sensors. This enables everything from nightly dbt transformations to pipelines triggered by new files or updated Delta tables.
Observability and governance: Dagster records metadata for every asset, including materialization history, data quality outcomes, and execution logs. This improves operational insight across the combined Databricks and dbt ecosystem.
How They Work Together
A typical flow includes the following steps:
- Data is ingested into Databricks using structured streaming or external ingestion services.
- Dagster triggers dbt jobs that build staging, intermediate, and mart models using Delta Lake.
- Downstream analytics or machine learning tasks in Databricks run after the relevant dbt models have been updated.
- dbt tests run within Dagster, and their results determine whether downstream steps proceed.
- Dagster schedules, monitors, and retries production runs, ensuring consistency and reliability across environments.
This structure creates a unified, well-governed, and fully observable pipeline that scales across teams and workloads.



