Revisiting the Poor Man’s Data Lake with MotherDuck | Dagster Blog

June 22, 20236 minute read

Revisiting the Poor Man’s Data Lake with MotherDuck

Pete Hunt
Name
Pete Hunt
Handle
@floydophone

Last year we wrote a blog post called Building a Poor Man’s Data Lake with DuckDB. In it, we showed that you could use two excellent open-source technologies - DuckDB and Dagster - and build a powerful, cheap, and useful data platform.

Since then, MotherDuck has emerged from stealth to build a high-powered cloud version of DuckDB that competes with other cloud data warehouses like Snowflake.

In this blog post, we will migrate the Poor Man’s Data Lake away from S3 and Parquet files into a single system: MotherDuck. As you’ll see in this blog post, it’s very straightforward due to the elegant design of both DuckDB, MotherDuck, and Dagster, and we can realize all of the benefits of MotherDuck without even touching our business logic.

In this blogpost:

Do you prefer to listen than to read? Pete Hunt walks us through porting our DuckDB project to MotherDuck.

What is MotherDuck?

MotherDuck is a cloud service that hosts your DuckDB tables. Rather than pointing DuckDB at a single database file on disk, you can point it at a cloud-hosted database that can be accessed anywhere with high performance, just like any other cloud data warehouse like Snowflake or BigQuery.

Let’s get started with it. But before we do, let’s remind ourselves of how DuckDB works. Install a recent version and create an in-memory database via the terminal:

$ duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table foo as select 1 as x, 2 as y;
D select * from foo;
┌───────┬───────┐
│   x   │   y   │
│ int32 │ int32 │
├───────┼───────┤
│     12 │
└───────┴───────┘

As you can see, we’ve used the CLI to create a new table and can read and write data to it. However, this is stored in a transient, in-memory database, and we’ll lose our data if we quit the DuckDB CLI:

$ duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from foo;
Error: Catalog Error: Table with name foo does not exist!
Did you mean "pg_proc"?
LINE 1: select * from foo;

We can instead save our database to a file on disk by passing it as a command-line arg:

$ duckdb mydatabase.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D create table foo as select 1 as x, 2 as y;
D select * from foo;
┌───────┬───────┐
│   x   │   y   │
│ int32 │ int32 │
├───────┼───────┤
│     12 │
└───────┴───────┘

…and if we quit and reopen the CLI, the data is still there!

$ duckdb mydatabase.duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D select * from foo;
┌───────┬───────┐
│   x   │   y   │
│ int32 │ int32 │
├───────┼───────┤
│     12 │
└───────┴───────┘

This is useful for a single engineer developing on a single machine, but what if you want to share this data with others? You need to find some way to distribute this database file. Additionally, if multiple stakeholders want to add data to the same DB, you need to figure out how to ensure that all stakeholders have up-to-date copies of the data and that their changes are merged into a single consistent database.

That’s where MotherDuck comes in. Head on over to app.motherduck.com and get an access token. Then, simply connect to MotherDuck right from the DuckDB CLI!

$ duckdb 'md:helloworld?token=<your token here>'
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D create table foo as select 1 as x, 2 as y;
D select * from foo;
┌───────┬───────┐
│   x   │   y   │
│ int32 │ int32 │
├───────┼───────┤
│     12 │
└───────┴───────┘

We can use the same argument to duckdb on a different machine and still see the same data:

$ duckdb 'md:helloworld?token=<your token here>'
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
D select * from foo;
┌───────┬───────┐
│   x   │   y   │
│ int32 │ int32 │
├───────┼───────┤
│     12 │
└───────┴───────┘

So how did this work?

We passed a funny looking argument to duckdb. This is a URL to our motherduck instance. md: is the MotherDuck protocol, helloworld is the name of our database (which will be automatically created if it doesn’t exist), and ?token=... provides our access token.

This works identically in the DuckDB Python bindings too. Let’s start hacking on the Poor Man’s Data Lake and get it integrated with MotherDuck.

Step 1: Connecting to MotherDuck

If you take a look at the original project, we created a class that wrapped our DuckDB connection:

class DuckDB:
    def __init__(self, options=""):
        self.options = options

    def query(self, select_statement: SQL):
        db = connect(":memory:")
        db.query("install httpfs; load httpfs;")
        db.query(self.options)

        dataframes = collect_dataframes(select_statement)
        for key, value in dataframes.items():
            db.register(key, value)

        result = db.query(sql_to_string(select_statement))
        if result is None:
            return
        return result.df()

We are going to make a slight change to this class to allow users to pass in their MotherDuck URL if they so desire. Update the class to read as follows:

class DuckDB:
    def __init__(self, options="", url=":memory:"):
        self.options = options
        self.url = url

    def query(self, select_statement: SQL):
        db = connect(self.url)
        # ...everything else remains unchanged...

Now we can pass in a MotherDuck URL, and we will automatically run our project on MotherDuck. Recall the Definitions in the original project’s __init__.py file:

from dagster import Definitions, load_assets_from_package_module
from jaffle import assets
from jaffle.duckpond import DuckPondIOManager, DuckDB

DUCKDB_LOCAL_CONFIG="""
set s3_access_key_id='test';
set s3_secret_access_key='test';
set s3_endpoint='localhost:4566';
set s3_use_ssl='false';
set s3_url_style='path';
"""

defs = Definitions(
    assets=load_assets_from_package_module(assets),
    resources={"io_manager":  DuckPondIOManager("datalake", DuckDB(DUCKDB_LOCAL_CONFIG))}
)

We can simply pass in our MotherDuck URL as an environment variable, and our project should still run identically to before, except this time it’s running in The Cloud™.

import os
defs = Definitions(
    assets=load_assets_from_package_module(assets),
    resources={"io_manager":  DuckPondIOManager("datalake", DuckDB(DUCKDB_LOCAL_CONFIG, url=os.environ["MOTHERDUCK_URL"]))}
)

Pretty easy! But we aren’t taking full advantage of MotherDuck quite yet.

Interested in trying Dagster Cloud for Free?
Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.

Step 2: Replacing the IOManager

Recall that Dagster uses IO Managers to abstract away the underlying storage system from business logic. This makes it very straightforward for us to swap out our existing in-memory DuckDB storage system with one powered by MotherDuck, without touching any of our business logic!

Let’s start by copy-pasting the existing DuckpondIOManager as a new IOManager called MotherduckIOManager (inside of duckpond.py):

class MotherduckIOManager(IOManager):
    def __init__(self, bucket_name: str, duckdb: DuckDB, prefix=""):
        self.bucket_name = bucket_name
        self.duckdb = duckdb
        self.prefix = prefix

    def _get_s3_url(self, context):
        if context.has_asset_key:
            id = context.get_asset_identifier()
        else:
            id = context.get_identifier()
        return f"s3://{self.bucket_name}/{self.prefix}{'/'.join(id)}.parquet"

    def handle_output(self, context, select_statement: SQL):
        if select_statement is None:
            return

        if not isinstance(select_statement, SQL):
            raise ValueError(
                f"Expected asset to return a SQL; got {select_statement!r}"
            )

        self.duckdb.query(
            SQL(
                "copy $select_statement to $url (format parquet)",
                select_statement=select_statement,
                url=self._get_s3_url(context),
            )
        )

    def load_input(self, context) -> SQL:
        return SQL("select * from read_parquet($url)", url=self._get_s3_url(context))

We need to make some changes to this:

  • We need to remove S3 from handle_output(), and replace it with a MotherDuck table
  • We need to read from that MotherDuck table in load_input()

Let’s refactor this class step-by-step. First, let’s modify the constructor to remove any reference to S3 buckets:

class MotherduckIOManager(IOManager):
    def __init__(self, duckdb: DuckDB, prefix=""):
        self.duckdb = duckdb
        self.prefix = prefix

Next, let’s replace our helper method that creates S3 URLs with one that creates MotherDuck table names:

    def _get_table_name(self, context):
        if context.has_asset_key:
            id = context.get_asset_identifier()
        else:
            id = context.get_identifier()
        return f"{self.prefix}{'_'.join(id)}"

Next, we’ll modify handle_output() to CREATE OR REPLACE a table by that name, rather than write a Parquet file to S3:

    def handle_output(self, context, select_statement: SQL):
        if select_statement is None:
            return

        if not isinstance(select_statement, SQL):
            raise ValueError(
                f"Expected asset to return a SQL; got {select_statement!r}"
            )

        self.duckdb.query(
            SQL(
                "create or replace table $table_name as $select_statement",
                table_name=self._get_table_name(context),
                select_statement=select_statement,
            )
        )

Notice the only thing we changed here was replacing the COPY SQL statement with a CREATE OR REPLACE TABLE statement towards the end of the function.

Finally, let’s read from that table:

    def load_input(self, context) -> SQL:
        return SQL(
            "select * from $table_name", table_name=self._get_table_name(context)
        )

Not too hard! We just SELECT from the table instead of reading a Parquet file as we did before.

One final step to wire it up. Let’s swap out the IOManager in our Definitions inside of __init__.py:

from jaffle.duckpond import MotherduckIOManager
defs = Definitions(
    assets=load_assets_from_package_module(assets),
    resources={
        "io_manager": MotherduckIOManager(DuckDB(url=os.environ["MOTHERDUCK_URL"]))
    },
)

The only change here is that we’ve changed the io_manager resource to point to MotherduckIOManager instead of DuckpondIOManager.

Step 3: There is no step 3!

Now if you simply run the project, you will see beautiful green boxes (after a short moment):

And even better, we can open up the MotherDuck UI and explore our tables, just like any other data warehouse.

This is a huge usability improvement on top of S3 and Parquet, and it’s much easier to collaborate using MotherDuck rather than vanilla DuckDB.

It is worth noting that this project has over 100 lines of business logic that we didn't even need to touch (in a real app, this could be tens of thousands, if not more!). Due to MotherDuck’s deep integration with DuckDB and Dagster’s IOManager abstraction, we were easily able to scale up from our “poor man’s data lake” to something much more robust, easy to use, and team-friendly.

That’s all we have for today. Thanks for following along, and let us know if you have success with this stack by joining our community and providing feedback!


The Dagster Labs logo

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us:


Read more filed under
Blog post category for Blog Post. Blog Post