November 2, 20225 minute read

Running Data Science Notebooks with Dagster: a Noteable integration

Jamie DeMaria
Name
Jamie DeMaria
Handle

an illustration at the top of the Noteable blog post.

Over the last few years, the discipline of data management (in its broadest sense) has seen an exciting convergence of several disciplines: data engineering, data science, data visualization, machine learning, and business analytics are all working to break down silos and learn from each other.

Each has brought its own set of tools, frameworks, and ways of working to the table.

Plenty of critical business applications and processes run scheduled data pipelines on Dagster. A growing use case is in data science, and the tool of choice in that field is the data notebook.

One of the more exciting notebook providers is Noteable. Noteable is a collaborative data notebook that enables data-driven teams to use and visualize data, together. Two key focus areas for the Noteable team are:

  • creating a platform that is accessible to the many personas that touch notebooks
  • connecting that platform to the heterogeneous toolset that makes up the tech stack of data science.

From this perspective, we believe Dagster users will find Noteable very simpatico to our own product development philosophy.

If orchestrating notebooks is on your roadmap, the Noteable team has just made this much easier with their new Dagster integration.

A diagram of how the Dagster, Dagstermill, and Noteable systems work together.

Doesn’t Dagster already run notebooks?

Yes, it does. From the get-go Dagster provided support for notebooks thanks to the Dagstermill library (docs here) which was integral to the first releases of Dagster. The Dagstermill package makes it straightforward to run notebooks using the Dagster tools, integrating them into data pipelines with heterogeneous computation and storage: for instance, Spark jobs, SQL statements run against a data warehouse, or arbitrary Python code.

Dagstermill lets you:

  • Run notebooks as ops in heterogeneous data jobs with minimal changes to notebook code
  • Define data dependencies to flow inputs and outputs between notebooks and between notebooks and other ops
  • Use Dagster resources and the Dagster config system from inside notebooks
  • Aggregate notebook logs with logs from other Dagster ops
  • Yield custom materializations and other Dagster events from your notebook code to keep Dagster aware of your operations inside a notebook.

With Dagstermill there is no need to translate notebook code into some other (less readable and interpretable) format. Instead, you can use notebooks as ops directly in data pipelines with minimal effort. This runs your notebook within the Dagster environment and provides links to the resulting files.

The Dagster + Noteable power-up…

By introducing Noteable into the equation, you gain several powerful capabilities using the existing dagstermill interface, including:

  • commentable and collaborative notebook links
  • improved visualizations
  • native data connectors
  • notebook commenting and notifications
  • the ability to fix issues on live jobs
  • automatic notebook versioning tied to the original notebook file

For example, should your scheduled notebook execution fail, you get a live link to the notebook, making it easier to locate and debug errors. The live notebook stays around for 90 minutes by default, allowing you to troubleshoot, and the notebook run stays indefinitely to relaunch as a copy at a later time. This saves hours or even days when a long-running execution hits an issue such as data drift or a renamed column.

Below is an example of a failed Noteable notebook orchestrated by Dagster. You can edit this session to correct the problem.

example of a failed Noteable notebook orchestrated by Dagster

This allows you to fix an issue, then resume execution of the notebook from the last successful asset materialization. This is a major productivity boost (not to mention a less stressful development experience).

Let’s look at how to set up a scheduled Noteable notebook using Dagster:

  1. Build your Noteable workflow:
  • Create a notebook within Noteable, or select an existing notebook to use.
  • If you would like to override existing parameters used within the notebook, mark the cell that contains the parameters with the parameters tag by using the dropdown next to the cell you want to have replaced. Marking a cell as a parameter will allow Dagster to override the default parameters of the notebook at execution time.

[Note: run parameters can come from both run_config or from upstream assets.]

For more details about what a parameterized cell is used for, see the Papermill documentation.

  • Next, copy the notebook URL. You can schedule a specific version of a notebook using that version’s URL. If you just want to use the notebook or version id, you can use noteable://{id} as the input name.
  • Get your API token from Noteable
    • Within user settings, go to the API Token page, and generate a new token. Add the token to Dagster secrets for the key NOTEABLE_TOKEN.
  • You can also override the host domain name by setting the NOTEABLE_DOMAIN key in Dagster secrets (e.g., NOTEABLEDOMAIN=cluster_name.noteable.io). If not provided the client will assume the default _app.noteable.io public offering.
  1. Set up Dagster:
pip install dagster
dagster project scaffold --name my-dagster-project
cd my-dagster-project
  • Add papermill_origami requirement to your pyproject.toml
requires = ["setuptools", “papermill_origami”]
  • In your my_dagster_project/assets/__init__.py file create a notebook asset file
    • You can find your notebook_id in the URL for your notebook, like this:
      • https://app.noteable.io/f/9b92ef52-29af-497a-bbd1-d14c18b27e5d/What-can-you-do-in-a-Noteable-notebook.ipynb
from papermill_origami.noteable_dagstermill import (
   define_noteable_dagster_asset,
)
from dagster import Field

notebook_id = NOTEABLE_NOTEBOOK_ID
noteable_asset = define_noteable_dagster_asset(
    name="my_noteable_asset",
    notebook_id=notebook_id,
)
  • Add a domain and token secrets to Dagster
    • NOTEABLE_DOMAIN [Optional]
    • NOTEABLE_TOKEN

To find out how to add this to your setup, checkout the following guides:

3) Run the notebook

We are going to run and test this integration locally. We recommend running Dagster’s UI (Dagit) locally to test connecting and running Noteable

pip install -e ".[dev]"
dagit

You should see an Asset Graph (Dagster’s visual representation of a DAG) using your defined operations and assets, including the new Noteable node.

  • Next, launch a materialization of your notebook asset by clicking ‘Materialize all’
  • View the execution results with a link to the modified run of the original notebook

Once you are happy with these changes, you can push your code to production via a branch using the Transitioning Data Pipelines from Development to Production guide.

Passing Data to Noteable from Dagster

Using Papermill with Dagster follows the same principles of data transport for both Dagstermill and Noteable, with one exception: Parameters will be serialized to the notebook cell rather than loaded at runtime by the library. This is similar to how it would be in a Dagstermill notebook execution, but the integration handles serializing your data instead of a data loader object in the injected notebook cell.

Let’s look at an example.

Say you have an asset that fetches a small DataFrame and you want to pass that dataframe as an input to a Noteable notebook. First, you define your dataframe. In this case, we loaded the iris dataset into a Pandas DataFrame.

from sklearn import datasets

@asset()
def iris():
   sk_iris = datasets.load_iris()
   return pd.DataFrame(
       data=np.c_[sk_iris['data'], sk_iris['target']],
       columns=sk_iris['feature_names'] + ['target']
   )

Then you define your ins argument to the noteable_dagstermill_asset as you would elsewhere.

noteable_asset = define_noteable_dagster_asset(
   name="my_noteable_asset",
   notebook_id=notebook_id,
   ins={
       "iris": AssetIn(key=AssetKey("iris")),
   },
)

This produces an asset graph in Dagster. Our dataframe is now an asset generated by the Iris job and acts as an input to the notebook.

When materialized, the notebook will have an iris variable loaded after the parameter cell is executed. You can now reference this anywhere in your notebook as a local variable.

Restrictions on Parameters

Parameterization today serializes the content from the Dagster node to the Noteable notebook.

This means that:

  A) only cloud-pickleable parameters can be passed, and
  B) very large parameters (above a couple of MB) will be rejected.

In the case where you have more complicated parameters you wish to pass, consider putting them into a database or blob storage. You can then reference the data path to be loaded at runtime. Noteable supports Data Connections with native SQL cells as well, so loading from a shared database is easy.

More science in the pipeline

We hope this overview illustrates how the tools used in data orchestration are integrating more closely with the toolkit used in data science. The premise of the Modern Data Stack is to empower data teams with best-of-breed tools that support these different disciplines while also encouraging software development best practices (version control, CI/CD, testing), and these integrations are a big step in that direction.

This is an exciting first version, and you can look forward to future collaboration between the Noteable and Elementl teams to provide a more seamless developer experience as data science teams weave more and more models into the core pipelines that drive modern organizations.


The Dagster Labs logo

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us:


Read more filed under
Blog post category for Integration. Integration