Building shared spaces for data teams at Drizly
March 15, 2021 • 5 minute read
- Dennis Hume
A shared space for non-SQL workflows
When we started rearchitecting our data infrastructure at Drizly a year and a half ago, we knew we wanted our data platform to be an application, supporting multiple stakeholders contributing and collaborating together in shared spaces. At the time we were a small team of data scientists and analysts who had to support the analytic needs of the entire company (from marketing to operations, partnerships and more).
Like many teams, we found that without a common data platform, there was a lot of duplicated data work being done, and getting that work to be consistent and reproducible was always a struggle.
We want to give people across our organization the ability to iterate quickly on data projects in a safe environment.
But non-SQL workflows posed a tougher challenge. Here the different needs of the team became more apparent. Data scientists want the ability to develop machine learning models, with unique dependencies, that adhere to strict SLAs; while an analyst may want to add Python transformations to a table in the warehouse without having to worry about the underlying infrastructure.
We saw Dagster as a framework that could provide us with a new shared space to support all our data roles.
One of the things that we’ve found works really well with Snowflake and dbt is that everything is built around the idea of a SQL model that fits into an overarching DAG. This is helpful for members of the team, who can always think about how the model they’re working on fits into the DAG.
Dagster is built on the same principles, and we liked the way our team members could think of the work they were doing as a solid that fits into a pipeline.
But to take full advantage of Dagster, you also want to leverage additional abstractions beyond the DAG, like modes, presets, and resources. We wanted to make the power of these abstractions available to our users, but we didn’t want learning about Dagster concepts to be a barrier to getting work integrated.
So we spent a lot of time thinking about how to standardize our processes. Our goal was to make it possible for someone to spin up a Dagster pipeline in a single PR during their first week on the job.
Standardized deployments for the entire development lifecycle
Before Dagster, deploying new data pipelines involved a lot of effort and risk. Usually, someone would build something on their local machine, we would configure the bare minimum AWS resources to support it, and we would throw it into production to test. This led to drawn out deployments with numerous back and forth conversations to resolve issues. As a result, getting new data workflows into production was a burden and making changes to existing pipelines was generally avoided.
Instead we wanted to frontload testing and invest a lot in our workflow to ensure a smooth transition through the development lifecycle and give users the tools they needed for their pipelines from the beginning.
One reason our Snowflake and dbt environment are effective is that every user can experiment with their models safely within a cloned database, run unit tests, and know that their work will be reflected in production when it’s merged. Our goal was to get to that same level of confidence with Dagster.
We decided on 4 key deployments within our Dagster project, each of which is tied to a stage in the development lifecycle and a deployment target for running Dagster, whether on a local machine or on AWS:
|local||Local, Python virtual environment||Ensure pipelines compile and check logic of solids against locally-stored, pre-configured data||< 30 seconds|
|docker compose||Local, Docker||Ensure that the pipeline fits into the overall Dagster application and that all images are created correctly. Also test any schedules or sensors.||1 min|
|dev||AWS dev account||Ensure integration testing and that all AWS permissions are configured correctly||5-10 mins|
|prod||AWS prod account||5-10 mins|
These deployments allow our team to take advantage of Dagster features without necessarily knowing all the details of the abstraction layer.
We use environment variables to filter the Dagster objects (modes, presets, resources, and schedules) that are available in each deployment. This makes sure that users have the facilities they need at each stage of their dev work, while preventing confusion (trying to launch a pipeline that might not have the supporting infrastructure present) and adding safety (not having someone accidentally launch a prod pipeline from their machine).
For example, in our local deployment, users only have access to local mode. This means they can only run their pipeline with mocks and no external dependencies. This is the perfect environment to ensure the pipeline is compiling, to test the logic of solids on small, pre-configured data sets, or to execute unit tests.
This process lets users take a pipeline from their local machine to production without worrying about infrastructure details.
When a user moves past this stage to deploy using docker-compose, they will have access to both local and dev mode. Now they can run their pipelines with resources configured to use Docker services within the compose environment or connections to sandbox environments.
Finally, when a user is ready to put out a PR, our CI runs final integration tests on our Dagster project in our AWS dev and prod accounts.
This process lets users take a pipeline from their local machine to production without worrying about infrastructure details. Whether they are setting up scheduled jobs or creating an event driven workflow with sensors, all configuration details can be maintained by the user, in Python, alongside their pipeline code.
Light-weight infrastructure for local development
Designing our workflow around different deployments also allows us to use different Dagster instance configurations for different stages of development. This means users don’t need to set up a lot of local infrastructure in order to run tests; the same code will run in production with production-appropriate scheduling and execution.
|local||local||Default||Single||Running a single Dagit locally on a single workspace. Running unit tests.|
|docker compose||local, dev||Docker||All||Running our entire Dagster project (all workspaces) as well as the daemon and any supporting infrastructure|
|dev||local, dev||AWS ECS||All||Our Dagster AWS stack on our dev AWS account|
|prod||prod||AWS ECS||All||Production Dagster|
The local deployment uses the default run launcher, while the docker compose deployment uses the
DockerRunLauncher to isolate runs in their own containers.
Our two AWS deployments (on our dev and prod) accounts both take advantage of a Dagster ECS Launcher (currently a custom launcher).
Our deployments in compose, dev and prod also include the Dagster daemon to run our schedules. This means users can define the logic of their own schedules in code and test them without having to think of the relationship between Dagit and the daemon.
This flexibility was a big selling point for Dagster, since our team can benefit from all of the Dagster features without ever touching the
workspace.yaml files (or even knowing that these files exist!).
Enabling self-service on the data platform
This separation of concerns means that a small core data infrastructure team can efficiently ship a data platform for users with different skillsets. We use a set of cookie-cutter templates to ensure that any new pipelines are set up with the correct modes, presets, and schedules for our deployment environments.
Dagster and the structure of our deployments are getting us to the point where everyone on the team can feel comfortable trying things out.
Now, anyone can contribute an additional pipeline, without any real coordination overhead to ensure that it will behave like everything else we’ve already stood up.
Before we started this work, people on the team would say “I want to try and do X but I don’t want to break anything.” As a data engineer, it always warmed my heart to hear people worry about the data stack. But as we continue to grow as an organization, we wanted there to be as little unnecessary coordination as possible preventing people from doing their jobs.
We want to give people across our organization the ability to iterate quickly on data projects in a safe environment. Dagster and the structure of our deployments are getting us to the point where everyone on the team can feel comfortable trying things out.
We are still getting up to speed with everything Dagster can do, and we’re excited about streamlining more of our processes. For example, we have a number of isolated workflows running in one-off environments (like AWS Step Functions) that would be better consolidated on Dagster. We also have microservices that might be better expressed as sensors within the Dagster system. We’d like to use the Dagit UI as a unified interface for these functions, some of which lie outside of the sphere of analytics proper.
Incrementally Adopting Dagster at Mapbox
- Ben Pleasanton