Abstracting Pipelines for Analysts with a YAML DSL
In this case study, we share how the data engineering team at SimpliSafe redesigned its analytics workflows to support dozens of analysts and hundreds of business users.
By deploying Dagster’s composable, fully-featured framework and automating deployments, the engineering team developed tremendous leverage, supporting a pool of analysts many times the multiple of the engineering team.
By eliminating overhead and removing the analysts’ dependency on the central team, the engineers can focus on expanding the platform and building new features to boost data quality and contain costs.
SimpliSafe: Home security means data at scale
SimpliSafe is a 1200+ strong team that provides home protection services and security technology to combat everything from intruders to fires, water damage and more to over four million people throughout the United States and the United Kingdom.
As a major supplier of in-home IoT technology, SimpliSafe generates and manages an impressive amount of data. The engineering team pipes 5TB of data daily from fifty heterogeneous data source types. They run 450 data pipelines for pulling data from sources and another 350 for data aggregation.
The analytics team’s central Athena instance manages 40Tb of AWS S3 data across 1,700 tables, some exceeding 1T records. Between the analysts and the business users, there are 300 users of the data platform.
With just six people on the engineering team to support all critical business processes, it is vital to set up the business users to work with as much autonomy as possible. Doing so lets the engineering team focus on enhancing the data processes instead of getting tied up in maintaining and fixing things.
In this case study, we will share how the data engineering team at SimpliSafe built a data platform on Dagster and EKS, abstracting away the pipeline creation steps from the analysts to remove any dependencies on the central data engineering team when it comes to defining and deploying new data pipelines.
“EKS and Dagster, they just work. It’s amazing. In the morning, I open the UI, and I see all the pipelines that ran overnight.” - Daniel Nuriyev
The need for a data platform
When Daniel joined SimpliSafe, the team had set up data flows on Streamsets. SimpliSafe has seasonal spikes in the business. During one of these spikes, Streamsets stopped working, creating a major challenge for Daniel and his team. He had to scramble to rewrite the data management processes for the most critical pipelines in Python at a breakneck pace. Though Daniel’s quick thinking helped to get the team back on track, the event exposed some serious weaknesses in the old set-up, which would not scale in a reliable way.
After the unpleasant fire drill, the team started to map out the foundations of a better data platform.
Reviewing the available options
With a revised set of specifications in hand and the vision of abstracting away the complexity from the analysts, the data engineering team evaluated 25 possible solutions. Knowing they would need to contend with complex scheduling, upstream dependencies, and define functions at run time, they looked for a composable, flexible solution with strong deployment options.
The final decision came down to an evaluation of Dagster and Prefect, with Dagster winning out based on the more complete feature set, which not only met SimpliSafe’s requirements but provided options for future development.
The SimpliSafe team started work on the new architecture in January 2022, with engineers rotating through the build. Over a three-week period, the team set up Dagster, EKS, and built a basic YAML parser, with deployment handled by AWS CDK.
Abstracting away the pipeline design for analysts with a YAML DSL
SimpliSafe has dozens of data analysts who need to rapidly define and put into production new pipelines for transforming and reporting on critical business data.
The upstream data sits in heterogeneous sources, including databases (MySQL, MongoDB, etc.) and business SaaS applications, some of which have rate-limited APIs. In addition, other data teams provide data for analysis by exporting it from their source systems and dropping it in a shared location.
The analysts are typically conversant with SQL but not Python, so Daniel and the team set out to build a Domain Specific Language (DSL) - an internal interface that would allow the analysts to define the sources, schedules, and transformation required in a format that would be easy to learn and quick to deploy.
From YAML to deployed pipelines
Building on Dagster and Amazon Elastic Kubernetes Service (EKS), the SimpliSafe team developed an abstraction layer that let the analysts install and run a local instance of Dagster, then specify in YAML the properties of the pipeline they needed: source data, schedules, dependencies, transformations required, and final destination(s) for the new data.
From setting up their local instance to learning the process of specifying pipelines in Python, most analysts are up and running within a day, says Daniel.
Once the pipeline works locally, analysts submit a pull request (including the YAML file and any SQL files required) to put their pipeline into production.
From here, there is a streamlined approval process (an engineering review, analytics approval, and signoff by the analytics manager), and the pipeline is added to production.
The YAML file is stored in a discrete Dagster code location and translated on the fly to Dagster Ops and Assets.
Here is a sample of an analyst’s YAML file that connects to MySQL and pipes data into Athena:
Analysts can include SQL in their YAML specification file. This does not pipe data from a source but aggregates data for analytics.
If analysts need to include more complex logic, they always have the option to inject Python code:
“Essentially, we developed a better version of dbt,” says Daniel, “We have tens of analysts at the company, and we never ran into a problem; they all are happy.”
Finally, data is queried through the Athena service or by using Tableau.
Limitations of the YAML DSL approach
There are some limitations to this approach. First, not everything can be abstracted away in YAML. When rare edge cases do emerge, Daniel and the team can assist. If the analyst has some familiarity with Python, they can code and add their own Dagster Assets to the pipeline.
The second limitation is that not everything can be tested locally, especially inter-pipeline dependencies. For this reason, SimpliSafe has a staged deployment process and testing prior to moving to production.
Maintaining the platform and order in the data
The engineering team includes six engineers (including Daniel), but Dagster and pipelines are just part of their overall workload. Supporting the Dagster platform involves having one engineer on call to support users when needed. Otherwise, the team works on connecting new data sources, troubleshooting, and building out the overall platform.
The SimpliSafe team stays current with the rapidly evolving Dagster project with quarterly updates and so far has not encountered any major rework due to new releases.
To make the entire data process reliable, Daniel has avoided a Data Lake approach. Instead, he has maintained high quality by controlling data flow into the warehouse. Data is well structured in Athena, allowing each team to have a discrete table prefix. Each new data source is carefully structured by the engineering team before it is ingested.
Each analyst team’s work is segregated with Athena table name prefixes and separate Dagster folders with their own pipelines.
“Teams can create a mess for themselves in their data, but not for the other teams,” says Daniel.
The team also builds jobs to purge old data from the databases and enforce naming conventions on the tables. Dagster supports this effort because any engineer can implement business logic. Database maintenance depends on a number of Python checks added to Dagster jobs and they use a combination of jobs and sensors to remain efficient.
The benefits of a solid data platform
With a solid platform in place, the SimpliSafe engineering team has tremendous leverage, spending minimal time troubleshooting issues and supporting a pool of analysts several times the size of the engineering team.
With less time spent on reactive efforts, Daniel and the team can focus on enabling the organizations with new data sources, optimizing the existing processes, and building new features to boost data quality and contain costs.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!