November 28, 2023 • 8 minute read •

Case Study: Abstracting Pipelines for Analysts with a YAML DSL

How SimpliSafe’s small engineering team uses YAML DSL within Dagster’s powerful data platform to support analysts and business stakeholders.

Name: Fraser Marlow
Handle: @frasermarlow

Case Study: Abstracting Pipelines for Analysts with a YAML DSL

In this case study, we share how the data engineering team at SimpliSafe redesigned its analytics workflows to support dozens of analysts and hundreds of business users.

By deploying Dagster’s composable, fully-featured framework and automating deployments, the engineering team developed tremendous leverage, supporting a pool of analysts many times the multiple of the engineering team.

By eliminating overhead and removing the analysts’ dependency on the central team, the engineers can focus on expanding the platform and building new features to boost data quality and contain costs.

SimpliSafe: Home security means data at scale

SimpliSafe is a 1200+ strong team that provides home protection services and security technology to combat everything from intruders to fires, water damage and more to over four million people throughout the United States and the United Kingdom.

SimpliSafe is a home protection company that generates a lot of data through IoT devices.

As a major supplier of in-home IoT technology, SimpliSafe generates and manages an impressive amount of data. The engineering team pipes 5TB of data daily from fifty heterogeneous data source types. They run 450 data pipelines for pulling data from sources and another 350 for data aggregation.

The analytics team’s central Athena instance manages 40Tb of AWS S3 data across 1,700 tables, some exceeding 1T records. Between the analysts and the business users, there are 300 users of the data platform.

With just six people on the engineering team to support all critical business processes, it is vital to set up the business users to work with as much autonomy as possible. Doing so lets the engineering team focus on enhancing the data processes instead of getting tied up in maintaining and fixing things.

In this case study, we will share how the data engineering team at SimpliSafe built a data platform on Dagster and EKS, abstracting away the pipeline creation steps from the analysts to remove any dependencies on the central data engineering team when it comes to defining and deploying new data pipelines.

“EKS and Dagster, they just work. It’s amazing. In the morning, I open the UI, and I see all the pipelines that ran overnight.” - Daniel Nuriyev

The need for a data platform

The need for a robust data platform was apparent early on to Daniel Nuriyev, Senior Manager of Data Engineering at SimpliSafe. But it was one particular fire drill that drove the point home.

When Daniel joined SimpliSafe, the team had set up data flows on Streamsets. SimpliSafe has seasonal spikes in the business. During one of these spikes, Streamsets stopped working, creating a major challenge for Daniel and his team. He had to scramble to rewrite the data management processes for the most critical pipelines in Python at a breakneck pace. Though Daniel’s quick thinking helped to get the team back on track, the event exposed some serious weaknesses in the old set-up, which would not scale in a reliable way.

After the unpleasant fire drill, the team started to map out the foundations of a better data platform.

Reviewing the available options

With a revised set of specifications in hand and the vision of abstracting away the complexity from the analysts, the data engineering team evaluated 25 possible solutions. Knowing they would need to contend with complex scheduling, upstream dependencies, and define functions at run time, they looked for a composable, flexible solution with strong deployment options.

The final decision came down to an evaluation of Dagster and Prefect, with Dagster winning out based on the more complete feature set, which not only met SimpliSafe’s requirements but provided options for future development.

The SimpliSafe team started work on the new architecture in January 2022, with engineers rotating through the build. Over a three-week period, the team set up Dagster, EKS, and built a basic YAML parser, with deployment handled by AWS CDK.

Abstracting away the pipeline design for analysts with a YAML DSL

SimpliSafe has dozens of data analysts who need to rapidly define and put into production new pipelines for transforming and reporting on critical business data.

The upstream data sits in heterogeneous sources, including databases (MySQL, MongoDB, etc.) and business SaaS applications, some of which have rate-limited APIs. In addition, other data teams provide data for analysis by exporting it from their source systems and dropping it in a shared location.

The analysts are typically conversant with SQL but not Python, so Daniel and the team set out to build a Domain Specific Language (DSL) - an internal interface that would allow the analysts to define the sources, schedules, and transformation required in a format that would be easy to learn and quick to deploy.

From YAML to deployed pipelines

Building on Dagster and Amazon Elastic Kubernetes Service (EKS), the SimpliSafe team developed an abstraction layer that let the analysts install and run a local instance of Dagster, then specify in YAML the properties of the pipeline they needed: source data, schedules, dependencies, transformations required, and final destination(s) for the new data.

From setting up their local instance to learning the process of specifying pipelines in Python, most analysts are up and running within a day, says Daniel.

Once the pipeline works locally, analysts submit a pull request (including the YAML file and any SQL files required) to put their pipeline into production.

A simplified diagram of Simplisafe's analytics platform architecture.

From here, there is a streamlined approval process (an engineering review, analytics approval, and signoff by the analytics manager), and the pipeline is added to production.

The YAML file is stored in a discrete Dagster code location and translated on the fly to Dagster Ops and Assets.

Here is a sample of an analyst’s YAML file that connects to MySQL and pipes data into Athena:

ingest_method: reloaded
schedule: 0 0 * * *
error_level: error
description: AST Call Log
resources:
  cpu: 1
  memory: 4Gi
steps:
  - name: load
    type: source
    resource: mysql
    config:
      database_name: prd_replica
      table_name: ast_call_log
      connection_secret: analytics-prd_replica
      batch_size: 1000000
  - name: hash_columns
    type: transform
    op: default_transform
    needs:
      - load
    config:
      hash_columns:
        - agent_device
        - agent_name
        - callerid
  - name: target
    type: sink
    resource: athena
    needs:
      - hash_columns
    config:
      table_name: ast_call_log
      primary_key: call_id
  - name: target_pii
    type: sink
    resource: athena
    needs:
      - load
    config:
      table_name: ast_call_log
      primary_key: call_id
      pii: True

Analysts can include SQL in their YAML specification file. This does not pipe data from a source but aggregates data for analytics.

ingest_method: reloaded
error_level: error
dependencies:
   assets:
     - datalake_agg__order_detail
     - datalake_agg__node_revisions
description: >
  This table is for order component details
steps:
  - name: free_components
    type: link
    op: ctas
    config:
      table_name: _free_components
      query_file: _free_components.sql
      temp: True
  - name: monitoring_orders
    type: link
    op: ctas
    config:
      table_name: _monitoring_orders
      query_file: _monitoring_orders.sql
      temp: True
  - name: sdl_order_component_details
    type: link
    op: ctas
    needs:
      - free_components
      - monitoring_orders
    config:
      table_name: order_component_details
      query_file: sdl_order_component_details.sql
validation:
  asset: datalake_agg__sdl_order_component_details
  steps:
    - name: order_count
      type: generic_sql
      config:
        sql: >
          SELECT
              order_count
          FROM datalake_agg.sdl_order_component_details
    - name: today_lower_count
      type: generic_sql
      config:
        sql: >
          SELECT
            (a.count-b.count) as count
          FROM datalake_agg.sdl_orders_audit as a,
          datalake_agg.sdl_order_component_details as b
          HAVING (a.count-b.count) > 0
</CodeBlock>

If analysts need to include more complex logic, they always have the option to inject Python code:

ingest_method: appended
skip: False
appended_schedule:
  schedules:
    - hour_of_day: 1
error_level: error
resources:
  memory: 16Gi
  cpu: 1
description: >
  Create a non pii version of datalake_pii_agg in datalake_agg
steps:
  - name: query
    type: source
    resource: athena_query_extract
    config:
      query: SELECT * FROM "datalake_pii_agg" where _partitioned__date = {START_DATE}
      output_type: dataframe
      big: True
      schema:
        app_group_id: string
        conversion_behavior_index: Int64
  - name: hash_columns
    type: transform
    op: default_transform
    needs:
      - query
    config:
      hash_columns:
        - email_address
        - phone_number
  - name: cast_columns
    type: transform
    op: cast_column_types
    needs:
      - hash_columns
    config:
      columns:
        app_group_id: string
        conversion_behavior_index: Int64
  - name: target
    type: sink
    resource: athena
    needs:
      - cast_columns
    config:
      table_name: users
      primary_key: id

“Essentially, we developed a better version of dbt,” says Daniel, “We have tens of analysts at the company, and we never ran into a problem; they all are happy.”

Finally, data is queried through the Athena service or by using Tableau.

Limitations of the YAML DSL approach

There are some limitations to this approach. First, not everything can be abstracted away in YAML. When rare edge cases do emerge, Daniel and the team can assist. If the analyst has some familiarity with Python, they can code and add their own Dagster Assets to the pipeline.

The second limitation is that not everything can be tested locally, especially inter-pipeline dependencies. For this reason, SimpliSafe has a staged deployment process and testing prior to moving to production.

Maintaining the platform and order in the data

The engineering team includes six engineers (including Daniel), but Dagster and pipelines are just part of their overall workload. Supporting the Dagster platform involves having one engineer on call to support users when needed. Otherwise, the team works on connecting new data sources, troubleshooting, and building out the overall platform.

The SimpliSafe team stays current with the rapidly evolving Dagster project with quarterly updates and so far has not encountered any major rework due to new releases.

To make the entire data process reliable, Daniel has avoided a Data Lake approach. Instead, he has maintained high quality by controlling data flow into the warehouse. Data is well structured in Athena, allowing each team to have a discrete table prefix. Each new data source is carefully structured by the engineering team before it is ingested.

Each analyst team’s work is segregated with Athena table name prefixes and separate Dagster folders with their own pipelines.

“Teams can create a mess for themselves in their data, but not for the other teams,” says Daniel.

The team also builds jobs to purge old data from the databases and enforce naming conventions on the tables. Dagster supports this effort because any engineer can implement business logic. Database maintenance depends on a number of Python checks added to Dagster jobs and they use a combination of jobs and sensors to remain efficient.

The benefits of a solid data platform

With a solid platform in place, the SimpliSafe engineering team has tremendous leverage, spending minimal time troubleshooting issues and supporting a pool of analysts several times the size of the engineering team.

With less time spent on reactive efforts, Daniel and the team can focus on enabling the organizations with new data sources, optimizing the existing processes, and building new features to boost data quality and contain costs.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: