February 8, 2024 • 5 minute read •

Standardize Pipelines with Domain-Specific Languages

By implementing DSLs, data teams can open their data platform to many more users without compromising on standards.

Name: Elliot Gunn
Handle: @elliot

Name: Tim Castillo
Handle: @tim

Standardize Pipelines with Domain-Specific Languages

Domain-specific languages (DSLs) facilitate better configuration, integration, and management of data workflows. Also known as “small languages”, they enable data professionals to articulate pipeline logic more intuitively without delving into low-level coding intricacies.

For data engineering practitioners, embracing DSLs is not just about technical efficiency; it allows them to scale data operations while still meeting broader organizational goals, leveraging their power to simplify, standardize, and democratize data processes.

In this article:

Strategic Advantages of DSL in Data Pipelines
DSLs in Data Engineering: an overview
DSLs and Data Orchestration
Implementing Modern Data Engineering Practices with DSL
Considerations
Conclusion

Strategic Advantages of DSL in Data Pipelines

Before we jump into what DSLs are specifically, let's talk about the problem they help to solve.

An image of a quote taken form the copy of this article.

Facilitating collaboration

Simplicity lends itself to cross-functional collaboration. DSLs such as YAML—as we will discover in this article—are designed to be "human readable", or easy for humans to write and understand. A clean DSL can be understood by non-engineers, such as business analysts or project managers. This makes for better cross-functional collaboration across data teams.

For example, the SimpliSafe team developed an abstraction layer that let analysts specify in YAML the properties of the Dagster pipeline they needed: source data, schedules, dependencies, transformations required, and final destination(s) for the new data. Their analysts, who understand SQL but not Python, can put pipelines into production with only the YAML file and any SQL files needed.

Faster iteration and deployment

DSLs also allow for faster iteration and deployment, speeding up the process of testing and rolling out new features or making incremental adjustments in the data pipelines. For instance, consider a custom DSL that includes a feature toggle for anomaly detection. A data platform team can activate this feature across multiple data engineering teams simply by switching a flag in the DSL, thus providing analytical capabilities quickly without the need for further code modifications. Standardizing auditing processes or implementing security patches can also be a simple matter of updating the DSL definitions.

For data practitioners concerned about creating sustainable systems, DSLs are a way to get ahead of the scale and complexity of a growing data platform.

Support proper DevOps practices

Finally, DSLs are aligned with DevOps and DataOps methodologies, which emphasize automation, continuous integration, and efficient deployment practices:

Automation: a DSL can automate the process of data validation and cleansing, ensuring that data pipelines consistently receive clean, well-formatted data
CI: if a DSL is used to define data transformation logic, changes to this logic can be quickly integrated and tested in the CI pipeline, ensuring that any modifications do not disrupt existing functionalities
Efficient Deployment Practices: a single configuration file for a data processing job can be deployed across multiple environments with minimal changes, ensuring consistency and reducing the risk of environment-specific errors

Adopting DSLs can be a step towards realizing these principles more fully in data engineering practices.

DSLs in Data Engineering

Now that we have discussed their benefits, let's delve into what DSLs are and how to put them into practice. Here's a definition:

Domain-specific languages are programming languages for a specific area of application.

Unlike general-purpose languages like Python, Java, or C++, which are designed to be versatile and applicable across a wide range of programming scenarios, DSLs are tailored for specific tasks, offering a more focused approach to solving particular challenges.

You may be familiar with some common public DSLs, such as SQL for databases, Terraform for cloud resources, or CSS for web development. They offer ways to concisely express how a system should work without going too into detail.

Conversely, many organizations write their own custom DSLs to address unique requirements. For example, a front-end engineering team may want a way to empower marketing teams to build a website with the brand’s standard components without having to dive into writing HTML/CSS/Javascript.

A familiar example of DSLs in the data world is SQL, a declarative language designed for managing and querying databases. SQL’s approach of specifying “what” needs to be done, leaving the “how” to the database’s query engine, highlights how DSLs abstract away complexity in data engineering. This allows anyone to easily get data from a database without having to know about how the data is physically stored and partitioned.

Another public DSL, Terraform, is designed for infrastructure as code (IaC). Terraform uses a declarative approach to define and manage infrastructure, allowing users to specify the desired end state of their infrastructure setup. The language handles the details of provisioning and managing the underlying resources by being able to pass around IDs, creating relationships between resources that would be harder to decipher from a UI.

In machine learning operations (MLOps), DSLs can offer streamlined ways to define and manage machine learning workflows. While public DSLs offer predefined capabilities, custom DSLs can be developed to cater to specific organizational needs, including the integration of both declarative and imperative elements. For instance, a custom MLOps DSL might standardize the deployment and monitoring of ML models, while also allowing for the integration of custom code or scripts to handle specific requirements.

This abstraction provided by DSLs allows data engineers to focus on the high-level design of data workflows, much like software engineers can concentrate on the overall architecture of their applications when using factory patterns.

DSLs and Data Orchestration

DSLs can also be designed to simplify and streamline the data orchestration process. They can enable data engineers to define complex workflows and dependencies in a more intuitive and less error-prone manner than using general-purpose programming languages. This leads to increased productivity as engineers can focus on the orchestration logic rather than the intricacies of coding.

Data orchestration often involves collaboration between different teams (like data science, engineering, and operations). A DSL, being more focused and easier to understand, can bridge the gap between these teams, allowing for better collaboration and understanding of the data workflows.

Many DSLs are designed to integrate seamlessly with the modern data ecosystem, including cloud platforms, big data processing frameworks, and machine learning pipelines. This integration can be critical in creating a cohesive and efficient data platform.

YAML: a tool of choice for DSLs

Because of it's human-readable nature, YAML is a popular DSL for defining data workflows in data engineering:

workflow:
  name: daily_sales_report
  steps:
    - extract:
        source: sql_database
        query: SELECT * FROM sales
    - transform:
        script: aggregate_sales.py
    - load:
        destination: bi_tool

Note that DSLs are not restricted to YAML, you can also write them in JSON or your own custom format.

Implementing Modern Data Engineering Practices with DSL

Because of the benefits listed above, there is a growing trend in data engineering to develop custom DSLs tailored to unique organizational requirements. In fact, data engineers and data platform engineers are likely already using DSLs in some capacity to manage your organization’s data infrastructure.

Implementation within a data platform, however, varies based on

the specific needs and scale of the organization,
the technological stack in place,
the level of technical expertise within the team, and
the overarching data strategy and governance policies.

DSLs in Data Engineering

DSLs are often used to standardize data pipelines. Dagster’s repository provides an example of using YAML to create asset-based data pipelines. The assets.yaml file defines a group of assets in DSL within a data pipeline.

group_name: assets_dsl_example
assets:
  - asset_key: "foo/bar"
    # replace "sql" with whatever information is needed for your assets
    sql: "SELECT 1, 2, 3"
  - asset_key: "foo/baz"
    description: "This is a description of the baz asset"
    deps:
      - "foo/bar"
    sql: "INSERT into baz SELECT * from bar"

Here, each asset has a unique identifier, metadata about it, and a definition for how it's computed. This structure is part of a larger configuration for managing complex data workflows, enabling clear and structured definitions of tasks and their relationships.

This structured approach to defining assets and their relationships within a DSL framework makes managing complex data workflows more intuitive and efficient. Hence, Data Engineers can empower analysts and other non-engineering team members to productionize their Dagster asset graphs and pipelines, without needing in-depth technical expertise. This democratization of data pipeline creation accelerates the development process and fosters more collaboration on the platform.

Considerations

Escape Hatches for Flexibility

One thing to consider in using DSLs for data engineering is ensuring flexibility for various types of data workflows. What happens when your data engineer wants to do something outside of the predefined scope of the DSL?

Consider a scenario where your DSL is designed for a specific sequence of tasks: ingesting data from S3, processing it into features, and training a machine learning model. Your data scientist now wants to send a subset of the processed data via email instead of training an ML model.

A best practice is to create an “escape hatch” in your DSL: allowing users the flexibility to insert custom code or tasks that might not be natively supported by the DSL. This ensures that while the DSL facilitates standard, common tasks efficiently, it does not become a limiting factor for unique or unconventional data operations.

Conclusion

DSLs streamline the configuration of data pipelines, transforming what was once a cumbersome, error-prone process into a more manageable and intuitive task. They offer a unified language to integrate complex data architectures with diverse data sources and storage systems, ensuring that data flows smoothly across the entire pipeline.

For further reading, consult the example in the Dagster repo.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: