Back to integrations
Using AWS Glue with Dagster

Dagster Integration:
Using AWS Glue with Dagster

The AWS Glue integration enables you to initiate AWS Glue jobs directly from Dagster, seamlessly pass parameters to your code, and stream logs and structured messages back into Dagster.

About this Integration

The dagster-aws integration library provides the PipesGlueClient resource, enabling you to launch AWS Glue jobs directly from Dagster assets and ops. This integration allows you to pass parameters to Glue code while Dagster receives real-time events, such as logs, asset checks, and asset materializations, from the initiated jobs. With minimal code changes required on the job side, this integration is both efficient and easy to implement.

Installation

pip install dagster-aws

Examples

import boto3
from dagster import AssetExecutionContext, Definitions, asset
from dagster_aws.pipes import (
    PipesGlueClient,
    PipesS3ContextInjector,
    PipesS3MessageReader,
)


@asset
def glue_pipes_asset(
    context: AssetExecutionContext, pipes_glue_client: PipesGlueClient
):
    return pipes_glue_client.run(
        context=context,
        job_name="Example Job",
        arguments={"some_parameter_value": "1"},
    ).get_materialize_result()


defs = Definitions(
    assets=[glue_pipes_asset],
    resources={
        "pipes_glue_client": PipesGlueClient(
            client=boto3.client("glue"),
            context_injector=PipesS3ContextInjector(
                client=boto3.client("s3"),
                bucket="my-bucket",
            ),
            message_reader=PipesS3MessageReader(
                client=boto3.client("s3"), bucket="my-bucket"
            ),
        )
    },
)

About AWS Glue

AWS Glue is a fully managed cloud service designed to simplify and automate the process of discovering, preparing, and integrating data for analytics, machine learning, and application development. It supports a wide range of data sources and formats, offering seamless integration with other AWS services. AWS Glue provides the tools to create, run, and manage ETL (Extract, Transform, Load) jobs, making it easier to handle complex data workflows. Its serverless architecture allows for scalability and flexibility, making it a preferred choice for data engineers and analysts who need to process and prepare data efficiently.