Blog
Deciphering Arcane Kubernetes and ECS Errors with Dagster

Deciphering Arcane Kubernetes and ECS Errors with Dagster

May 17, 2023
Deciphering Arcane Kubernetes and ECS Errors with Dagster
Deciphering Arcane Kubernetes and ECS Errors with Dagster

Recent enhancements allow Dagster to surface clearer and more actionable errors to accelerate your development cycles.

In this article, we will delve into recent improvements in surfacing errors we have built to enhance the Dagster developer experience in the 1.3 release, in both Dagster Open Source and Dagster Cloud.

Any developer can attest that a large part of our job involves dealing with errors, and that errors come in many different degrees of usefulness. When something breaks, we might strike gold and find ourselves with a detailed stack trace that pinpoints exactly where the problem is and how to fix it. But just as frequently, we're faced with inscrutable error codes, sporadic crashes that are impossible to reproduce, or the dreaded “Something went wrong”.

Grappling with errors is hard enough when the surface area of our work is isolated to just a single machine. As data engineers, however, we inevitably deal with many disparate systems and environments, each with its own idiosyncratic flavor of edge cases and errors. For example, when faced with an error like this when spinning up a new job in an Amazon ECS cluster:

DNS resolution failed for myjob-de906371f79d85188234c23ec44a8f1534cf4f34-38c2b8.serverless-agents-namespace-1:4000: C-ares status is not ARES_SUCCESS qtype=A name=myjob-de906371f79d85188234c23ec44a8f1534cf4f34-38c2b8.serverless-agents-namespace-1 is_balancer=0: Domain name not found

A seasoned platform engineer might know what ARES_SUCCESS means and what makes it desirable, but depending on team members accumulating those battle scars creates significant barriers for newcomers to a team. To be fair, I personally had no idea what it meant, and neither did most of the otherwise-very-talented engineers on the Dagster Labs team.

How Dagster surfaces errors

Dagster is an orchestrator for the whole data engineering development lifecycle - and an essential part of that lifecycle is debugging and dealing with errors. Previously, we discussed some of the debugging features of Dagster. We think of the interface for displaying errors in Dagster user code as one of our core user-facing surfaces, and as a result, we continue to invest a lot of effort to ensure that errors are displayed in Dagster with as much context as possible.

Sending Structured Errors across Process Boundaries

One of the core design principles of Dagster's architecture is that user code never runs in the same process as Dagster system code, with all communication between the two happening over a structured API. This unique design lets users run code in multiple isolated Python environments simultaneously, preventing issues in one user-defined job from destabilizing the system.

Errors are considered part of the structured API between user code and system code - when an error is raised within Dagster code, the full recursive stack trace is captured in a custom serialized format.

These stack traces can be safely displayed in system processes like the UI while still maintaining separation between system code and user code, and let Dagster users take advantage of the scalability of the cloud without having to sacrifice observability when something inevitably goes wrong within user code.

Interested in trying Dagster+ for Free?

        Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.        

             Try Dagster+ Free for 30 days            

Surfacing platform errors, not just code errors

Many errors are simply problems in code, where the root cause of the problem can be captured in a stack trace, and the remediation can be easily derived. The serialized errors described above generally handle these cases well. However, there are a multitude of potential problems in production systems that cannot be cleanly expressed as just a Python exception.

Take a Kubernetes pod running a Dagster job, for example. It might fail to start up due to any of the following reasons:

  • The pod is configured to use a secret that doesn't exist
  • The pod has requested more resources than the cluster has available.
  • The Docker image was built using the wrong system architecture.

If our error reporting relied solely on Python exceptions raised from within user code, these would have nothing to show - the pod that runs the code never spun up in the first place, and even if it did, it wouldn't have the context to understand what went wrong or how to fix it. This resulted in frustrating experiences for our users - instead of clear and actionable errors they saw cryptic timeouts like this:

To fix this, we built a separate monitoring process that's responsible for periodically checking in on all of a Dagster deployment's in-progress runs. If it detects that the run has moved into a problematic state, it gathers context about the container in which the run is happening and surfaces a detailed error message including logs, context about the environment in which the job was running, and suggested next steps to investigate further.

For example, the previously shown task that timed out due to a missing secret now displays an error like this—containing the key insight that the pod failed due to a missing secret—and instructions about what to do to investigate further:

A task that failed due to running out of memory previously would have failed with a similarly inscrutable timeout, but now looks like this:

And particularly thorny issues like building your image using the wrong system architecture can now include specific guidance for how to resolve the issue:

We plan to continue enhancing these capabilities over time to cover more and more failure cases.

Conclusion

Data orchestration tools are playing a more vital role today than simply scheduling tasks and reporting back on the basic “success/failure” state.  To deliver on the premise of the full development lifecycle, the orchestration layer has to provide more in terms of error reporting, helping data engineers rapidly and surgically pinpoint issues.  Short of this, the tool is wasting your time.

While this is not an easy task in a complex distributed tech stack, we continue to make inroads in delivering a better developer experience.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 13, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

How to Orchestrate Across Multiple Databricks Workspaces Without Losing Your Mind
How to Orchestrate Across Multiple Databricks Workspaces Without Losing Your Mind
Blog

April 20, 2026

How to Orchestrate Across Multiple Databricks Workspaces Without Losing Your Mind

Once your pipelines span multiple Databricks workspaces, you're no longer orchestrating a single system you're coordinating a distributed one.

Dagster 1.13: Octopus's Garden
Dagster 1.13: Octopus's Garden
Blog

April 9, 2026

Dagster 1.13: Octopus's Garden

Dagster skills, partitioned asset checks, state backed components, virtual assets, and stronger integrations.

Monorepos, the hub-and-spoke model, and Copybara
Monorepos, the hub-and-spoke model, and Copybara
Blog

April 3, 2026

Monorepos, the hub-and-spoke model, and Copybara

How we configure Copybara for bi-directional syncing to enable a hub-and-spoke model for Git repositories

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.

Modernize Your Data Platform for the Age of AI
Guide

January 15, 2026

Modernize Your Data Platform for the Age of AI

While 75% of enterprises experiment with AI, traditional data platforms are becoming the biggest bottleneck. Learn how to build a unified control plane that enables AI-driven development, reduces pipeline failures, and cuts complexity.

Download the eBook on How to Scale Data Teams
Guide

November 5, 2025

Download the eBook on How to Scale Data Teams

From a solo data practitioner to an enterprise-wide platform, learn how to build systems that scale with clarity, reliability, and confidence.

Download the eBook Primer on How to Build Data Platforms
Guide

February 21, 2025

Download the eBook Primer on How to Build Data Platforms

Learn the fundamental concepts to build a data platform in your organization; covering common design patterns for data ingestion and transformation, data modeling strategies, and data quality tips.

AI Driven Data Engineering
Course

March 19, 2026

AI Driven Data Engineering

Learn how to build Dagster applications faster using AI-driven workflows. You'll use Dagster's AI tools and skills to scaffold pipelines, write quality code, and ship data products with confidence while still learning the fundamentals.

Dagster & ETL
Course

July 11, 2025

Dagster & ETL

Learn how to ingest data to power your assets. You’ll build custom pipelines and see how to use Embedded ETL and Dagster Components to build out your data platform.

Testing with Dagster
Course

April 21, 2025

Testing with Dagster

In this course, learn best practices for testing, including unit tests, mocks, integration tests and applying them to Dagster.