September 27, 2024 • 3 minute read •
Dagster Deep Dive Recap: Orchestrating Flexible Compute for ML with Dagster and Modal
- Name
- TéJaun RiChard
- Handle
- @tejaun
Machine learning requires scalable and flexible infrastructure to handle heavy computing tasks like model training and data processing. In many teams, the challenge comes when trying to manage this infrastructure without getting slammed with complex configurations, like writing Kubernetes YAML or managing GPU operators.
In our most recent Dagster Deep Dive, led by Colton Padden (Developer Advocate at Dagster) and Charles Frye (AI Engineer at Modal), we jumped into how using Dagster and Modal together help automate and streamline these processes so that they become more developer-friendly and scalable.
In case you missed it (or just want to watch it over again), I’ve embedded the video below.
Highlights
The deep dive covered different ways of using Dagster and Modal for machine learning workflows. Here’s a rundown of the main points Colton and Charles discussed:
Orchestration with Dagster
Colton began the demo by talking about Dagster’s ability to orchestrate ML pipelines. He highlighted key Dagster features like asset-based workflows for managing data dependencies, partitioned assets for handling time-series data, and sensors for triggering runs based on external events, such as a new episode of a podcast.
Colton finished up this segment by talking about how Dagster could integrate with different compute environments through Dagster Pipes.
Scalable Infrastructure with Modal
Charles followed up by showing Modal’s scaling capabilities and explaining how it can easily parallelize workloads. He demonstrated this by splitting podcast audio and transcribing multiple segments simultaneously. Modal also has serverless execution, which Charles demonstrated by showing how it automatically allocates necessary resources, including GPU support for accelerated machine learning tasks.
Demo: A Podcast Summary Application
Colton and Charles then combined Dagster and Modal to automate podcast summarization.
The system was designed to fetch new podcast episodes using RSS feeds, download and store audio files in cloud storage, transcribe audio segments using the Whisper model, summarize the transcript with OpenAI’s language models, and email concise podcast summaries to users.
The demo ultimately and effectively showed the combined strength of Dagster’s orchestration and Modal’s scalable computing power.
Takeaways
Here’s the TL:DR of all of the insights from the deep dive.
- Dagster’s orchestration and Modal’s scalable infrastructure give you a strong solution for machine learning pipelines, particularly for tasks needing parallelism like audio transcription and large-scale data processing.
- Modal offers flexibility without complexity, giving you a way to scale GPU workloads without needing to deal with complex Kubernetes configurations so you can focus on development.
- Being able to rerun specific assets or pipelines through Dagster’s UI while relying on Modal’s support for parallel workloads lets you handle production-ready machine learning systems.
- Dagster’s developer-friendly features (partitions, sensors, and cursors) let you orchestrate and simplify pipelines for engineers, especially when dealing with data sources like RSS feeds
Conclusion
With Dagster and Modal, teams can streamline and optimize their machine learning pipelines while reducing infrastructure complexity. On top of simplifying pipeline orchestration and giving you scalable, auto-scaling compute infrastructure, the two tools let developers focus on building and refining applications instead of tediously managing complex infrastructure.
If you’re interested in using this approach in your own projects:
- Watch the video above to see the full conversation.
- Connect with the Dagster and Modal communities in Slack to learn more and connect with other developers tackling similar challenges.
- Visit our platform page for more information about Dagster or start a free trial and start creating projects.
Stay tuned for the next deep dive!
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Dagster Deep Dive Recap: Building a True Data Platform
- Name
- TéJaun RiChard
- Handle
- @tejaun
Dagster Deep Dive Recap: Evolution of the Data Platform
- Name
- TéJaun RiChard
- Handle
- @tejaun
Dagster Deep Dive Recap: Building Reliable Data Platforms
- Name
- TéJaun RiChard
- Handle
- @tejaun
- Name
- Colton Padden
- Handle
- @colton