October 27, 20229 minute read

Orchestrating Data Science at Zephyr AI

Fraser Marlow
Fraser Marlow

The Zephyr AI logo

On April 14, 2003, the Human Genome Project (HGP) announced to great fanfare that it had mapped the human genome. This represented a giant leap in our understanding of functional biology and ushered in the era of personalized medicine.

A less well-known fact is that the HGP had only mapped 85% of the genome. The complete genome was only finished in January 2022.

Genetic data, it turns out, is difficult. And at around 30GB to 70GB per single sequence file (FASTQ and BAM formats), it’s also a lot of data.

Just ask the data team at Zephyr AI, a Washington DC based company working on AI-driven precision medicine and drug discovery.

Founded in 2021, the team specializes in predictive analytics by sifting through patient data and finding patterns or “interrelationships.” They achieve this by ingesting large amounts of complex, disparate molecular, experimental, and clinical data and applying novel machine learning algorithms to draw new insights in the field of patient care, drug development, and healthcare administration.

“To build predictive models of therapeutic response (like in cancer) or of chronic disease progression (like in diabetes), you need three fundamental components: domain knowledge from which you can select the features and outcomes for your desired prediction task, data scientists who can build test and scale machine learning models, and real-world data to evaluate the performance of your models. From the start, Zephyr has been focused on excelling across these dimensions,” said Zephyr AI Chief Technology Officer Jeff Sherman, “the Zephyr ecosystem allows experts across areas of domain knowledge to form dynamic teams and creatively tackle new prediction tasks.”

By comparing a small amount of genetic information from a patient’s tumor to a large collection of tumor genotypes, Zephyr AI can provide insights into the best drug or treatment plan. This concept of "learn from many, apply to one" forms the foundation of precision medicine and is central to Zephyr's approach and mission.

We caught up with Fahad Khan and Jordan Wolinsky, both senior software engineers working on Zephyr AI’s data platform, to learn about the process of collaborating across teams and ingesting and managing such large volumes of data in complex data science pipelines.

Fahad Khan

Fahad Khan
Senior Software Engineer at Zephyr AI
Jordan Wolinsky

Jordan Wolinsky
Senior Software Engineer at Zephyr AI

The challenge: Different teams, different datasets, different tools, and lots of data

Both genomic data and health records represent large datasets and get analyzed by domain specialists using bespoke tools.

“For our team, it’s mostly a matter of getting the different experts across the different domains to build their tools or their data transformation. And then kind of hooking it all up,” says Fahad. “As we work on a project, it’s a case of making sure we have everything set up with the observability, the assets we need, and then being able to push that out with one-click deploys, so the data is constantly refreshed and ready to go.”

The data team was tasked with pulling together very large datasets from disparate sources into two production pipelines:

The bioinformatics oncology pipeline, which takes a de-identified patient's raw DNA sequence (140 to 160 gigabytes per patient) and gets processed to identify the relevant mutations that drive their cancer.
The predictive analytics pipeline, which analyzes electronic health record (EHR) systems and claims data. This pipeline runs a terabyte of Parquet files or 9 billion rows of raw data.

The data scientists receive the data assets produced, will run their own computations and predictive analytics workloads, and return the trained models for inclusion in the production pipelines.

Combining these large datasets from disparate domains and tools would quickly overwhelm naive task schedulers. The different toolsets make it hard for teams to collaborate. And without a single control plane for the pipelines, it is difficult to understand data lineage and manage data processing.

From early days to building on Dagster

In the very early days of the company, the data team was manually running data science algorithms on open-source biology data. They were applying preprocessing (i.e. thresholding) and post-processing but had no orchestration in place. The team was small (three developers and two computational biologists), and the platform was still fairly simple, with just one repository and a single ambitious hypothesis that they were looking to prove before translating it into more complicated models.

When they reached a point where they had a large library of functions in place, they needed to bring some structure to the process.

The data platform team needed to engineer a system that would allow for one-click deploys so that changes could be rapidly implemented and reliably deployed at scale.

“I feel like there are a lot of problems that we didn't encounter because we adopted Dagster so early.“ - Jordan Wolinsky

The data team had experience with both Dagster and Airflow, but at this early stage, Dagster provided a number of key benefits:

  • A shared Python-based framework for developing data pipelines with powerful abstractions
  • A single pane of glass for observability
  • A centralized control plane with shared logging and governance

While the observability and parameterization functions boosted developer productivity in the immediate term, Dagster’s value was also in providing structure for the ongoing pipeline development process, which proved invaluable as multiple technologies came into the picture.

A major requirement for the kinds of high-stakes clinical decisions Zephyr is building predictive models for is transparency—another key feature of Zephyr’s core enabling technology that was made easier to implement using Dagster.

“I feel like there are a lot of problems that we didn't encounter because we adopted Dagster so early.“ said Jordan.

As the size and complexity of the projects grew, the team realized another benefit: The teams were able to maintain one master orchestration view in Dagit yet manage many distinct deployments with isolated codebases. With this framework, each team maintains its own deployment in its own repository, avoiding dependency conflicts or merge conflicts with other teams. With Dagster, dozens of teams can work in parallel on separate workflows without stepping on each other's toes.

The team deploys on K8s using AWS EKS clusters, maintaining a staging/development and production environment.. They are organized by business focus area and maintain deployments through CI/CD with Helm.

The Zephyr AI tech stack circa 2022

Fahad came onto the team shortly after Dagster was implemented in May 2021. He was familiar with Airflow but rapidly adapted to the Dagster way of working. Fahad had to build a new data pipeline for new data types but found he could reuse the previous abstractions without worrying about the underlying pipeline functions.

This greatly accelerated the speed of development and Zephyr AI’s ability to create new pipelines very quickly. With Dagster, “you're just writing Python code instead of being a yaml developer, which is what I used to feel sometimes when I'm just writing data pipelines.” says Fahad.

The value of software-defined assets

One of Dagster’s key differentiators is the use of Software-Defined Assets, a core abstraction that provides a declarative approach to managing and orchestrating data.

“We’ve become big fans of software-defined assets. That's been a big win for us. Just being able to hit ‘materialize all’ and then get all of our data is crucial.” says Jordan.

In a joint meeting in October of 2022 between the data science, data platform, and product teams, the product team was describing their future vision. Their ultimate goal was to process patient data and provide predictive analytics directly to physicians.

[It] sounded like a lot to build, but the day after we get all the requirements, we showed them Dagster, the assets, and the lineage and how the declarative aspect of it makes it very easy to know when assets are out of date or when they need to be updated.” said Fahad

In this sense, Dagster becomes a collaborative tool for understanding the assets available, and sharing their current status and lineage.

“There is a lot of value in not just explaining what transformations we're doing, but also what the result of that transformation is, especially to someone who is purely seeing it as data, like a product person,” says Fahad.

"There is a lot of value in not just explaining what transformations we're doing, but also what the result of that transformation is, especially to someone who is purely seeing it as data, like a product person," - Fahad Khan

Zephyr AI has started using Dagster to publish metadata about the software-defined assets directly to Slack channels using the Slack integration. This posts a message with a link to completed assets.

“So if it's an S3 bucket, there's a link to an S3 console. Or if it's going to MLFlow or a Jupyter Notebook, just link out directly to it,” says Fahad.

Today the Zephyr AI team has ten people actively developing on Dagster, but all teams in the company interact with the system in some fashion.

The EvolutionIQ tech stack circa 2020

Getting creative with Dagster: Perl, S3, Shell commands, and Python

Dagster is a great development framework for building pipelines using a Pythonic syntax, so it provides a good deal of familiarity and flexibility to Python developers.

Dagster also turns out to be a versatile tool for orchestrating tasks beyond the realm of Python. When a particular piece of functionality was easier for the data science team to manage in Scala, it was initially orchestrated with a Kubernetes cronJob, Jordan came up with the idea of orchestrating the job in Dagster using a Docker image in 10 to 15 lines of code, giving the team all the metadata, run history, and observability on those jobs.

Once data engineers get a handle on the new toolset, they can start to re-architecture their pipelines to make more efficient use of their infrastructure in a way that is specific to their workload.

One example is the work Fahad has done with the bioinformatics pipelines: connecting various Java Programs, C++, and Perl scripts that process DNA sequencing data.

Fahad explains: “A lot of file-based tools assume that the file is going to be on your local file system. You can't really tell them to go to S3 and pull data. And they're also invoked using Shell commands that call out to some process.”

But Dagster provided a framework for seamlessly weaving these tools into a pipeline. This is made easier using Dagster’s I/O Managers - user-provided objects that store asset and op outputs and load them as inputs to downstream assets and ops.

“It goes back to the fact that it's all on Python. We can use stuff like pathlib to manage file paths. What the I/O Managers can do is basically make the S3 file download completely transparent. And then the I/O manager is a component that you don't really have to think about when you're writing pipelines.”

“Now all you're doing is you're getting paths in and paths out. You call whatever shell command you want on it. It creates a very nice developer experience because you're just writing reusable functions in Python that operate on file paths.”

What the Zephyr team loves about working with Dagster is that they can identify incremental opportunities to get more out of the platform.

From the Zephyr perspective, “we are actively debating what can be internalized into a Dagster op (like a function doing sklearn training) or externalized (like a function that delegates to a Docker image that needs to run on Spark running on Kubernetes).” says fellow Zephyr AI engineer Ripple Khera. “We are interested in some bridge of the two that would allow for code to be written inline in a Dagster op but transparently executed on a declared execution engine (like Spark on Kubernetes, or EMR). We would love to consume new features from Dagster that make it easier to write code that leverages external execution engines.“

Moving forward, the team also plans on incorporating model training pipelines written in R.

A photograph of the Zephyr AI team on a cruise.
The Zephyr AI team taking some time off from solving complex bioinformatics challenges.

Scaling jobs on K8s

With the volume of data Zephyr is handling, scaling jobs is a key consideration.

With Dagster, the Zephyr AI team has the option to swap out the K8s executor. They are not just running in different processes but running in a cluster of pods. One 80GB genetic sequence file gets broken up into 24 different portions of the genetic sequence. Each sub-sequence gets sent to individual ops, processed, and then merged back together.

“We're getting a lot of speed-ups and performance benefits just from using those,” says Fahad.

In essence, the Zephyr AI team has built their own mapreduce-like job on top of Kubernetes. It's just a DAG whose execution is completely abstracted away, and they don’t have to worry about it when writing the actual code.

What the future looks like:

"Most of the work now is on the actual data, understanding the data and combining it, bringing it together. It’s a nice place to be, not fighting the tools but using the tools." - Fahad Khan

The Zephyr AI engineering team is now focused on how many different types of data sources they can merge into one common data set. Part of doing that is bringing together even more types of tools across different types of environments to build the holistic model for analyzing patient clinical data alongside genetic profiling.

“We've mostly solved that by building abstractions through Dagster. We can build up a pipeline that's running Perl 5 all the way through with no issue and then can do the same thing with Spark. And it doesn't feel much different,'' says Fahad, “Most of the work now is on the actual data, understanding the data and combining it, bringing it together. It’s a nice place to be, not fighting the tools but using the tools.”

For Zephyr, the next big data science feats will be validating their predictive models in even larger real-world data sets and continuing to expand and improve their models to bring better outcomes and a better health care system to more people.

The Zephyr AI tech stack circa 2022

Read more filed under
User Story