July 24, 2024 • 3 minute read •
Case Study: How Petal Incrementally Adopted a Data Orchestrator
- Name
- Fraser Marlow
- Handle
- @frasermarlow
Adopting a modern data orchestrator may seem like a big leap, one that could cause your data operations to go dark for weeks or months. But this is not the case. Dagster can be adopted incrementally, adding value to our data operations from day one, and then scaling based on your needs. Here we explore one such example.
Meet Petal
Petal is an innovative FinTech firm that provides access to credit by looking at creditworthiness beyond the traditional credit scores used by most providers. Petal focuses as much on helping consumers save and build up good credit as it does on facilitating purchases.
Liem Truong, is an Engineering Manager at Empower, Petal’s parent company, and works on the Petal business. He is part of a small team of three data engineers. They support 2 product engineering teams, a growth team and teams managing existing customer accounts, as well as the internal data teams. Petal is increasingly tapping into ML for underwriting models, and their data ecosystem continues to expand.
Importantly, Petal has partners across the financial services ecosystem, including VISA, WebBank, Credit Karma, Equifax, Experian, and TransUnion. These partners rely on timely and accurate data feeds from companies like Petal.
The Petal team is building on AWS and uses several capabilities from the AWS toolkit.
Upstream, the team replicates data from their operational datastore using AWS Database Migration Service (DMS) using CDC. Data is also received from external partners, such as daily reports from Petal’s card processor. A large volume of daily data comes from credit reporting bureaus. With around 300,000 customers, the data scales up rapidly.
A loading framework acquires the data via SFTP or from S3, applies parsing mechanisms on each job, then lands raw data to S3 and loads to Redshift. From here dbt handles the transformation into data assets for downstream stakeholders.
Moving beyond Jenkins
Prior to Dagster, Petal was running a number of CRON schedules triggered through Jenkins. However, the Jenkins setup needed to be updated, and the team missed the functionality of a modern data orchestrator.
“We never really had an orchestration tool. We relied on a bunch of CRON jobs stitched together with Jenkins. We wanted to break away from that and adopt a true data-centric tool to provide more context-aware orchestration.” recalls Liem.
“The main part we wanted to derive value from was event-based triggering from one model to the next. We wanted to have a dependency graph and streamline the overall pipeline. From the dbt side, we had the lineage and we got a lot of value from that, but it was lacking in our orchestration process. Now we can see it end-to-end.”
An incremental adoption approach.
The team has been building out a lot of the constructs, but adapting them incrementally. They retained much of the code from the original ingestion flow, dropping it into the Dagster framework.
“The team is taking bits of our pipeline and creating the Assets and Ops to run the pipelines on Dagster” says Liem.
Over time, the plan is to adopt more and more of the Dagster capabilities.
While Liem did not personally select Dagster as the foundation of the data platform, today he manages the team that works on Dagster. Petal adopted Dagster+ Serverless to delegate the infrastructure concerns.
Weaving in dbt
Petal adopted dbt Core early in their design. The team is leaning on the dagster-dbt integration to shift all dbt jobs over to Dagster and fully leverage the scheduling capabilities.
Incrementally shifting collaboration
In a similar way, the team is incrementally using Dagster for collaboration. So far, the data cataloging has been done through the dbt docs, but as the team builds out the Dagster catalog, Liem sees this as being a foundation for collaborating with internal and possibly external stakeholders.
Setting up options for the future
In conclusion, Petal is a great example of how a small data team can lay the foundations for future functionality without having to invest a lot of time today or disrupt current pipelines. Dagster has the breadth of functionality to replace homespun or simple cron-based scheduling processes and gives you options for adopting capabilities like event-driven scheduling, data cataloging, data quality tests, and so much more.
If you want to try out Dagster Serverless, you can sign up for a free trial today.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Case Study: Analytiks - Fast-Track AI Projects With Managed Dagster+
- Name
- Pedram Navid
- Handle
- @pdrmnvd
Case Study: From Disconnected Data to a Unified Platform
- Name
- Alex Noonan
- Handle
- @noonan
Case Study: KIPP - Building a Resilient Data Platform with Dagster
- Name
- Fraser Marlow
- Handle
- @frasermarlow