Blog
LLM training pipelines with Langchain, Airbyte, and Dagster

LLM training pipelines with Langchain, Airbyte, and Dagster

July 5, 2023
LLM training pipelines with Langchain, Airbyte, and Dagster
LLM training pipelines with Langchain, Airbyte, and Dagster

This tutorial shows you how to combine Langchain, Airbyte, and Dagster to build maintainable and scalable pipelines for training LLMs.

Training Large Language Models (LLMs) requires appropriate contextual data.  Typically, this data is dispersed across a plethora of sources, including books, websites, articles, open datasets, external services, and an array of databases and data warehouses. Maintaining this data's freshness necessitates a robust pipeline, which transcends the scope of ad hoc shell or Python scripting.

Dagster can assume a pivotal role in LLM training by orchestrating the services involved, running ingestion tasks, transforming and structuring your data and making it readily available for your LLM. When paired with an ingestion tool such as Airbyte and a framework for developing applications powered by language models like LangChain, the task of making data accessible to LLMs becomes not only feasible but also maintainable and scalable.

This article describes the process of establishing such sophisticated data pipelines.

Overview of what we’ll be building

The first step in training a LLM is to aggregate or ingest the data set that the LLM will be trained on.

Popular public sources to find datasets include Kaggle, Google Dataset Search, Hugging Face, Data.gov, UCI’s Machine Learning Repository, Wikipedia/WikiData, and many others.

The data then needs to be transformed, cleaned and prepared for training.

Let’s start with the data ingestion step, using Airbyte.  In step 1 we will set up a connection in Airbyte to fetch the relevant data.  Airbyte supports hundreds of data sources or lets you implement your own:

In step 2, we will configure the pipeline in Dagster to process the data and store it in a vectorstore.

In step 3 we will combine contextual information in the vectorstore using the LangChain retrieval Question/Answering (QA) module:

The final code for this example can be found on Github.

Prerequisites for Training LLMs

This is not a beginner’s tutorial and we will assume you are familiar with Python, working with command line, and have a basic familiarity with Dagster.

To run, you need:

  • Python 3 and Docker installed locally
  • An OpenAI api key
  • a number of Python dependencies that can be installed as follows:
pip install openai faiss-cpu requests beautifulsoup4 tiktoken dagster_managed_elements langchain dagster dagster-airbyte dagit

Bash

Step 1: Set up the Airbyte connection

First, start Airbyte locally, as described in the Airbyte README.md file and set up a connection:

  • Configure a source - if you don’t have sample data ready, you can use the “Sample Data (Faker)” data source
  • Configure a “Local JSON” destination with path /local - Dagster will pick up the data from there
  • Configure a connection from your configured source to the local JSON destination. Set the “_Replication frequenc_y” to manual, as Dagster will take care of running the sync at the right time.
  • To keep things simple, only enable a single stream of records (in the example below, we select the “Account” stream from the Salesforce source)

Step 2: Configure the pipeline in Dagster

Configure the Software-defined Asset graph in Dagster in a new file ingest.py:

First, load the existing Airbyte connection as Dagster asset (no need to define manually). The load_assets_from_airbyte_instance function will use the API to fetch existing connections from your Airbyte instance and make them available as assets that can be specified as dependencies to the python-defined assets processing the records in the subsequent steps.

from dagster_airbyte import load_assets_from_airbyte_instance, AirbyteResource

airbyte_instance = AirbyteResource(
    host="localhost",
    port="8000",
)

airbyte_assets = load_assets_from_airbyte_instance(
    airbyte_instance,
    key_prefix="airbyte_asset",
)

Python

Then, add the LangChain loader to turn the raw jsonl file into LangChain documents as a dependent asset (set stream_name to the name of the stream of records in Airbyte you want to make accessible to the LLM - in my case it’s Account):

from langchain.document_loaders import AirbyteJSONLoader
from dagster import asset, AssetIn

stream_name = ""

airbyte_loader = AirbyteJSONLoader(
    f"/tmp/airbyte_local/_airbyte_raw_{stream_name}.jsonl"
)

@asset(
    non_argument_deps={AssetKey(["airbyte_asset", stream_name])},
)
def raw_documents():
    return airbyte_loader.load()

Python

Then, add another step to the pipeline splitting the documents up into chunks so they will fit the LLM context later:

from langchain.text_splitter import RecursiveCharacterTextSplitter

@asset
def documents(raw_documents):
    return RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(raw_documents)

Python

The next step generates the embeddings for the documents:


from langchain.vectorstores.faiss import FAISS
from langchain.embeddings import OpenAIEmbeddings
import pickle

@asset
def vectorstore(documents):
    vectorstore_contents = FAISS.from_documents(documents, OpenAIEmbeddings())
    with open("vectorstore.pkl", "wb") as f:
        pickle.dump(vectorstore_contents, f)

Python

Finally, define how to manage IO and export the definitions for Dagster.  In this example, we are using the default IO manager which will save the results to a pickled file on the local system:


from dagster import Definitions, build_asset_reconciliation_sensor, AssetSelection

defs = Definitions(
    assets=[airbyte_assets, raw_documents, documents, vectorstore],
    sensors=[
        build_asset_reconciliation_sensor(
            AssetSelection.all(),
            name="reconciliation_sensor",
        )
    ],
)

Python

See the full ingest.py file here

Step 3: Load your data

Now, we can run the pipeline:

  • Run export OPENAI_API_KEY=<your api key>
  • Run dagster dev -f ingest.py to start Dagster
  • Go to http://127.0.0.1:3000/asset-groups to see the Dagster UI. You can click the “Materialize” button to materialize all the assets. This will run all steps of the pipeline:
    • Triggering an Airbyte job to load the data from the source into a local jsonl file
    • Splitting the data into nice document chunks that will fit the context window of the LLM
    • Embedding these documents
    • Storing the embeddings in a local vector database for later retrieval
  • Now, a vectorstore.pkl file showed up in your local directory - this contains the embeddings for the data we just loaded via Airbyte.

Alternatively, you can materialize the Dagster assets directly from the command line using dagster asset materialize --select \* -f ingest.py

Step 4: Create a simple QA application with Langchain

The next step is to run a QA chain using LLMs. Create a new file query.pyand load the embeddings:


from langchain.vectorstores import VectorStore
import pickle

vectorstore_file = "vectorstore.pkl"

with open(vectorstore_file, "rb") as f:
    global vectorstore
    local_vectorstore: VectorStore = pickle.load(f)

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff", retriever=local_vectorstore.as_retriever())

Python

Initialize the LLM and QA retrieval chain based on the vectorstore:

Python

Add a basic way of quering the model with a question-answering loop:

print("Chat LangChain Demo")
print("Ask a question to begin:")
while True:
    query = input("")
    answer = qa.run(query)
    print(answer)
    print("\nWhat else can I help you with:")

See the full query.py file here

You can run the QA bot with OPENAI_API_KEY=YOUR_API_KEY python query.py

When asking questions about your use case (e.g. CRM data), LangChain will manage the interaction between the LLM and the vectorstore:

  • The LLM receives a task from the user.
  • The LLM queries the vectorstore based on the given task.
  • LangChain embeds the question in the same way as the incoming records were embedded during the ingest phase - a similarity search of the embeddings returns the most relevant document which is passed to the LLM.
  • The LLM formulates an answer based on the contextual information.

How to build on this example

This is just a simplistic demo, but it showcases how to use Airbyte and Dagster to bring data into a format that can be used by LangChain.

From this core use case, there are a lot of directions to explore further:

  • Get deeper into what can be done with Dagster by reading this excellent article on the Dagster blog
  • Check out the Airbyte catalog to learn more about what kinds of data you are able to load with minimal effort
  • In case you are dealing with large amounts of data, consider storing your data on S3 or a similar service -  this is supported by Airbyte S3 destination and built-in Dagster IO managers
  • Look into the various vectorstores supported by LangChain - a managed service like offered by Pinecone or Weaviate can simplify things a lot
  • Productionize your pipeline by deploying Dagster properly and run your pipeline on a schedule
  • A big advantage of LLMs is that they can be multi-purpose - add multiple retrieval “tools” to your QA system to allow the bot to answer to a wide range of questions
  • LangChain doesn’t stop at question answering - explore the LangChain documentation to learn about other use cases like summarization, information extraction and autonomous agents

This article was written by Joe Reuter, software engineer at Airbyte.  You can find the ortiginal article here.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 7, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform
DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform
Blog

March 17, 2026

DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform

DataOps is about building a system that provides visibility into what's happening and control over how it behaves

Unlocking the Full Value of Your Databricks
Unlocking the Full Value of Your Databricks
Blog

March 12, 2026

Unlocking the Full Value of Your Databricks

Standardizing on Databricks is a smart strategic move, but consolidation alone does not create a working operating model across teams, tools, and downstream systems. By pairing Databricks and Unity Catalog with Dagster, enterprises can add the coordination layer needed for dependency visibility, end-to-end lineage, and faster, more confident delivery at scale.

Announcing AI Driven Data Engineering
Announcing AI Driven Data Engineering
Blog

March 5, 2026

Announcing AI Driven Data Engineering

AI coding agents are changing how data engineers work. This Dagster University course shows how to build a production-ready ELT pipeline from prompts while learning practical patterns for reliable AI-assisted development.

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.

Modernize Your Data Platform for the Age of AI
Guide

January 15, 2026

Modernize Your Data Platform for the Age of AI

While 75% of enterprises experiment with AI, traditional data platforms are becoming the biggest bottleneck. Learn how to build a unified control plane that enables AI-driven development, reduces pipeline failures, and cuts complexity.

Download the eBook on how to scale data teams
Guide

November 5, 2025

Download the eBook on how to scale data teams

From a solo data practitioner to an enterprise-wide platform, learn how to build systems that scale with clarity, reliability, and confidence.

Download the e-book primer on how to build data platforms
Guide

February 21, 2025

Download the e-book primer on how to build data platforms

Learn the fundamental concepts to build a data platform in your organization; covering common design patterns for data ingestion and transformation, data modeling strategies, and data quality tips.

No items found.