August 28, 2023 • 13 minute read •

ML pipelines for fine-tuning LLMs

LLM fine-tuning best practices for creating a clean production ML pipeline, streamlining model training, and operationalizing fine-tuned LLMs.

Name: Odette Harary
Handle: @odette

We fine-tuned an LLM for Dagster-specific technical support, using parameter-efficient techniques such as LoRA. This article shares our findings and demonstrates best practices in creating a clean production ML pipeline for fine-tuned LLMs.

We'll also discuss operationalizing and keeping the machine learning model up to date, and monitoring the quality of the predicted responses.

Large language models and the curse of the cut-off

Software engineers have found LLMs remarkably efficient at writing functional code and answering fairly sophisticated programming questions.

Naturally, Dagster users were keen to use these tools to speed up their data pipeline work but rapidly ran into a snag: Since ChatGPT’s training data cut-off date is September 2021, any updates to the Dagster APIs will not be reflected in ChatGPT’s responses.

A screenshot from ChatGPT showing the cutoff date of September 2021, making it less than ideal for learning the new APIs in Dagster.

There have been 6,809 commits to the Dagster project since then. At Dagster Labs, we release new features every week and have deprecated some of our older concepts. General purpose LLMs—while tremendously powerful—will never be a great fit for providing user support on rapidly evolving solutions like Dagster. They would require frequent re-training, and training a generalized LLM takes enormous amounts of data and computing.

Luckily, a base Large Language Model can be used as a starting point to build a new model specific to your needs. The technique is called “fine-tuning”.

Fine-tuning LLMs

Fine-tuning is when a pre-trained model (a “foundation model”) is customized using additional data to learn new information or be trained for a specific task. With an efficient training process, a pretrained model can provide more accurate and contextually relevant results. For example, taking an LLM and fine-tuning it on Dagster's latest docs can lead to an answer that includes Dagster's latest concepts such as declarative scheduling, advanced partitioning, or Software-defined Assets.

Use Case: a Dagster support chatbot.

As Dagster’s user base grows, scaling support is key to ensuring all users are getting effective support.

Currently, we provide support via our Slack Community. Some questions that come into the channel are entirely novel, but many have been asked before. While a full-blown response to a nuanced question will still require a Dagster engineer to look into it, a fine-tuned LLM should be able to handle frequently asked or less demanding questions without human intervention. At the very least, it would be helpful for our support team to have a starting point for answering questions generated from historical data (even if these starting points are not available publicly).

This is what we will be doing in this tutorial: use a base LLM model as a starting point and fine-tune it on Dagster’s most up-to-date technical documentation.

We'll also explore Dagster's resources, making model experimentation more structured and traceable.

We’ll be starting with the notebook version of fine-tuning an LLM on Slack unstructured data, and walk step-by-step through how to convert that into production-ready Dagster code.

There are several ML fine-tuning techniques. In this blog, we will perform Supervised Fine-Tuning (SFT) for dialogue. This approach instructs the model to be useful for specific questions, rather than the entire dialogue. This is the most common type of fine-tuning, as the other two options - Pre-training for Completion and RLHF - require more computational power in the first case and higher-quality dialogue data in the second.

To keep this tutorial simple, we'll be running the Dagster pipeline on a regular CPU (as opposed to a GPU), which will allow you to test out the pipeline locally without any additional setup. LoRA (low-rank adaptation of large language models) allows us to train only a subset of the LLM's parameters and prevents overfitting (i.e., when a model becomes too tailored or "fit" to the specifics of the training data that was provided and consequently performs poorly on novel data in the wild).

In a prior post, Build a GitHub Support Bot with GPT3, LangChain, and Python, we used the GPT-3 API and were limited by the number of tokens we could consume. In this tutorial, we'll go through an example that does not require a token or even advanced GPUs (which are frequently used in LLM training), so you can try this out for free, on regular hardware.

Spoiler alert – this blog post can't guarantee groundbreaking performance in your specific use case — but it can teach you how to fine-tune an LLM using parameter-efficient techniques such as LoRA, how to demonstrate the usage of best practices in creating a clean pipeline to make your life better when going to production (and maintenance!).

Picking the right LLM to start with

Our go-to model for fine-tuning is HuggingFace’s Fine-tuned LAnguage Net (FLAN) models: flan-t5-xl and flan-t5-xxl with 1.2B and 12B parameters, respectively. There are many LLMs, but Flan is one of the most popular models for fine-tuning. It allows a trade-off between size and performance, as it has many variants, from small to extremely large models.

From notebooks to Dagster Code

Machine learning engineers are used to working with Notebooks. They can see the results, test different outputs, jump around to tweak data and pivot back to the training model quickly. However, once a bug arises, this approach leads to jumping around in different cells, duplicating notebooks, and difficulty finding the correct code version.

So we will start off with this notebook that has a built-out supervised fine-tuning model for our use case and walk through how to take an ML notebook and transform it into Dagster code.

Moving a trained model to production is typically complex. But this is where Dagster helps. Dagster is a powerful orchestrator, whether building data or machine learning pipelines.

Some advantages of using Dagster to build machine learning pipelines include:

Automation of the pipeline, including refreshing the data and the models based on business needs
Experimentation capabilities by using resources to test out different hyper-parameters
Environment management features, which allow for testing in dev and seamlessly deploying in production in a cloud environment.

Thinking in Assets and Resources

Throughout this section, we will go through the notebook with examples of using Dagster’s assets and resources.

Dagster Assets are any object in persistent storage, such as a table, file, or persisted machine learning model.

Dagster resources are objects that are shared across the implementations of multiple software-defined assets, ops, schedules, and sensors. These resources are designed to be easily swapped out.

The Slack text will be used as a raw input data to train the existing Flan LLM. We will encode the input text in a way that matches the encoding of the existing model. We'll then evaluate our results and use a dataset to predict answers for Dagster questions the model has yet to see.

From raw data to Dagster @asset

The first asset we will consider is raw_datasets, which will obtain and split the input data into a train and test dataset, almost identical to the notebook version (the cell called Pre-process dataset in our Notebook).

A Notebook in the Google Colab environment allows us to run notebooks against CPUs and GPUs on the cloud.

The first change in converting our Notebook code to Dagster is adding the @asset decorator to our functions. This defines a Dagster asset, which is a combination of the code and the object that is instantiated.

As part of the migration from tasks to Software-defined Assets, we’ve renamed the function from get_dataset to raw_datasets. Dagster uses the function name to label the asset, making it clear in this case that this code generates the raw dataset.

Making our pipeline configurable

To make our pipeline easier to maintain, we’ll add configurations to abstract the tuning options from execution. This way, we can change configurations based on environments, easily tweak tuning parameters, and test our code without touching the core functions that operate on our data. For now, we’ll set:

where Slack messages are stored
what percentage of Slack questions we will use for training vs. testing
the input and output keys in our dataset.

Finally, we convert the notebook’s print() statements to asset metadata. Asset metadata can be more helpful than a print statement since it is tracked in Dagster with each materialization.

from dagster import asset, config, Output

class RawDatasetConfig(Config):
    path_to_slack_messages: str = "dataset/dagster-support-dataset.json"
    seed: int = 42
    train_size: float = 0.95
    input_text_key: str = "question"
    output_text_key: str = "gpt4_replies_target"

@asset(description="Slack Q&A user data")
def raw_datasets(config: RawDatasetConfig):

    with open(config.path_to_slack_messages, "r") as f:
        data = json.load(f)

    format_data = []
    for sample in data:
        format_data.append(
            {
                "input_text": sample[config.input_text_key],
                "output_text": sample[config.output_text_key],
            }
        )
   format_data = pd.DataFrame(format_data)

    train, test= train_test_split(format_data,
                                   random_state=config.seed,
                                   train_size=config.train_size)
    ## split the test set into a validation set and inference set
    validation, inference = train_test_split(test,
                                random_state=config.seed,
                                train_size=.8)

    dataset_train = datasets.Dataset.from_pandas(train)
    dataset_validation = datasets.Dataset.from_pandas(validation)
    dataset_inference = datasets.Dataset.from_pandas(inference)

    dataset = datasets.DatasetDict({"train": dataset_train, "validation": dataset_validation, "inference": dataset_inference})

    return Output(dataset, metadata= {"Train dataset size": len(dataset_train), "Test dataset size": len(dataset_validation), "Inference dataset size": len(dataset_inference)})

Tokenizing our data

Now that we have our first data import step defined, lets look at a downstream task. We will take the data and tokenize it with the same encodings as the model. In a machine learning pipeline, tokenization takes large blocks of text and converts them into smaller numeric units that can be read by Large Language Models.

In the existing notebook code, the pre-processing section is very dense. It contains code that pre-processes the data and then tokenizes both the train and test data. To make this pipeline more aligned with assets rather than tasks, we will break this apart so that the tokenizer and three datasets (tokenizer, processed_datasets, train_dataset, and val_dataset) are all separate assets rather than one big function.

Building in Resources

For the tokenizer asset, we will introduce our first Dagster resource. A Dagster resource lets you capture config and connections to external systems, which can easily be swapped out for production versions and reduces code redundancy.

We'll create a BaseLLM resource which we will add to downstream assets. We'll define the parameters we need for the tokenizer and define a function that gets the pre-trained tokenizer that we will be using. This resource will allow us to change the model in a single location, and experiment with the results easily using this pattern.

from dagster import ConfigurableResource
from transformers import AutoTokenizer

class BaseLLM(ConfigurableResource):
    model_name: str
    load_in_8bit: bool
    def PretrainedTokenizer(self):
        return AutoTokenizer.from_pretrained(self.model_name)

We'll define our tokenizer asset with the following code.

@asset(
    description= "HuggingFace Tokenizer",
)
def tokenizer(BaseLLM: BaseLLM):
    my_tokenizer = BaseLLM.PretrainedTokenizer()
    return Output(
        my_tokenizer,
        metadata={"model_name": MetadataValue.text(BaseLLM.model_name)},
    )

Processed data as an asset

The next asset which we'll define is processed_datasets. This is the first asset we'll introduce that has dependencies, telling Dagster what upstream assets are required as inputs to the asset. When setting up automation, Dagster will take these dependencies into account. Dagster will also be able to parallelize some materialization.

This asset tokenizes the raw data, tokenizer and raw_datasets are inputs to the asset.

@asset(
    description="Processed and deanonymized Q&A data",
)
def processed_datasets(
    tokenizer, raw_datasets):
    tokenized_inputs = datasets.concatenate_datasets([raw_datasets["train"], raw_datasets["validation"],  raw_datasets["inference"]]).map(
        lambda x: tokenizer(x["input_text"], truncation=True),
        batched=True,
        remove_columns=["input_text", "output_text"],
    )
    input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]
    max_source_lengths = int(np.percentile(input_lengths, 95))
    print(f"Max source lengths: {max_source_lengths}")

    tokenized_targets = datasets.concatenate_datasets([raw_datasets["train"], raw_datasets["validation"],  raw_datasets["inference"]]).map(
        lambda x: tokenizer(x["output_text"], truncation=True),
        batched=True,
        remove_columns=["input_text", "output_text"],
    )
    target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
    max_target_lengths = int(np.percentile(target_lengths, 95))
    print(f"Max target lengths: {max_target_lengths}")

    def preprocess_function(sample, padding="max_length"):
        # add prefix to the input for t5
        inputs = [item for item in sample["input_text"]]
        # tokenize inputs
        model_inputs = tokenizer(inputs, max_length=max_source_lengths, padding=padding, truncation=True)
        # Tokenize targets with the `text_target` keyword argument
        labels = tokenizer(
            text_target=sample["output_text"],
            max_length=max_target_lengths,
            padding=padding,
            truncation=True,
        )
        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
        # padding in the loss.
        if padding == "max_length":
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label]
                for label in labels["input_ids"]
            ]

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    tokenized_dataset = datasets.DatasetDict()
    tokenized_dataset["train"] = raw_datasets["train"].map(
        preprocess_function, batched=True, remove_columns=["input_text", "output_text"]
    )
    tokenized_dataset["validation"] = raw_datasets["validation"].map(
        preprocess_function, batched=True, remove_columns=["input_text", "output_text"]
    )
    tokenized_dataset["inference"] = raw_datasets["inference"].map(
        preprocess_function, batched=True, remove_columns=["input_text", "output_text"]
    )
    return tokenized_dataset
            ]

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    tokenized_dataset = datasets.DatasetDict()
    tokenized_dataset["train"] = raw_datasets["train"].map(
        preprocess_function, batched=True, remove_columns=["input_text", "output_text"]
    )
    tokenized_dataset["validation"] = raw_datasets["validation"].map(
        preprocess_function, batched=True, remove_columns=["input_text", "output_text"]
    )
    return tokenized_dataset

The last step in breaking down this code block into assets is creating train_dataset, val_dataset, and inference_dataset.

@asset(
    description= "Training dataset",
)
def train_dataset(processed_datasets: datasets.DatasetDict):
    dataset = processed_datasets["train"]
    return Output(
        dataset,
        metadata={
            "size (bytes)": MetadataValue.int(dataset.size_in_bytes),
            "info": MetadataValue.text(dataset.info.description),
            "len": MetadataValue.int(len(dataset)),
        },
    )

@asset(
    description= "Validation dataset",
)
def val_dataset(processed_datasets: datasets.DatasetDict):
    dataset = processed_datasets["validation"]
    return Output(
        dataset,
        metadata={
            "size (bytes)": MetadataValue.int(dataset.size_in_bytes),
            "info": MetadataValue.text(dataset.info.description),
            "len": MetadataValue.int(len(dataset)),
        },
    )
@asset(
    description="Inference dataset",
)
def inference_dataset(processed_datasets: datasets.DatasetDict):
    dataset = processed_datasets["inference"]
    return Output(
        dataset,
        metadata={
            "size (bytes)": MetadataValue.int(dataset.size_in_bytes),
            "info": MetadataValue.text(dataset.info.description),
            "len": MetadataValue.int(len(dataset)),
        },
    )

At this point, we have the train and validation data that's been tokenized and ready to be used to fine-tune.

Train LoRA model with Trainer

In the next cell, we set variables, which typically are revisited and added while engineers are working on a notebook. In Dagster, we have Config and ConfigurableResource, which can help organize our config and resources with the relevant information that can be adjusted during experimentation. Here we are using google/flan-t5-small and lora-flan-t5-base .

We'll organize these values in two parts, some of which are part of the BaseLLM that we used in tokenizer and the other, which will be focused on training.

For BaseLLM we'll add onto our existing resource that uses the same parameters as the tokenizer to load in the base model and the LoRA model.

class BaseLLM(ConfigurableResource):
    model_name: str
    load_in_8bit: bool
    r: int = 16
    lora_alpha: int = 32
    lora_dropout: float = .2
    bias: str = "none"
    target_modules: list = ["q", "v"]

    def PretrainedTokenizer(self):
        return AutoTokenizer.from_pretrained(self.model_name)

    def LoraModel(self):
        model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name, load_in_8bit=self.load_in_8bit)
        lora_config = LoraConfig(
            r=self.r,
            lora_alpha=self.lora_alpha,
            target_modules=self.target_modules,
            lora_dropout=self.lora_dropout,
            bias=self.bias,
            task_type=TaskType.SEQ_2_SEQ_LM,
        )
        if self.load_in_8bit:
            model = prepare_model_for_int8_training(model)
        model = get_peft_model(model, lora_config)
        return model

We'll also set up TrainingResource which will be used to define the options we want for the LLM supervised fine-tuning process, this will include things like the learning rate, and number of epochs, or training rounds.

class TrainingResource(ConfigurableResource):
    peft_model_id: str = "lora-flan-t5-base"
    num_train_epochs: int = 1
    per_device_eval_batch_size: int = 8
    per_device_train_batch_size: int = 8
    gradient_accumulation_steps: int = 1
    lr: float = 1e-3
    def training_args(self):
        training_args = Seq2SeqTrainingArguments(
        do_train=True,
        do_eval=True,
        evaluation_strategy= "epoch",
        logging_strategy= "epoch",
        save_strategy= "epoch",
        per_device_eval_batch_size=self.per_device_eval_batch_size,
        per_device_train_batch_size=self.per_device_train_batch_size,
        gradient_accumulation_steps=self.gradient_accumulation_steps,
        output_dir=self.peft_model_id,
        auto_find_batch_size=True,
        learning_rate=self.lr,
        num_train_epochs=self.num_train_epochs,
        logging_dir=f"{self.peft_model_id}/logs",
        use_mps_device=False
    )
        return training_args

Using resources for ML experimentation

Resources give the ability to adjust the parameters of a machine learning model and test out different options to see the performance.

We'll also add a resource that gets the accelerator; in this case, we are using a CPU, but this can be used to toggle different computing environments. For example, you might have a smaller training dataset in development and a large GPU environment for production, which uses a larger model and is a more expensive process.

import torch
from dagster import ConfigurableResource

class AcceleratorResource(ConfigurableResource):
    def get_device(self) -> torch.device:
        return torch.device("cuda" if torch.cuda.is_available() else "cpu")

Now that the resources are set up, we'll create an asset for the base Large Language Model and an asset for the fine-tuned model.

@asset(
    description= "Base HuggingFace large language model for fine-tuning",
)
def base_llm(
    BaseLLM: BaseLLM, accelerator: AcceleratorResource
) -> Output[torch.nn.Module]:

    model = BaseLLM.LoraModel()
    # Place model on accelerator
    model = model.to(accelerator.get_device())

    return Output(
        model,
        metadata={
            "model_name": MetadataValue.text(BaseLLM.model_name),
            "trainable_parameters": MetadataValue.text(
                model.print_trainable_parameters()
            ),
        },
    )

The finetuned_llm asset will also include performance measures of the model which can be inputted into Dagster's metadata.

Using Dagster resources and metadata creates an environment that enables machine learning experimentation. If you want to try different learning rates, instead of changing a specific cell in the notebook, you can change the configuration in the resource so you can easily track what values created what results.

def evaluate_peft_model(sample, model, tokenizer, max_target_length=512):
    # generate summary
    outputs = model.generate(
            input_ids=sample["input_ids"].unsqueeze(0).cpu(),
            do_sample=True,
            top_p=0.9,
            max_new_tokens=512,)
    prediction = tokenizer.decode(
            outputs[0].detach().cpu().numpy(), skip_special_tokens=True
        )
    # decode eval sample
    labels = np.where(
            sample["labels"] != -100, sample["labels"], tokenizer.pad_token_id
        )
    # Some simple post-processing
    labels = tokenizer.decode(labels, skip_special_tokens=True)
    return prediction, labels

@asset(
    description= "A LoRA fine-tuned HuggingFace large language model",
    # freshness_policy=FreshnessPolicy(maximum_lag_minutes=3),
    auto_materialize_policy=AutoMaterializePolicy.lazy(),
)
def finetuned_llm(
    TrainingResource: TrainingResource,
    base_llm: torch.nn.Module,
    tokenizer,
    train_dataset,
    val_dataset,
    accelerator: AcceleratorResource,
) -> Output[torch.nn.Module]:
    # Place model on accelerator
    base_llm = base_llm.to(accelerator.get_device())

    label_pad_token_id = -100
    data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        model=base_llm,
        label_pad_token_id=label_pad_token_id,
        pad_to_multiple_of=8,
    )

    training_args = TrainingResource.training_args()

    trainer = Seq2SeqTrainer(
        model=base_llm,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
    )

    trainer.train()
    eval_metrics = trainer.evaluate()

    _finetuned_llm = trainer.model
    _finetuned_llm.eval()
    ## Peft model loaded
    metric = evaluate.load("rouge")
    predictions, references = [], []

    model = _finetuned_llm.to(accelerator.get_device())
    for sample in val_dataset.with_format("torch"):

        prediction, labels = evaluate_peft_model(sample, model, tokenizer, max_target_length=512)
        predictions.append(prediction)
        references.append(labels)
    rogue = metric.compute(
    predictions=predictions, references=references, use_stemmer=True)

    eval_rouge_scores = {
        "rogue1": rogue["rouge1"] * 100,
        "rouge2": rogue["rouge2"] * 100,
        "rougeL": rogue["rougeL"] * 100,
        "rougeLsum": rogue["rougeLsum"] * 100,
    }

    eval_metric_all = {**eval_metrics, **eval_rouge_scores}

    return Output(
        _finetuned_llm,
        metadata={
            name: MetadataValue.float(value) for name, value in eval_metric_all.items()
        },
    )

Model predictions

The last step is using the machine learning model to predict responses to questions using the model. We’ll create a predictions asset to use the fine-tuned model to predict responses to new Slack questions.

import torch
from dagster import AutoMaterializePolicy, MetadataValue, Output, asset
from tutorial.resources.resources import AcceleratorResource

@asset(description="Predictions on Slack user questions")
def predictions(
    finetuned_llm: torch.nn.Module,
    tokenizer,
    inference_dataset,
    accelerator: AcceleratorResource,
):
    finetuned_llm = finetuned_llm.to(accelerator.get_device())
    predictions = []
    questions = []
    with torch.inference_mode():

        for sample in inference_dataset.with_format('torch', device=accelerator.get_device()):
            max_target_length = 512
            outputs = finetuned_llm.generate(
                input_ids=sample["input_ids"].unsqueeze(0).cpu(),
                do_sample=True,
                top_p=0.9,
                max_new_tokens=max_target_length,
            )
            prediction = tokenizer.decode(
                outputs[0].detach().cpu().numpy(), skip_special_tokens=True
            )
            predictions.append(prediction)
            questions.append(
                tokenizer.decode(sample["input_ids"], skip_special_tokens=True)
            )

    return Output(
        predictions,
        metadata={
            "predictions": MetadataValue.json(predictions),
            "questions": MetadataValue.json(questions),})

Dagster in action

By now, our asset graph will look like this:

We can see all the dependencies built out and our assets arranged into different groups. If we wanted to build a fine-tuning model for different Slack messages or any new dataset, we could reuse some of the assets.

Machine Learning experimentation in Dagster

Diving into finetuned_llm, we can see the evaluation metrics on the side navigation.

When we click on the asset, we will see the full materialization history of our machine learning pipeline.

The metadata plots track the values of the evaluation that we have logged.

We had originally set the number of epochs, and the number of rounds of training to 1. Which we can see by navigating to TrainingResource.

Let's adjust num_train_epochs to 5 and model to a lora-flan-t5-xxl to see the impact on our evolution metrics. We'll be able to see the updated values in the Dagster UI.

Looking at the latest metadata plots, we can see the metrics relative to the previous runs and continue to iterate on our model.

Checking our results

Let's now take a look at what the predicted answers are to some of our sample human language questions. Let’s compare using the same data:

Slack question:

Here is a representative request posted to our Slack channel:

"Hi! Is there a way to configure jobs so that assets will only be rerun if the partition is missing or out of date? Or do I need to use a sensor?”

ChatGPT output:

“As of my last knowledge update in September 2021, Dagster does not natively provide a built-in feature to automatically rerun jobs only when assets are missing or out of date. Instead, you would typically use sensors to achieve this behavior. Sensors are a key component of Dagster's event-driven architecture and are designed for tasks like monitoring external systems, checking for changes, and triggering pipeline runs based on conditions….”

Base Flan Model output:

“Do you need a sensor”

Fine-tuned Model output:

"You can configure jobs so that assets will only be rerun if the partition is missing or out of date and you want to use a sensor to configure jobs so that assets will only be rerun if the partition is missing or out of date."

Dagster Engineer’s response:

“Hi, versioning/staleness isn’t supported yet for partitioned assets (we’re working towards it), so using a sensor, for now, is your best bet.”

Comparing the three models’ predictions:

ChatGPT shares it was last updated in September 2021, when Dagster did not provide a feature to materialize stale assets. Since 2021, this capability has been added to materialize assets when they are out of date.
The base model responds with the same question the user asked and does not provide additional insight, and barely matches human language.
The fine-tuned model provides the most insight, though not on par with a Dagster’s engineer’s response with the nuance that the feature exists, but is not supported for partitions yet.
Our Engineer’s response provides the best context, and is still the most helpful.

Automating our ML pipeline

The last step to turning ml models into production-ready versions is automation. A schedule, sensor, or auto-materialization policy can trigger each asset. With automated data flows and parameter efficient fine-tuning, keeping machine learning systems up to date can be done quite cost efficiently.

These are helpful guides to help understand automation in Dagster, for both data and machine learning pipelines.

Conclusion

Machine learning models are challenging to train and maintain, but with Dagster and asset-based coding, you can remove many of the time consuming steps involved and simplify the life of your ML team. Orchestrating your machine learning pipeline streamlines model training and helps operationalize fine tuned LLMs. Many data scientists have also found this problem solving approach useful more generally in their research.

Feel free to star and fork this repo, change the problem (i.e., dataset) to your own, and enjoy the structure and tightening of our ML pipelines.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us:

Subscribe to the Dagster Newsletter RSS feed

Subscribe to the RSS feed

Star us on GitHub

Subscribe to the Dagster YouTube channel

Subscribe to our YouTube Channel

ML pipelines for fine-tuning LLMs

In this article:

Large language models and the curse of the cut-off

Fine-tuning LLMs

Use Case: a Dagster support chatbot.

Picking the right LLM to start with

From notebooks to Dagster Code

Thinking in Assets and Resources

From raw data to Dagster @asset

Making our pipeline configurable

Tokenizing our data

Building in Resources

Processed data as an asset

Train LoRA model with Trainer

Using resources for ML experimentation

Model predictions

Dagster in action

Machine Learning experimentation in Dagster

Checking our results

Automating our ML pipeline

Conclusion

Dagster vs. Airflow

Building with Dagster vs Airflow

Routing LLM prompts with Dagster and Not Diamond

ML pipelines for fine-tuning LLMs

In this article:

Large language models and the curse of the cut-off

Fine-tuning LLMs

Use Case: a Dagster support chatbot.

Picking the right LLM to start with

From notebooks to Dagster Code

Thinking in Assets and Resources

From raw data to Dagster @asset

Making our pipeline configurable

Tokenizing our data

Building in Resources

Processed data as an asset

Train LoRA model with Trainer

Using resources for ML experimentation

Model predictions

Dagster in action

Machine Learning experimentation in Dagster

Checking our results

Automating our ML pipeline

Conclusion

Dagster vs. Airflow

Building with Dagster vs Airflow

Routing LLM prompts with Dagster and Not Diamond

Dagster Newsletter: Get updates delivered to your inbox