Monorepos, the hub-and-spoke model, and Copybara

How we configure Copybara for bi-directional syncing to enable a hub-and-spoke model for Git repositories

From the outside, Dagster can look like a single open source project that lives in the dagster-io/dagster repository. What is less visible is how many systems sit behind that public surface.

The public side of Dagster has always resided in a single repo containing the core framework and Dagster-supported integrations . However, we've also needed to maintain a private internal repository since the launch of Dagster Plus.

We live in a world where software development depends more and more on context, coordination, and AI-assisted tooling. To help us make cross-cutting changes more easily in this new environment, we recently completed a project to unify our codebases. In this post, we walk through the structure and tooling we chose, how syncing between the internal and external repositories works, and what this means for contributors going forward.

The structure we chose

The structure we landed on is a hub-and-spoke model centered around an internal monorepo.

We do most development in a single repository that includes product code, internal tooling, and synchronized copies of the dagster-io/dagster and dagster-io/skills repositories in their own subdirectories. We use Copybara to keep them in sync in both directions.

This creates a hub-and-spoke model:

The internal monorepo is the hub
Public repositories are spokes
Changes can flow from the internal repo out to public repos
Changes made in public repos can also flow back into the internal repo

This gives us a unified internal development environment without forcing every internal system into the public repository surface area.

For a broader argument for monorepos as a development model, Google’s classic writeup, Why Google Stores Billions of Lines of Code in a Single Repository, is still a good reference. Our version is different in one important way: we wanted those monorepo advantages internally while still keeping focused public repositories for contributors.

Copybara

We chose to use an open source tool from Google called Copybara to implement this model.

Copybara provides a rich API for synchronization of code between repositories. Unfortunately, it does not officially support the “two-way sync” approach described above, where changes can be made directly to either a hub or spoke repository.

Copybara’s blessed path is instead to use one repository as a single source of truth. In this model, PRs opened against “spoke” repos dagster and skills cannot be directly merged. Instead, a spoke PR automatically triggers a mirror PR against the private hub. When the hub PR is merged and synced back out to the spoke repository, the original PR is auto-closed. We rejected this model because it confuses and fails to properly credit external contributors. So we needed to work around Copybara’s limitations to achieve robust two-way syncing.

Adding two-way syncing

For outbound sync, we export a subdirectory from the internal monorepo, move it to the root of the public repository, and explicitly skip commits that already originated in that public repo. That skip logic is the key piece that makes two-way sync workable.

def _skip_synced_from_dagster(ctx):
    if ctx.find_label("Dagster-RevId"):
        core.fail_with_noop("Skipping commit that originated from dagster-io/dagster.")

core.workflow(
    name = "sync-dagster",
    origin = git.origin(
        url = "file:///workspace/build/buildkite/dagster/copybara-internal-to-public",
        ref = "master",
    ),
    destination = git.destination(
        url = "<https://github.com/dagster-io/dagster.git>",
        fetch = "master",
        push = "master",
    ),
    origin_files = glob(["public/dagster/**"]),
    destination_files = glob(["**"]),
    transformations = [
        core.dynamic_transform(impl = _skip_synced_from_dagster),
        core.move("public/dagster", ""),
        core.move("js_modules/.yarnrc.oss.yml", "js_modules/.yarnrc.yml"),
        core.move("js_modules/package.oss.json", "js_modules/package.json"),
        core.move("js_modules/yarn.oss.lock", "js_modules/yarn.lock"),
    ],
    mode = "ITERATIVE",
    reversible_check = True,
    custom_rev_id = "Internal-RevId",
)

The interesting part here is not that we add brand new metadata. Copybara already records origin revisions by default using GitOrigin-RevId. What we do instead is rename that revision label per sync pair. When internal commits are synced to dagster, we write Internal-RevId instead of GitOrigin-RevId. When dagster commits are synced back to internal, we write Dagster-RevId. That makes the source repo obvious in commit history, and it gives the skip function a reliable signal for loop prevention.

The inbound side does the inverse. It takes the public repository and mounts it back into the right subdirectory inside the monorepo. The custom_rev_id is important here. Without it, commits synced from dagster and skills would both use Copybara’s default GitOrigin-RevId, and iterative sync would start misattributing commits across repositories. In practice, that would break the sync, because a commit that actually came from skills could be mistaken for one that came from dagster, and vice versa.

def _skip_synced_from_internal(ctx):
    if ctx.find_label("Internal-RevId"):
        core.fail_with_noop("Skipping commit that originated from dagster-io/internal.")

core.workflow(
    name="sync-internal",
    origin = git.origin(
        url = "file:///workspace/build/buildkite/dagster/copybara-dagster-to-internal",
        ref = "master",
    ),
    destination = git.destination(
        url = "<httpsb://github.com/dagster-io/internal.git>",
        fetch = "master",
        push = "master",
    ),
    destination_files = glob(["public/dagster/"]),
    transformations = [
        core.dynamic_transform(impl = _skip_synced_from_internal),
        core.move("js_modules/.yarnrc.yml", "js_modules/.yarnrc.oss.yml"),
        core.move("js_modules/package.json", "js_modules/package.oss.json"),
        core.move("js_modules/yarn.lock", "js_modules/yarn.oss.lock"),
        core.move("", "public/dagster"),
    ],
    mode = "ITERATIVE",
    reversible_check = True,
    custom_rev_id = "Dagster-RevId",
)

‍_skip_synced_from_internal is the other half of the loop-prevention story. When a commit originates in the internal monorepo, is synced out to dagster, and then appears in dagster history, this function tells the inbound sync to noop instead of importing that same change back into internal again.

Operationally, these syncs are run by Buildkite pipelines. A pipeline runs whenever there is a commit to master in either the hub or a spoke repository. On the internal side, the pipeline then checks which subtree changed and only runs the relevant outbound sync:

steps:
  - commands:
      - |
        if git diff --name-only HEAD~1 HEAD | grep -q '^public/dagster/'; then
          copybara ./copy.bara.sky sync-dagster \\\\
            --git-committer-email=devtools@dagsterlabs.com \\\\
            --git-committer-name='Dagster Devtools' || test $? -eq 4
        fi
      - |
        if git diff --name-only HEAD~1 HEAD | grep -q '^public/skills/'; then
          copybara ./copy.bara.sky sync-skills \\\\
            --git-committer-email=devtools@dagsterlabs.com \\\\
            --git-committer-name='Dagster Devtools' || test $? -eq 4
        fi

‍At the pipeline level, we also add guardrails so the syncs do not retrigger on commits that were themselves created by Copybara. In Terraform, those conditions look like this:

module "copybara-internal-to-public" {
  name           = "copybara: internal to public"
  git_repository = "git@github.com:dagster-io/internal.git"
  conditionals = <<EOF
build.branch == "master" &&
build.message !~ /Dagster-RevId:/ &&
build.message !~ /Skills-RevId:/
EOF
}

module "copybara-dagster-to-internal" {
  name           = "copybara: dagster to internal"
  git_repository = "git@github.com:dagster-io/dagster.git"
  conditionals = <<EOF
build.branch == "master" && build.message !~ /Internal-RevId:/
EOF
}

‍The important details are the operational ones. We treat Copybara no-op exit code 4 as success. We scope syncs to specific subtrees instead of syncing everything on every change. And we use repo-specific revision IDs like Internal-RevId, Dagster-RevId, and Skills-RevId to keep iterative sync and loop prevention scoped to the right repo pair.

Preventing race conditions with the Github merge queue

The Copybara setup described above implements two-way syncing, but it has a crucial problem. If a commit is landed to master of one repo before outstanding changes of its partner have been synced, the system’s integrity becomes compromised, and you can end up with unintended reversion of the outstanding changes. This occurs because of a subtlety of the way Copybara syncs commits.

To avoid this issue, we gate merges behind a sync check that ensures hub and spoke are in sync before landing. This check needs to run not just when a PR is pushed, but immediately before merge. We do this using a GitHub Actions workflow together with the GitHub merge queue. Here is the workflow used for dagster to internal:

# Copybara Sync Gate (OSS)
#
# Prevents merging OSS PRs while unsynced internal→OSS changes exist.
# Without this gate, the Copybara OSS→internal sync of this commit would
# overwrite those internal changes, silently reverting them.
#
# Required status check: the job name "copybara-sync-gate".
#
# Works with GitHub merge queue: the pull_request trigger provides a
# passthrough (instant success) so the PR can enter the merge queue. The
# real check runs on the merge_group trigger at merge time, polling for
# up to 10 minutes if the sync is behind.
#
# Prerequisites:
#   - Repository secret ELEMENTL_DEVTOOLS_PAT: a GitHub PAT with read
#     access to dagster-io/internal (needed to check internal commit history).

name: Copybara Sync Gate

on:
  pull_request:
    types: [opened, synchronize, reopened]
  merge_group:

permissions:
  contents: read

jobs:
  copybara-sync-gate:
    runs-on: ubuntu-latest
    steps:
      - name: Wait for internal-to-OSS sync to catch up
        if: github.event_name == 'merge_group'
        env:
          GH_TOKEN: ${{ secrets.ELEMENTL_DEVTOOLS_PAT }}
        run: |
          set -euo pipefail

          check_sync() {
            # Find the last internal commit that was synced to OSS.
            # Copybara labels synced commits with "Internal-RevId: <internal-hash>".
            LAST_SYNCED=$(gh api "repos/${{ github.repository }}/commits" \
              --method GET -f sha=master -f per_page=100 \
              --jq 'map(select(.commit.message | test("Internal-RevId:"))) | .[0].commit.message' \
            | grep -oP 'Internal-RevId: \K[a-f0-9]+')

            if [ -z "$LAST_SYNCED" ]; then
              echo "Could not find any Internal-RevId in the last 100 OSS commits"
              return 1
            fi

            # Check how many internal commits are ahead of the last synced one.
            COMPARE=$(gh api "repos/dagster-io/internal/compare/${LAST_SYNCED}...master")
            AHEAD=$(echo "$COMPARE" | jq '.ahead_by')

            if [ "$AHEAD" -eq 0 ]; then
              echo "Internal-to-OSS sync is caught up"
              return 0
            fi

            # Filter out commits that originated from OSS (noops for this direction).
            # Then check if any remaining commits touch public/dagster/.
            UNSYNCED_SHAS=$(echo "$COMPARE" | jq -r \
              '[.commits[] | select(.commit.message | test("Dagster-RevId:") | not)] | .[].sha')

            if [ -z "$UNSYNCED_SHAS" ]; then
              echo "All $AHEAD ahead commits originated from OSS — sync is effectively caught up"
              return 0
            fi

            # Check if any unsynced commits touch public/dagster/.
            BLOCKING=0
            for SHA in $UNSYNCED_SHAS; do
              TOUCHES_OSS=$(gh api "repos/dagster-io/internal/commits/$SHA" \
                --jq '[.files[].filename | select(startswith("public/dagster/"))] | length')
              if [ "$TOUCHES_OSS" -gt 0 ]; then
                BLOCKING=$((BLOCKING + 1))
              fi
            done

            if [ "$BLOCKING" -eq 0 ]; then
              echo "No unsynced internal commits touch public/dagster/ — safe to merge"
              return 0
            fi

            echo "$BLOCKING unsynced internal commit(s) touch public/dagster/"
            return 1
          }

          MAX_ATTEMPTS=30  # 10 minutes at 20s intervals
          for i in $(seq 1 $MAX_ATTEMPTS); do
            if check_sync; then
              echo "Sync gate passed"
              exit 0
            fi

            if [ "$i" -eq 1 ]; then
              echo "Waiting for copybara internal-to-public sync to complete..."
            fi

            if [ "$i" -eq "$MAX_ATTEMPTS" ]; then
              echo "::error::Internal-to-OSS sync did not catch up within 10 minutes."
              echo "::error::Merging this PR risks having internal changes silently reverted."
              echo "::error::Check the 'copybara: internal to public' Buildkite pipeline."
              exit 1
            fi

            sleep 20
          done

‍This workflow is configured as a required status check in GitHub branch protection. It runs on both pull request events and merge_group, which is GitHub’s event for entering the merge queue. The practical reason it runs on pull request events at all is that GitHub does not expose a way to make a check required for merge_group but not for pull request events. So the check passes through on ordinary pull request events, and does the real enforcement on merge_group. If the corresponding Copybara pipeline is behind, the merge waits instead of risking a silent revert.

What this means for contributors

This architecture does mean that some kinds of changes, especially large cross-cutting ones, may be developed internally and then synced into the public repository rather than appearing first as a sequence of public pull requests. As a result, you may sometimes see fewer intermediate PRs in dagster than you would expect if every part of the work were happening directly in that repository.

For contributors, the practical point is that your workflow should not change. If you open a pull request against the public dagster repository, it should work the same way it always has. You should not need to think about the internal monorepo or the sync process in order to contribute normally.

At the same time, the public repository remains a real part of the development loop. Public changes still flow inward, and contributions to the open source project still shape Dagster in meaningful ways.

Over time, we think this structure will let us do a better job on both sides of that boundary. It should help us ship framework improvements more cleanly, keep the public repository more navigable, and reduce the incidental complexity that would otherwise build up between Dagster and Dagster+.

That is ultimately what we want from our development model: better engineering leverage internally, and a better open source project externally.

‍

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.