Reconciliation

Reconciliation is how Trailer keeps what is actually running on a host in step with what you asked for. You describe the state you want through the web client or the API. The agent on each host continuously works to make that host match.

Desired state and observed state

There are two pictures of the world:

Desired state is what you configured: which workspaces, images, networks, and volumes should exist, and how each is set up. The server stores this in its database.
Observed state is what actually exists on the host right now: the running containers, their images, networks, and volumes.

Reconciliation is the process of comparing the two and making the changes that bring the host in line with the desired state.

A pull model

The agent pulls work from the server rather than the server pushing changes to the agent. On a fixed interval, the agent asks the server for the desired state of its host, compares that against what is running, and applies the difference. Because the agent pulls on a schedule, a brief network interruption or a restart is self-correcting: the next cycle simply picks up wherever things stand.

sequenceDiagram
    participant U as User
    participant S as Server
    participant A as Agent
    participant R as Container runtime
    U->>S: create or edit a workspace
    S->>S: validate and store desired state
    loop every reconciliation interval
        A->>S: ask for this host's desired state
        S-->>A: desired workspaces, images, networks, volumes
        A->>R: create, update, or remove resources
        A->>S: report status and messages
    end
    S-->>U: live status updates

The reconciliation cycle

Within a single cycle the agent works in a deliberate order so that each resource exists before anything that depends on it. Independent resources within a step are handled at the same time.

flowchart TD
    A["Shared prerequisites<br/>(GPU support, reverse proxy)"] --> B["Networks and volumes"]
    B --> C["Images<br/>(build or pull)"]
    C --> D["Workspaces<br/>(create, update, or recreate)"]
    D --> E["Snapshots<br/>(capture running workspaces)"]
    E --> F["Cleanup<br/>(remove resources no longer wanted)"]

Shared prerequisites. Host-wide services such as GPU support and the request router are set up first, since workspaces may rely on them.
Networks and volumes. Created before any workspace attaches to a network or mounts a volume.
Images. Built or pulled before the workspaces that run from them.
Workspaces. Created, updated, or recreated to match their configuration.
Snapshots. Running workspaces are captured into new images once deployments settle.
Cleanup. Resources that are no longer in the desired state are removed. This step runs only when it is safe to do so (see Staying stable).

Applying changes to a workspace

When a workspace already exists, the agent decides whether the change can be applied to the running container or whether the container has to be replaced. Most properties of a container can only be set when it is created, so changing them means removing the old container and creating a fresh one in its place. The workspace keeps its identity, name, and attached storage across a recreate.

Changes that replace the container:

The workspace image
CPU, memory, or shared-memory limits
Environment variables
Ports
Volumes
GPU or nested virtualization access
Startup command
Routing settings

Changes applied without replacing the container:

Attaching or detaching a network: the container is reconnected in place.
Renaming the workspace: the existing container is renamed.

Images and builds

Images move through their own lifecycle, which you can watch live on the image detail page:

stateDiagram-v2
    [*] --> Pending
    Pending --> Building
    Building --> Finished
    Building --> Failed
    Building --> Cancelled
    Finished --> [*]

Custom images are built from your configuration on the host. Build output streams back so you can follow progress and read any errors.
External images are pulled from a registry, with credentials when the registry needs them.
Snapshots capture a running workspace into a new image after deployments in the cycle have completed.

A build that fails is reported with its error and can be retried. A build can also be cancelled while it is running.

Reporting status

The agent reports back over two channels:

Per-resource updates are sent the moment a step finishes: a container created, a build completed, an error raised. This is what drives the live status badges and messages you see in the web client, so a workspace that starts, stops, or fails updates without a refresh. Workspaces move through states such as Deployment pending, Deploying, Starting, Running, Stopping, and Stopped, or Error if something goes wrong.
Heartbeats are sent on their own interval and carry the host’s overall health: available runtimes, and, when enabled, host, GPU, and per-workspace metrics. The server’s reply can adjust the host’s settings on the fly.

stateDiagram-v2
    dp: Deployment pending
    d: Deploying
    s: Starting
    r: Running
    st: Stopping
    sd: Stopped
    sn: Snapshotting
    e: Error
    [*] --> dp
    dp --> d
    d --> s
    s --> r
    r --> st
    st --> sd
    sd --> d: redeploy
    r --> sn
    sn --> r
    d --> e
    s --> e
    e --> d: retry

Staying stable and idempotent

Reconciliation runs continuously, so it is designed to be safe to repeat. Two ideas keep it from causing needless churn:

Thorough comparison. Before recreating anything, the agent compares the desired and observed configuration in detail rather than at a glance. Differences that do not actually matter, such as a different ordering of the same values, do not trigger a rebuild. This prevents a workspace from being restarted on every cycle for no reason.
Overlapping-cycle safety. If one cycle takes a long time (a large image build, for example) and a newer cycle begins, the cleanup step is deferred to the newer cycle, which has the freshest desired state. This avoids a slow cycle removing something a newer cycle just created.

Overlapping-cycle safety has a trade-off worth knowing about. Cleanup is the last step of a cycle, so it only runs once everything earlier in that cycle has finished. On a host under sustained heavy build activity, cycle after cycle can stay busy long enough to be overtaken by the next one before it reaches cleanup, so cleanup keeps being deferred. When that happens, removing resources you have already deleted can lag well behind the rest of reconciliation. Capturing new snapshots, which also runs late in the cycle and waits on in-progress builds, can lag for the same reason. This is a delay rather than a loss: once the build load eases and cycles begin finishing before the next one starts, the pending cleanup and snapshots complete on the following cycle.

When things go wrong

Failures are contained and self-healing:

A failure on one resource is reported and does not stop the rest of the cycle from proceeding.
A failed cycle does not stop the loop. The next interval simply tries again with the current desired state.
Status updates are retried until they reach the server, so the web client eventually reflects the true state.

Tuning reconciliation

The reconciliation interval, the heartbeat interval, and whether a host is enabled at all are per-host settings you can change from the host’s detail page. They take effect without redeploying the agent. Disabling a host tells its agent to stop the host’s workspaces on the next cycle. See the Configuration Reference for the full list of host settings.