Skip to content
Trailer.devDocumentation

Search is only available in production builds. Try building and previewing the site to test it out locally.

GPU & NVIDIA Support

The host detail page showing the NVIDIA driver and container toolkit deployment toggles, plus the detected GPU inventory.

Trailer.dev has two separate GPU concerns that are easy to confuse:

  • Host-level NVIDIA deployment: the agent can deploy the NVIDIA driver and the NVIDIA Container Toolkit into the host’s Docker, so containers can use the GPU at all.
  • Per-workspace acceleration: a single workspace opts in to GPU sharing, DRI hardware acceleration, nested virtualization, or (for Windows VDI) full GPU passthrough.

The first is configured once per host. The second is configured per workspace.

The agent can deploy two host-scoped containers:

  • The NVIDIA driver (DeployDriver). Runs the nvidia-driver init entrypoint as a privileged container with a restart policy. It mounts /run/nvidia (shared propagation), /lib/firmware, and /var/log so the driver it installs is visible to the rest of the host.
  • The NVIDIA Container Toolkit (DeployToolkit). Runs nvidia-toolkit, installs the nvidia Docker runtime into the host’s /etc/docker, restarts Docker so the runtime is picked up, then stays running.

Both are configured per host:

SettingMeaningDefault
Deploy driverDeploy the NVIDIA driver containeroff
Deploy toolkitDeploy the NVIDIA Container Toolkit containeroff
Driver imageNVIDIA driver container imagenvcr.io/nvidia/driver:580.126.16
Toolkit imageNVIDIA Container Toolkit imagenvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04

These settings live on the host and are delivered to the agent. On each reconcile the agent prunes whichever of the two is disabled and deploys (or upgrades) whichever is enabled. If the running container’s image does not match the desired version, the stale one is removed and the new version deployed.

When the toolkit is (re)deployed the nvidia runtime becomes available, so the agent restarts itself to pick it up. Host-level deployment is Linux only.

flowchart TD
  A[Host NVIDIA config] --> B{deployDriver?}
  B -- yes --> C[Run nvidia-driver init container]
  B -- no --> D[Remove driver container]
  A --> E{deployToolkit?}
  E -- yes --> F[Run nvidia-toolkit, install nvidia runtime, restart Docker]
  E -- no --> G[Remove toolkit container]
  F --> H[Agent restarts to pick up nvidia runtime]

The driver container is optional. If an operator already manages the NVIDIA driver on the host, enable only the toolkit. The toolkit is always pointed at /run/nvidia/driver (via NVIDIA_DRIVER_ROOT and DRIVER_ROOT_CTR_PATH), so an operator-managed driver must be reachable there.

Independent of whether Trailer deployed the driver, the agent reports the host’s NVIDIA GPUs on every heartbeat. Two data sources are merged:

  • A scan of /sys/bus/pci/devices for NVIDIA display-class devices. This is the authoritative device list and works even with no driver loaded.
  • nvidia-smi -q -x output, when a driver is present. This enriches each device with product name, temperature, utilization, memory, power, and virtualization mode. nvidia-smi is run on the host, or, if that fails, inside the deployed driver container.

For each GPU the agent records its PCI address (BDF), PCI device ID, bound kernel driver, IOMMU group, and (when nvidia-smi is available) UUID and live metrics. GPU metrics are collected on every heartbeat regardless of configuration. The CollectGpuMetrics dynamic config flag only controls whether the server persists the time-series metrics. See Resource monitoring for how metrics surface in the UI.

A workspace has a boolean Hardware Acceleration option in its configuration. When enabled, the agent attaches DRI/render devices to the container:

  • On a Linux host: /dev/dri.
  • On a Windows (WSL) host: /dev/dxg, /dev/dri/card0, /dev/dri/renderD128, plus a read-only bind mount of /usr/lib/wsl.

This is vendor-neutral device sharing for OpenGL/Vulkan/VA-API style workloads. It does not require the NVIDIA toolkit. Toggling hardware acceleration on or off triggers a workspace recreate.

A workspace also has a Nested Virtualization option. When enabled the agent attaches /dev/kvm and /dev/net/tun. This is what backs nested hypervisors and the Windows VDI runtime.

A non-Windows workspace can have specific GPUs attached (each stored as a UUID plus BDF). For these workspaces the agent sets NVIDIA_VISIBLE_DEVICES to the attached GPUs’ UUIDs, so the NVIDIA Container Runtime exposes exactly those devices inside the container. This is the path that requires the toolkit and runtime described above.

If the workspace already sets NVIDIA_VISIBLE_DEVICES explicitly in its environment variables, the agent leaves that value alone.

GPU sharing via NVIDIA_VISIBLE_DEVICES and GPU passthrough are mutually exclusive. A Windows VDI workspace hands its attached GPUs to the guest VM via VFIO instead of advertising them to the container runtime.

For a Windows VDI workspace the agent:

  • Auto-loads the vfio-pci kernel module if needed, then fails fast if it is still not loaded.
  • Validates each requested GPU. Mediated modes (vGPU / vSGA), MIG-enabled GPUs, GPUs with no IOMMU group, and GPUs whose IOMMU group is not cleanly isolated to one physical card are rejected.
  • Rebinds the whole IOMMU group (the GPU and its companion functions such as HDMI audio) to vfio-pci, then maps /dev/vfio/vfio plus each /dev/vfio/<group> device into the container.
  • Records the passed-through PCI addresses in a container label so the devices are released back to the host driver when the workspace is deleted.

Because VFIO pins all of the guest’s RAM, the container’s memlock ulimit is raised to unbounded. See Windows virtual desktops for the rest of the Windows VDI runtime.

Host-level driver/toolkit deployment, GPU inventory/metrics, and NVIDIA_VISIBLE_DEVICES sharing are NVIDIA-specific. Hardware acceleration (DRI device sharing) and VFIO passthrough are vendor-neutral at the device level, but the GPU detection and passthrough validation paths target NVIDIA display devices.