GPU & NVIDIA Support
Trailer.dev has two separate GPU concerns that are easy to confuse:
- Host-level NVIDIA deployment: the agent can deploy the NVIDIA driver and the NVIDIA Container Toolkit into the hostâs Docker, so containers can use the GPU at all.
- Per-workspace acceleration: a single workspace opts in to GPU sharing, DRI hardware acceleration, nested virtualization, or (for Windows VDI) full GPU passthrough.
The first is configured once per host. The second is configured per workspace.
Host-level NVIDIA driver and toolkit
Section titled âHost-level NVIDIA driver and toolkitâThe agent can deploy two host-scoped containers:
- The NVIDIA driver (
DeployDriver). Runs thenvidia-driver initentrypoint as a privileged container with a restart policy. It mounts/run/nvidia(shared propagation),/lib/firmware, and/var/logso the driver it installs is visible to the rest of the host. - The NVIDIA Container Toolkit (
DeployToolkit). Runsnvidia-toolkit, installs thenvidiaDocker runtime into the hostâs/etc/docker, restarts Docker so the runtime is picked up, then stays running.
Both are configured per host:
| Setting | Meaning | Default |
|---|---|---|
| Deploy driver | Deploy the NVIDIA driver container | off |
| Deploy toolkit | Deploy the NVIDIA Container Toolkit container | off |
| Driver image | NVIDIA driver container image | nvcr.io/nvidia/driver:580.126.16 |
| Toolkit image | NVIDIA Container Toolkit image | nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04 |
These settings live on the host and are delivered to the agent. On each reconcile the agent prunes whichever of the two is disabled and deploys (or upgrades) whichever is enabled. If the running containerâs image does not match the desired version, the stale one is removed and the new version deployed.
When the toolkit is (re)deployed the nvidia runtime becomes available, so the agent restarts itself to pick it up. Host-level deployment is Linux only.
flowchart TD
A[Host NVIDIA config] --> B{deployDriver?}
B -- yes --> C[Run nvidia-driver init container]
B -- no --> D[Remove driver container]
A --> E{deployToolkit?}
E -- yes --> F[Run nvidia-toolkit, install nvidia runtime, restart Docker]
E -- no --> G[Remove toolkit container]
F --> H[Agent restarts to pick up nvidia runtime]
The driver container is optional. If an operator already manages the NVIDIA driver on the host, enable only the toolkit. The toolkit is always pointed at /run/nvidia/driver (via NVIDIA_DRIVER_ROOT and DRIVER_ROOT_CTR_PATH), so an operator-managed driver must be reachable there.
GPU inventory and metrics
Section titled âGPU inventory and metricsâIndependent of whether Trailer deployed the driver, the agent reports the hostâs NVIDIA GPUs on every heartbeat. Two data sources are merged:
- A scan of
/sys/bus/pci/devicesfor NVIDIA display-class devices. This is the authoritative device list and works even with no driver loaded. nvidia-smi -q -xoutput, when a driver is present. This enriches each device with product name, temperature, utilization, memory, power, and virtualization mode.nvidia-smiis run on the host, or, if that fails, inside the deployed driver container.
For each GPU the agent records its PCI address (BDF), PCI device ID, bound kernel driver, IOMMU group, and (when nvidia-smi is available) UUID and live metrics. GPU metrics are collected on every heartbeat regardless of configuration. The CollectGpuMetrics dynamic config flag only controls whether the server persists the time-series metrics. See Resource monitoring for how metrics surface in the UI.
Per-workspace hardware acceleration
Section titled âPer-workspace hardware accelerationâA workspace has a boolean Hardware Acceleration option in its configuration. When enabled, the agent attaches DRI/render devices to the container:
- On a Linux host:
/dev/dri. - On a Windows (WSL) host:
/dev/dxg,/dev/dri/card0,/dev/dri/renderD128, plus a read-only bind mount of/usr/lib/wsl.
This is vendor-neutral device sharing for OpenGL/Vulkan/VA-API style workloads. It does not require the NVIDIA toolkit. Toggling hardware acceleration on or off triggers a workspace recreate.
A workspace also has a Nested Virtualization option. When enabled the agent attaches /dev/kvm and /dev/net/tun. This is what backs nested hypervisors and the Windows VDI runtime.
Per-workspace NVIDIA GPU sharing
Section titled âPer-workspace NVIDIA GPU sharingâA non-Windows workspace can have specific GPUs attached (each stored as a UUID plus BDF). For these workspaces the agent sets NVIDIA_VISIBLE_DEVICES to the attached GPUsâ UUIDs, so the NVIDIA Container Runtime exposes exactly those devices inside the container. This is the path that requires the toolkit and runtime described above.
If the workspace already sets NVIDIA_VISIBLE_DEVICES explicitly in its environment variables, the agent leaves that value alone.
Windows VDI GPU passthrough
Section titled âWindows VDI GPU passthroughâGPU sharing via NVIDIA_VISIBLE_DEVICES and GPU passthrough are mutually exclusive. A Windows VDI workspace hands its attached GPUs to the guest VM via VFIO instead of advertising them to the container runtime.
For a Windows VDI workspace the agent:
- Auto-loads the
vfio-pcikernel module if needed, then fails fast if it is still not loaded. - Validates each requested GPU. Mediated modes (vGPU / vSGA), MIG-enabled GPUs, GPUs with no IOMMU group, and GPUs whose IOMMU group is not cleanly isolated to one physical card are rejected.
- Rebinds the whole IOMMU group (the GPU and its companion functions such as HDMI audio) to
vfio-pci, then maps/dev/vfio/vfioplus each/dev/vfio/<group>device into the container. - Records the passed-through PCI addresses in a container label so the devices are released back to the host driver when the workspace is deleted.
Because VFIO pins all of the guestâs RAM, the containerâs memlock ulimit is raised to unbounded. See Windows virtual desktops for the rest of the Windows VDI runtime.
Supported vendors
Section titled âSupported vendorsâHost-level driver/toolkit deployment, GPU inventory/metrics, and NVIDIA_VISIBLE_DEVICES sharing are NVIDIA-specific. Hardware acceleration (DRI device sharing) and VFIO passthrough are vendor-neutral at the device level, but the GPU detection and passthrough validation paths target NVIDIA display devices.