GPU Passthrough Architecture in ai-gpu Cluster

Overview

Server: 8x NVIDIA H100 SXM5 80GB (NV18 all-to-all NVLink topology)

All 8 GPUs are configured for VFIO passthrough into KubeVirt virtual machines.

Passthrough Layers

Host (Ubuntu 24.04 + k3s + Cozystack)
  → VFIO-PCI driver (all 8 GPUs isolated at initramfs level)
    → KubeVirt virt-handler (device plugin → registers GPUs with kubelet)
      → KubeVirt (PCI passthrough into VMs)
        → Nested Kubernetes cluster (workload-main)
          → NVIDIA GPU Operator (installs drivers + device plugin inside VM)
            → Pods with GPU resource requests

1. GPU Isolation on Host

Method: initramfs driver_override

A script placed in /etc/initramfs-tools/scripts/init-top/ sets driver_override=vfio-pci for all 8 GPUs before the nvidia driver loads. This is critical — driverctl/udev rules fire later and cause deadlock (D state processes).

Modules vfio, vfio_iommu_type1, vfio_pci are loaded via initramfs. Reboot required after configuration.

GPU Assignment

GPU	PCI Address	Driver	Purpose
0	0A:00.0	vfio-pci	Passthrough
1	18:00.0	vfio-pci	Passthrough
2	3B:00.0	vfio-pci	Passthrough
3	44:00.0	vfio-pci	Passthrough
4	87:00.0	vfio-pci	Passthrough
5	90:00.0	vfio-pci	Passthrough
6	B8:00.0	vfio-pci	Passthrough
7	C1:00.0	vfio-pci	Passthrough

2. How GPUs Appear in Kubernetes (Host Cluster)

KubeVirt's virt-handler DaemonSet acts as a Kubernetes Device Plugin:

Reads permittedHostDevices from KubeVirt CR — finds pciVendorSelector: "10DE:2330"
Scans PCI bus for devices matching that vendor:device ID
Finds all 8 H100 GPUs (PCI ID is the same regardless of bound driver)
Registers them with kubelet via the Device Plugin API
Node reports: nvidia.com/H100_SXM5_80GB: 8 in capacity/allocatable

Important: virt-handler does NOT check which driver is bound (nvidia vs vfio-pci). It counts PCI devices by ID. The actual VFIO binding only matters at VM start time — KubeVirt will fail to passthrough a GPU that isn't bound to vfio-pci.

3. KubeVirt Configuration

KubeVirt is patched to allow H100 PCI passthrough into VMs:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: cozy-kubevirt
spec:
  configuration:
    permittedHostDevices:
      pciHostDevices:
        - pciVendorSelector: "10DE:2330"
          resourceName: "nvidia.com/H100_SXM5_80GB"
          externalResourceProvider: false

4. Nested Kubernetes Cluster with GPU

The nested cluster requests GPUs and gets NVIDIA GPU Operator automatically via Cozystack addon:

apiVersion: apps.cozystack.io/v1alpha1
kind: Kubernetes
metadata:
  name: main
  namespace: tenant-workload
spec:
  nodeGroups:
    md0:
      minReplicas: 1
      maxReplicas: 1
      resources:
        cpu: 56
        memory: 512Gi
      gpus:
        - name: "nvidia.com/H100_SXM5_80GB"
        - name: "nvidia.com/H100_SXM5_80GB"
  addons:
    gpuOperator:
      enabled: true

5. How GPUs Appear in Kubernetes (Nested Cluster)

Inside the VM, the GPU appears as a regular PCI device — identical to bare-metal. The GPU Operator (deployed by Cozystack as an addon) handles everything:

nvidia driver container — installs nvidia kernel driver inside the VM
nvidia-device-plugin DaemonSet — registers GPUs with kubelet via Device Plugin API
nvidia-container-toolkit — configures containerd runtime for GPU support in containers

The nested cluster's kubelet then reports nvidia.com/gpu: N in node status, and pods can request GPUs through standard resource requests/limits.

There is no difference from the nested cluster's perspective between bare-metal GPU and VFIO-passthrough GPU — GPU Operator works identically in both cases.

6. Deployment Order

Node preparation (k3s prerequisites)
k3s installation
GPU passthrough setup (all 8 GPUs) + reboot
Cozystack deployment
KubeVirt GPU patch
Create nested cluster with GPU requests + GPU Operator addon

7. Verification

After boot, verify GPU isolation:

# All 8 GPUs should show vfio-pci
for gpu in 0a 18 3b 44 87 90 b8 c1; do
  echo "GPU $gpu:"; lspci -nnk -s $gpu:00.0 | grep driver
done
# Expected: Kernel driver in use: vfio-pci (for all GPUs)

8. Rollback

If boot fails after GPU passthrough configuration:

# Via IPMI/recovery console:
rm /etc/initramfs-tools/scripts/init-top/vfio-pci-override
update-initramfs -u -k all
reboot

Key Technical Details

IOMMU must be enabled on the host (iommu=pt in kernel params)
No nvidia driver on host — all 8 GPUs are vfio-pci, host has no GPU access
No nvidia-docker/toolkit on host — containerd is sufficient, GPU passthrough happens at KVM/VFIO level
GPU Operator runs inside nested cluster VMs — handles drivers, CUDA, container runtime
PCI device IDs: NVIDIA H100 SXM5 80GB = 10DE:2330
NVLink topology: NV18 all-to-all — all GPUs are peers, no hierarchy impact from selection

lexfrei/gpu-passthrough-ai-gpu.md

Select an option

No results found