Skip to content

Instantly share code, notes, and snippets.

@lexfrei
Last active February 10, 2026 14:21
Show Gist options
  • Select an option

  • Save lexfrei/9df39c7b8517558a46fffe95b09c406b to your computer and use it in GitHub Desktop.

Select an option

Save lexfrei/9df39c7b8517558a46fffe95b09c406b to your computer and use it in GitHub Desktop.
GPU passthrough architecture in ai-gpu cluster (Cozystack + KubeVirt + NVIDIA H100)

GPU Passthrough Architecture in ai-gpu Cluster

Overview

Server: 8x NVIDIA H100 SXM5 80GB (NV18 all-to-all NVLink topology)

All 8 GPUs are configured for VFIO passthrough into KubeVirt virtual machines.

Passthrough Layers

Host (Ubuntu 24.04 + k3s + Cozystack)
  → VFIO-PCI driver (all 8 GPUs isolated at initramfs level)
    → KubeVirt virt-handler (device plugin → registers GPUs with kubelet)
      → KubeVirt (PCI passthrough into VMs)
        → Nested Kubernetes cluster (workload-main)
          → NVIDIA GPU Operator (installs drivers + device plugin inside VM)
            → Pods with GPU resource requests

1. GPU Isolation on Host

Method: initramfs driver_override

A script placed in /etc/initramfs-tools/scripts/init-top/ sets driver_override=vfio-pci for all 8 GPUs before the nvidia driver loads. This is critical — driverctl/udev rules fire later and cause deadlock (D state processes).

Modules vfio, vfio_iommu_type1, vfio_pci are loaded via initramfs. Reboot required after configuration.

GPU Assignment

GPU PCI Address Driver Purpose
0 0A:00.0 vfio-pci Passthrough
1 18:00.0 vfio-pci Passthrough
2 3B:00.0 vfio-pci Passthrough
3 44:00.0 vfio-pci Passthrough
4 87:00.0 vfio-pci Passthrough
5 90:00.0 vfio-pci Passthrough
6 B8:00.0 vfio-pci Passthrough
7 C1:00.0 vfio-pci Passthrough

2. How GPUs Appear in Kubernetes (Host Cluster)

KubeVirt's virt-handler DaemonSet acts as a Kubernetes Device Plugin:

  1. Reads permittedHostDevices from KubeVirt CR — finds pciVendorSelector: "10DE:2330"
  2. Scans PCI bus for devices matching that vendor:device ID
  3. Finds all 8 H100 GPUs (PCI ID is the same regardless of bound driver)
  4. Registers them with kubelet via the Device Plugin API
  5. Node reports: nvidia.com/H100_SXM5_80GB: 8 in capacity/allocatable

Important: virt-handler does NOT check which driver is bound (nvidia vs vfio-pci). It counts PCI devices by ID. The actual VFIO binding only matters at VM start time — KubeVirt will fail to passthrough a GPU that isn't bound to vfio-pci.

3. KubeVirt Configuration

KubeVirt is patched to allow H100 PCI passthrough into VMs:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: cozy-kubevirt
spec:
  configuration:
    permittedHostDevices:
      pciHostDevices:
        - pciVendorSelector: "10DE:2330"
          resourceName: "nvidia.com/H100_SXM5_80GB"
          externalResourceProvider: false

4. Nested Kubernetes Cluster with GPU

The nested cluster requests GPUs and gets NVIDIA GPU Operator automatically via Cozystack addon:

apiVersion: apps.cozystack.io/v1alpha1
kind: Kubernetes
metadata:
  name: main
  namespace: tenant-workload
spec:
  nodeGroups:
    md0:
      minReplicas: 1
      maxReplicas: 1
      resources:
        cpu: 56
        memory: 512Gi
      gpus:
        - name: "nvidia.com/H100_SXM5_80GB"
        - name: "nvidia.com/H100_SXM5_80GB"
  addons:
    gpuOperator:
      enabled: true

5. How GPUs Appear in Kubernetes (Nested Cluster)

Inside the VM, the GPU appears as a regular PCI device — identical to bare-metal. The GPU Operator (deployed by Cozystack as an addon) handles everything:

  1. nvidia driver container — installs nvidia kernel driver inside the VM
  2. nvidia-device-plugin DaemonSet — registers GPUs with kubelet via Device Plugin API
  3. nvidia-container-toolkit — configures containerd runtime for GPU support in containers

The nested cluster's kubelet then reports nvidia.com/gpu: N in node status, and pods can request GPUs through standard resource requests/limits.

There is no difference from the nested cluster's perspective between bare-metal GPU and VFIO-passthrough GPU — GPU Operator works identically in both cases.

6. Deployment Order

  1. Node preparation (k3s prerequisites)
  2. k3s installation
  3. GPU passthrough setup (all 8 GPUs) + reboot
  4. Cozystack deployment
  5. KubeVirt GPU patch
  6. Create nested cluster with GPU requests + GPU Operator addon

7. Verification

After boot, verify GPU isolation:

# All 8 GPUs should show vfio-pci
for gpu in 0a 18 3b 44 87 90 b8 c1; do
  echo "GPU $gpu:"; lspci -nnk -s $gpu:00.0 | grep driver
done
# Expected: Kernel driver in use: vfio-pci (for all GPUs)

8. Rollback

If boot fails after GPU passthrough configuration:

# Via IPMI/recovery console:
rm /etc/initramfs-tools/scripts/init-top/vfio-pci-override
update-initramfs -u -k all
reboot

Key Technical Details

  • IOMMU must be enabled on the host (iommu=pt in kernel params)
  • No nvidia driver on host — all 8 GPUs are vfio-pci, host has no GPU access
  • No nvidia-docker/toolkit on host — containerd is sufficient, GPU passthrough happens at KVM/VFIO level
  • GPU Operator runs inside nested cluster VMs — handles drivers, CUDA, container runtime
  • PCI device IDs: NVIDIA H100 SXM5 80GB = 10DE:2330
  • NVLink topology: NV18 all-to-all — all GPUs are peers, no hierarchy impact from selection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment