Server: 8x NVIDIA H100 SXM5 80GB (NV18 all-to-all NVLink topology)
All 8 GPUs are configured for VFIO passthrough into KubeVirt virtual machines.
Host (Ubuntu 24.04 + k3s + Cozystack)
→ VFIO-PCI driver (all 8 GPUs isolated at initramfs level)
→ KubeVirt virt-handler (device plugin → registers GPUs with kubelet)
→ KubeVirt (PCI passthrough into VMs)
→ Nested Kubernetes cluster (workload-main)
→ NVIDIA GPU Operator (installs drivers + device plugin inside VM)
→ Pods with GPU resource requests
Method: initramfs driver_override
A script placed in /etc/initramfs-tools/scripts/init-top/ sets driver_override=vfio-pci for all 8 GPUs before the nvidia driver loads. This is critical — driverctl/udev rules fire later and cause deadlock (D state processes).
Modules vfio, vfio_iommu_type1, vfio_pci are loaded via initramfs. Reboot required after configuration.
| GPU | PCI Address | Driver | Purpose |
|---|---|---|---|
| 0 | 0A:00.0 | vfio-pci | Passthrough |
| 1 | 18:00.0 | vfio-pci | Passthrough |
| 2 | 3B:00.0 | vfio-pci | Passthrough |
| 3 | 44:00.0 | vfio-pci | Passthrough |
| 4 | 87:00.0 | vfio-pci | Passthrough |
| 5 | 90:00.0 | vfio-pci | Passthrough |
| 6 | B8:00.0 | vfio-pci | Passthrough |
| 7 | C1:00.0 | vfio-pci | Passthrough |
KubeVirt's virt-handler DaemonSet acts as a Kubernetes Device Plugin:
- Reads
permittedHostDevicesfrom KubeVirt CR — findspciVendorSelector: "10DE:2330" - Scans PCI bus for devices matching that vendor:device ID
- Finds all 8 H100 GPUs (PCI ID is the same regardless of bound driver)
- Registers them with kubelet via the Device Plugin API
- Node reports:
nvidia.com/H100_SXM5_80GB: 8in capacity/allocatable
Important: virt-handler does NOT check which driver is bound (nvidia vs vfio-pci). It counts PCI devices by ID. The actual VFIO binding only matters at VM start time — KubeVirt will fail to passthrough a GPU that isn't bound to vfio-pci.
KubeVirt is patched to allow H100 PCI passthrough into VMs:
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
name: kubevirt
namespace: cozy-kubevirt
spec:
configuration:
permittedHostDevices:
pciHostDevices:
- pciVendorSelector: "10DE:2330"
resourceName: "nvidia.com/H100_SXM5_80GB"
externalResourceProvider: falseThe nested cluster requests GPUs and gets NVIDIA GPU Operator automatically via Cozystack addon:
apiVersion: apps.cozystack.io/v1alpha1
kind: Kubernetes
metadata:
name: main
namespace: tenant-workload
spec:
nodeGroups:
md0:
minReplicas: 1
maxReplicas: 1
resources:
cpu: 56
memory: 512Gi
gpus:
- name: "nvidia.com/H100_SXM5_80GB"
- name: "nvidia.com/H100_SXM5_80GB"
addons:
gpuOperator:
enabled: trueInside the VM, the GPU appears as a regular PCI device — identical to bare-metal. The GPU Operator (deployed by Cozystack as an addon) handles everything:
- nvidia driver container — installs nvidia kernel driver inside the VM
- nvidia-device-plugin DaemonSet — registers GPUs with kubelet via Device Plugin API
- nvidia-container-toolkit — configures containerd runtime for GPU support in containers
The nested cluster's kubelet then reports nvidia.com/gpu: N in node status, and pods can request GPUs through standard resource requests/limits.
There is no difference from the nested cluster's perspective between bare-metal GPU and VFIO-passthrough GPU — GPU Operator works identically in both cases.
- Node preparation (k3s prerequisites)
- k3s installation
- GPU passthrough setup (all 8 GPUs) + reboot
- Cozystack deployment
- KubeVirt GPU patch
- Create nested cluster with GPU requests + GPU Operator addon
After boot, verify GPU isolation:
# All 8 GPUs should show vfio-pci
for gpu in 0a 18 3b 44 87 90 b8 c1; do
echo "GPU $gpu:"; lspci -nnk -s $gpu:00.0 | grep driver
done
# Expected: Kernel driver in use: vfio-pci (for all GPUs)If boot fails after GPU passthrough configuration:
# Via IPMI/recovery console:
rm /etc/initramfs-tools/scripts/init-top/vfio-pci-override
update-initramfs -u -k all
reboot- IOMMU must be enabled on the host (
iommu=ptin kernel params) - No nvidia driver on host — all 8 GPUs are vfio-pci, host has no GPU access
- No nvidia-docker/toolkit on host — containerd is sufficient, GPU passthrough happens at KVM/VFIO level
- GPU Operator runs inside nested cluster VMs — handles drivers, CUDA, container runtime
- PCI device IDs: NVIDIA H100 SXM5 80GB =
10DE:2330 - NVLink topology: NV18 all-to-all — all GPUs are peers, no hierarchy impact from selection