This guide covers adding an NVIDIA Orin Nano as a worker node to an existing RKE2 cluster managed by Rancher, with full GPU support enabled.
By default, running the Rancher custom cluster script will not enable GPU support in containerd. This guide provides the necessary steps to configure the NVIDIA runtime for RKE2/containerd.
- Existing RKE2 cluster with 3 nodes managed by Rancher
- NVIDIA Orin Nano with JetPack installed (drivers pre-installed)
- Access to Rancher UI to generate custom cluster join script
# Add the NVIDIA container toolkit repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkitCreate /etc/rancher/rke2/config.yaml:
node-label:
- "nvidia.com/gpu=true"Execute the custom cluster join script from Rancher UI to add the node to your cluster.
Important: Use the template method to ensure configuration persists across RKE2 restarts and upgrades.
Create /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl:
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"OR use nvidia-ctk to configure the template:
sudo nvidia-ctk runtime configure \
--runtime=containerd \
--config=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
sudo systemctl restart rke2-agentApply the NVIDIA device plugin DaemonSet to your cluster:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-pluginsApply with:
kubectl apply -f nvidia-device-plugin.yamlkubectl get nodes -o json | jq '.items[].status.capacity'Look for nvidia.com/gpu: "1" in the Orin Nano node's capacity.
kubectl get pods -n kube-system -l name=nvidia-device-plugin-dsapiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:11.0.3-base-ubuntu20.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu: "true"kubectl apply -f gpu-test.yaml
kubectl logs gpu-testYou should see nvidia-smi output showing the GPU.
- DO NOT manually edit
/var/lib/rancher/rke2/agent/etc/containerd/config.tomldirectly - RKE2 will regenerate this file on restarts, upgrades, and configuration changes
- ALWAYS use
config.toml.tmplfor persistent containerd customizations - This is the RKE2-native declarative way to manage containerd configuration
RKE2 regenerates containerd config when:
- RKE2 agent service restarts (some scenarios)
- RKE2 is upgraded
- Cluster configuration changes are applied via Rancher
- RKE2 config:
/etc/rancher/rke2/config.yaml - Containerd config template:
/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl - Containerd config (generated):
/var/lib/rancher/rke2/agent/etc/containerd/config.toml
# Check if nvidia-container-runtime is installed
which nvidia-container-runtime
# Check RKE2 logs
sudo journalctl -u rke2-agent -f
# Verify containerd config was applied
sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml | grep nvidia# Check pod status
kubectl describe pod -n kube-system -l name=nvidia-device-plugin-ds
# Check node labels
kubectl get nodes --show-labels | grep nvidiasudo systemctl restart rke2-agent
sudo systemctl status rke2-agent