Skip to content

Instantly share code, notes, and snippets.

@bgulla
Created December 22, 2025 15:02
Show Gist options
  • Select an option

  • Save bgulla/4ef72f028ed9d62fbb4f97432cfdac16 to your computer and use it in GitHub Desktop.

Select an option

Save bgulla/4ef72f028ed9d62fbb4f97432cfdac16 to your computer and use it in GitHub Desktop.
Adding nvidia jetson to an existing RKE2 cluster

Adding NVIDIA Orin Nano as GPU Worker Node to RKE2 Cluster

This guide covers adding an NVIDIA Orin Nano as a worker node to an existing RKE2 cluster managed by Rancher, with full GPU support enabled.

Overview

By default, running the Rancher custom cluster script will not enable GPU support in containerd. This guide provides the necessary steps to configure the NVIDIA runtime for RKE2/containerd.

Prerequisites

  • Existing RKE2 cluster with 3 nodes managed by Rancher
  • NVIDIA Orin Nano with JetPack installed (drivers pre-installed)
  • Access to Rancher UI to generate custom cluster join script

Installation Steps

1. Install NVIDIA Container Toolkit on Orin Nano

# Add the NVIDIA container toolkit repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

2. Configure RKE2 Node Labels (Before Joining Cluster)

Create /etc/rancher/rke2/config.yaml:

node-label:
  - "nvidia.com/gpu=true"

3. Run Rancher Custom Cluster Script

Execute the custom cluster join script from Rancher UI to add the node to your cluster.

4. Configure NVIDIA Runtime for Containerd (PERSISTENT METHOD)

Important: Use the template method to ensure configuration persists across RKE2 restarts and upgrades.

Create /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl:

version = 2

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

OR use nvidia-ctk to configure the template:

sudo nvidia-ctk runtime configure \
  --runtime=containerd \
  --config=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl

sudo systemctl restart rke2-agent

5. Deploy NVIDIA Device Plugin

Apply the NVIDIA device plugin DaemonSet to your cluster:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Apply with:

kubectl apply -f nvidia-device-plugin.yaml

Verification

Check Node GPU Capacity

kubectl get nodes -o json | jq '.items[].status.capacity'

Look for nvidia.com/gpu: "1" in the Orin Nano node's capacity.

Verify Device Plugin is Running

kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds

Test GPU Access with a Pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu: "true"
kubectl apply -f gpu-test.yaml
kubectl logs gpu-test

You should see nvidia-smi output showing the GPU.

Important Notes

Config Persistence

  • DO NOT manually edit /var/lib/rancher/rke2/agent/etc/containerd/config.toml directly
  • RKE2 will regenerate this file on restarts, upgrades, and configuration changes
  • ALWAYS use config.toml.tmpl for persistent containerd customizations
  • This is the RKE2-native declarative way to manage containerd configuration

When RKE2 Overwrites Config

RKE2 regenerates containerd config when:

  • RKE2 agent service restarts (some scenarios)
  • RKE2 is upgraded
  • Cluster configuration changes are applied via Rancher

File Locations Reference

  • RKE2 config: /etc/rancher/rke2/config.yaml
  • Containerd config template: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
  • Containerd config (generated): /var/lib/rancher/rke2/agent/etc/containerd/config.toml

Troubleshooting

GPU Not Showing Up

# Check if nvidia-container-runtime is installed
which nvidia-container-runtime

# Check RKE2 logs
sudo journalctl -u rke2-agent -f

# Verify containerd config was applied
sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml | grep nvidia

Device Plugin Not Running

# Check pod status
kubectl describe pod -n kube-system -l name=nvidia-device-plugin-ds

# Check node labels
kubectl get nodes --show-labels | grep nvidia

Restart RKE2 Agent

sudo systemctl restart rke2-agent
sudo systemctl status rke2-agent

Additional Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment