Adding NVIDIA Orin Nano as GPU Worker Node to RKE2 Cluster

This guide covers adding an NVIDIA Orin Nano as a worker node to an existing RKE2 cluster managed by Rancher, with full GPU support enabled.

Overview

By default, running the Rancher custom cluster script will not enable GPU support in containerd. This guide provides the necessary steps to configure the NVIDIA runtime for RKE2/containerd.

Prerequisites

Existing RKE2 cluster with 3 nodes managed by Rancher
NVIDIA Orin Nano with JetPack installed (drivers pre-installed)
Access to Rancher UI to generate custom cluster join script

Installation Steps

1. Install NVIDIA Container Toolkit on Orin Nano

# Add the NVIDIA container toolkit repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

2. Configure RKE2 Node Labels (Before Joining Cluster)

Create /etc/rancher/rke2/config.yaml:

node-label:
  - "nvidia.com/gpu=true"

3. Run Rancher Custom Cluster Script

Execute the custom cluster join script from Rancher UI to add the node to your cluster.

4. Configure NVIDIA Runtime for Containerd (PERSISTENT METHOD)

Important: Use the template method to ensure configuration persists across RKE2 restarts and upgrades.

Create /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl:

version = 2

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

OR use nvidia-ctk to configure the template:

sudo nvidia-ctk runtime configure \
  --runtime=containerd \
  --config=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl

sudo systemctl restart rke2-agent

5. Deploy NVIDIA Device Plugin

Apply the NVIDIA device plugin DaemonSet to your cluster:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Apply with:

kubectl apply -f nvidia-device-plugin.yaml

Verification

Check Node GPU Capacity

kubectl get nodes -o json | jq '.items[].status.capacity'

Look for nvidia.com/gpu: "1" in the Orin Nano node's capacity.

Verify Device Plugin is Running

kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds

Test GPU Access with a Pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu: "true"

kubectl apply -f gpu-test.yaml
kubectl logs gpu-test

You should see nvidia-smi output showing the GPU.

Important Notes

Config Persistence

DO NOT manually edit /var/lib/rancher/rke2/agent/etc/containerd/config.toml directly
RKE2 will regenerate this file on restarts, upgrades, and configuration changes
ALWAYS use config.toml.tmpl for persistent containerd customizations
This is the RKE2-native declarative way to manage containerd configuration

When RKE2 Overwrites Config

RKE2 regenerates containerd config when:

RKE2 agent service restarts (some scenarios)
RKE2 is upgraded
Cluster configuration changes are applied via Rancher

File Locations Reference

RKE2 config: /etc/rancher/rke2/config.yaml
Containerd config template: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
Containerd config (generated): /var/lib/rancher/rke2/agent/etc/containerd/config.toml

Troubleshooting

GPU Not Showing Up

# Check if nvidia-container-runtime is installed
which nvidia-container-runtime

# Check RKE2 logs
sudo journalctl -u rke2-agent -f

# Verify containerd config was applied
sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml | grep nvidia

Device Plugin Not Running

# Check pod status
kubectl describe pod -n kube-system -l name=nvidia-device-plugin-ds

# Check node labels
kubectl get nodes --show-labels | grep nvidia

Restart RKE2 Agent

sudo systemctl restart rke2-agent
sudo systemctl status rke2-agent

bgulla/rke2_orin_nano.md

Select an option

No results found