Skip to content

Instantly share code, notes, and snippets.

@dougbtv
Last active December 15, 2025 16:04
Show Gist options
  • Select an option

  • Save dougbtv/aa310e780840350a654ca470730c29d7 to your computer and use it in GitHub Desktop.

Select an option

Save dougbtv/aa310e780840350a654ca470730c29d7 to your computer and use it in GitHub Desktop.
RHAIIS Preview: NVIDIA Nemotron v3 (Nano 30B-A3B) on Red Hat AI Inference Server

RHAIIS Preview: NVIDIA Nemotron v3 (Nano 30B-A3B) on Red Hat AI Inference Server

This is a technical quick-start gist for the latest Red Hat AI Inference Server (RHAIIS) preview image, featuring NVIDIA Nemotron v3 Nano 30B-A3B models on vLLM.

Preview image tag (this release):

  • registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3

Upstream model family (Hugging Face):

  • nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
  • nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  • nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Technology Preview notice

The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.

Reference Links

  • NVIDIA Nemotron v3 collection (models + datasets) (Hugging Face)
  • Red Hat reference setup + serving guide (Day-0 article for Mistral, setup steps still apply) (Red Hat Developer)

Minimal prereqs (fast path)

If you want the full setup notes (SELinux, cache dir perms, NVSwitch / Fabric Manager, etc.), use the Mistral release reference guide here.

Here’s the quick version:

1) Login + pull

podman login registry.redhat.io
podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3

2) (SELinux) allow containers to use GPU devices

If SELinux is enabled:

sudo setsebool -P container_use_devices 1

3) Hugging Face token

echo 'export HF_TOKEN=<your_hf_token>' > private.env
source private.env

Quick start: serve Nemotron v3 Nano (BF16)

This runs the post-trained BF16 checkpoint:

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3 \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --served-model-name nemotron-nano-v3-bf16 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 128 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Notes

  • --tensor-parallel-size: set this to match how many GPUs you want to shard across (ex: 2, 4, 8).

  • If you’re on a multi-GPU NVSwitch box and hit weirdness, see the reference guide’s Fabric Manager notes. (Red Hat Developer)


Quick start: serve Nemotron v3 Nano (FP8)

This runs the FP8 checkpoint:

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3 \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
    --served-model-name nemotron-nano-v3-fp8 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 128 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Smoke test (completions)

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "max_tokens": 250,
    "prompt": "Write a perl script that outputs an advertisement for noodles marketed to sysadmins"
  }'

If you prefer chat-style:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-nano-v3-bf16",
    "messages": [
      {"role": "user", "content": "Give me a 5-bullet pitch for noodles marketed to sysadmins."}
    ],
    "max_tokens": 250
  }'

Reference links

  • NVIDIA Nemotron v3 collection (models + datasets) (Hugging Face)
  • Red Hat reference setup + serving guide (Day-0 article for Mistral, setup steps still apply) (Red Hat Developer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment