RHAIIS Preview: NVIDIA Nemotron v3 (Nano 30B-A3B) on Red Hat AI Inference Server

This is a technical quick-start gist for the latest Red Hat AI Inference Server (RHAIIS) preview image, featuring NVIDIA Nemotron v3 Nano 30B-A3B models on vLLM.

Preview image tag (this release):

registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3

Upstream model family (Hugging Face):

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Technology Preview notice

The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.

Reference Links

NVIDIA Nemotron v3 collection (models + datasets) (Hugging Face)
Red Hat reference setup + serving guide (Day-0 article for Mistral, setup steps still apply) (Red Hat Developer)

Minimal prereqs (fast path)

If you want the full setup notes (SELinux, cache dir perms, NVSwitch / Fabric Manager, etc.), use the Mistral release reference guide here.

Here’s the quick version:

1) Login + pull

podman login registry.redhat.io
podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3

2) (SELinux) allow containers to use GPU devices

If SELinux is enabled:

sudo setsebool -P container_use_devices 1

3) Hugging Face token

echo 'export HF_TOKEN=<your_hf_token>' > private.env
source private.env

Quick start: serve Nemotron v3 Nano (BF16)

This runs the post-trained BF16 checkpoint:

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (Hugging Face)

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3 \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --served-model-name nemotron-nano-v3-bf16 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 128 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Notes

--tensor-parallel-size: set this to match how many GPUs you want to shard across (ex: 2, 4, 8).
If you’re on a multi-GPU NVSwitch box and hit weirdness, see the reference guide’s Fabric Manager notes. (Red Hat Developer)

Quick start: serve Nemotron v3 Nano (FP8)

This runs the FP8 checkpoint:

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 (Hugging Face)

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3 \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
    --served-model-name nemotron-nano-v3-fp8 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 128 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Smoke test (completions)

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "max_tokens": 250,
    "prompt": "Write a perl script that outputs an advertisement for noodles marketed to sysadmins"
  }'

If you prefer chat-style:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-nano-v3-bf16",
    "messages": [
      {"role": "user", "content": "Give me a 5-bullet pitch for noodles marketed to sysadmins."}
    ],
    "max_tokens": 250
  }'

Reference links

NVIDIA Nemotron v3 collection (models + datasets) (Hugging Face)
Red Hat reference setup + serving guide (Day-0 article for Mistral, setup steps still apply) (Red Hat Developer)

dougbtv/README.md

Select an option

No results found