This is a technical quick-start gist for the latest Red Hat AI Inference Server (RHAIIS) preview image, featuring NVIDIA Nemotron v3 Nano 30B-A3B models on vLLM.
Preview image tag (this release):
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3
Upstream model family (Hugging Face):
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Technology Preview notice
The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.
Reference Links
- NVIDIA Nemotron v3 collection (models + datasets) (Hugging Face)
- Red Hat reference setup + serving guide (Day-0 article for Mistral, setup steps still apply) (Red Hat Developer)
If you want the full setup notes (SELinux, cache dir perms, NVSwitch / Fabric Manager, etc.), use the Mistral release reference guide here.
Here’s the quick version:
podman login registry.redhat.io
podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3If SELinux is enabled:
sudo setsebool -P container_use_devices 1echo 'export HF_TOKEN=<your_hf_token>' > private.env
source private.envThis runs the post-trained BF16 checkpoint:
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16(Hugging Face)
podman run --rm -it \
--device nvidia.com/gpu=all \
--shm-size=4g \
-p 8000:8000 \
--tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-e HF_HUB_CACHE=/opt/app-root/src/.cache \
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3 \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--served-model-name nemotron-nano-v3-bf16 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000Notes
-
--tensor-parallel-size: set this to match how many GPUs you want to shard across (ex:2,4,8). -
If you’re on a multi-GPU NVSwitch box and hit weirdness, see the reference guide’s Fabric Manager notes. (Red Hat Developer)
This runs the FP8 checkpoint:
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8(Hugging Face)
podman run --rm -it \
--device nvidia.com/gpu=all \
--shm-size=4g \
-p 8000:8000 \
--tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-e HF_HUB_CACHE=/opt/app-root/src/.cache \
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:nvidia-nemotron-v3 \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--served-model-name nemotron-nano-v3-fp8 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"max_tokens": 250,
"prompt": "Write a perl script that outputs an advertisement for noodles marketed to sysadmins"
}'If you prefer chat-style:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron-nano-v3-bf16",
"messages": [
{"role": "user", "content": "Give me a 5-bullet pitch for noodles marketed to sysadmins."}
],
"max_tokens": 250
}'- NVIDIA Nemotron v3 collection (models + datasets) (Hugging Face)
- Red Hat reference setup + serving guide (Day-0 article for Mistral, setup steps still apply) (Red Hat Developer)