Skip to content

Instantly share code, notes, and snippets.

@ashwinprasadme
Last active January 11, 2026 09:14
Show Gist options
  • Select an option

  • Save ashwinprasadme/b0447c53f98fec48df8ca906b66296d3 to your computer and use it in GitHub Desktop.

Select an option

Save ashwinprasadme/b0447c53f98fec48df8ca906b66296d3 to your computer and use it in GitHub Desktop.
πŸš€ vLLM Docker Deployment Commands for NVIDIA H100

Tested Docker commands for deploying Hugging Face LLMs with vLLM on NVIDIA H100 GPUs.

openai/gpt-oss-120b

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    -p 8002:8000 \
    --ipc=host \
    vllm/vllm-openai:gptoss \
    --api-key token-sse123 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --max_num_seqs 16 \
    --model openai/gpt-oss-120b

meta-llama/Llama-3.3-70B-Instruct

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    -p 8002:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --api-key token-sse123 \
    --gpu_memory_utilization 0.92 \
    --quantization bitsandbytes \
    --load_format bitsandbytes \
    --max_model_len 32768 \
    --max_num_seqs 64 \
    --enable_prefix_caching \
    --model meta-llama/Llama-3.3-70B-Instruct

Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    -p 8002:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --api-key token-sse123 \
    --gpu_memory_utilization 0.92 \
    --max_model_len 16384 \
    --max_num_seqs 64 \
    --enable_prefix_caching \
    --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

Qwen/Qwen3-30B-A3B-Thinking-2507-FP8

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    -p 8002:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --api-key token-sse123 \
    --gpu_memory_utilization 0.92 \
    --max_model_len 32768 \
    --max_num_seqs 64 \
    --enable_prefix_caching \
    --reasoning-parser qwen3 \
    --model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8

mistralai/Mixtral-8x7B-Instruct-v0.1

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=<token> \
  -p 8002:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --api-key token-sse123 \
  --gpu_memory_utilization 0.90 \
  --max_model_len 16384 \
  --enable_prefix_caching \
  --quantization bitsandbytes \
  --load_format bitsandbytes \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    -p 8002:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --api-key token-sse123 \
    --gpu_memory_utilization 0.92 \
    --max_model_len 32768 \
    --max_num_seqs 64 \
    --enable_prefix_caching \
    --quantization bitsandbytes \
    --load_format bitsandbytes \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment