Skip to content

Instantly share code, notes, and snippets.

@gouravjshah
Created December 10, 2025 02:44
Show Gist options
  • Select an option

  • Save gouravjshah/9568012529466a527cfba3074e6433b0 to your computer and use it in GitHub Desktop.

Select an option

Save gouravjshah/9568012529466a527cfba3074e6433b0 to your computer and use it in GitHub Desktop.

This Dockerfile builds a container for running vLLM (Large Language Model inference engine) on CPU with specific patches and optimizations. Here's a breakdown:

Base Image

FROM openeuler/vllm-cpu:0.9.1-oe2403lts

  • Uses OpenEuler Linux distribution's pre-built vLLM image (version 0.9.1)
  • Built for CPU inference (not GPU)
  • Based on OpenEuler 24.03 LTS

Critical Patch (Lines 4-5)

RUN sed -i 's/cpu_count_per_numa = cpu_count // numa_size/cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count/g'
/workspace/vllm/vllm/worker/cpu_worker.py

What it does:

  • Fixes a division-by-zero bug in vLLM's CPU worker
  • Original code: cpu_count_per_numa = cpu_count // numa_size
  • Patched code: cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count

Why needed: On systems with zero NUMA nodes (like some Docker/VM environments), the original code would crash with a division-by-zero error. The patch adds a conditional check to handle this edge case.

Environment Variables (Lines 7-11)

These optimize vLLM for CPU execution:

  • VLLM_TARGET_DEVICE=cpu - Explicitly targets CPU (not CUDA/ROCm)
  • VLLM_CPU_KVCACHE_SPACE=1 - Allocates 1GB for key-value cache storage
  • OMP_NUM_THREADS=2 - Limits OpenMP to 2 threads (prevents over-subscription)
  • OPENBLAS_NUM_THREADS=1 - Single-threaded BLAS operations
  • MKL_NUM_THREADS=1 - Single-threaded Intel MKL operations

Threading Strategy: The conservative thread limits prevent CPU thrashing and contention. This suggests the container is designed for environments with limited CPU resources or where multiple containers run concurrently.

Use Case

This container would be used to:

  1. Run LLM inference on CPU-only systems
  2. Handle environments without NUMA node detection (VMs, Docker Desktop, cloud containers)
  3. Provide stable, predictable performance with controlled threading

To build: docker build -t vllm-cpu-patched

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment