This Dockerfile builds a container for running vLLM (Large Language Model inference engine) on CPU with specific patches and optimizations. Here's a breakdown:
Base Image
FROM openeuler/vllm-cpu:0.9.1-oe2403lts
- Uses OpenEuler Linux distribution's pre-built vLLM image (version 0.9.1)
- Built for CPU inference (not GPU)
- Based on OpenEuler 24.03 LTS
Critical Patch (Lines 4-5)
RUN sed -i 's/cpu_count_per_numa = cpu_count // numa_size/cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count/g'
/workspace/vllm/vllm/worker/cpu_worker.py
What it does:
- Fixes a division-by-zero bug in vLLM's CPU worker
- Original code: cpu_count_per_numa = cpu_count // numa_size
- Patched code: cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count
Why needed: On systems with zero NUMA nodes (like some Docker/VM environments), the original code would crash with a division-by-zero error. The patch adds a conditional check to handle this edge case.
Environment Variables (Lines 7-11)
These optimize vLLM for CPU execution:
- VLLM_TARGET_DEVICE=cpu - Explicitly targets CPU (not CUDA/ROCm)
- VLLM_CPU_KVCACHE_SPACE=1 - Allocates 1GB for key-value cache storage
- OMP_NUM_THREADS=2 - Limits OpenMP to 2 threads (prevents over-subscription)
- OPENBLAS_NUM_THREADS=1 - Single-threaded BLAS operations
- MKL_NUM_THREADS=1 - Single-threaded Intel MKL operations
Threading Strategy: The conservative thread limits prevent CPU thrashing and contention. This suggests the container is designed for environments with limited CPU resources or where multiple containers run concurrently.
Use Case
This container would be used to:
- Run LLM inference on CPU-only systems
- Handle environments without NUMA node detection (VMs, Docker Desktop, cloud containers)
- Provide stable, predictable performance with controlled threading
To build: docker build -t vllm-cpu-patched