vllm_cpu.md

This Dockerfile builds a container for running vLLM (Large Language Model inference engine) on CPU with specific patches and optimizations. Here's a breakdown:

Base Image

FROM openeuler/vllm-cpu:0.9.1-oe2403lts

Uses OpenEuler Linux distribution's pre-built vLLM image (version 0.9.1)
Built for CPU inference (not GPU)
Based on OpenEuler 24.03 LTS

Critical Patch (Lines 4-5)

RUN sed -i 's/cpu_count_per_numa = cpu_count // numa_size/cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count/g'
/workspace/vllm/vllm/worker/cpu_worker.py

What it does:

Fixes a division-by-zero bug in vLLM's CPU worker
Original code: cpu_count_per_numa = cpu_count // numa_size
Patched code: cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count

Why needed: On systems with zero NUMA nodes (like some Docker/VM environments), the original code would crash with a division-by-zero error. The patch adds a conditional check to handle this edge case.

Environment Variables (Lines 7-11)

These optimize vLLM for CPU execution:

VLLM_TARGET_DEVICE=cpu - Explicitly targets CPU (not CUDA/ROCm)
VLLM_CPU_KVCACHE_SPACE=1 - Allocates 1GB for key-value cache storage
OMP_NUM_THREADS=2 - Limits OpenMP to 2 threads (prevents over-subscription)
OPENBLAS_NUM_THREADS=1 - Single-threaded BLAS operations
MKL_NUM_THREADS=1 - Single-threaded Intel MKL operations

Threading Strategy: The conservative thread limits prevent CPU thrashing and contention. This suggests the container is designed for environments with limited CPU resources or where multiple containers run concurrently.

Use Case

This container would be used to:

Run LLM inference on CPU-only systems
Handle environments without NUMA node detection (VMs, Docker Desktop, cloud containers)
Provide stable, predictable performance with controlled threading

To build: docker build -t vllm-cpu-patched

gouravjshah/vllm_cpu.md

Select an option

No results found

Select an option

No results found