This gist provides a production-ready docker-compose.yaml for running OpenWebUI + ComfyUI on NVIDIA DGX Spark (Grace Blackwell / GB10).
NVIDIA blueprints are a good baseline.
They break down once you combine ComfyUI, Ollama, and OpenWebUI in a multi-service setup.
Mainly due to dependency drift, frontend changes, and memory assumptions.
This configuration closes those gaps while preserving NVIDIA’s Blackwell-optimized stack.
-
Native
sm_121support
Usesnvcr.io/nvidia/pytorch:25.10-py3, which includes PyTorch 2.9.0a0 compiled for Blackwell and CUDA 13. -
Shared memory tuning
Setsshm_size: 16gband appropriateulimitsto avoid bus errors during large tensor transfers. Required to use the Spark’s 128GB unified memory effectively. -
Dependency protection
Installs missing Python packages (transformers,torchsde,einops,av,comfyui-frontend-package, etc.) without overwriting NVIDIA’s optimized PyTorch build. -
ComfyUI frontend handling
Explicitly installscomfyui-frontend-packageand workflow templates, which are now mandatory after recent ComfyUI changes.
- Hardware: DELLGB10 | NVIDIA DGX Spark (GB10)
- Driver:
580.95.05or newer (required forsm_121) - Software: Docker Engine with NVIDIA Container Toolkit (
runtime: nvidia) - Secrets:
.envfile withGOOGLE_API_KEYandGOOGLE_CX(used by OpenWebUI RAG search)
-
Configure Docker runtime (if not already done):
sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
-
Start the stack:
docker compose up -d
-
Monitor first startup:
docker logs -f comfyui
Initial run takes ~60 seconds to clone ComfyUI and install media dependencies.
Use the following Node ID mappings from the Blackwell-optimized default.json workflow:
| Feature | Node ID | Class Type |
|---|---|---|
| Prompt | 6 | CLIPTextEncode |
| Model | 4 | CheckpointLoaderSimple |
| Sampler | 3 | KSampler |
| Latent / Size | 5 | EmptyLatentImage |
-
Blackwell coherency
Leverages NVLink-C2C (≈900 GB/s) between Grace CPU and Blackwell GPU. -
Root pip warnings
Expected. The container installs dependencies globally to ensure NVIDIA libraries stay correctly linked. -
Unified memory usage
Designed to run large diffusion workflows and very large LLMs (100B–200B via Ollama) concurrently within the 128GB unified memory pool.