Date: 2026-02-08
Session: Production Worker Autoscaler Testing
Test App: 9f7f86c5-bf69-496e-b53c-9b8b62a00161 (prod-worker-test)
Test App URL: https://prod-worker-test-sk7o7.ondigitalocean.app
Status: COMPLETE
Image Tag: latest (commits a1cbf31, 207e226)
Worker Configuration:
- Instance: 16GB (apps-d-4vcpu-16gb)
- Autoscaler: MemoryAwareAutoscaler (70% HIGH, 50% LOW thresholds)
- Tasks: 2000 x 15s sleep
- EST_WORKER_RSS_MB: 250
- max_safe_workers: 44
| Metric | Value |
|---|---|
| Worker RSS at startup | ~239 MB (per child after fork) |
| Worker RSS mid-load | ~239-240 MB |
| Worker RSS peak | ~240.5 MB |
| Peak worker count | 39 |
| Peak memory % | 52.0% |
| BLOCK_SCALEUP events | 0 (HOLD events at >50%) |
| OOM kills | 1 (first run before autoscaler fix, 0 after fix) |
| max_safe_workers | 44 |
| Est throughput | ~2.6 tasks/sec (39 workers / 15s) |
Status: COMPLETE
Image Tag: latest (commit 962047d)
Worker Configuration: Same as Test #1 except EST_WORKER_RSS_MB=100
| Metric | Value |
|---|---|
| Worker RSS at startup | ~237-238 MB (per child after fork) |
| Worker RSS mid-load | ~237-238 MB |
| Worker RSS peak | ~238.2 MB |
| Peak worker count | 39 |
| Peak memory % | 51.4% |
| BLOCK_SCALEUP events | 0 |
| OOM kills | 0 |
| max_safe_workers | 113 (but capped at 100 by --autoscale) |
| Est throughput | ~2.6 tasks/sec (39 workers / 15s) |
| Metric | Test #1 (PyTorch) | Test #2 (No PyTorch) | Difference | % Change |
|---|---|---|---|---|
| Worker RSS (startup) | ~239 MB | ~238 MB | ~1 MB | 0.4% |
| Worker RSS (peak) | ~240.5 MB | ~238.2 MB | ~2.3 MB | 1.0% |
| Peak worker count | 39 | 39 | 0 | 0% |
| Peak memory % | 52.0% | 51.4% | 0.6% | 1.2% |
| Docker image (buildcache) | 880 MB | 536 MB | 344 MB | 39% smaller |
Removing PyTorch has virtually NO impact on Celery worker memory (~1-2 MB/worker difference).
This is because PyTorch is NOT imported by the Celery worker process. It's only imported by:
ctr_visual_mockup.py→ used by FastAPI API endpoints only- Not in any file imported by
celery_worker.py
The ~238 MB per-worker RSS comes entirely from production code imports:
- google-genai, openai, supabase, celery, aiohttp, kombu
- LangChain, OpenCV, PIL, matplotlib
- Core business logic modules
Files using PyTorch:
listingoptimisation_ai_agent/utils/ctr_optimization/ctr_visual_mockup.py
Specific usage:
- Lines 18, 23-24, 26-27: torch, torchvision, pytorch_grad_cam imports
- Lines 80-92: Module-level ResNet50 model initialization + transform
- Lines 245-255: GradCAM heatmap generation (in function body)
Is it imported by Celery worker?: NO
ctr_visual_mockup.pyis only imported byapi_v1/endpoints/ctr_visual_mockup.py- Which is only imported by
api.py(FastAPI router) celery_worker.pydoes NOT import any of these
Can PyTorch be made lazy?: YES (implemented in test/remove-pytorch branch)
- Moved imports inside the function that uses them
- Model initialization deferred to first use
- Problem: Default Celery autoscaler calls
_maybe_scale()on EVERY task receive - Result: 2→51 processes in 24 seconds → OOM kill (exit 128)
- Root cause:
/proc/meminforeports via Available memory, but COW pages from forks aren't dirty yet → underreports actual memory - Fix: Added predictive worker cap (
max_safe_workers) based on EST_WORKER_RSS_MB
- Worker cap:
max_safe_workers = total_mb * HIGH_PCT / EST_RSS - 1 - Three states:
- GROW: mem < LOW_PCT (50%) → allow default autoscaler to add workers
- HOLD: LOW_PCT < mem < HIGH_PCT → don't scale up, don't shrink
- BLOCK & SHRINK: mem > HIGH_PCT (70%) → shrink pool by 1
- Result: Stable at 39 workers, 51-52% memory, no OOM
- Celery's default autoscaler stops growing when there are no pending tasks
- With 39 workers at 15s each, the queue drains faster than new tasks arrive
- The autoscaler converges at the "just enough" worker count
- HOLD at 50% also caps the effective count
PyTorch doesn't affect Celery worker memory. The lazy import from test/remove-pytorch branch should be merged to avoid loading the ResNet50 model at FastAPI startup for endpoints that don't use it.
The improved autoscaler (v2 from commit 207e226) prevents OOM kills and should be deployed to production. Key env vars:
CELERY_WORKER_AUTOSCALER=listingoptimisation_ai_agent.utils.celery_worker:MemoryAwareAutoscalerCONTAINER_MEMORY_MB=2048(for 2GB instances)EST_WORKER_RSS_MB=250(measured)
With 2048 MB total and 250 MB per worker:
- max_safe_workers = 2048 * 0.70 / 250 - 1 = 4.7 → 4 workers
- This matches the production
--autoscale=20,2being capped by memory
The ~238 MB per-worker RSS is driven by production imports, not PyTorch. To reduce it:
- Lazy-load google-genai, openai, langchain (only when tasks actually use them)
- Move import-heavy modules to be loaded on-demand
- This could potentially reduce RSS to ~75-100 MB/worker
OOM kills lose all in-flight tasks. Set acks_late=True on critical tasks so they're re-queued on worker failure.
- Memory detection:
os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')works on DO App Platform /proc/meminfoMemAvailable underreports during rapid fork due to COW pages- Celery v5.5.3 with prefork pool
- Test used sleep tasks (no actual computation) — real tasks may have different RSS patterns
- Docker image compressed size: WITH PyTorch 880 MB → WITHOUT 536 MB (39% reduction)