A comprehensive guide to understanding the CUA-Bench codebase, focusing on benchmark launching and Docker container execution.
- Overview
- Project Structure
- High-Level Architecture
- CLI Command Flow
- Docker Container Architecture
- Task Definition & Execution
- Agent Architecture
- Session Management
- Complete Execution Flow
- Key Classes Reference
CUA-Bench is a toolkit for creating, managing, and evaluating computer-use (desktop automation) benchmarks for AI agents. It supports:
- Webtop environments: HTML/CSS/JS based desktop simulations via Playwright
- VM environments: Real Windows/Linux/macOS/Android VMs
- Agent evaluation: Test AI agents on desktop automation tasks
- Dataset generation: Convert trajectories into training data
cua-bench/
├── cua_bench/ # Main package
│ ├── cli/ # CLI commands (run, interact, sessions, etc.)
│ ├── sessions/ # Session & provider management
│ │ └── providers/ # Docker provider implementation
│ ├── agents/ # Agent implementations (CuaAgent, GeminiAgent)
│ ├── computers/ # Desktop session providers (webtop, computer)
│ ├── batch/ # Batch execution solver (runs inside container)
│ ├── processors/ # Dataset processing (aguvis, gui-r1)
│ ├── Environment2.py # Core environment class
│ ├── core.py # Task class & make() function
│ ├── decorators.py # @tasks_config, @setup_task, etc.
│ ├── types.py # Action types (ClickAction, TypeAction, etc.)
│ └── tracing.py # Trajectory recording
├── tasks/ # Example task environments
├── Dockerfile # Container for batch execution
└── pyproject.toml # Dependencies
graph TB
subgraph "User Interface"
CLI[cb CLI]
end
subgraph "Orchestration Layer"
RUN[Run Command]
SM[Session Manager]
DP[Docker Provider]
end
subgraph "Docker Containers"
MC[Main Container<br/>cua-bench:latest]
CC[Child Container<br/>cua-xfce:latest]
end
subgraph "Inside Main Container"
BS[Batch Solver]
ENV[EnvironmentV2]
AG[Agent]
TR[Tracing]
end
subgraph "Session Providers"
WT[Webtop Session<br/>Playwright]
VM[VM Session<br/>Computer API]
end
CLI --> RUN
RUN --> SM
SM --> DP
DP --> MC
DP --> CC
MC --> BS
BS --> ENV
ENV --> AG
ENV --> TR
ENV --> WT
ENV --> VM
CC -.-> VM
| Command | Purpose |
|---|---|
cb run |
Execute tasks with agents or oracle solutions |
cb interact |
Interactive task execution (non-headless) |
cb sessions |
List/manage/stop running sessions |
cb runs |
Watch batch runs, aggregate statistics |
cb process |
Convert outputs to training datasets |
cb view-trace |
Open HTML trace viewer |
sequenceDiagram
participant U as User
participant CLI as cb run
participant DP as Docker Provider
participant DC as Docker Container
participant BS as Batch Solver
participant ENV as Environment
participant AG as Agent
U->>CLI: cb run tasks/my_env --agent cua-agent
CLI->>CLI: Load .env, validate args
CLI->>CLI: Count task variants
CLI->>DP: Create provider instance
loop For each task variant
CLI->>DP: start_session(env_path, variant_idx)
DP->>DP: Create Docker network (if needed)
DP->>DC: docker run cua-bench:latest
DC->>BS: python -m cua_bench.batch.solver
BS->>ENV: make(env_path)
BS->>ENV: reset(task_id)
alt Agent Mode
BS->>AG: perform_task(description, session)
loop Until done or max_steps
AG->>ENV: screenshot()
AG->>AG: LLM decision
AG->>ENV: execute_action()
end
else Oracle Mode
BS->>ENV: solve()
end
BS->>ENV: evaluate()
BS->>ENV: tracing.save_to_disk()
end
CLI->>U: Display results in TUI
graph LR
subgraph "Docker Network: cua-bench_default"
MC1[Main Container 1<br/>Session: abc123]
MC2[Main Container 2<br/>Session: def456]
CC1[Child Container<br/>XFCE Desktop]
CC2[Child Container<br/>XFCE Desktop]
end
MC1 -->|API Port 8000| CC1
MC2 -->|API Port 8000| CC2
HOST[Host Machine]
HOST -->|VNC 6901| CC1
HOST -->|VNC 6902| CC2
flowchart TD
A[start_session called] --> B{Task needs<br/>child container?}
B -->|Yes: provider=computer| C[Create XFCE Container]
B -->|No: provider=webtop| D[Skip child container]
C --> E[Allocate VNC + API ports]
E --> F[Start child container]
F --> G[Build CUA_TASK_CONTAINERS env var]
D --> H[Build docker run command]
G --> H
H --> I[Mount /app/env read-only]
I --> J[Mount /tmp/td_output for traces]
J --> K[Pass env vars:<br/>API keys, BATCH_TASK_INDEX]
K --> L[docker run -d cua-bench:latest]
L --> M[Save session to ~/.cua/cbsessions.json]
graph TD
A[python:3.12-slim] --> B[System Dependencies<br/>libnss3, libxkbcommon0, etc.]
B --> C[Node.js + npm]
C --> D[Claude Code CLI]
D --> E[Python Dependencies<br/>pip install -e .]
E --> F[Playwright + Chromium]
F --> G[Non-root user: cuauser]
G --> H[cua-bench:latest]
Tasks are defined using decorators that mark functions for discovery:
import cua_bench as cb
@cb.tasks_config(split="train")
def load():
"""Returns list of Task objects"""
return [
cb.Task(
description="Click the Submit button",
computer={"provider": "webtop", "setup_config": {...}}
)
]
@cb.setup_task(split="train")
async def start(task_cfg, session):
"""Initialize task state"""
await session.launch_window(html="...", title="My App")
@cb.solve_task(split="train")
async def solve(task_cfg, session):
"""Oracle solution"""
await session.execute_action(ClickAction(x=100, y=200))
@cb.evaluate_task(split="train")
async def evaluate(task_cfg, session) -> list[float]:
"""Return rewards"""
return [1.0 if success else 0.0]flowchart TD
A[make(env_path)] --> B[Load main.py as module]
B --> C[Scan for _td_type attributes]
C --> D{Function type?}
D -->|tasks_config| E[Register tasks_config_fn]
D -->|setup_task| F[Register setup_task_fn]
D -->|solve_task| G[Register solve_task_fn]
D -->|evaluate_task| H[Register evaluate_task_fn]
E --> I[Create EnvironmentV2]
F --> I
G --> I
H --> I
stateDiagram-v2
[*] --> Created: make()
Created --> Ready: reset(task_id)
Ready --> Stepping: step(action)
Stepping --> Ready: screenshot returned
Ready --> Solving: solve()
Solving --> Ready: oracle complete
Ready --> Evaluated: evaluate()
Evaluated --> Closed: close()
Closed --> [*]
classDiagram
class BaseAgent {
<<abstract>>
+name() str
+perform_task(description, session, logging_dir) AgentResult
}
class CuaAgent {
+name() "cua-agent"
+perform_task() AgentResult
-_create_custom_computer()
}
class GeminiAgent {
+name() "gemini"
+perform_task() AgentResult
}
class AgentResult {
+total_input_tokens: int
+total_output_tokens: int
+failure_mode: FailureMode
}
BaseAgent <|-- CuaAgent
BaseAgent <|-- GeminiAgent
CuaAgent --> AgentResult
GeminiAgent --> AgentResult
flowchart TD
A[perform_task called] --> B[Create computer adapter]
B --> C[Take initial screenshot]
C --> D{Step < max_steps?}
D -->|Yes| E[Send screenshot + task to LLM]
E --> F[Parse action from response]
F --> G{Action type?}
G -->|click| H[Execute click]
G -->|type| I[Execute typing]
G -->|key| J[Execute keypress]
G -->|done| K[Exit loop]
H --> L[Take screenshot]
I --> L
J --> L
L --> M[Track tokens]
M --> D
D -->|No| N[Max steps reached]
K --> O[Return AgentResult]
N --> O
stateDiagram-v2
[*] --> running: start_session()
running --> completed: Container exits 0
running --> failed: Container exits non-zero
running --> stopped: stop_session()
completed --> deleted: cleanup
failed --> deleted: cleanup
stopped --> deleted: cleanup
deleted --> [*]
Sessions are tracked in ~/.cua/cbsessions.json:
{
"session_id": "abc123",
"container_id": "sha256:...",
"env_path": "/path/to/tasks/my_env",
"output_dir": "/tmp/outputs/abc123",
"status": "running",
"provider": "docker",
"run_id": "run_20240101_120000",
"agent": "cua-agent",
"model": "claude-sonnet-4-20250514",
"child_containers": ["container_id_1"],
"created_at": 1704110400
}flowchart TB
subgraph "1. CLI Layer"
A[cb run tasks/my_env<br/>--agent cua-agent]
B[Parse arguments]
C[Load .env file]
D[Count task variants]
end
subgraph "2. Session Planning"
E[Create session tasks]
F[Build container script]
G[Initialize Docker provider]
end
subgraph "3. Docker Orchestration"
H[Create shared network]
I[For each session:]
J[Start child containers<br/>if needed]
K[Start main container]
L[Mount volumes]
M[Pass env vars]
end
subgraph "4. Container Execution"
N[batch/solver.py]
O[Load environment]
P[reset(task_id)]
Q{Agent or<br/>Oracle?}
R[Agent loop]
S[Oracle solve]
T[evaluate()]
U[Save trace]
end
subgraph "5. Monitoring"
V[Watch TUI]
W[Poll container status]
X[Display rewards]
Y[Show statistics]
end
A --> B --> C --> D --> E --> F --> G
G --> H --> I --> J --> K --> L --> M
M --> N --> O --> P --> Q
Q -->|Agent| R --> T
Q -->|Oracle| S --> T
T --> U
G --> V --> W --> X --> Y
The core environment wrapper that delegates to session providers:
class EnvironmentV2:
# Injected functions from decorators
tasks_config_fn: Callable # Load task list
setup_task_fn: Callable # Initialize task state
solve_task_fn: Callable # Oracle solution
evaluate_task_fn: Callable # Compute rewards
# Session management
session: DesktopSession # Webtop or Computer provider
bot: Bot # Helper for solution writing
tracing: Tracing # Trajectory recording
# Lifecycle methods
async def reset(task_id) -> Tuple[Image, Task]
async def step(action) -> Image
async def solve() -> Image
async def evaluate() -> List[float]
async def close() -> None# Mouse actions
ClickAction(x: int, y: int)
RightClickAction(x: int, y: int)
DoubleClickAction(x: int, y: int)
DragAction(from_x, from_y, to_x, to_y, duration)
ScrollAction(direction: str, amount: int)
# Keyboard actions
TypeAction(text: str)
KeyAction(key: str)
HotkeyAction(keys: List[str])
# Control actions
DoneAction()
WaitAction(seconds: float)classDiagram
class SessionProvider {
<<abstract>>
+start_session(session_id, env_path, container_script, ...)
+get_session_status(session_id)
+stop_session(session_id)
+get_session_logs(session_id, tail)
}
class DockerProvider {
-network_name: str
-created_network: bool
+start_session()
+get_session_status()
+stop_session()
-_ensure_network()
-_create_child_containers()
}
class DesktopSession {
<<abstract>>
+screenshot() Image
+execute_action(action)
+launch_window(html, title)
+close()
}
class WebDesktopSession {
-browser: Browser
-page: Page
+screenshot()
+execute_action()
}
class VMDesktopSession {
-computer: Computer
+screenshot()
+execute_action()
}
SessionProvider <|-- DockerProvider
DesktopSession <|-- WebDesktopSession
DesktopSession <|-- VMDesktopSession
# Run with oracle solution
cb run tasks/my_env --oracle
# Run with AI agent
cb run tasks/my_env --agent cua-agent --model claude-sonnet-4-20250514
# Run multiple variants in parallel
cb run tasks/my_env --max-parallel 8 --max-variants 10
# Interactive mode (opens browser)
cb interact tasks/my_env# List all sessions
cb sessions list
# View logs for a session
cb sessions logs <session_id>
# Stop a running session
cb sessions stop <session_id>
# Clean up stopped sessions
cb sessions --clean# View trace in browser
cb view-trace ./outputs/session_abc123
# Process for training data
cb process ./outputs --mode aguvis-stage-1
# Push to Hugging Face
cb process ./outputs --push-to-hub username/dataset| Variable | Purpose |
|---|---|
ANTHROPIC_API_KEY |
API key for Claude models |
GOOGLE_API_KEY |
API key for Gemini models |
BATCH_TASK_INDEX |
Current task variant index (set by container) |
BATCH_TASK_COUNT |
Total variants (set by container) |
CUA_TASK_CONTAINERS |
JSON of child container info |
Generated for CUA-Bench repository understanding