| name | version | license | description |
|---|---|---|---|
remote-coding-orchestration |
0.1 |
CC0-1.0 |
A reusable playbook for coordinating one or more coding agents (Codex CLI, Claude Code, Cursor, etc.)
on a separate dev machine via SSH with a tight feedback loop: scenario-driven harness, frequent commits,
retro notes, gated research, and an evidence-based watchdog to keep work progressing without chat spam.
|
This document is an agent-agnostic operating method for shipping a non-trivial engineering project with coding agents. It assumes you have:
- a codebase with tests (or a scenario harness you’re building)
- a “worker” machine (local or remote) where agents do most of the editing
- a human operator who steers via short timeboxed loops
It is intentionally tool/runtime neutral. Replace placeholders for your environment.
- Keep agent work measurable (tests/scenarios drive progress).
- Avoid “local maxima” (overfitting tests, brittle hacks, chatty coordination).
- Keep the worker non-idle while minimizing thrash and spam.
- Make progress visible via small commits + retro notes.
Each iteration must be short and outcome-driven.
-
Pick exactly one next scenario / failure mode
- Smallest test/scenario that increases real capability.
- Prefer scenarios that are hard to game.
-
Task an agent with a timebox (default: 30 minutes)
- Codex-style agents are best for “run tests → patch → rerun”.
- Review-style agents are best for tightening docs/harness assertions.
-
Proof
- Require: test/harness output +
git diff --stat.
- Require: test/harness output +
-
Checkpoint
- One small commit with a clear message.
-
Retro (2–6 bullets)
- What worked
- What got stuck
- Biggest bottleneck
- Next adjustment
Progress only counts if:
- a scenario/test was added, or
- the suite moved closer to green (and the delta is committed).
- SSH target:
WORKER_SSH_HOST(example:devbox) - Repo location:
~/src/REPO - Ensure PATH is correct in non-interactive SSH sessions (macOS Homebrew example):
export PATH=/opt/homebrew/bin:$PATH
- Codex CLI:
codex - Claude Code:
claude
Use whatever you have installed; the playbook only assumes you can run an agent non-interactively.
- Use artifact-oriented prompts:
- exact file paths to edit
- exact command(s) to run
- exact success criteria
- Always include a timebox (10–30 minutes).
- Require end-of-task proof:
- harness/test output
git diff --stat- and optionally
git status --porcelain
- If an agent becomes interactive or silent:
- kill and restart with a smaller prompt
- reduce scope (“get suite green, make one commit”)
Suggested division of labor:
- Builder agent: minimal code changes to satisfy scenarios
- Reviewer agent: docs/harness hardening; adversarial test design; usability critique
These sources have good, practical framing:
- Anthropic: Demystifying evals for AI agents (tasks/trials/graders/traces/outcomes)
- Layered testing for agentic systems (unit → deterministic scenarios → E2E)
- Deterministic replay (record/replay to reduce non-determinism)
- Make tasks, trials, graders, traces, outcomes explicit.
- Prefer outcome-based evals (verifiable end state), not just string matching.
- Capture traces for debugging and regression detection.
- Add a replay mode when possible.
- Keep scenarios realistic and progressively harder.
- Add negative assertions (what must not happen).
- Add anti-gaming checks:
- ensure the real binary runs (not only a stub)
- ensure outputs have structure, not just keywords
- assert dedupe/no-spam constraints
- Happy-path semantics
- Anti-gaming assertions + negative assertions
- 3+ agent scenarios + interleaved events
- Messy reality (stale entries, partial intents, expiry)
- Adversarial cases (token collisions, substring traps, repeated posts)
Policy suggestion:
- 1 new scenario per 30-minute builder run
- every 2nd run: reviewer tightens one existing scenario
Research is useful, but it’s easy to waste time.
- Trigger: blocked or repeating the same failure for ~8–10 minutes.
- Budget: max ~5 minutes per run.
- Output must be one artifact (short note or retro bullet) and map to:
- a new scenario/assertion, or
- a concrete minimal fix.
If it can’t map to a scenario/fix, stop.
Make “worker not idle” a system property, not a hope.
Preferred design: deterministic, evidence-based watchdog.
- Check cadence: every ~3–5 minutes.
- Productivity evidence (prefer these):
- tracked agent PID is alive
- agent log file was updated recently (e.g., last 5 minutes)
- Fallback evidence:
- process list pattern match (be careful; it’s brittle)
- Cooldown exists only to avoid thrash; keep it short enough to avoid long idle gaps
- suggested default: 10 minutes
- Stuck detection (mandatory):
- if PID is alive but log hasn’t updated within threshold (e.g., 10m)
- or logs show “waiting for input” → kill and restart with a smaller prompt
- Default: watchdog is silent.
- Notify only on meaningful state transitions (e.g., “started a new builder run”).
- If notifying, keep it to one sentence.
At least every ~2 iterations:
- update a repo-local Agent Guide for using the tool (<= 1 page)
- add a smoke test for the guide examples
Treat the guide as a contract:
- short
- marks unstable things “subject to change”
- tested to prevent drift
- Run the harness/tests.
- Fix the minimal thing.
- Add exactly ONE new realistic scenario.
- Make suite pass.
- Update retro (3–6 bullets).
- Make ONE commit.
- Provide proof: harness output +
git diff --stat.
- Review scenarios/docs.
- Identify one weak point that can be gamed.
- Tighten exactly one scenario assertion (docs/harness only).
- Provide proof:
git diff --stat.
End state: a system that stays moving, gets harder over time, and produces durable artifacts (tests + traces + commits + retro).