name	version	license	description
remote-coding-orchestration	0.1	CC0-1.0	A reusable playbook for coordinating one or more coding agents (Codex CLI, Claude Code, Cursor, etc.) on a separate dev machine via SSH with a tight feedback loop: scenario-driven harness, frequent commits, retro notes, gated research, and an evidence-based watchdog to keep work progressing without chat spam.

Remote Coding Orchestration (Generic Playbook)

This document is an agent-agnostic operating method for shipping a non-trivial engineering project with coding agents. It assumes you have:

a codebase with tests (or a scenario harness you’re building)
a “worker” machine (local or remote) where agents do most of the editing
a human operator who steers via short timeboxed loops

It is intentionally tool/runtime neutral. Replace placeholders for your environment.

Goals

Keep agent work measurable (tests/scenarios drive progress).
Avoid “local maxima” (overfitting tests, brittle hacks, chatty coordination).
Keep the worker non-idle while minimizing thrash and spam.
Make progress visible via small commits + retro notes.

Core loop (never skip)

Each iteration must be short and outcome-driven.

Pick exactly one next scenario / failure mode
- Smallest test/scenario that increases real capability.
- Prefer scenarios that are hard to game.
Task an agent with a timebox (default: 30 minutes)
- Codex-style agents are best for “run tests → patch → rerun”.
- Review-style agents are best for tightening docs/harness assertions.
Proof
- Require: test/harness output + git diff --stat.
Checkpoint
- One small commit with a clear message.
Retro (2–6 bullets)
- What worked
- What got stuck
- Biggest bottleneck
- Next adjustment

Progress only counts if:

a scenario/test was added, or
the suite moved closer to green (and the delta is committed).

Remote worker conventions (SSH)

SSH target: WORKER_SSH_HOST (example: devbox)
Repo location: ~/src/REPO
Ensure PATH is correct in non-interactive SSH sessions (macOS Homebrew example):
- export PATH=/opt/homebrew/bin:$PATH

Agent binaries (examples)

Codex CLI: codex
Claude Code: claude

Use whatever you have installed; the playbook only assumes you can run an agent non-interactively.

Steering rules (keep agents productive)

Use artifact-oriented prompts:
- exact file paths to edit
- exact command(s) to run
- exact success criteria
Always include a timebox (10–30 minutes).
Require end-of-task proof:
- harness/test output
- git diff --stat
- and optionally git status --porcelain
If an agent becomes interactive or silent:
- kill and restart with a smaller prompt
- reduce scope (“get suite green, make one commit”)

Suggested division of labor:

Builder agent: minimal code changes to satisfy scenarios
Reviewer agent: docs/harness hardening; adversarial test design; usability critique

Evaluation principles (recommended reading)

These sources have good, practical framing:

Anthropic: Demystifying evals for AI agents (tasks/trials/graders/traces/outcomes)
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Layered testing for agentic systems (unit → deterministic scenarios → E2E)
- https://virtuslab.com/blog/ai/testing-evaluating-agentic-systems/
Deterministic replay (record/replay to reduce non-determinism)
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/

Practices to adopt

Make tasks, trials, graders, traces, outcomes explicit.
Prefer outcome-based evals (verifiable end state), not just string matching.
Capture traces for debugging and regression detection.
Add a replay mode when possible.

Harness guidance (avoid test overfitting)

Keep scenarios realistic and progressively harder.
Add negative assertions (what must not happen).
Add anti-gaming checks:
- ensure the real binary runs (not only a stub)
- ensure outputs have structure, not just keywords
- assert dedupe/no-spam constraints

Progressive realism ladder

Happy-path semantics
Anti-gaming assertions + negative assertions
3+ agent scenarios + interleaved events
Messy reality (stale entries, partial intents, expiry)
Adversarial cases (token collisions, substring traps, repeated posts)

Policy suggestion:

1 new scenario per 30-minute builder run
every 2nd run: reviewer tightens one existing scenario

Research in the loop (gated)

Research is useful, but it’s easy to waste time.

Trigger: blocked or repeating the same failure for ~8–10 minutes.
Budget: max ~5 minutes per run.
Output must be one artifact (short note or retro bullet) and map to:
- a new scenario/assertion, or
- a concrete minimal fix.

If it can’t map to a scenario/fix, stop.

Keep the worker non-idle (watchdog)

Make “worker not idle” a system property, not a hope.

Preferred design: deterministic, evidence-based watchdog.

Check cadence: every ~3–5 minutes.
Productivity evidence (prefer these):
1. tracked agent PID is alive
2. agent log file was updated recently (e.g., last 5 minutes)
Fallback evidence:
- process list pattern match (be careful; it’s brittle)

Cooldown + stuck detection

Cooldown exists only to avoid thrash; keep it short enough to avoid long idle gaps
- suggested default: 10 minutes
Stuck detection (mandatory):
- if PID is alive but log hasn’t updated within threshold (e.g., 10m)
- or logs show “waiting for input” → kill and restart with a smaller prompt

Noise control

Default: watchdog is silent.
Notify only on meaningful state transitions (e.g., “started a new builder run”).
If notifying, keep it to one sentence.

Operator UX / onboarding pressure (avoid local maxima)

At least every ~2 iterations:

update a repo-local Agent Guide for using the tool (<= 1 page)
add a smoke test for the guide examples

Treat the guide as a contract:

short
marks unstable things “subject to change”
tested to prevent drift

Template prompts (copy/paste)

Builder prompt (30 min)

Run the harness/tests.
Fix the minimal thing.
Add exactly ONE new realistic scenario.
Make suite pass.
Update retro (3–6 bullets).
Make ONE commit.
Provide proof: harness output + git diff --stat.

Reviewer prompt (20–30 min)

Review scenarios/docs.
Identify one weak point that can be gamed.
Tighten exactly one scenario assertion (docs/harness only).
Provide proof: git diff --stat.

End state: a system that stays moving, gets harder over time, and produces durable artifacts (tests + traces + commits + retro).

krlvi/remote-coding-orchestration.md

Select an option

No results found