Skip to content

Instantly share code, notes, and snippets.

@krlvi
Last active February 15, 2026 00:21
Show Gist options
  • Select an option

  • Save krlvi/463f7ea14835157d6a7660fce429a7b6 to your computer and use it in GitHub Desktop.

Select an option

Save krlvi/463f7ea14835157d6a7660fce429a7b6 to your computer and use it in GitHub Desktop.
remote-coding-orchestration
name version license description
remote-coding-orchestration
0.1
CC0-1.0
A reusable playbook for coordinating one or more coding agents (Codex CLI, Claude Code, Cursor, etc.) on a separate dev machine via SSH with a tight feedback loop: scenario-driven harness, frequent commits, retro notes, gated research, and an evidence-based watchdog to keep work progressing without chat spam.

Remote Coding Orchestration (Generic Playbook)

This document is an agent-agnostic operating method for shipping a non-trivial engineering project with coding agents. It assumes you have:

  • a codebase with tests (or a scenario harness you’re building)
  • a “worker” machine (local or remote) where agents do most of the editing
  • a human operator who steers via short timeboxed loops

It is intentionally tool/runtime neutral. Replace placeholders for your environment.

Goals

  • Keep agent work measurable (tests/scenarios drive progress).
  • Avoid “local maxima” (overfitting tests, brittle hacks, chatty coordination).
  • Keep the worker non-idle while minimizing thrash and spam.
  • Make progress visible via small commits + retro notes.

Core loop (never skip)

Each iteration must be short and outcome-driven.

  1. Pick exactly one next scenario / failure mode

    • Smallest test/scenario that increases real capability.
    • Prefer scenarios that are hard to game.
  2. Task an agent with a timebox (default: 30 minutes)

    • Codex-style agents are best for “run tests → patch → rerun”.
    • Review-style agents are best for tightening docs/harness assertions.
  3. Proof

    • Require: test/harness output + git diff --stat.
  4. Checkpoint

    • One small commit with a clear message.
  5. Retro (2–6 bullets)

    • What worked
    • What got stuck
    • Biggest bottleneck
    • Next adjustment

Progress only counts if:

  • a scenario/test was added, or
  • the suite moved closer to green (and the delta is committed).

Remote worker conventions (SSH)

  • SSH target: WORKER_SSH_HOST (example: devbox)
  • Repo location: ~/src/REPO
  • Ensure PATH is correct in non-interactive SSH sessions (macOS Homebrew example):
    • export PATH=/opt/homebrew/bin:$PATH

Agent binaries (examples)

  • Codex CLI: codex
  • Claude Code: claude

Use whatever you have installed; the playbook only assumes you can run an agent non-interactively.

Steering rules (keep agents productive)

  • Use artifact-oriented prompts:
    • exact file paths to edit
    • exact command(s) to run
    • exact success criteria
  • Always include a timebox (10–30 minutes).
  • Require end-of-task proof:
    • harness/test output
    • git diff --stat
    • and optionally git status --porcelain
  • If an agent becomes interactive or silent:
    • kill and restart with a smaller prompt
    • reduce scope (“get suite green, make one commit”)

Suggested division of labor:

  • Builder agent: minimal code changes to satisfy scenarios
  • Reviewer agent: docs/harness hardening; adversarial test design; usability critique

Evaluation principles (recommended reading)

These sources have good, practical framing:

Practices to adopt

  • Make tasks, trials, graders, traces, outcomes explicit.
  • Prefer outcome-based evals (verifiable end state), not just string matching.
  • Capture traces for debugging and regression detection.
  • Add a replay mode when possible.

Harness guidance (avoid test overfitting)

  • Keep scenarios realistic and progressively harder.
  • Add negative assertions (what must not happen).
  • Add anti-gaming checks:
    • ensure the real binary runs (not only a stub)
    • ensure outputs have structure, not just keywords
    • assert dedupe/no-spam constraints

Progressive realism ladder

  1. Happy-path semantics
  2. Anti-gaming assertions + negative assertions
  3. 3+ agent scenarios + interleaved events
  4. Messy reality (stale entries, partial intents, expiry)
  5. Adversarial cases (token collisions, substring traps, repeated posts)

Policy suggestion:

  • 1 new scenario per 30-minute builder run
  • every 2nd run: reviewer tightens one existing scenario

Research in the loop (gated)

Research is useful, but it’s easy to waste time.

  • Trigger: blocked or repeating the same failure for ~8–10 minutes.
  • Budget: max ~5 minutes per run.
  • Output must be one artifact (short note or retro bullet) and map to:
    • a new scenario/assertion, or
    • a concrete minimal fix.

If it can’t map to a scenario/fix, stop.

Keep the worker non-idle (watchdog)

Make “worker not idle” a system property, not a hope.

Preferred design: deterministic, evidence-based watchdog.

  • Check cadence: every ~3–5 minutes.
  • Productivity evidence (prefer these):
    1. tracked agent PID is alive
    2. agent log file was updated recently (e.g., last 5 minutes)
  • Fallback evidence:
    • process list pattern match (be careful; it’s brittle)

Cooldown + stuck detection

  • Cooldown exists only to avoid thrash; keep it short enough to avoid long idle gaps
    • suggested default: 10 minutes
  • Stuck detection (mandatory):
    • if PID is alive but log hasn’t updated within threshold (e.g., 10m)
    • or logs show “waiting for input” → kill and restart with a smaller prompt

Noise control

  • Default: watchdog is silent.
  • Notify only on meaningful state transitions (e.g., “started a new builder run”).
  • If notifying, keep it to one sentence.

Operator UX / onboarding pressure (avoid local maxima)

At least every ~2 iterations:

  • update a repo-local Agent Guide for using the tool (<= 1 page)
  • add a smoke test for the guide examples

Treat the guide as a contract:

  • short
  • marks unstable things “subject to change”
  • tested to prevent drift

Template prompts (copy/paste)

Builder prompt (30 min)

  • Run the harness/tests.
  • Fix the minimal thing.
  • Add exactly ONE new realistic scenario.
  • Make suite pass.
  • Update retro (3–6 bullets).
  • Make ONE commit.
  • Provide proof: harness output + git diff --stat.

Reviewer prompt (20–30 min)

  • Review scenarios/docs.
  • Identify one weak point that can be gamed.
  • Tighten exactly one scenario assertion (docs/harness only).
  • Provide proof: git diff --stat.

End state: a system that stays moving, gets harder over time, and produces durable artifacts (tests + traces + commits + retro).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment