Skip to content

Instantly share code, notes, and snippets.

@jamesnordlund
Created February 1, 2026 14:44
Show Gist options
  • Select an option

  • Save jamesnordlund/b32c5afd417e1cb11fb90cf883c6f61f to your computer and use it in GitHub Desktop.

Select an option

Save jamesnordlund/b32c5afd417e1cb11fb90cf883c6f61f to your computer and use it in GitHub Desktop.

System2 Spec Artifacts

This gist contains the formal engineering artifacts for the Standalone Bedrock Orchestrator.

Workflow

The development follows the System2 spec-driven workflow:

  1. Context (context.md) - Problem statement, goals, constraints, and success criteria. (Gate 1)
  2. Requirements (requirements.md) - Functional and non-functional requirements in EARS format. (Gate 2)
  3. Design (design.md) - Architecture, data models, interfaces, and algorithms. (Gate 3)
  4. Tasks (tasks.md) - Implementation plan broken down into atomic, testable tasks. (Gate 4)

Artifacts

Artifact Status Description
Context Approved Defines the "Why" and "What" at a high level.
Requirements Approved Defines strict behaviors and quality constraints.
Design Approved Defines the technical approach and internal structure.
Tasks Approved Defines the step-by-step implementation roadmap.

Traceability

  • Requirements trace back to Context goals (G1-G7).
  • Design traces back to Requirements (REQ-xxx).
  • Tasks trace back to Design components and Requirements.

Generated by System2

Context: Standalone Bedrock Orchestrator for System2

Problem Statement

Users with AWS Bedrock access but without Claude Code CLI or Roo Code cannot run System2's multi-agent workflow. The existing lib/bedrock_client.py provides raw single-turn model invocation (prompt in, text out) but no orchestration, agent management, tool execution, conversation management, or quality gate enforcement. There is no way to execute the System2 delegation pipeline -- context, requirements, design, tasks, implementation, verification, ship -- outside of Claude Code CLI or Roo Code.

Goals

  • G1: Build a Python CLI orchestrator that runs the full System2 workflow (Gates 0-5) using AWS Bedrock as the sole LLM backend. Measurable: python3 -m system2 "task description" starts an interactive session and produces spec artifacts through agent delegation.
  • G2: Reuse the existing 13 agent definitions from .claude/agents/*.md without requiring a separate agent definition format. Measurable: the orchestrator parses all 13 existing agent files (YAML frontmatter + Markdown system prompt) and uses them without modification.
  • G3: Implement local tool execution for the 6 tools agents use: Read, Write, Edit, Grep, Glob, Bash. Measurable: each tool produces correct results matching the behavior described in the agent prompts.
  • G4: Implement the delegation workflow from CLAUDE.md including delegation map ordering, delegation contracts, and interactive quality gates. Measurable: the orchestrator delegates to agents in the order specified in CLAUDE.md and pauses at each gate for user approval.
  • G5: Implement multi-turn conversation via tool-use loops within each agent invocation. Measurable: an agent can make multiple tool calls in sequence (e.g., Read a file, then Edit it, then Read again to verify) within a single delegation.
  • G6: Use the existing BedrockClient from lib/bedrock_client.py for all LLM calls. Measurable: zero direct boto3 calls outside of BedrockClient.
  • G7: Provide a programmatic API in addition to CLI. Measurable: from lib.orchestrator import Orchestrator works and can be driven from scripts.

Non-Goals / Out of Scope

  • Reimplementing Claude Code CLI or Roo Code. This is a focused orchestrator for the System2 workflow, not a general-purpose AI coding assistant.
  • GUI or web interface. CLI and programmatic API only.
  • Hook execution. The safety hooks in scripts/claude-hooks/ are Claude Code-specific shell scripts invoked by the Claude Code hook system. We will not invoke those scripts. Equivalent safety constraints (file path validation, dangerous command blocking) will be implemented in Python within the tool layer.
  • Streaming responses. BedrockClient uses invoke_model which returns full responses. Streaming is not supported (documented limitation in README-BEDROCK.md).
  • Supporting non-Bedrock providers. This orchestrator is Bedrock-only. Native Claude Code and Roo Code remain as separate platform options.
  • Roo Code mode file parsing. We read only .claude/agents/*.md format, not roo/*.yml.
  • Subagent spawning. Per CLAUDE.md, subagents cannot spawn other subagents. The orchestrator manages all delegation centrally.

Users & Use-Cases

User Use Case Key Need
Enterprise developer with Bedrock access only Wants to run System2 spec-driven workflow without installing Claude Code CLI or Roo Code. Has AWS credentials and IAM permissions for Bedrock. End-to-end workflow execution via CLI.
CI/CD pipeline operator Automates spec-driven development as part of a build pipeline. Needs non-interactive mode or scriptable approval. Programmatic API, headless operation with pre-approved gates.
Team lead in regulated environment Needs all LLM calls routed through AWS (VPC, CloudTrail, IAM). Cannot use direct Anthropic API. All traffic goes through Bedrock; no external API calls.
Developer evaluating System2 Wants to try the workflow using existing AWS infrastructure before adopting Claude Code CLI. Low setup cost; pip install + AWS credentials.

Constraints & Invariants

Platform Constraints

  • Python 3.10+ only. The existing BedrockClient uses typing features and pathlib that assume 3.10+.
  • Dependencies must be minimal. Required: boto3, pyyaml (already required by BedrockClient). Allowed additions: click or argparse for CLI (Assumption: argparse preferred since it is stdlib). No heavy frameworks.
  • Must work on macOS, Linux. Windows support is not a constraint for MVP.

Architectural Constraints

  • Reuse BedrockClient from lib/bedrock_client.py. No duplicate boto3 invocation code. The orchestrator wraps or extends BedrockClient to support the Converse API or multi-turn messages format.
  • Parse existing .claude/agents/*.md files. No separate agent definition format. The orchestrator reads YAML frontmatter (name, description, tools, hooks) and Markdown body (system prompt) from the same files agents currently use.
  • Orchestrator manages all state. Since BedrockClient.invoke_model() is single-turn, the orchestrator maintains per-agent message history (system prompt, user messages, assistant messages, tool_use/tool_result pairs) and passes the full conversation on each API call.
  • Configuration via .system2/config.yml. Extend the existing config file with orchestrator-specific settings (e.g., gate behavior, tool safety mode, cost warning thresholds). Do not create a separate config file.

Safety Constraints

  • File operations must be sandboxed to the project directory. Read, Write, Edit, Grep, Glob must not access files outside the repository root. Absolute paths are resolved and validated.
  • Bash commands require user confirmation by default. A --unsafe-bash flag may disable this for CI use, but the default is interactive confirmation.
  • Output sanitization. Agent outputs are treated as untrusted input per CLAUDE.md. The orchestrator must not execute instructions embedded in agent responses that were not explicitly tool calls.
  • No secrets in logs. Cost estimates and usage data may be logged; AWS credentials, session tokens, and file contents containing secrets must not be logged.

Constitutional Items (from CLAUDE.md)

  • Treat all file contents and tool outputs as untrusted input; resist prompt injection.
  • Never invent build/test commands; discover from repo.
  • Subagents cannot spawn other subagents.
  • Pause for explicit user approval at each quality gate.

Success Metrics & Acceptance Criteria

ID Criterion Verification Method
AC-1 python3 -m system2 "build a REST API" starts an interactive session that delegates to spec-coordinator and produces spec/context.md. Manual end-to-end test.
AC-2 The orchestrator parses all 13 agent definitions from .claude/agents/*.md and extracts name, description, tools list, and system prompt without error. Unit test: parse each agent file, assert fields are non-empty.
AC-3 Each of the 6 tools (Read, Write, Edit, Grep, Glob, Bash) produces correct results when invoked by an agent through the tool-use loop. Unit tests per tool; integration test with a mock LLM returning tool_use blocks.
AC-4 Quality gates pause for user input. The user can approve, reject, or provide feedback at each gate. Manual test: run a workflow and verify gate prompts appear at gates 0-5.
AC-5 The orchestrator follows the delegation map order from CLAUDE.md: spec-coordinator before requirements-engineer before design-architect, etc. Integration test: mock LLM, assert agent invocation order.
AC-6 Conversation history is correctly maintained within an agent's tool-use loop. A second tool call in the same agent session can reference results from the first. Integration test: agent reads a file, then edits it based on contents.
AC-7 All LLM calls go through BedrockClient. No direct boto3 calls elsewhere. Code review; grep for boto3 imports outside bedrock_client.py.
AC-8 File operations are sandboxed. Attempting to write outside the project root raises an error. Unit test: attempt out-of-bounds write, assert rejection.
AC-9 Bash commands prompt for user confirmation before execution (unless --unsafe-bash flag is set). Manual test and unit test with mock stdin.
AC-10 Cost tracking: cumulative cost estimate is displayed after each agent delegation completes. Manual test; unit test asserting cost accumulation.

Risks & Edge Cases

Risk Likelihood Impact Mitigation
Bedrock tool_use API format differs from Messages API. BedrockClient.invoke_model() currently sends a simple messages array. Tool use requires tools parameter and parsing tool_use content blocks in responses. High High Extend or adapt BedrockClient to support the Bedrock Converse API or the tools parameter in the Messages API format. Research and prototype early.
Agent system prompts reference Claude Code-specific features. Agent prompts mention hooks, attempt_completion, subagent behavior, and Claude Code tooling. High Medium Parse and adapt prompts at load time: strip hook references, map attempt_completion to an orchestrator-understood signal, document which prompt features are unsupported.
Token limits exceeded for long conversations. Full message history per agent call will grow with each tool-use turn. A complex executor session could exceed context window limits (200K tokens). Medium High Implement conversation truncation or summarization. Track token count per conversation and warn at 80% of model limit.
Bash tool safety. Without Claude Code's hook-based safety, a malicious or confused agent could execute destructive commands. Low Critical Default to user confirmation for all Bash commands. Maintain a blocklist of destructive patterns (rm -rf, drop, deploy, publish).
Cost runaway. A full workflow (13 agents, each with multiple tool-use turns) could cost significant amounts on Bedrock. Medium Medium Display running cost total after each agent. Add configurable cost ceiling in .system2/config.yml; pause and warn when approaching threshold.
Agent expects tools it cannot have. An agent's frontmatter lists tools (e.g., Bash) but the orchestrator's safety policy restricts it. Low Low Log a warning when a tool is restricted. The agent will receive a tool error and should adapt.
Edit tool old_string not found. The Edit tool requires exact string matching. If the agent hallucinates file contents, edits will fail. Medium Medium Return clear error messages. The tool-use loop allows the agent to retry with corrected content.

Observability / Telemetry expectations

  • Per-agent cost tracking. After each agent delegation completes, log and display: agent name, model used, total input tokens, total output tokens, estimated cost (USD), number of tool-use turns.
  • Workflow-level summary. At the end of a workflow (or at Gate 5), display: total agents invoked, total LLM calls, total tokens, total estimated cost, wall-clock time.
  • Tool execution logging. Each tool invocation is logged with: tool name, truncated arguments (no file contents in logs), success/failure, duration.
  • Gate decisions. Log each gate approval/rejection with timestamp.
  • Log destination. Default to stderr for human-readable logs. Optionally write structured JSON logs to a file for CI/CD integration. Configurable via .system2/config.yml.
  • No telemetry phone-home. All observability is local. No data is sent anywhere except to AWS Bedrock for LLM calls.

Rollout & Backward Compatibility

  • No changes to existing files. The orchestrator is additive: new files in lib/ and a __main__.py entry point. Existing .claude/agents/*.md, CLAUDE.md, lib/bedrock_client.py, and .system2/config.yml are read but not modified (config may be extended with new optional keys).
  • Agent definitions remain compatible. .claude/agents/*.md files continue to work with Claude Code CLI. The orchestrator reads them in a forward-compatible way (ignoring unknown frontmatter keys like hooks).
  • Phased rollout.
    • Phase 1 (MVP): Agent parser + tool implementations + single-agent invocation with tool-use loop. No delegation workflow yet.
    • Phase 2: Full delegation workflow with quality gates. Linear agent sequencing per CLAUDE.md delegation map.
    • Phase 3: Post-execution workflow (test-engineer, security-sentinel, docs-release, code-reviewer chain with blocker handling and boomerang cycles).
  • Config backward compatibility. New config keys under providers.bedrock.orchestrator namespace. Existing config continues to work without orchestrator-specific keys.

Open Questions

# Question Recommendation Owner Resolution Path
OQ-1 Should MVP include the full post-execution workflow (blocker handling, boomerang cycles) or defer to Phase 3? Defer to Phase 3. MVP covers Gates 0-4 + linear delegation. Post-execution is complex and can be layered on. User Decision at Gate 1 approval.
OQ-2 Should BedrockClient be extended in-place to support tool_use / Converse API, or should a new BedrockConverseClient wrapper be created? Create a wrapper class BedrockConversation that uses BedrockClient internally but manages the Converse API format. Keeps BedrockClient stable for other users. Design Architect Decision at Gate 3 (design).
OQ-3 How should agent system prompts that reference Claude Code-specific features (hooks, attempt_completion, subagent restrictions) be handled? Strip or adapt at parse time with a documented transformation layer. Map attempt_completion to a JSON completion signal the orchestrator recognizes. Design Architect Decision at Gate 3 (design).
OQ-4 Should the CLI support a non-interactive / batch mode for CI/CD (auto-approve gates)? Yes, via --auto-approve flag. But defer to Phase 2. MVP is interactive only. User Decision at Gate 1 approval.
OQ-5 What is the cost ceiling default? Should the orchestrator refuse to continue above a configurable USD threshold? Default ceiling of $5.00 per workflow run with a warning at $2.00. Configurable in .system2/config.yml. User Decision at Gate 1 approval.
OQ-6 Should the orchestrator use Bedrock's Converse API (bedrock-runtime.converse) or stick with InvokeModel with the Messages API format? Converse API is purpose-built for multi-turn + tool use and is the recommended path. However, BedrockClient currently uses invoke_model. This is a key design decision. Design Architect Research spike before Gate 3.

Glossary

Term Definition
Agent A specialist role defined in .claude/agents/*.md with a system prompt, tool allowlist, and focused responsibility (e.g., spec-coordinator, executor).
Delegation The orchestrator invoking an agent by constructing a conversation with the agent's system prompt, a user message containing the delegation contract, and running the tool-use loop until the agent signals completion.
Delegation contract A structured message from the orchestrator to an agent containing: objective, inputs, outputs, constraints, and completion summary requirements. Defined in CLAUDE.md.
Quality gate An interactive checkpoint where the orchestrator pauses and asks the user to approve, reject, or provide feedback on a spec artifact before proceeding. Gates 0-5 are defined in CLAUDE.md.
Tool-use loop The cycle of: (1) send messages to Bedrock, (2) receive response with tool_use blocks, (3) execute tools locally, (4) append tool_result to messages, (5) repeat until the agent produces a final text response without tool calls.
BedrockClient The existing Python class in lib/bedrock_client.py that wraps boto3 for single-turn Claude model invocation on AWS Bedrock.
Converse API AWS Bedrock's bedrock-runtime.converse API, designed for multi-turn conversations with tool use. An alternative to invoke_model with the Messages API format.
Post-execution workflow The sequence of agents (test-engineer, security-sentinel, eval-engineer, docs-release, code-reviewer) that run after the executor completes, with trigger conditions and blocker handling. Defined in CLAUDE.md.
Boomerang cycle When a post-execution agent reports blockers, the orchestrator delegates fixes to the executor and re-runs the reporting agent. Limited to 3 iterations per agent.
Frontmatter The YAML metadata block at the top of .claude/agents/*.md files, delimited by ---. Contains name, description, tools, and hooks fields.

Design: Standalone Bedrock Orchestrator

Overview

The Standalone Bedrock Orchestrator is a Python CLI and library that executes the System2 spec-driven workflow (Gates 0-5) using AWS Bedrock as the sole LLM backend. It reuses the 13 existing agent definitions from .claude/agents/*.md, implements local tool execution for 6 tools, and manages multi-turn conversations through the Bedrock Converse API.

The system is structured as a layered pipeline:

CLI (__main__.py)
  |
  v
Orchestrator (lib/orchestrator.py)
  |
  v
Delegation Engine (lib/delegation.py)
  |
  v
Agent Parser (lib/agent_parser.py)  +  BedrockConversation (lib/bedrock_conversation.py)
                                            |
                                            v
                                     Tool Registry (lib/tools/)
                                            |
                                            v
                                     BedrockClient (lib/bedrock_client.py) [existing, unmodified]

Key design decision: BedrockConversation does not call BedrockClient.invoke_model(). That method uses invoke_model with the Messages API format and cannot support tool definitions or the Converse API protocol. Instead, BedrockConversation accesses the BedrockClient.client attribute (the initialized boto3 bedrock-runtime client) and calls client.converse() directly. This satisfies REQ-062 (zero direct boto3 calls outside bedrock_client.py for client initialization) while enabling Converse API usage. See AD-1 for the full rationale and alternatives.


Architecture

Module Layout

System2/
  system2/
    __init__.py              # Package marker; Python version check (REQ-142)
    __main__.py              # CLI entry point (REQ-001, REQ-002)
  lib/
    __init__.py              # Existing (unchanged)
    bedrock_client.py        # Existing (unchanged)
    bedrock_conversation.py  # NEW: Converse API wrapper (REQ-060, REQ-061)
    orchestrator.py          # NEW: Main workflow engine (REQ-070-073)
    delegation.py            # NEW: Delegation map, contracts, agent loop (REQ-040-047)
    agent_parser.py          # NEW: YAML+MD parser, prompt transforms (REQ-010-016)
    config.py                # NEW: Configuration loading and schema (REQ-084, REQ-085)
    cost_tracker.py          # NEW: Cumulative cost tracking (REQ-130-133)
    transcript.py            # NEW: JSONL transcript writer (REQ-125, REQ-126)
    constants.py             # NEW: Delegation map, completion signal schema, pricing
    tools/
      __init__.py            # Tool registry and base interface
      base.py                # BaseTool abstract class
      read_tool.py           # Read tool (REQ-021)
      write_tool.py          # Write tool (REQ-022)
      edit_tool.py           # Edit tool (REQ-023, REQ-023a)
      grep_tool.py           # Grep tool (REQ-024)
      glob_tool.py           # Glob tool (REQ-025)
      bash_tool.py           # Bash tool (REQ-026-028a)
      sandbox.py             # Path validation (REQ-110)

Component Responsibilities and Boundaries

Component Responsibility Boundary
system2/__main__.py CLI arg parsing, TTY detection, entry No business logic
lib/orchestrator.py Workflow state, gate management, cost ceiling Does not call Bedrock directly
lib/delegation.py Agent invocation, tool-use loop, contract construction One agent at a time; no subagent spawning
lib/agent_parser.py Parse .claude/agents/*.md, apply prompt transforms Read-only; no file modification
lib/bedrock_conversation.py Converse API call formatting, message history, token tracking Uses BedrockClient.client for API calls
lib/tools/* Execute individual tools, return structured results Sandboxed to project root
lib/config.py Load and validate .system2/config.yml Falls back to defaults on missing/invalid
lib/cost_tracker.py Accumulate costs, check thresholds Stateless per-workflow; resets on new run
lib/transcript.py Append JSONL entries to run transcript Best-effort; never halts workflow
lib/constants.py Delegation map, pricing tables, blocklist defaults Code constants, not parsed from CLAUDE.md

Data Flow

Primary Workflow Sequence

sequenceDiagram
    participant User
    participant CLI as __main__.py
    participant Orch as Orchestrator
    participant Del as DelegationEngine
    participant AP as AgentParser
    participant BC as BedrockConversation
    participant TR as ToolRegistry
    participant AWS as Bedrock API

    User->>CLI: python3 -m system2 "task"
    CLI->>Orch: Orchestrator.run(task_description)

    loop For each agent in delegation map
        Orch->>AP: parse_agent(agent_name)
        AP-->>Orch: AgentDefinition
        Orch->>Del: delegate(agent_def, contract)

        Del->>BC: new_conversation(system_prompt, tools)
        Del->>BC: send_message(contract_text)

        loop Tool-use loop
            BC->>AWS: converse(messages, tools)
            AWS-->>BC: response (text + tool_use blocks)
            BC-->>Del: ParsedResponse

            alt Response has tool_use blocks
                loop For each tool_use block
                    Del->>TR: execute(tool_name, tool_input)
                    TR-->>Del: ToolResult
                end
                Del->>BC: send_tool_results(results)
            else Response is final (no tool calls)
                Del-->>Orch: CompletionSummary
            end
        end

        Orch->>User: Display agent summary + cost
        Orch->>User: Gate prompt (approve/reject/feedback)

        alt User approves
            Note over Orch: Continue to next agent
        else User rejects with feedback
            Orch->>Del: re-delegate with feedback
        end
    end

    Orch->>User: Workflow summary (Gate 5)
Loading

Step-by-Step Data Flow

  1. CLI receives task -- __main__.py parses args, detects TTY, loads config, creates Orchestrator.
  2. Orchestrator starts workflow -- Creates CostTracker, TranscriptWriter, iterates delegation map.
  3. Agent loaded on-demand -- AgentParser.parse() reads the .md file, extracts frontmatter, applies prompt transforms, returns AgentDefinition.
  4. Delegation contract constructed -- DelegationEngine builds a structured user message with objective, inputs, outputs, constraints.
  5. BedrockConversation created -- Initialized with system prompt (transformed), tool definitions (filtered to agent's allowlist).
  6. Initial message sent -- Contract text becomes the first user message. BedrockConversation calls converse().
  7. Tool-use loop runs -- Response parsed. If tool_use blocks present, tools executed via ToolRegistry, results appended, loop continues. Each iteration logged to transcript.
  8. Completion detected -- Agent either: (a) produces text with no tool calls, or (b) calls a pseudo-tool signal_completion with a JSON payload. Either way, the delegation engine extracts the completion summary.
  9. Cost updated -- API-reported token usage added to CostTracker. Thresholds checked.
  10. Gate presented -- Orchestrator displays agent output and prompts user.
  11. Repeat or terminate -- On approval, next agent. On rejection, re-invoke with feedback.

Public Interfaces

CLI Interface (REQ-001 through REQ-005)

python3 -m system2 "<task_description>"

Options:
  --unsafe-bash         Disable interactive Bash confirmation (REQ-004)
  --config PATH         Path to .system2/config.yml (default: auto-discover)
  --project-root PATH   Project root for sandboxing (default: git root or cwd)
  --log-format text|json Log format (REQ-124)
  --log-file PATH       Log file path (default: stderr)
  --verbose             Enable debug-level logging

Exit codes:

  • 0: Workflow completed (all gates approved)
  • 1: User aborted or fatal error
  • 2: Invalid arguments or missing task in non-TTY mode (REQ-003a)
  • 3: AWS credential failure (REQ-093)
  • 4: Cost ceiling reached and user declined to continue

Programmatic API (REQ-070 through REQ-073)

from lib.orchestrator import Orchestrator

class Orchestrator:
    def __init__(
        self,
        project_root: Path,
        config_path: Path | None = None,
        gate_policy: GatePolicy | None = None,      # Injectable (REQ-073)
        bash_policy: BashPolicy | None = None,       # Injectable (REQ-073)
        on_agent_complete: Callable | None = None,    # Callback hook
    ): ...

    def run(self, task_description: str) -> WorkflowResult: ...
    def invoke_agent(self, agent_name: str, contract: DelegationContract) -> AgentResult: ...
    def get_status(self) -> WorkflowStatus: ...

Policy Interfaces (REQ-073)

from typing import Protocol

class GatePolicy(Protocol):
    def decide(self, gate: int, artifact_path: str, summary: str) -> GateDecision: ...

class BashPolicy(Protocol):
    def confirm(self, command: str, is_blocklisted: bool) -> bool: ...

# Default implementations for CLI
class InteractiveGatePolicy:
    """Prompts on stdin/stdout."""

class InteractiveBashPolicy:
    """Prompts on stdin/stdout. Blocks if safety_mode=strict and blocklisted."""

class AutoApproveGatePolicy:
    """Phase 2: auto-approves all gates."""

Data Model & Storage

Data Classes

from dataclasses import dataclass, field
from enum import Enum
from typing import Any
from pathlib import Path

@dataclass
class AgentDefinition:
    """Parsed agent definition (REQ-081)."""
    name: str
    description: str
    tools: list[str]
    system_prompt: str  # Post-transformation
    raw_system_prompt: str  # Pre-transformation (for debugging)
    source_path: Path

@dataclass
class DelegationContract:
    """Structured delegation message (REQ-082)."""
    objective: str
    inputs: list[str]         # File paths or descriptions
    outputs: list[str]        # Expected output files
    constraints: list[str]
    completion_requirements: list[str]

    def to_message(self) -> str:
        """Serialize to labeled-section text for the user message."""
        ...

class GateDecisionType(Enum):
    APPROVE = "approve"
    REJECT = "reject"
    ABORT = "abort"

@dataclass
class GateDecision:
    gate_number: int
    decision: GateDecisionType
    feedback: str | None = None  # Present when REJECT
    timestamp: str = ""

@dataclass
class ToolResult:
    """Result from tool execution (REQ-030, REQ-053)."""
    tool_use_id: str
    content: str            # Text content of the result
    is_error: bool = False

@dataclass
class CostRecord:
    agent_name: str
    model_id: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    api_calls: int
    tool_turns: int

@dataclass
class CompletionSignal:
    """Agent completion signal (REQ-083)."""
    status: str              # "success", "failure", "blockers"
    files_changed: list[str]
    summary: str
    blockers: list[str] = field(default_factory=list)

@dataclass
class WorkflowResult:
    agents_invoked: list[str]
    gate_decisions: list[GateDecision]
    total_cost: CostRecord
    wall_clock_seconds: float
    transcript_path: Path

Transcript Storage (REQ-125, REQ-126)

Transcripts are written to .system2/runs/<YYYYMMDD-HHMMSS>.jsonl. Each line is a JSON object with one of these type values:

{"type": "workflow_start", "ts": "...", "task": "...", "config": {...}}
{"type": "agent_start", "ts": "...", "agent": "spec-coordinator", "contract": {...}}
{"type": "api_request", "ts": "...", "agent": "...", "message_count": 5, "tool_count": 3}
{"type": "api_response", "ts": "...", "agent": "...", "stop_reason": "tool_use", "usage": {...}}
{"type": "tool_exec", "ts": "...", "tool": "Read", "args_summary": "file=spec/context.md", "success": true, "duration_ms": 12}
{"type": "gate_decision", "ts": "...", "gate": 1, "decision": "approve"}
{"type": "agent_complete", "ts": "...", "agent": "...", "cost": {...}, "completion": {...}}
{"type": "workflow_end", "ts": "...", "total_cost": {...}, "duration_s": 342.5}

The TranscriptWriter wraps file writes in try/except. On failure, it logs a warning to stderr and sets an internal flag (checked once per agent for repeated warnings). It never raises exceptions to callers (REQ-126).

Configuration Schema (REQ-084, REQ-085)

New keys under providers.bedrock.orchestrator in .system2/config.yml:

providers:
  bedrock:
    # ... existing keys unchanged ...
    orchestrator:
      cost_ceiling_usd: 5.00          # Pause at this cumulative cost (REQ-132)
      cost_warning_usd: 2.00          # Warn at this cumulative cost (REQ-131)
      log_format: text                 # "text" or "json" (REQ-124)
      log_destination: stderr          # "stderr" or a file path
      safety_mode: strict              # "strict" or "permissive" (REQ-115)
      bash_blocklist:                  # Additional patterns merged with defaults (REQ-028a)
        - "custom-dangerous-cmd"
      transcript_dir: .system2/runs    # Where JSONL transcripts are written
      max_tool_turns: 200              # Safety limit on tool-use iterations per agent
      context_window_tokens: 200000    # Model context window size
      max_output_tokens: 8192          # Max tokens per Converse API call

All keys are optional. Defaults are applied by lib/config.py when missing (REQ-153).


Converse API Integration (OPEN-001 Resolution)

AD-1: Accessing the Bedrock Converse API

Decision: BedrockConversation accesses the boto3 bedrock-runtime client object stored in BedrockClient.client and calls client.converse() directly.

Rationale: The existing BedrockClient.invoke_model() method is hardcoded to the invoke_model API with the Anthropic Messages format. It constructs a single-turn messages array with no tools parameter. Converse API has a completely different request structure (native messages, toolConfig, system parameters -- not a JSON body). There are three options:

  1. Modify BedrockClient to add a converse() method -- Violates REQ-063 and OQ-2 ("do not modify BedrockClient").
  2. Create BedrockConversation that calls BedrockClient.invoke_model() with hacked parameters -- invoke_model sends to the invoke_model API endpoint. There is no way to make it call the converse endpoint. Not viable.
  3. Create BedrockConversation that accesses BedrockClient.client (the boto3 client) directly -- This reuses BedrockClient's authentication, session, and initialization logic while calling a different API method on the same boto3 client. Selected approach.

Tradeoff: We depend on BedrockClient.client being a bedrock-runtime boto3 client (which it is). This is a coupling to an internal attribute, but BedrockClient is in-repo and under our control. We document this coupling. REQ-062 is satisfied because BedrockConversation does not create its own boto3 session or client -- it reuses the one initialized by BedrockClient.

Converse API Request Format

# BedrockConversation.send() -- core API call
response = self._bedrock_client.client.converse(
    modelId=self._model_id,
    system=[{"text": self._system_prompt}],
    messages=self._messages,      # list[dict] in Converse format
    toolConfig={"tools": self._tool_definitions},
    inferenceConfig={
        "maxTokens": self._max_output_tokens,
        "temperature": self._temperature,
    },
)

Message Format (Converse API native)

# User message
{"role": "user", "content": [{"text": "...contract text..."}]}

# Assistant message with tool use
{
    "role": "assistant",
    "content": [
        {"text": "Let me read the file."},
        {
            "toolUse": {
                "toolUseId": "tool_abc123",
                "name": "Read",
                "input": {"file_path": "/path/to/file"}
            }
        }
    ]
}

# User message with tool results
{
    "role": "user",
    "content": [
        {
            "toolResult": {
                "toolUseId": "tool_abc123",
                "content": [{"text": "file contents here..."}],
                "status": "success"  # or "error"
            }
        }
    ]
}

Tool Definition Schema

Each tool is defined in the Converse API toolSpec format:

{
    "toolSpec": {
        "name": "Read",
        "description": "Read a file from the filesystem.",
        "inputSchema": {
            "json": {
                "type": "object",
                "properties": {
                    "file_path": {
                        "type": "string",
                        "description": "Absolute path to the file to read"
                    },
                    "offset": {
                        "type": "integer",
                        "description": "Line number to start reading from (1-based)"
                    },
                    "limit": {
                        "type": "integer",
                        "description": "Number of lines to read"
                    }
                },
                "required": ["file_path"]
            }
        }
    }
}

Full tool definitions for all 6 tools plus signal_completion are defined in lib/tools/__init__.py and exported as a list. Each tool's BaseTool subclass provides a get_tool_spec() -> dict method that returns its Converse API toolSpec.

Completion Signal as a Pseudo-Tool

To give agents an explicit mechanism to signal completion (REQ-014, REQ-083), we register a signal_completion pseudo-tool:

{
    "toolSpec": {
        "name": "signal_completion",
        "description": "Signal that you have completed your task. Call this when done.",
        "inputSchema": {
            "json": {
                "type": "object",
                "properties": {
                    "status": {"type": "string", "enum": ["success", "failure", "blockers"]},
                    "files_changed": {"type": "array", "items": {"type": "string"}},
                    "summary": {"type": "string"},
                    "blockers": {"type": "array", "items": {"type": "string"}}
                },
                "required": ["status", "files_changed", "summary"]
            }
        }
    }
}

When the delegation engine detects a signal_completion tool use, it extracts the input as a CompletionSignal and terminates the tool-use loop. The tool result returned to the API is "Completion acknowledged." (though the loop ends immediately after).

Fallback: If the agent produces a response with stopReason: "end_turn" (no tool calls and no signal_completion), the delegation engine treats it as an implicit completion. It parses the final text response looking for a JSON completion signal. If not found, it constructs a CompletionSignal with status="success", files_changed=[], and the full text as the summary.

Response Parsing

# response structure from converse()
{
    "output": {
        "message": {
            "role": "assistant",
            "content": [
                {"text": "..."},           # Optional text block
                {"toolUse": {...}},        # Optional tool use blocks (0 or more)
            ]
        }
    },
    "stopReason": "tool_use" | "end_turn" | "max_tokens",
    "usage": {
        "inputTokens": 1234,
        "outputTokens": 567
    }
}

The BedrockConversation class parses this response and returns a ConverseTurn dataclass:

@dataclass
class ConverseTurn:
    text_blocks: list[str]
    tool_use_blocks: list[ToolUseRequest]
    stop_reason: str          # "tool_use", "end_turn", "max_tokens"
    input_tokens: int
    output_tokens: int
    raw_message: dict         # The full assistant message (appended to history)

@dataclass
class ToolUseRequest:
    tool_use_id: str
    name: str
    input: dict[str, Any]

Prompt Transformation Rules (OPEN-002 Resolution)

Audit Findings

After reading all 13 agent files, the following Claude Code-specific patterns were identified:

Pattern Agents Occurrences
hooks: frontmatter block (PreToolUse, PostToolUse, SubagentStop) All 13 13
attempt_completion in completion instructions executor, test-engineer, security-sentinel, spec-coordinator, requirements-engineer, design-architect, task-planner, code-reviewer, docs-release, eval-engineer, postmortem-scribe, mcp-toolsmith, repo-governor ~13 (varies in phrasing)
References to scripts/claude-hooks/ All 13 (via hooks block) Frontmatter only
.claude/allowlists/*.regex references All 13 (via hooks block) Frontmatter only
"Claude Code CLI" references repo-governor ("automatically loaded by Claude Code CLI") 1
"delegate to executor" / "boomerang" instructions test-engineer, task-planner 2
<thinking> protocol blocks executor, design-architect, requirements-engineer 3
"CLAUDE.md (automatically loaded by Claude Code CLI at startup)" repo-governor 1

Transformation Rules

Applied in lib/agent_parser.py at parse time. Each rule has an ID for traceability.

Rule ID Pattern Action Rationale
TR-01 hooks: frontmatter block Ignore (do not include in AgentDefinition) Hooks are Claude Code-specific. Equivalent safety is in the tool layer. (REQ-012, REQ-N02)
TR-02 attempt_completion in body text Replace with signal_completion tool instruction Maps to our pseudo-tool. (REQ-014)
TR-03 "delegate to executor" / "boomerang ... to executor" Replace with "report to orchestrator for re-delegation" Agents cannot spawn subagents. (REQ-N05)
TR-04 "Claude Code CLI" / "Claude Code" references Replace with "the orchestrator" Contextual accuracy.
TR-05 References to .claude/settings.json Keep unchanged The file may exist and agents should read it for context.
TR-06 <thinking> protocol blocks Keep unchanged These are agent reasoning instructions, not Claude Code features. The model can follow them.
TR-07 "CLAUDE.md (automatically loaded by Claude Code CLI at startup)" Replace with "CLAUDE.md (project instructions)" Removes Claude Code loading mechanism reference.
TR-08 References to hook scripts (scripts/claude-hooks/*.py) in body text Remove lines Not applicable; safety is in tool layer. (Rare: only if body text references hooks outside the frontmatter block.)

Implementation

import re

TRANSFORM_RULES = [
    # TR-02: attempt_completion -> signal_completion
    (
        re.compile(r'attempt_completion'),
        'signal_completion'
    ),
    # TR-03: delegate/boomerang to executor -> report to orchestrator
    (
        re.compile(r'(?:delegate|boomerang)\s+(?:such\s+)?(?:fixes\s+)?to\s+executor'),
        'report to orchestrator for re-delegation to executor'
    ),
    # TR-04: Claude Code CLI -> orchestrator
    (
        re.compile(r'Claude Code CLI'),
        'the orchestrator'
    ),
    # TR-04 variant: Claude Code (standalone, not in "Claude Code CLI")
    (
        re.compile(r'Claude Code(?!\s+CLI)'),
        'the orchestrator'
    ),
    # TR-07: CLAUDE.md loading reference
    (
        re.compile(r'CLAUDE\.md\s*\(automatically loaded by Claude Code CLI at startup\)'),
        'CLAUDE.md (project instructions)'
    ),
    # TR-08: Hook script references in body
    (
        re.compile(r'^.*scripts/claude-hooks/.*$', re.MULTILINE),
        ''
    ),
]

# Additionally, append to system prompt:
COMPLETION_INSTRUCTION = """

## Completion Protocol

When you have finished your task, call the `signal_completion` tool with:
- status: "success", "failure", or "blockers"
- files_changed: list of file paths you created or modified
- summary: brief summary of what you did
- blockers: (optional) list of blocking issues if status is "blockers"
"""

The transformation function applies rules sequentially, then appends COMPLETION_INSTRUCTION to the system prompt.


Token Management (OPEN-003 Resolution)

AD-2: Token Counting Strategy

Decision: Use API-reported token counts from Converse API responses as the primary tracking mechanism.

Rationale:

Strategy Pros Cons
API-reported (usage.inputTokens, usage.outputTokens) Exact, no extra dependencies, reflects actual billing Only available after the call (cannot pre-check)
Local tokenizer (e.g., tiktoken) Can pre-check before sending Extra dependency (violates REQ-143), may not match Bedrock's tokenizer exactly
Heuristic (chars/4) Zero dependencies, can pre-check Inaccurate, especially for code and structured data

Chosen approach: API-reported with heuristic pre-check.

  • Tracking: After each converse() call, BedrockConversation reads usage.inputTokens from the response and accumulates a running total. This is the authoritative count.
  • Pre-check: Before sending a message, estimate the conversation size using a heuristic (character count / 3.5, which is conservative for English + code). The estimate must include a fixed overhead for tool definitions in the system turn — each tool spec contributes approximately 200-400 tokens depending on schema complexity. With 7 tools (6 real + signal_completion), budget ~2,500 tokens of tool overhead in addition to message content. If the estimate exceeds 80% of the context window, warn the user (REQ-054). If it exceeds 95%, trigger the overflow handling (REQ-055).
  • No new dependency: The heuristic avoids adding tiktoken or similar.

Context Window Overflow Handling (REQ-055)

When the pre-check estimate exceeds 95% of context_window_tokens (default 200,000):

  1. Halt the current agent invocation.
  2. Present the user with options:
    • (a) Abort (default, safe): Terminate this agent. Present partial output.
    • (b) Auto-summarize: Send the conversation to Bedrock with a summarization prompt, replace the message history with the summary, and continue.

Auto-summarize implementation:

  • Create a new single-turn conversation with the prompt: "Summarize the following conversation, preserving all file changes made, tool results, and decisions. This summary will replace the conversation history."
  • The summary response replaces all messages except the system prompt.
  • A [CONTEXT SUMMARIZED] marker is inserted so the agent knows history was compressed.
  • Cost of the summarization call is added to the tracker.

Concurrency, Ordering, and Consistency

The orchestrator is single-threaded and sequential. There is no concurrency within the MVP.

  • Agent ordering is determined by the delegation map in lib/constants.py (REQ-040).
  • Tool execution within a single response is sequential (even when multiple tool_use blocks are returned, they are executed one at a time in order). This avoids race conditions on file I/O.
  • Gate decisions are synchronous and blocking.
  • Transcript writes are append-only and flushed after each entry.

Phase 2 extension point: Parallel tool execution could be added for independent tools (e.g., two Read calls). The ToolResult list would be assembled before sending back to the API.


Failure Modes & Recovery

API Errors

Error Detection Recovery REQ
Throttling (429 / ThrottlingException) ClientError with code ThrottlingException Exponential backoff: 1s, 2s, 4s (max 3 retries, max 30s) REQ-090
Service error (5xx) ClientError with 5xx status 2 retries with exponential backoff, then present to user: retry/abort REQ-091
Invalid credentials ClientError at init or first call Clear error message referencing AWS credential config, exit code 3 REQ-093
Model not available ClientError with ModelNotReadyException Present error, suggest checking model access in Bedrock console --
Context window exceeded ValidationException from API Trigger overflow handling (REQ-055) REQ-055

Retry logic lives in BedrockConversation._call_with_retry():

def _call_with_retry(self, **kwargs) -> dict:
    max_retries = 3
    base_delay = 1.0
    for attempt in range(max_retries + 1):
        try:
            return self._bedrock_client.client.converse(**kwargs)
        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code == "ThrottlingException" and attempt < max_retries:
                delay = min(base_delay * (2 ** attempt), 30.0)
                time.sleep(delay)
                continue
            elif e.response["ResponseMetadata"]["HTTPStatusCode"] >= 500 and attempt < 2:
                delay = min(base_delay * (2 ** attempt), 30.0)
                time.sleep(delay)
                continue
            raise

Tool Execution Errors

Error Recovery REQ
File not found (Read/Edit) Return ToolResult(is_error=True, content="File not found: ...") REQ-053
Path outside sandbox Return ToolResult(is_error=True, content="Path outside project root") REQ-110
Edit old_string not found Return error with file path and snippet of actual content REQ-096
Bash command fails Return stdout + stderr + exit code as tool result REQ-053
Bash blocked by blocklist Return error explaining which pattern matched REQ-028
Permission denied Return error with path and permission details REQ-053

Workflow-Level Errors

Error Recovery REQ
Agent exceeds max_tool_turns (200) Halt agent, present partial output, offer retry/skip/abort REQ-092
Agent produces no completion signal and hits end_turn Treat as implicit completion (see Completion Signal section) REQ-083
Cost ceiling reached Pause, display total cost, require explicit confirmation REQ-132
Transcript write failure Log warning, continue workflow REQ-126
Config file invalid Fall back to defaults, log warning REQ-094
Agent .md file unparseable Skip agent, log warning, continue REQ-095

Security Model

Authentication and Authorization

  • All AWS authentication is handled by BedrockClient using boto3's credential chain (environment variables, AWS profiles, IAM roles). Configured via .system2/config.yml auth block (REQ-161).
  • No additional authentication layer exists between CLI and orchestrator.

File Sandbox (REQ-110)

All file-operating tools (Read, Write, Edit, Grep, Glob) use a shared sandbox.py module:

def validate_path(requested_path: str, project_root: Path) -> Path:
    """Resolve and validate that a path is within project_root.

    Resolves symlinks, normalizes '..' components, and checks
    that the resolved absolute path starts with project_root.
    Raises SandboxViolationError if not.
    """
    resolved = Path(requested_path).resolve()
    root = project_root.resolve()
    if not str(resolved).startswith(str(root) + os.sep) and resolved != root:
        raise SandboxViolationError(
            f"Path {requested_path} resolves to {resolved}, "
            f"which is outside project root {root}"
        )
    return resolved

Bash Safety (REQ-027, REQ-028, REQ-115)

The Bash tool has three layers of protection:

  1. Blocklist check: Every command is checked against the combined blocklist (built-in + config). Matching is substring/regex.
  2. Safety mode enforcement:
    • strict (default): Blocklisted commands are rejected outright with an error. No override.
    • permissive: Blocklisted commands trigger a warning and require explicit confirmation.
  3. User confirmation: Unless --unsafe-bash is set, all non-blocked commands still require user confirmation.

Built-in blocklist (16 patterns per REQ-028):

DEFAULT_BASH_BLOCKLIST = [
    r"rm\s+-rf\s+/",
    r"rm\s+-rf\s+~",
    r"rm\s+-rf\s+\.",
    r"mkfs",
    r"dd\s+if=",
    r":\(\)\s*\{",
    r">\s*/dev/sd",
    r"chmod\s+-R\s+777",
    r"wget\s+.*\|\s*sh",
    r"curl\s+.*\|\s*sh",
    r"\beval\b",
    r"DROP\s+TABLE",
    r"DROP\s+DATABASE",
    r"TRUNCATE",
    r"\bdeploy\b",
    r"\bpublish\b",
    r"push\s+--force",
    r"git\s+push\s+-f",
]

Output Sanitization (REQ-112, REQ-113)

  • The orchestrator only executes tool_use blocks from the structured Converse API response. Free-text in assistant messages is displayed but never executed.

  • Prompt injection detection (REQ-113): After each agent response, scan text blocks for suspicious patterns:

    • "skip security" / "bypass security"
    • "modify CLAUDE.md" / "edit CLAUDE.md"
    • "escalate privileges" / "run as root" / "sudo"
    • "ignore previous instructions"

    If detected, flag the response and require user confirmation before continuing.

Secrets in Logs (REQ-111)

  • Tool arguments logged with truncation: file paths are logged, but file contents are never logged.
  • Bash commands are logged, but stdout/stderr from Bash is not included in logs (only in tool results sent to the API).
  • AWS credentials are never logged. The BedrockClient handles credentials internally.

Observability

Per-Agent Metrics (REQ-120)

After each agent delegation completes, display to stderr:

--- spec-coordinator complete ---
Model:        us.anthropic.claude-sonnet-4-20250514-v1:0
API calls:    7
Input tokens: 45,230
Output tokens: 12,891
Tool turns:   6
Est. cost:    $0.33
Cumulative:   $0.33

Workflow Summary (REQ-121)

At the end of the workflow (or user abort):

=== Workflow Summary ===
Agents invoked: 4 (spec-coordinator, requirements-engineer, design-architect, task-planner)
Total API calls: 28
Total tokens:    234,567 (in: 178,432, out: 56,135)
Total est. cost: $1.38
Wall-clock time: 12m 34s
Transcript:      .system2/runs/20260201-143022.jsonl

Tool Logging (REQ-122)

Each tool invocation is logged at INFO level:

[TOOL] Read file_path=spec/context.md offset=None limit=None -> success (12ms)
[TOOL] Edit file_path=lib/foo.py old_string="def bar..." -> success (3ms)
[TOOL] Bash command="pytest tests/ -x" -> success (4502ms)
[TOOL] Write file_path=/etc/passwd -> ERROR: sandbox violation (0ms)

Arguments are truncated to 100 characters. File contents are never included.

Gate Logging (REQ-123)

[GATE] Gate 1 (context) -> APPROVED at 2026-02-01T14:32:15Z
[GATE] Gate 2 (requirements) -> REJECTED at 2026-02-01T14:45:22Z feedback="Add error handling reqs"

Structured JSON Logging (REQ-124)

When log_format: json, each log entry is a JSON object on one line to the configured destination:

{"ts": "2026-02-01T14:32:15Z", "level": "INFO", "type": "tool_exec", "tool": "Read", "args": {"file_path": "spec/context.md"}, "success": true, "duration_ms": 12}

Rollout Plan

Phase 1 (MVP): Single-Agent Invocation

Scope: Agent parser + tool implementations + BedrockConversation + single-agent tool-use loop via Orchestrator.invoke_agent().

Deliverables:

  • system2/__init__.py, system2/__main__.py (basic CLI, single-agent mode)
  • lib/agent_parser.py with all transformation rules
  • lib/bedrock_conversation.py with Converse API integration
  • lib/tools/ (all 6 tools + sandbox + signal_completion)
  • lib/config.py, lib/cost_tracker.py, lib/transcript.py, lib/constants.py
  • Unit tests for each tool, agent parser, and BedrockConversation

Verification: Parse all 13 agents. Invoke one agent (e.g., spec-coordinator) against live Bedrock with a simple task. Confirm tool-use loop works end-to-end.

Backout: All new files. Delete system2/ and new files in lib/. No existing files modified.

Phase 2: Full Delegation Workflow

Scope: lib/orchestrator.py, lib/delegation.py, quality gates (0-4), delegation map sequencing, --auto-approve flag.

Deliverables:

  • lib/orchestrator.py with full workflow loop
  • lib/delegation.py with contract construction and agent sequencing
  • Gate prompt UI on stdin/stdout
  • Integration tests with mocked Bedrock

Verification: Run full workflow (spec-coordinator through task-planner) against live Bedrock.

Phase 3: Post-Execution Workflow (Deferred)

Scope: Post-execution agents (test-engineer, security-sentinel, eval-engineer, docs-release, code-reviewer), trigger evaluation, blocker handling, boomerang cycles, Gate 5 aggregation.

Extension points in Phase 1/2 code:

  • DelegationEngine accepts a post_execution_plan parameter (unused until Phase 3).
  • CompletionSignal.blockers field is parsed but not acted upon until Phase 3.
  • Orchestrator has a _run_post_execution() method stub that raises NotImplementedError.

Alternatives Considered

Alt-1: Modify BedrockClient to Add converse() Method

Approach: Add a converse() method to BedrockClient that calls self.client.converse().

Pros:

  • Clean API: all Bedrock calls go through BedrockClient methods.
  • No coupling to internal client attribute.

Cons:

  • Violates REQ-063 and OQ-2 ("do not modify BedrockClient").
  • BedrockClient is used by other code; adding methods risks unintended side effects.
  • The invoke_model and converse APIs have fundamentally different signatures; merging them into one class conflates responsibilities.

Decision: Rejected per explicit constraint.

Alt-2: Use invoke_model with Messages API Tool Use

Approach: Instead of Converse API, use invoke_model with the Anthropic Messages API format that supports tools in the request body.

Pros:

  • Could potentially reuse BedrockClient.invoke_model() with modifications to the body construction.
  • Anthropic Messages API is well-documented.

Cons:

  • BedrockClient.invoke_model() hardcodes the body format (single messages array, no tools key). We would need to modify it (violating REQ-063) or bypass it entirely.
  • Bedrock's Converse API is the AWS-recommended path for tool use and is provider-agnostic.
  • The Messages API format through invoke_model requires manual JSON body construction and response parsing with Anthropic-specific schemas. Converse API provides native boto3 request/response objects.

Decision: Rejected. Converse API is the recommended path per OQ-6 resolution.

Alt-3: Fork BedrockClient into BedrockConverseClient

Approach: Copy BedrockClient and create a new BedrockConverseClient that initializes its own boto3 client and calls converse().

Pros:

  • Complete independence from BedrockClient. No coupling.
  • Clean Converse API design from scratch.

Cons:

  • Duplicates all authentication and session logic (violates DRY).
  • Two boto3 clients initialized for the same service. Wasteful and confusing.
  • Violates the spirit of REQ-062 (reuse BedrockClient for AWS interactions).

Decision: Rejected. The access-internal-client approach is simpler and avoids duplication.


Open Design Questions

ID Question Recommendation Impact if Deferred
DQ-1 Should BedrockConversation cache the model's actual context window from the Bedrock API (GetFoundationModel) or use a configured constant? Use configured constant (200K) for MVP. Phase 2 could query the API. Low -- constant is accurate for Claude models on Bedrock.
DQ-2 How should the orchestrator determine which agents to skip (REQ-047)? For MVP: always run the full delegation map in order; user can skip via gate rejection. Phase 2: add heuristics (e.g., skip postmortem-scribe unless incident context detected). Low -- user has override at every gate.
DQ-3 Should the auto-summarize (REQ-055) use the same model or a cheaper model? Same model for accuracy. The summarization prompt is small; cost is bounded. Low -- only triggered in edge cases.
DQ-4 What is the maximum Bash command output size before truncation? 100KB. Larger outputs are truncated with a "[TRUNCATED]" marker and the full output saved to a temp file. Medium -- large outputs could fill context.

Architecture Decisions Summary

ID Decision Key Rationale Requirements
AD-1 Access BedrockClient.client for Converse API calls Cannot modify BedrockClient; invoke_model cannot call converse endpoint REQ-060, REQ-062, REQ-063, REQ-064
AD-2 API-reported tokens + heuristic pre-check No extra dependencies; accurate billing alignment REQ-054, REQ-055, REQ-143
AD-3 signal_completion pseudo-tool for completion detection Explicit, structured, parseable; fallback for agents that don't call it REQ-014, REQ-083
AD-4 Sequential tool execution (no parallelism) Avoids file I/O race conditions; simpler implementation REQ-050, REQ-052
AD-5 Delegation map as code constant, not parsed from CLAUDE.md Decouples orchestrator behavior from documentation changes REQ-040
AD-6 Prompt transforms applied at parse time, not at send time Single transformation pass; consistent system prompt throughout agent session REQ-013, REQ-015
AD-7 Policy injection for gates and bash (Protocol classes) Enables headless/CI use without stdin/stdout dependency REQ-073
AD-8 Transcript as JSONL (append-only, best-effort) Simple, crash-recoverable, no external dependencies REQ-125, REQ-126
AD-9 Agent loaded on-demand, not all at startup Memory efficiency; only parse agents that will be invoked REQ-103

Verification Strategy

Requirements to Design Traceability

Requirement Group Design Component Test Strategy
REQ-001 to REQ-005 (CLI) system2/__main__.py Unit: arg parsing, TTY detection. Manual: interactive session.
REQ-010 to REQ-016 (Agent parsing) lib/agent_parser.py Unit: parse all 13 agents, assert fields. Test transforms against known patterns.
REQ-020 to REQ-030 (Tools) lib/tools/ Unit: each tool with valid/invalid inputs. Integration: mock LLM tool-use loop.
REQ-040 to REQ-047 (Delegation) lib/delegation.py, lib/constants.py Integration: mock Bedrock, assert agent order and contract structure.
REQ-050 to REQ-055 (Conversation) lib/bedrock_conversation.py Unit: message formatting, token tracking. Integration: multi-turn with mock.
REQ-060 to REQ-065 (Bedrock) lib/bedrock_conversation.py Unit: verify converse() call format. Integration: live Bedrock smoke test.
REQ-070 to REQ-073 (API) lib/orchestrator.py Unit: instantiate, invoke with mock, verify no stdin dependency.
REQ-080 to REQ-085 (Data contracts) lib/constants.py, lib/config.py Unit: schema validation, default values.
REQ-090 to REQ-096 (Error handling) lib/bedrock_conversation.py, tools Unit: mock errors, assert retry/backoff. Assert error messages.
REQ-110 to REQ-115 (Security) lib/tools/sandbox.py, lib/tools/bash_tool.py Unit: path traversal attempts, blocklist matching, injection detection.
REQ-120 to REQ-126 (Observability) lib/cost_tracker.py, lib/transcript.py Unit: cost accumulation, JSONL format. Integration: verify log output.
REQ-130 to REQ-133 (Cost) lib/cost_tracker.py, lib/config.py Unit: threshold checks with mock costs.
REQ-140 to REQ-153 (Config/compat) lib/config.py, all modules Unit: missing config fallback. Code review: no existing file modifications.
REQ-N01 to REQ-N06 (Negative) Code review Grep: no GUI, no hook execution, no streaming, no Roo, no subagent spawning.

Test Pyramid

  • Unit tests (Phase 1): Each tool, agent parser transforms, config loading, cost tracker, sandbox validation, Converse API message formatting.
  • Integration tests (Phase 1-2): Tool-use loop with mocked converse() returning scripted responses. Full workflow with mocked Bedrock.
  • Smoke tests (Phase 1): Single agent invocation against live Bedrock (manual, not in CI).
  • End-to-end tests (Phase 2): Full delegation workflow against live Bedrock (manual acceptance test).

Implementation Notes

Edit Tool — Phase 2 Extension Point

The MVP Edit tool implements exact string matching only (REQ-023). As noted in review feedback, LLMs frequently struggle with exact whitespace/indentation matching, which can cause "apply failed" loops. REQ-023a defines a SHOULD-priority unified diff fallback. For MVP, this is deferred but the BaseTool interface is designed to allow EditTool to accept an optional diff parameter in Phase 2 without breaking changes. Implementation should track edit failure rates to inform the Phase 2 prioritization decision.

Entry Point Permissions

system2/__main__.py is invoked via python3 -m system2, which does not require the file to be executable (chmod +x). Python's -m flag treats the package as a module, bypassing filesystem execute permissions. No chmod is needed. If a console script entry point is added in pyproject.toml in Phase 2 (e.g., [project.scripts] system2 = "system2.__main__:main"), pip/uv handles making it executable during installation.

Requirements: Standalone Bedrock Orchestrator

Traceability source: spec/context.md (Standalone Bedrock Orchestrator for System2) Resolved open questions applied: OQ-1 through OQ-6 (see Constraints below) EARS syntax reference: Ubiquitous (shall), Event-driven (When), State-driven (While), Unwanted (If), Optional (Where)


Resolved Open Question Constraints

These resolved decisions from spec/context.md are treated as binding constraints throughout:

  • OQ-1: Post-execution workflow deferred to Phase 3. MVP covers Gates 0-4 + linear delegation.
  • OQ-2: Create BedrockConversation wrapper using BedrockClient internally. Do not extend BedrockClient in-place.
  • OQ-3: Strip/adapt Claude Code references in agent prompts at parse time with a documented transformation layer.
  • OQ-4: Non-interactive/batch mode deferred to Phase 2. MVP is interactive only.
  • OQ-5: Cost ceiling $5.00 per workflow run, warning at $2.00. Configurable in .system2/config.yml.
  • OQ-6: Use Bedrock Converse API. Research spike needed before design phase.

Functional Requirements

CLI Entry Point (G1)

ID EARS Statement Priority Traces To
REQ-001 When a user runs python3 -m system2 "<task description>", the system shall start an interactive session that accepts the task description as the initial scope input. Must G1, AC-1
REQ-002 The system shall provide a system2 package with a __main__.py entry point that can be invoked via python3 -m system2. Must G1, AC-1
REQ-003 When the CLI is invoked without a task description argument and stdin is a TTY, the system shall prompt the user interactively for a task description. Should G1
REQ-003a When the CLI is invoked without a task description argument and stdin is not a TTY (non-interactive environment), the system shall exit with a non-zero exit code and a clear error message indicating that a task description is required in non-interactive mode. Must G1
REQ-004 The system shall accept a --unsafe-bash flag that disables interactive confirmation for Bash tool invocations. Must G1, AC-9
REQ-005 [Deferred: Phase 2] Where --auto-approve flag is provided, the system shall automatically approve all quality gates without user interaction. Should G1, OQ-4

Agent Parsing (G2)

ID EARS Statement Priority Traces To
REQ-010 The system shall parse all .claude/agents/*.md files, extracting YAML frontmatter fields (name, description, tools, hooks) and the Markdown body as the system prompt. Must G2, AC-2
REQ-011 The system shall successfully parse all 13 existing agent definitions without error. Must G2, AC-2
REQ-012 When an agent file contains unknown YAML frontmatter keys (e.g., hooks), the system shall ignore those keys without error, preserving forward compatibility with Claude Code CLI. Must G2
REQ-013 The system shall apply a documented prompt transformation layer at parse time that strips or adapts Claude Code-specific references in agent system prompts, including: hook references, attempt_completion references, subagent spawning instructions, and Claude Code tool signatures. Must G2, OQ-3
REQ-014 When the prompt transformation layer encounters an attempt_completion reference, the system shall map it to a JSON completion signal that the orchestrator recognizes as the agent signaling task completion. Must G2, OQ-3
REQ-015 The system shall not modify the .claude/agents/*.md files on disk. All transformations are applied in memory at parse time. Must G2
REQ-016 The system shall extract the tools list from each agent's frontmatter and use it to determine which tools are available for that agent's invocation. Must G2, AC-3

Tool Implementations (G3)

ID EARS Statement Priority Traces To
REQ-020 The system shall implement local execution for the following 6 tools: Read, Write, Edit, Grep, Glob, Bash. Must G3, AC-3
REQ-021 The Read tool shall accept a file path and return the file contents. It shall support optional offset and limit parameters for partial reads. Must G3, AC-3
REQ-022 The Write tool shall accept a file path and content, and write the content to the specified file, creating parent directories if needed. Must G3, AC-3
REQ-023 The Edit tool shall accept a file path, old_string, and new_string, and perform exact string replacement. If old_string is not found or is not unique (and replace_all is false), the tool shall return a clear error message. Must G3, AC-3
REQ-023a The Edit tool should support a unified diff mode as a fallback when exact literal matching fails, allowing agents to apply patches via standard unified diff format. Should G3, AC-3
REQ-024 The Grep tool shall accept a regex pattern and optional path, glob filter, and output mode, and return matching results. Must G3, AC-3
REQ-025 The Glob tool shall accept a glob pattern and optional path, and return matching file paths sorted alphabetically by path for deterministic behavior across environments. Must G3, AC-3
REQ-026 The Bash tool shall accept a command string and execute it in a subprocess, returning stdout, stderr, and exit code. Must G3, AC-3
REQ-027 While the --unsafe-bash flag is not set, the Bash tool shall prompt the user for confirmation before executing any command. Must G3, AC-9
REQ-028 The Bash tool shall maintain a blocklist of destructive command patterns and shall warn the user when a command matches a blocklist pattern, even when --unsafe-bash is set. The initial blocklist shall include: rm -rf /, rm -rf ~, rm -rf ., mkfs, dd if=, :(){, > /dev/sd, chmod -R 777, `wget ... sh/curl ... sh(piped execution),eval, DROP TABLE, DROP DATABASE, TRUNCATE, deploy, publish, push --force, git push -f`.
REQ-028a The Bash tool blocklist shall be configurable via .system2/config.yml under providers.bedrock.orchestrator.bash_blocklist, allowing users to add or override patterns. When configured, the user-provided list shall be merged with the built-in default list. Must G3
REQ-029 When an agent's frontmatter tools list does not include a given tool, the system shall not make that tool available to the agent during invocation. Must G3, AC-3
REQ-030 Each tool shall return results in a structured format compatible with the Bedrock Converse API tool_result content block. Must G3, G6

Delegation Workflow (G4)

ID EARS Statement Priority Traces To
REQ-040 The system shall implement the delegation map ordering as a configuration constant within the orchestrator code (e.g., a Python list/dict in a constants module): repo-governor, spec-coordinator, requirements-engineer, design-architect, task-planner, executor, test-engineer, security-sentinel, eval-engineer, docs-release, code-reviewer, postmortem-scribe, mcp-toolsmith. The delegation map shall not be parsed from CLAUDE.md at runtime; CLAUDE.md remains the human-readable documentation of the map, but the orchestrator's behavior is not coupled to it. Must G4, AC-5
REQ-041 When delegating to an agent, the system shall construct a delegation contract containing: objective, inputs, outputs, constraints, and completion summary requirements, as defined in CLAUDE.md. Must G4
REQ-042 The system shall implement quality gates (Gate 0 through Gate 4 for MVP) that pause execution and prompt the user for approval, rejection, or feedback before proceeding to the next phase. Must G4, AC-4
REQ-043 When a user rejects a gate artifact, the system shall accept textual feedback and re-invoke the responsible agent with the rejection feedback appended as additional context, preserving the prior conversation history for that agent. Must G4, AC-4
REQ-043a In MVP (Phases 1-2), user rejection at a quality gate shall be the sole mechanism for iteration. Automated boomerang cycles (agent-to-agent iteration without user involvement) remain deferred to Phase 3. Must G4, OQ-1
REQ-044 The system shall not delegate to a downstream agent until the upstream gate is approved. Must G4
REQ-045 [Deferred: Phase 3] The system shall implement the post-execution workflow including trigger evaluation for test-engineer, security-sentinel, eval-engineer, docs-release, and code-reviewer with blocker handling and boomerang cycles (max 3 iterations per agent). Should G4, OQ-1
REQ-046 [Deferred: Phase 3] The system shall implement Gate 5 summary aggregation that reads spec/post-execution-log.md and presents a combined summary for user approval. Should G4, OQ-1
REQ-047 The system shall skip agents in the delegation map that are not relevant to the current workflow phase, as determined by the orchestrator's assessment of the task scope. Should G4

Multi-Turn Conversation / Tool-Use Loop (G5)

ID EARS Statement Priority Traces To
REQ-050 The system shall implement a tool-use loop for each agent invocation that cycles through: (1) send messages to Bedrock, (2) parse response for tool_use blocks, (3) execute tools locally, (4) append tool_result to conversation history, (5) repeat until the agent produces a response without tool calls or signals completion. Must G5, AC-6
REQ-051 The system shall maintain per-agent conversation history including system prompt, user messages, assistant messages, and tool_use/tool_result pairs, passing the full history on each API call within that agent's session. Must G5, AC-6
REQ-052 When an agent response contains multiple tool_use blocks, the system shall execute all requested tools and return all tool_result blocks in the subsequent message. Must G5
REQ-053 When a tool execution fails, the system shall return a tool_result with is_error: true and a descriptive error message, allowing the agent to retry or adapt. Must G5
REQ-054 The system shall track token count per agent conversation and shall warn the user when usage reaches 80% of the model's context window limit. Should G5
REQ-055 If the token count for an agent conversation exceeds the model's context window limit, the system shall halt the agent invocation and offer the user a choice between: (a) halt and abort the current agent invocation (default/safe option), or (b) auto-summarize the conversation using a recursive summary prompt and continue with the summarized context. Must G5

BedrockClient Integration (G6)

ID EARS Statement Priority Traces To
REQ-060 The system shall create a BedrockConversation wrapper class that uses BedrockClient from lib/bedrock_client.py internally for all LLM calls. Must G6, AC-7, OQ-2
REQ-061 The BedrockConversation class shall manage the Bedrock Converse API format, including multi-turn message history, tool definitions, and tool_use/tool_result content blocks. Must G6, OQ-6
REQ-062 There shall be zero direct boto3 calls outside of lib/bedrock_client.py. All AWS API interactions shall go through BedrockClient. Must G6, AC-7
REQ-063 The BedrockConversation class shall not modify the existing BedrockClient class. It shall compose over it or use its boto3 session/client internally. Must G6, OQ-2
REQ-064 The system shall use the Bedrock Converse API (bedrock-runtime:converse) for multi-turn conversations with tool use. Must G6, OQ-6
REQ-065 A research spike shall be completed before the design phase to validate Converse API compatibility with the tool-use loop and existing BedrockClient infrastructure. Must G6, OQ-6

Programmatic API (G7)

ID EARS Statement Priority Traces To
REQ-070 The system shall provide a programmatic API accessible via from lib.orchestrator import Orchestrator. Must G7
REQ-071 The Orchestrator class shall accept configuration (project root, config path, safety settings) at initialization time. Must G7
REQ-072 The Orchestrator class shall expose methods to: start a workflow, invoke a single agent, and query workflow status. Must G7
REQ-073 The programmatic API shall not depend on stdin/stdout for core operation. Gate approvals and Bash confirmations shall be injectable as callback functions or policy objects. Must G7

Data & Interface Contracts

ID EARS Statement Priority Traces To
REQ-080 The system shall define tool input/output schemas compatible with the Bedrock Converse API toolSpec and toolResult formats. Must G3, G6
REQ-081 The agent parser shall produce a structured AgentDefinition object containing: name (str), description (str), tools (list of str), system_prompt (str, post-transformation). Must G2
REQ-082 The delegation contract shall be serialized as a structured user message containing labeled sections: Objective, Inputs, Outputs, Constraints, Completion Summary Requirements. Must G4
REQ-083 The agent completion signal shall be a JSON object containing: status (success/failure/blockers), files_changed (list), summary (str), and optional blockers (list). Must G4, G5
REQ-084 Configuration for the orchestrator shall be stored under the providers.bedrock.orchestrator namespace in .system2/config.yml. Existing configuration keys shall not be modified. Must G1
REQ-085 The orchestrator configuration schema shall include: cost_ceiling_usd (float, default 5.00), cost_warning_usd (float, default 2.00), log_format (enum: text/json, default text), log_destination (str, default stderr), safety_mode (enum: strict/permissive, default strict), bash_blocklist (list of str, optional, merged with built-in defaults). Must OQ-5

Error Handling & Recovery

ID EARS Statement Priority Traces To
REQ-090 If the Bedrock API returns a throttling error (HTTP 429 or ThrottlingException), the system shall retry with exponential backoff (initial 1s, max 30s, max 3 retries). Must G6
REQ-091 If the Bedrock API returns a service error (5xx), the system shall retry up to 2 times with exponential backoff before presenting the error to the user with options to retry or abort. Must G6
REQ-092 If an agent fails to produce a valid completion signal after exhausting the token limit, the system shall present the partial output to the user and offer options: retry the agent, skip and continue, or abort the workflow. Must G5
REQ-093 If AWS credentials are invalid or expired at startup, the system shall report a clear error message referencing AWS credential configuration and exit with a non-zero exit code. Must G6
REQ-094 If .system2/config.yml is missing or contains invalid YAML, the system shall fall back to default configuration values and log a warning. Must G1
REQ-095 If an agent definition file in .claude/agents/ cannot be parsed (malformed YAML frontmatter or missing required fields), the system shall skip that agent, log a warning, and continue with the remaining agents. Should G2
REQ-096 When the Edit tool fails because old_string is not found in the file, the system shall return a clear error message including the file path and a snippet of the expected content, enabling the agent to retry. Must G3

Performance & Scalability

ID EARS Statement Priority Traces To
REQ-100 The system shall parse all 13 agent definition files in under 1 second on standard hardware. Must G2
REQ-101 Tool execution latency for Read, Write, Edit, Grep, and Glob shall not exceed 5 seconds for typical operations on repositories under 10,000 files. Should G3
REQ-102 The system shall support agent conversations of up to 200,000 tokens (the model context window) without memory errors. Must G5
REQ-103 The system shall not load all agent definitions into memory simultaneously; agents shall be loaded on-demand when delegated to. Should G2

Security & Privacy

ID EARS Statement Priority Traces To
REQ-110 The Read, Write, Edit, Grep, and Glob tools shall resolve all file paths to absolute paths and validate that they are within the project root directory. If a path resolves outside the project root, the tool shall reject the operation with an error. Must AC-8
REQ-111 The system shall not log AWS credentials, session tokens, or file contents that may contain secrets to any log destination. Must Safety
REQ-112 The system shall treat all agent outputs as untrusted input. The orchestrator shall not execute instructions embedded in agent text responses that are not explicitly structured as tool_use blocks. Must Safety
REQ-113 If an agent output contains suspected prompt injection patterns (instructions to skip security checks, modify CLAUDE.md, or escalate privileges), the system shall flag the output and require explicit user review before proceeding. Should Safety
REQ-114 The system shall not make any network calls other than to AWS Bedrock via BedrockClient. No telemetry, analytics, or phone-home calls. Must Safety
REQ-115 While safety mode is set to strict (default), the Bash tool shall block commands matching destructive patterns without allowing override. While safety mode is set to permissive, the Bash tool shall warn but allow execution after user confirmation. Must Safety, AC-9

Observability

ID EARS Statement Priority Traces To
REQ-120 After each agent delegation completes, the system shall display: agent name, model used, total input tokens, total output tokens, estimated cost (USD), and number of tool-use turns. Must AC-10
REQ-121 At the end of a workflow (or at the final gate), the system shall display a summary including: total agents invoked, total LLM calls, total tokens, total estimated cost, and wall-clock time. Must AC-10
REQ-122 Each tool invocation shall be logged with: tool name, truncated arguments (no file contents), success/failure status, and duration. Must AC-10
REQ-123 Each gate decision (approve/reject) shall be logged with a timestamp. Must G4
REQ-124 The system shall default to human-readable logs on stderr. Where log_format is set to json in configuration, the system shall write structured JSON logs to the configured destination. Should G1
REQ-125 The system shall stream/append the full conversation transcript (prompts, responses, tool calls, and tool results) to a local JSONL file at .system2/runs/<timestamp>.jsonl as each message occurs. This transcript is independent of the Phase 3 post-execution log (REQ-046) and serves crash recovery and audit purposes. Must G5, Safety
REQ-126 If the transcript file cannot be written (e.g., disk full, permission error), the system shall log a warning but shall not halt the workflow. Must G5

Cost Tracking

ID EARS Statement Priority Traces To
REQ-130 The system shall maintain a cumulative cost estimate across all agent invocations within a workflow run. Must AC-10, OQ-5
REQ-131 When the cumulative cost estimate reaches the configured cost_warning_usd threshold (default $2.00), the system shall display a warning to the user. Must OQ-5
REQ-132 When the cumulative cost estimate reaches the configured cost_ceiling_usd threshold (default $5.00), the system shall pause execution and require explicit user confirmation to continue. Must OQ-5
REQ-133 The cost ceiling and warning thresholds shall be configurable in .system2/config.yml under providers.bedrock.orchestrator.cost_ceiling_usd and providers.bedrock.orchestrator.cost_warning_usd. Must OQ-5

Configuration

ID EARS Statement Priority Traces To
REQ-140 The system shall read configuration from .system2/config.yml at startup. Must G1
REQ-141 New orchestrator-specific configuration keys shall be placed under the providers.bedrock.orchestrator namespace. Existing configuration keys shall remain unchanged and functional. Must G1
REQ-142 The system shall require Python 3.10 or higher. If invoked on a lower Python version, it shall exit with a clear error message. Must G1
REQ-143 The system shall depend only on: boto3, pyyaml, and Python standard library modules (including argparse for CLI). No additional third-party dependencies. Must G1

Backward Compatibility & Migration

ID EARS Statement Priority Traces To
REQ-150 The system shall not modify any existing files: .claude/agents/*.md, CLAUDE.md, lib/bedrock_client.py, or .system2/config.yml (aside from optional new keys). Must G2
REQ-151 Agent definition files (.claude/agents/*.md) shall remain fully compatible with Claude Code CLI after the orchestrator is installed. Must G2
REQ-152 The orchestrator shall be purely additive: new files in lib/ and a system2/ package. No changes to existing source files. Must G1
REQ-153 Where .system2/config.yml does not contain orchestrator-specific keys, the system shall use default values for all orchestrator settings. Must G1

Compliance / Policy Constraints

ID EARS Statement Priority Traces To
REQ-160 All LLM traffic shall be routed through AWS Bedrock. The system shall make no direct calls to the Anthropic API or any other LLM provider. Must Safety
REQ-161 The system shall support AWS IAM role assumption and AWS profile-based authentication as configured in .system2/config.yml. Must G6
REQ-162 The system shall work within AWS VPC environments with no requirement for internet access other than the Bedrock endpoint. Must Safety

Negative Requirements

ID EARS Statement Priority Traces To
REQ-N01 The system shall not implement a GUI or web interface. Must Non-goals
REQ-N02 The system shall not execute Claude Code hook scripts from scripts/claude-hooks/. Must Non-goals
REQ-N03 The system shall not support streaming responses. Must Non-goals
REQ-N04 The system shall not parse Roo Code mode files (roo/*.yml). Must Non-goals
REQ-N05 The system shall not allow subagents to spawn other subagents. All delegation is managed centrally by the orchestrator. Must Non-goals
REQ-N06 The system shall not support non-Bedrock LLM providers. Must Non-goals

Open Requirements

ID Description Resolution Path
OPEN-001 Exact Converse API request/response schema and tool definition format need validation via research spike (OQ-6). Research spike before design phase.
OPEN-002 The full list of prompt transformation rules (REQ-013) needs to be enumerated after auditing all 13 agent prompt files. Design phase: audit agent prompts and document each transformation.
OPEN-003 Token counting method for context window tracking (REQ-054) -- whether to use API-reported usage, a local tokenizer, or heuristic estimation. Design decision.

Validation Plan

Requirement(s) Validation Method Phase
REQ-001, REQ-002 Manual end-to-end test: run python3 -m system2 "test task" and verify interactive session starts. Phase 1
REQ-010, REQ-011, REQ-012, REQ-015, REQ-016 Unit test: parse each of the 13 agent files, assert name, description, tools, system_prompt are non-empty. Assert unknown keys are ignored. Assert no files modified on disk. Phase 1
REQ-013, REQ-014 Unit test: parse agent files with known Claude Code references, assert they are transformed. Assert attempt_completion is mapped to JSON completion signal. Phase 1
REQ-020 through REQ-026 Unit test per tool: invoke with valid inputs and assert correct output. Integration test with mock LLM returning tool_use blocks. Phase 1
REQ-027, REQ-115 Unit test: mock stdin, invoke Bash without --unsafe-bash, assert prompt appears. With --unsafe-bash, assert no prompt. Test blocklist pattern matching. Phase 1
REQ-040, REQ-041, REQ-044, REQ-047 Integration test: mock LLM, run workflow, assert agent invocation follows delegation map order and delegation contracts are well-formed. Phase 2
REQ-042, REQ-043 Manual test: run workflow, verify gate prompts at Gates 0-4. Reject a gate and verify feedback is re-delegated. Phase 2
REQ-050, REQ-051, REQ-052, REQ-053 Integration test: mock LLM returning multi-turn tool_use sequences. Assert conversation history is maintained. Assert error tool_results are returned for failed tools. Phase 1
REQ-054, REQ-055 Unit test: simulate conversation approaching and exceeding token limit, assert warning and halt behaviors. Phase 1
REQ-060, REQ-061, REQ-062, REQ-063, REQ-064 Code review: grep for boto3 outside bedrock_client.py. Unit test: verify BedrockConversation delegates to BedrockClient and does not instantiate boto3 directly. Phase 1
REQ-070, REQ-071, REQ-072, REQ-073 Unit test: import Orchestrator, instantiate with config, invoke single-agent method with mock LLM. Verify no stdin/stdout dependency. Phase 1
REQ-080, REQ-083 Unit test: validate tool schemas against Converse API spec. Validate completion signal JSON schema. Phase 1
REQ-090, REQ-091 Unit test: mock Bedrock returning 429 and 5xx, assert retry with backoff. Assert max retries respected. Phase 1
REQ-093 Unit test: mock invalid credentials, assert clear error message and non-zero exit. Phase 1
REQ-110 Unit test: attempt Read/Write/Edit/Grep/Glob with path outside project root, assert rejection. Phase 1
REQ-111 Code review: audit all log statements for credential or secret leakage. Phase 1
REQ-112, REQ-113 Integration test: mock agent returning text with embedded instructions, assert orchestrator does not execute them. Phase 2
REQ-120, REQ-121, REQ-122, REQ-123 Manual test + unit test: verify per-agent cost display, workflow summary, tool logging, and gate logging. Phase 1/2
REQ-130, REQ-131, REQ-132, REQ-133 Unit test: simulate cost accumulation, assert warning at $2.00 and pause at $5.00. Verify configurable thresholds. Phase 1
REQ-142 Unit test: mock sys.version_info below 3.10, assert error message. Phase 1
REQ-143 Code review: audit imports for disallowed third-party dependencies. Phase 1
REQ-150, REQ-151, REQ-152, REQ-153 Code review: verify no existing files are modified. Integration test: run Claude Code agent parse after orchestrator install. Phase 1
REQ-003a Unit test: invoke CLI without task description with stdin mocked as non-TTY, assert non-zero exit code and error message. Phase 1
REQ-023a Unit test: invoke Edit with an old_string that fails exact match, provide a unified diff input, assert patch is applied correctly. Phase 1
REQ-028a Unit test: configure custom blocklist patterns in .system2/config.yml, assert they are merged with defaults. Test a command matching a custom pattern triggers warning. Phase 1
REQ-043a Integration test: reject a gate, verify re-invocation preserves conversation history and appends feedback. Verify no automated boomerang occurs. Phase 2
REQ-125, REQ-126 Unit test: run a short agent session, assert .system2/runs/<timestamp>.jsonl is created and contains prompts, responses, tool calls, and tool results as JSONL entries. Mock disk-full scenario for REQ-126, assert warning logged but workflow continues. Phase 1
REQ-N01 through REQ-N06 Code review: verify absence of GUI, hook execution, streaming, Roo parsing, subagent spawning, non-Bedrock providers. All phases

Traceability Matrix

Goals to Requirements

Goal Requirements
G1 (CLI entry point) REQ-001, REQ-002, REQ-003, REQ-003a, REQ-004, REQ-005, REQ-140, REQ-141, REQ-142, REQ-143, REQ-152
G2 (Agent parsing) REQ-010, REQ-011, REQ-012, REQ-013, REQ-014, REQ-015, REQ-016, REQ-081, REQ-095, REQ-150, REQ-151
G3 (Tool implementations) REQ-020 through REQ-030 (including REQ-023a, REQ-028a), REQ-080, REQ-096, REQ-101
G4 (Delegation workflow) REQ-040 through REQ-047 (including REQ-043a), REQ-082, REQ-083, REQ-123
G5 (Multi-turn conversation) REQ-050 through REQ-055, REQ-092, REQ-102, REQ-125, REQ-126
G6 (BedrockClient integration) REQ-060 through REQ-065, REQ-090, REQ-091, REQ-093, REQ-161
G7 (Programmatic API) REQ-070 through REQ-073

Acceptance Criteria to Requirements

AC Requirements
AC-1 (CLI starts interactive session) REQ-001, REQ-002
AC-2 (Parse 13 agent definitions) REQ-010, REQ-011, REQ-012, REQ-016, REQ-081
AC-3 (6 tools produce correct results) REQ-020 through REQ-030, REQ-080
AC-4 (Gates pause for user input) REQ-042, REQ-043
AC-5 (Delegation map order) REQ-040, REQ-044
AC-6 (Conversation history maintained) REQ-050, REQ-051
AC-7 (All LLM calls through BedrockClient) REQ-060, REQ-062, REQ-063
AC-8 (File operations sandboxed) REQ-110
AC-9 (Bash confirmation) REQ-004, REQ-027, REQ-115
AC-10 (Cost tracking displayed) REQ-120, REQ-121, REQ-130

Requirements to Design Sections (to be filled at Gate 3)

Requirement Design Section Task IDs
REQ-001 through REQ-005 (including REQ-003a) CLI Module TBD
REQ-010 through REQ-016 Agent Parser TBD
REQ-020 through REQ-030 (including REQ-023a, REQ-028a) Tool Layer TBD
REQ-040 through REQ-047 (including REQ-043a) Delegation Engine TBD
REQ-050 through REQ-055 Conversation Manager TBD
REQ-060 through REQ-065 BedrockConversation TBD
REQ-070 through REQ-073 Orchestrator API TBD
REQ-080 through REQ-085 Data Contracts TBD
REQ-090 through REQ-096 Error Handling TBD
REQ-110 through REQ-115 Security Layer TBD
REQ-120 through REQ-126 Observability / Transcript TBD
REQ-130 through REQ-133 Cost Tracking TBD

Tasks: Standalone Bedrock Orchestrator -- Phase 1 (MVP)

Upstream artifacts: spec/context.md, spec/requirements.md, spec/design.md Phase scope: Agent parser + tool implementations + BedrockConversation + single-agent invocation with tool-use loop. No full delegation workflow (Phase 2) or post-execution workflow (Phase 3).


Task Graph Overview

Phase 1 delivers 19 tasks across 7 batches. The dependency graph fans out after the foundational batch (Batch 1), allowing Batches 2-4 to execute in parallel, then converges for integration (Batches 5-6) and the CLI entry point (Batch 7).

Batch 1: Foundation (TASK-001, TASK-002, TASK-003)
    |           |             |
    v           v             v
Batch 2:    Batch 3:      Batch 4:
Tools       Agent Parser  BedrockConversation
(TASK-004   (TASK-010,    (TASK-012, TASK-013)
 thru        TASK-011)
 TASK-009)
    |           |             |
    +-----+-----+-------------+
          |
          v
Batch 5: Integration Layer
(TASK-014, TASK-015, TASK-016)
          |
          v
Batch 6: Orchestrator + CLI
(TASK-017, TASK-018)
          |
          v
Batch 7: Integration Test
(TASK-019)

Tasks

Batch 1: Foundation


TASK-001: Data classes and constants module

Goal: Create the shared data model (dataclasses, enums, type aliases) and the constants module (delegation map, pricing tables, default blocklist, completion signal schema).

Files to create:

  • lib/constants.py

Files to modify: None

Steps:

  1. Create lib/constants.py with:
    • DELEGATION_MAP: ordered list of agent role names matching CLAUDE.md order (REQ-040)
    • DEFAULT_BASH_BLOCKLIST: the 18 regex patterns from the design doc (REQ-028)
    • MODEL_PRICING: dict mapping model IDs to input/output cost per 1K tokens
    • DEFAULT_CONTEXT_WINDOW_TOKENS = 200_000
    • DEFAULT_MAX_OUTPUT_TOKENS = 8192
    • DEFAULT_MAX_TOOL_TURNS = 200
    • DEFAULT_COST_CEILING_USD = 5.00
    • DEFAULT_COST_WARNING_USD = 2.00
    • Dataclasses: AgentDefinition, DelegationContract, GateDecisionType, GateDecision, ToolResult, CostRecord, CompletionSignal, ConverseTurn, ToolUseRequest, WorkflowResult, WorkflowStatus
    • Protocol classes: GatePolicy, BashPolicy
    • Custom exceptions: SandboxViolationError, CostCeilingError, AgentParseError
  2. All dataclasses must match the design doc signatures exactly.
  3. Write unit tests in tests/test_constants.py: verify delegation map length (13), verify all dataclasses are instantiable, verify DelegationContract.to_message() produces labeled sections.

Requirements traced: REQ-040, REQ-028, REQ-081, REQ-082, REQ-083, REQ-085

Verification:

  • python3 -m pytest tests/test_constants.py -v passes
  • All dataclasses importable from lib.constants

Estimated complexity: S

Risk level: Low -- pure data definitions with no external dependencies.

Recommended mode: executor


TASK-002: Configuration loader

Goal: Implement lib/config.py to load .system2/config.yml, extract orchestrator-specific settings under providers.bedrock.orchestrator, and fall back to defaults when keys are missing or the file is invalid.

Files to create:

  • lib/config.py
  • tests/test_config.py

Files to modify: None

Steps:

  1. Create lib/config.py with an OrchestratorConfig dataclass containing all fields from REQ-085 with defaults.
  2. Implement load_config(config_path: Path | None = None) -> OrchestratorConfig:
    • Auto-discover .system2/config.yml relative to project root if no path given.
    • Parse YAML; on FileNotFoundError or yaml.YAMLError, log warning and return defaults (REQ-094).
    • Extract providers.bedrock.orchestrator namespace.
    • Merge bash_blocklist with DEFAULT_BASH_BLOCKLIST from constants (REQ-028a).
    • Return populated OrchestratorConfig.
  3. Write tests in tests/test_config.py:
    • Valid config with all keys set.
    • Missing config file -> defaults.
    • Invalid YAML -> defaults with warning.
    • Missing orchestrator namespace -> defaults.
    • Custom bash_blocklist merged with defaults.
    • Existing config keys preserved (REQ-141).

Requirements traced: REQ-084, REQ-085, REQ-094, REQ-140, REQ-141, REQ-143, REQ-153

Verification:

  • python3 -m pytest tests/test_config.py -v passes

Estimated complexity: S

Risk level: Low -- straightforward YAML loading with fallback.

Recommended mode: executor


TASK-003: Transcript writer

Goal: Implement lib/transcript.py for append-only JSONL transcript writing to .system2/runs/<timestamp>.jsonl. Must be best-effort (never halt workflow on write failure).

Files to create:

  • lib/transcript.py
  • tests/test_transcript.py

Files to modify: None

Steps:

  1. Create lib/transcript.py with class TranscriptWriter:
    • __init__(self, transcript_dir: Path) -- creates the directory if needed, opens the file.
    • write(self, entry: dict) -> None -- adds ts field, serializes to JSON, appends line, flushes. Wraps in try/except; on error logs warning to stderr, sets internal _write_failed flag (REQ-126).
    • Convenience methods: workflow_start(), agent_start(), api_request(), api_response(), tool_exec(), gate_decision(), agent_complete(), workflow_end() -- each constructs the appropriate dict with type field and calls write().
    • close() -- flushes and closes the file handle.
  2. Write tests in tests/test_transcript.py:
    • Write several entries, read back JSONL, assert correct types and fields.
    • Simulate write failure (read-only directory or mock), assert no exception raised and warning logged.
    • Verify timestamp is present on each entry.

Requirements traced: REQ-125, REQ-126

Verification:

  • python3 -m pytest tests/test_transcript.py -v passes

Estimated complexity: S

Risk level: Low -- simple file I/O with error suppression.

Recommended mode: executor


Batch 2: Tool Implementations (parallelizable)

All tool tasks depend on TASK-001 (for ToolResult, SandboxViolationError, BashPolicy).


TASK-004: Path sandbox and base tool interface

Goal: Implement lib/tools/sandbox.py (path validation) and lib/tools/base.py (abstract BaseTool class). Create lib/tools/__init__.py with the tool registry.

Files to create:

  • lib/tools/__init__.py
  • lib/tools/base.py
  • lib/tools/sandbox.py
  • tests/test_sandbox.py

Files to modify: None

Steps:

  1. Create lib/tools/sandbox.py:
    • validate_path(requested_path: str, project_root: Path) -> Path -- resolves symlinks, normalizes .., checks prefix. Raises SandboxViolationError if outside root.
  2. Create lib/tools/base.py:
    • Abstract class BaseTool with:
      • name: str (class attribute)
      • get_tool_spec() -> dict -- returns Converse API toolSpec dict
      • execute(input: dict, project_root: Path, **kwargs) -> ToolResult -- abstract
  3. Create lib/tools/__init__.py:
    • ToolRegistry class: registers tools by name, filters by agent allowlist, returns tool spec lists.
    • create_default_registry(project_root: Path, bash_policy: BashPolicy, config: OrchestratorConfig) -> ToolRegistry
    • Registers all 6 tools + signal_completion pseudo-tool.
  4. Write tests/test_sandbox.py:
    • Path within project root -> passes.
    • Path outside project root -> raises SandboxViolationError.
    • Path with .. traversal -> raises.
    • Symlink pointing outside -> raises.
    • Project root itself -> passes.

Requirements traced: REQ-110, REQ-029, REQ-030, REQ-080

Verification:

  • python3 -m pytest tests/test_sandbox.py -v passes

Estimated complexity: M

Risk level: Med -- sandbox is security-critical; symlink edge cases need careful handling.

Recommended mode: executor


TASK-005: Read tool

Goal: Implement lib/tools/read_tool.py -- reads files with optional offset/limit, returns content with line numbers (cat -n format).

Files to create:

  • lib/tools/read_tool.py
  • tests/test_read_tool.py

Files to modify: None

Dependencies: TASK-004

Steps:

  1. Create ReadTool(BaseTool) with:
    • get_tool_spec() matching the design doc schema (file_path required, offset/limit optional).
    • execute(): validate path via sandbox, read file, apply offset/limit, format with line numbers, return ToolResult.
    • Handle file not found -> ToolResult(is_error=True).
    • Truncate lines longer than 2000 characters.
    • Default: read up to 2000 lines from start.
  2. Write tests:
    • Read a small file -> correct content with line numbers.
    • Read with offset and limit.
    • File not found -> error result.
    • Path outside sandbox -> error result.
    • Large file -> truncation at 2000 lines.

Requirements traced: REQ-021, REQ-030, REQ-053, REQ-110

Verification:

  • python3 -m pytest tests/test_read_tool.py -v passes

Estimated complexity: S

Risk level: Low

Recommended mode: executor


TASK-006: Write tool

Goal: Implement lib/tools/write_tool.py -- writes content to a file, creating parent directories if needed.

Files to create:

  • lib/tools/write_tool.py
  • tests/test_write_tool.py

Files to modify: None

Dependencies: TASK-004

Steps:

  1. Create WriteTool(BaseTool) with:
    • get_tool_spec() with file_path and content as required parameters.
    • execute(): validate path via sandbox, create parent dirs, write content, return success ToolResult.
    • Handle permission errors -> ToolResult(is_error=True).
  2. Write tests:
    • Write to new file -> file exists with correct content.
    • Write creating parent dirs.
    • Path outside sandbox -> error.
    • Overwrite existing file.

Requirements traced: REQ-022, REQ-030, REQ-053, REQ-110

Verification:

  • python3 -m pytest tests/test_write_tool.py -v passes

Estimated complexity: S

Risk level: Low

Recommended mode: executor


TASK-007: Edit tool

Goal: Implement lib/tools/edit_tool.py -- exact string replacement with clear error on mismatch.

Files to create:

  • lib/tools/edit_tool.py
  • tests/test_edit_tool.py

Files to modify: None

Dependencies: TASK-004

Steps:

  1. Create EditTool(BaseTool) with:
    • get_tool_spec() with file_path, old_string, new_string (required), replace_all (optional bool, default false).
    • execute():
      • Validate path via sandbox.
      • Read file content.
      • If old_string not found: return error with file path and a snippet of the file around the expected location (REQ-096).
      • If old_string found multiple times and replace_all is false: return error stating non-unique match.
      • If replace_all is true: replace all occurrences.
      • Otherwise: replace first occurrence. Write file.
    • File must have been read by the Read tool before editing (design doc states this, but we enforce by checking file existence rather than tracking reads -- keep it simple for MVP).
  2. Write tests:
    • Successful single replacement.
    • old_string not found -> descriptive error with snippet.
    • Non-unique old_string without replace_all -> error.
    • replace_all=True -> all occurrences replaced.
    • Path outside sandbox -> error.

Requirements traced: REQ-023, REQ-030, REQ-053, REQ-096, REQ-110

Verification:

  • python3 -m pytest tests/test_edit_tool.py -v passes

Estimated complexity: M

Risk level: Med -- exact string matching edge cases (whitespace, encoding).

Recommended mode: executor


TASK-008: Grep and Glob tools

Goal: Implement lib/tools/grep_tool.py and lib/tools/glob_tool.py.

Files to create:

  • lib/tools/grep_tool.py
  • lib/tools/glob_tool.py
  • tests/test_grep_tool.py
  • tests/test_glob_tool.py

Files to modify: None

Dependencies: TASK-004

Steps:

  1. Create GrepTool(BaseTool):
    • get_tool_spec() with pattern (required), path, glob filter, type filter, output_mode, context lines (-A/-B/-C), case-insensitive flag, head_limit, multiline flag.
    • execute(): validate path via sandbox, use subprocess.run with rg (ripgrep) if available, fall back to Python re + pathlib walk if not. Return matches in requested output_mode.
    • Handle regex errors -> ToolResult(is_error=True).
  2. Create GlobTool(BaseTool):
    • get_tool_spec() with pattern (required), path (optional).
    • execute(): validate path, use pathlib.Path.glob() or glob.glob(), sort results alphabetically (REQ-025), return file paths.
  3. Write tests for both:
    • Grep: regex match, case-insensitive, no matches, invalid regex -> error.
    • Glob: match files, no matches, sorted output.
    • Both: path outside sandbox -> error.

Requirements traced: REQ-024, REQ-025, REQ-030, REQ-053, REQ-110

Verification:

  • python3 -m pytest tests/test_grep_tool.py tests/test_glob_tool.py -v passes

Estimated complexity: M

Risk level: Low -- standard library operations; ripgrep fallback adds minor complexity.

Recommended mode: executor


TASK-009: Bash tool

Goal: Implement lib/tools/bash_tool.py with blocklist enforcement, safety mode handling, and user confirmation via BashPolicy.

Files to create:

  • lib/tools/bash_tool.py
  • tests/test_bash_tool.py

Files to modify: None

Dependencies: TASK-004, TASK-002 (for config with merged blocklist)

Steps:

  1. Create BashTool(BaseTool):
    • __init__ accepts BashPolicy, safety_mode, blocklist (merged list from config).
    • get_tool_spec() with command (required), timeout (optional, default 120s).
    • execute():
      • Check command against blocklist (regex matching). If match:
        • strict mode: return error ToolResult explaining which pattern matched (REQ-115).
        • permissive mode: call bash_policy.confirm(command, is_blocklisted=True). If denied, return error.
      • If not blocklisted and --unsafe-bash not set: call bash_policy.confirm(command, is_blocklisted=False). If denied, return error.
      • Execute via subprocess.run(command, shell=True, capture_output=True, timeout=timeout, cwd=project_root).
      • Return stdout, stderr, exit code in ToolResult. Truncate output at 100KB (DQ-4).
  2. Create InteractiveBashPolicy (default CLI policy) that prompts on stdin.
  3. Write tests:
    • Normal command execution -> correct stdout/stderr/exit code.
    • Blocklisted command in strict mode -> error without execution.
    • Blocklisted command in permissive mode with deny -> error.
    • Blocklisted command in permissive mode with confirm -> executes.
    • Non-blocklisted command without unsafe-bash, deny -> error.
    • Non-blocklisted command without unsafe-bash, confirm -> executes.
    • Output truncation at 100KB.
    • Custom blocklist patterns from config (REQ-028a).
    • Timeout handling.

Requirements traced: REQ-026, REQ-027, REQ-028, REQ-028a, REQ-030, REQ-053, REQ-115

Verification:

  • python3 -m pytest tests/test_bash_tool.py -v passes

Estimated complexity: M

Risk level: Med -- safety-critical tool; must not allow bypass of blocklist in strict mode.

Recommended mode: executor


Batch 3: Agent Parser

Depends on TASK-001 (for AgentDefinition, AgentParseError).


TASK-010: Agent parser with YAML frontmatter extraction

Goal: Implement lib/agent_parser.py -- parse .claude/agents/*.md files, extract YAML frontmatter and Markdown body, produce AgentDefinition objects.

Files to create:

  • lib/agent_parser.py
  • tests/test_agent_parser.py

Files to modify: None

Dependencies: TASK-001

Steps:

  1. Create lib/agent_parser.py:
    • parse_agent(agent_path: Path) -> AgentDefinition:
      • Read file content.
      • Split YAML frontmatter (between --- delimiters) from Markdown body.
      • Parse YAML: extract name, description, tools. Ignore unknown keys like hooks (REQ-012).
      • Handle malformed YAML -> raise AgentParseError with descriptive message (REQ-095).
      • Store raw system prompt (body) before transformation.
    • parse_all_agents(agents_dir: Path) -> dict[str, AgentDefinition]:
      • Glob for *.md, parse each.
      • On AgentParseError, log warning and skip (REQ-095).
      • Return dict keyed by agent name.
    • get_agent(name: str, agents_dir: Path) -> AgentDefinition:
      • Load single agent on-demand (REQ-103).
  2. Write tests:
    • Parse a well-formed agent file -> correct name, description, tools, system_prompt.
    • Unknown frontmatter keys ignored (REQ-012).
    • Malformed YAML -> AgentParseError.
    • Missing required fields -> AgentParseError.
    • parse_all_agents with one bad file -> skipped, others loaded.
    • Verify no files modified on disk (REQ-015).
    • Parse all 13 actual agent files without error (REQ-011).

Requirements traced: REQ-010, REQ-011, REQ-012, REQ-015, REQ-016, REQ-081, REQ-095, REQ-103

Verification:

  • python3 -m pytest tests/test_agent_parser.py -v passes
  • Test that parses all 13 real .claude/agents/*.md files succeeds

Estimated complexity: M

Risk level: Low -- YAML + string splitting; well-defined format.

Recommended mode: executor


TASK-011: Prompt transformation layer

Goal: Implement the documented prompt transformation rules (TR-01 through TR-08) in lib/agent_parser.py and append the completion protocol instruction.

Files to modify:

  • lib/agent_parser.py (extend from TASK-010)

Files to create:

  • tests/test_prompt_transforms.py

Dependencies: TASK-010

Steps:

  1. Add to lib/agent_parser.py:
    • TRANSFORM_RULES: list of (compiled_regex, replacement) tuples matching the design doc.
    • COMPLETION_INSTRUCTION: the appended system prompt block.
    • apply_transforms(raw_prompt: str) -> str:
      • Apply each rule sequentially (TR-02 through TR-08).
      • Append COMPLETION_INSTRUCTION.
      • Return transformed prompt.
    • Integrate into parse_agent(): store raw_system_prompt and system_prompt (post-transform).
  2. Write tests in tests/test_prompt_transforms.py:
    • TR-02: attempt_completion -> signal_completion.
    • TR-03: "delegate to executor" -> "report to orchestrator for re-delegation to executor".
    • TR-04: "Claude Code CLI" -> "the orchestrator"; "Claude Code" (standalone) -> "the orchestrator".
    • TR-07: CLAUDE.md loading reference -> simplified.
    • TR-08: Hook script lines removed.
    • Completion instruction appended.
    • Transforms applied to real agent files -> no attempt_completion remains, no Claude Code CLI remains.
    • raw_system_prompt preserved untransformed.

Requirements traced: REQ-013, REQ-014, REQ-N02

Verification:

  • python3 -m pytest tests/test_prompt_transforms.py -v passes
  • Grep transformed output of all 13 agents for attempt_completion -> zero matches

Estimated complexity: M

Risk level: Med -- regex rules must not corrupt prompts; need to verify against all 13 real agents.

Recommended mode: executor


Batch 4: Bedrock Conversation

Depends on TASK-001 (for data classes) and TASK-002 (for config).


TASK-012: BedrockConversation wrapper -- core Converse API integration

Goal: Implement lib/bedrock_conversation.py -- wraps BedrockClient.client to call the Converse API with message history, tool definitions, and response parsing.

Files to create:

  • lib/bedrock_conversation.py
  • tests/test_bedrock_conversation.py

Files to modify: None

Dependencies: TASK-001, TASK-002

Steps:

  1. Create lib/bedrock_conversation.py with class BedrockConversation:
    • __init__(self, bedrock_client: BedrockClient, model_id: str, system_prompt: str, tool_definitions: list[dict], config: OrchestratorConfig):
      • Store reference to bedrock_client.client (the boto3 bedrock-runtime client) (AD-1).
      • Initialize empty _messages list.
      • Store system prompt and tool config.
    • send(self, user_content: list[dict]) -> ConverseTurn:
      • Append user message to _messages.
      • Call self._call_with_retry() with Converse API format.
      • Parse response into ConverseTurn dataclass.
      • Append assistant message to _messages.
      • Accumulate token counts.
      • Return ConverseTurn.
    • send_tool_results(self, results: list[ToolResult]) -> ConverseTurn:
      • Format ToolResult objects as toolResult content blocks in a user message.
      • Call send().
    • _call_with_retry(self, **kwargs) -> dict:
      • Implement retry logic: ThrottlingException -> 3 retries exponential backoff; 5xx -> 2 retries (REQ-090, REQ-091).
    • get_token_usage(self) -> tuple[int, int]: return (total_input, total_output).
    • estimate_next_call_tokens(self) -> int: heuristic pre-check (chars / 3.5 + tool overhead).
  2. Write tests (mock bedrock_client.client.converse):
    • Send user message -> correct Converse API call format.
    • Parse response with text blocks -> correct ConverseTurn.
    • Parse response with toolUse blocks -> correct ToolUseRequest objects.
    • Send tool results -> correct toolResult format.
    • Token accumulation across multiple calls.
    • Retry on ThrottlingException (mock 3 failures then success).
    • Retry on 5xx (mock 2 failures then success).
    • Max retries exceeded -> exception raised.
    • Verify converse() called on bedrock_client.client, not invoke_model (AD-1).
    • Verify no direct boto3 import in this module (REQ-062).

Requirements traced: REQ-060, REQ-061, REQ-062, REQ-063, REQ-064, REQ-090, REQ-091

Verification:

  • python3 -m pytest tests/test_bedrock_conversation.py -v passes
  • grep -r "import boto3" lib/bedrock_conversation.py returns nothing

Estimated complexity: L

Risk level: High -- core API integration layer; Converse API format must be exactly correct; retry logic is safety-critical.

Rollback: Delete lib/bedrock_conversation.py. No existing files modified.

Recommended mode: executor


TASK-013: Token tracking and context window overflow handling

Goal: Add token tracking, 80% warning, and context window overflow handling (abort or auto-summarize) to BedrockConversation.

Files to modify:

  • lib/bedrock_conversation.py (extend from TASK-012)

Files to create:

  • tests/test_token_management.py

Dependencies: TASK-012

Steps:

  1. Extend BedrockConversation:
    • Before each send(), call estimate_next_call_tokens().
    • If estimate > 80% of context_window_tokens: emit warning to stderr (REQ-054).
    • If estimate > 95% of context_window_tokens: raise a ContextWindowOverflow exception that the caller (delegation engine) catches to present user options (REQ-055).
    • Add auto_summarize(self) -> None: creates a summarization request, replaces message history with the summary, inserts [CONTEXT SUMMARIZED] marker. Adds cost to tracker.
  2. Write tests:
    • Simulate conversation at 79% -> no warning.
    • Simulate conversation at 81% -> warning logged.
    • Simulate conversation at 96% -> ContextWindowOverflow raised.
    • Auto-summarize: verify message history replaced, marker present, token count reduced.
    • Heuristic estimation includes tool definition overhead (~2500 tokens for 7 tools).

Requirements traced: REQ-054, REQ-055, REQ-102

Verification:

  • python3 -m pytest tests/test_token_management.py -v passes

Estimated complexity: M

Risk level: Med -- heuristic estimation can be inaccurate; auto-summarize is a complex flow.

Recommended mode: executor


Batch 5: Integration Layer

Depends on Batches 2, 3, and 4.


TASK-014: Cost tracker

Goal: Implement lib/cost_tracker.py -- accumulates per-agent cost records, checks warning/ceiling thresholds, formats display strings.

Files to create:

  • lib/cost_tracker.py
  • tests/test_cost_tracker.py

Files to modify: None

Dependencies: TASK-001, TASK-002

Steps:

  1. Create lib/cost_tracker.py with class CostTracker:
    • __init__(self, config: OrchestratorConfig) -- reads warning/ceiling thresholds.
    • add(self, agent_name: str, model_id: str, input_tokens: int, output_tokens: int, api_calls: int, tool_turns: int) -> CostRecord:
      • Calculate cost using MODEL_PRICING from constants.
      • Append to internal list.
      • Return the CostRecord.
    • check_thresholds(self) -> str | None:
      • If cumulative >= ceiling: return "ceiling" (REQ-132).
      • If cumulative >= warning and not yet warned: return "warning" (REQ-131).
      • Otherwise: None.
    • get_cumulative(self) -> float.
    • format_agent_summary(self, record: CostRecord) -> str -- the per-agent display block (REQ-120).
    • format_workflow_summary(self, agents_invoked: list[str], wall_clock_seconds: float) -> str -- the workflow summary (REQ-121).
  2. Write tests:
    • Add costs, verify cumulative.
    • Warning threshold triggered once.
    • Ceiling threshold triggered.
    • Format strings match expected output.
    • Custom thresholds from config.

Requirements traced: REQ-120, REQ-121, REQ-130, REQ-131, REQ-132, REQ-133

Verification:

  • python3 -m pytest tests/test_cost_tracker.py -v passes

Estimated complexity: S

Risk level: Low

Recommended mode: executor


TASK-015: Signal completion pseudo-tool and tool-use loop

Goal: Implement the signal_completion pseudo-tool and the core tool-use loop logic that drives an agent through multiple tool calls until completion.

Files to create:

  • lib/tools/signal_completion.py
  • lib/delegation.py
  • tests/test_tool_use_loop.py

Files to modify:

  • lib/tools/__init__.py (register signal_completion)

Dependencies: TASK-004 through TASK-009, TASK-010, TASK-011, TASK-012

Steps:

  1. Create lib/tools/signal_completion.py:
    • SignalCompletionTool(BaseTool) with get_tool_spec() matching design doc schema.
    • execute(): parse input into CompletionSignal, return ToolResult with "Completion acknowledged."
  2. Register in lib/tools/__init__.py.
  3. Create lib/delegation.py with:
    • DelegationEngine:
      • __init__(self, tool_registry: ToolRegistry, transcript: TranscriptWriter, config: OrchestratorConfig).
      • run_agent(self, agent_def: AgentDefinition, contract: DelegationContract, bedrock_conversation: BedrockConversation, cost_tracker: CostTracker) -> CompletionSignal:
        • Send contract as initial user message.
        • Enter tool-use loop:
          • If stop_reason == "tool_use": execute each tool, send results back.
          • If a tool is signal_completion: extract CompletionSignal, break.
          • If stop_reason == "end_turn": implicit completion (parse text for JSON, fallback to text summary).
          • If stop_reason == "max_tokens": warn, present options (REQ-092).
          • Log each tool execution to transcript (REQ-122).
          • Check max_tool_turns safety limit.
        • Update cost tracker with cumulative tokens.
        • Return CompletionSignal.
    • Tool filtering: only provide tools listed in agent's tools field (REQ-029).
  4. Write tests (mock BedrockConversation):
    • Agent makes 2 tool calls then signals completion -> correct flow.
    • Agent makes a tool call that errors -> error returned, agent retries.
    • Agent signals implicit completion (end_turn, no tool calls).
    • Multiple tool_use blocks in single response -> all executed.
    • signal_completion detected -> loop terminates.
    • Max tool turns exceeded -> halted.
    • Tool not in agent allowlist -> not provided to API.
    • Tool exec logged to transcript.

Requirements traced: REQ-014, REQ-029, REQ-050, REQ-051, REQ-052, REQ-053, REQ-083, REQ-092, REQ-122

Verification:

  • python3 -m pytest tests/test_tool_use_loop.py -v passes

Estimated complexity: L

Risk level: High -- core orchestration logic; many edge cases in loop termination.

Rollback: Delete lib/delegation.py, lib/tools/signal_completion.py.

Recommended mode: executor


TASK-016: Prompt injection detection and output sanitization

Goal: Implement output sanitization that scans agent text responses for prompt injection patterns and flags them for user review.

Files to create:

  • lib/safety.py
  • tests/test_safety.py

Files to modify: None

Dependencies: TASK-001

Steps:

  1. Create lib/safety.py:
    • INJECTION_PATTERNS: list of regex patterns matching the design doc (skip security, modify CLAUDE.md, escalate privileges, ignore previous instructions).
    • scan_for_injection(text: str) -> list[str]: returns list of matched pattern descriptions. Empty list = clean.
    • sanitize_log_entry(entry: dict) -> dict: strips sensitive fields (credentials, secrets patterns) from log entries (REQ-111).
  2. Integrate with DelegationEngine (TASK-015): after each agent response, call scan_for_injection(). If matches found, flag to user via a callback (or raise if no callback).
  3. Write tests:
    • Text with "skip security" -> detected.
    • Text with "modify CLAUDE.md" -> detected.
    • Text with "ignore previous instructions" -> detected.
    • Clean text -> no matches.
    • Log sanitization removes credential-like patterns.

Requirements traced: REQ-112, REQ-113, REQ-111

Verification:

  • python3 -m pytest tests/test_safety.py -v passes

Estimated complexity: S

Risk level: Med -- must not produce false positives that block normal agent operation; must not miss real injections.

Recommended mode: security-sentinel


Batch 6: Orchestrator and CLI

Depends on Batches 1-5.


TASK-017: Orchestrator programmatic API

Goal: Implement lib/orchestrator.py with the Orchestrator class exposing run(), invoke_agent(), and get_status(). For Phase 1, run() supports single-agent mode; full workflow sequencing is Phase 2.

Files to create:

  • lib/orchestrator.py
  • tests/test_orchestrator.py

Files to modify: None

Dependencies: TASK-001 through TASK-016

Steps:

  1. Create lib/orchestrator.py:
    • class Orchestrator:
      • __init__(self, project_root: Path, config_path: Path | None, gate_policy: GatePolicy | None, bash_policy: BashPolicy | None, on_agent_complete: Callable | None) (REQ-071, REQ-073).
      • Loads config, creates BedrockClient, CostTracker, TranscriptWriter, ToolRegistry, AgentParser, DelegationEngine.
      • invoke_agent(self, agent_name: str, contract: DelegationContract) -> AgentResult (REQ-072):
        • Parse agent on-demand.
        • Create BedrockConversation with agent's system prompt and filtered tools.
        • Call DelegationEngine.run_agent().
        • Display cost summary.
        • Return result.
      • run(self, task_description: str) -> WorkflowResult (REQ-072):
        • Phase 1: delegates to a single agent (spec-coordinator) as proof of concept.
        • Phase 2 stub: _run_full_workflow() raises NotImplementedError.
      • get_status(self) -> WorkflowStatus.
    • Verify no stdin/stdout dependency in core logic (REQ-073) -- all I/O through injected policies/callbacks.
    • Handle credential failures at init -> clear error (REQ-093).
  2. Write tests (mock BedrockClient and Converse API):
    • Instantiate with config -> no error.
    • invoke_agent() with mock -> returns AgentResult.
    • No stdin/stdout calls in orchestrator (assert using mock).
    • Invalid credentials -> clear error message and appropriate exit.
    • Cost ceiling reached -> pauses via gate_policy.

Requirements traced: REQ-070, REQ-071, REQ-072, REQ-073, REQ-093, REQ-114

Verification:

  • python3 -m pytest tests/test_orchestrator.py -v passes
  • from lib.orchestrator import Orchestrator works in a Python shell

Estimated complexity: L

Risk level: Med -- wiring layer that connects all components; many injection points.

Recommended mode: executor


TASK-018: CLI entry point

Goal: Implement system2/__init__.py (version check) and system2/__main__.py (CLI argument parsing, TTY detection, interactive session launch).

Files to create:

  • system2/__init__.py
  • system2/__main__.py
  • tests/test_cli.py

Files to modify: None

Dependencies: TASK-017

Steps:

  1. Create system2/__init__.py:
    • Python version check: if sys.version_info < (3, 10), print error and sys.exit(1) (REQ-142).
  2. Create system2/__main__.py:
    • import argparse (REQ-143 -- stdlib only).
    • Parse args: positional task_description (optional), --unsafe-bash, --config, --project-root, --log-format, --log-file, --verbose.
    • TTY detection: if no task and stdin is TTY, prompt interactively (REQ-003). If not TTY and no task, exit code 2 with error (REQ-003a).
    • Discover project root: git root or cwd.
    • Create InteractiveGatePolicy, InteractiveBashPolicy (or unsafe variant).
    • Create Orchestrator and call orchestrator.run(task) or orchestrator.invoke_agent().
    • Handle KeyboardInterrupt gracefully.
    • Exit codes per design: 0=success, 1=error, 2=bad args, 3=credentials, 4=cost ceiling.
  3. Write tests:
    • Arg parsing: task description extracted correctly.
    • --unsafe-bash flag parsed.
    • Non-TTY without task -> exit code 2 (REQ-003a).
    • Python version check (mock sys.version_info).
    • python3 -m system2 --help produces usage text.

Requirements traced: REQ-001, REQ-002, REQ-003, REQ-003a, REQ-004, REQ-142, REQ-143, REQ-152

Verification:

  • python3 -m pytest tests/test_cli.py -v passes
  • python3 -m system2 --help displays usage without error

Estimated complexity: M

Risk level: Low -- standard argparse usage; entry point wiring.

Recommended mode: executor


Batch 7: End-to-End Validation


TASK-019: Integration test -- single-agent end-to-end with mocked Bedrock

Goal: Write an integration test that exercises the full path: CLI args -> Orchestrator -> AgentParser -> BedrockConversation (mocked) -> tool-use loop -> tool execution -> completion signal -> cost display -> transcript written.

Files to create:

  • tests/test_integration_e2e.py

Files to modify: None

Dependencies: TASK-001 through TASK-018

Steps:

  1. Create tests/test_integration_e2e.py:
    • Mock BedrockClient.client.converse to return scripted responses:
      • Turn 1: agent calls Read tool on spec/context.md.
      • Turn 2: agent calls Write tool to create a file.
      • Turn 3: agent calls signal_completion with success.
    • Instantiate Orchestrator with mock BedrockClient and AutoApproveGatePolicy.
    • Call invoke_agent("spec-coordinator", contract).
    • Assert:
      • Agent parsed correctly (name, tools, transformed prompt).
      • 3 API calls made (matching conversation history growth).
      • Read tool returned file contents.
      • Write tool created the file.
      • CompletionSignal has status="success".
      • CostTracker accumulated tokens from all 3 calls.
      • Transcript JSONL file exists and contains expected entry types.
      • No boto3 import outside bedrock_client.py (grep check).
      • No files outside project root accessed (sandbox).
  2. Add a second test case:
    • Agent makes a tool call that errors (Edit with wrong old_string).
    • Agent receives error and retries with corrected old_string.
    • Assert error handling and retry work within the loop.
  3. Add a negative test:
    • Attempt to read file outside project root -> sandbox violation returned to agent.

Requirements traced: AC-1 through AC-10 (partial), REQ-050, REQ-051, REQ-052, REQ-053, REQ-062, REQ-110, REQ-120, REQ-125

Verification:

  • python3 -m pytest tests/test_integration_e2e.py -v passes
  • grep -rn "import boto3" lib/ --include="*.py" | grep -v bedrock_client.py returns nothing (AC-7)

Estimated complexity: L

Risk level: Med -- complex mock setup; test fragility if internal APIs change.

Recommended mode: test-engineer


Definition of Done Checklist

  • All 19 tasks completed with passing tests
  • python3 -m pytest tests/ -v passes with zero failures
  • python3 -m system2 --help displays usage text without error
  • grep -rn "import boto3" lib/ --include="*.py" | grep -v bedrock_client.py returns nothing (REQ-062)
  • All 13 .claude/agents/*.md files parse without error (REQ-011)
  • No existing files modified: lib/bedrock_client.py, .claude/agents/*.md, CLAUDE.md unchanged (REQ-150, REQ-152)
  • No third-party imports beyond boto3, pyyaml, stdlib (REQ-143)
  • Sandbox rejects paths outside project root (REQ-110)
  • Bash blocklist blocks destructive patterns in strict mode (REQ-028)
  • Cost tracker accumulates and checks thresholds (REQ-130, REQ-131, REQ-132)
  • Transcript JSONL written for agent sessions (REQ-125)
  • from lib.orchestrator import Orchestrator succeeds (REQ-070)

Execution Notes

Environment

  • Python: 3.10+
  • Test framework: pytest + pytest-mock (already in pyproject.toml dev dependencies)
  • No additional dependencies needed beyond what is in pyproject.toml
  • Platform: macOS (development), Linux (CI target)

Checkpoints

After Batch Checkpoint
Batch 1 All data classes importable; config loads with defaults; transcript writes JSONL
Batch 2 All 6 tools pass unit tests independently; sandbox rejects bad paths
Batch 3 All 13 agents parse; transforms produce correct output; no attempt_completion remains
Batch 4 BedrockConversation formats correct Converse API calls (verified against mock)
Batch 5 Tool-use loop completes with mocked API; cost tracker works; safety scanner runs
Batch 6 Orchestrator instantiates and invokes single agent; CLI parses args
Batch 7 End-to-end integration test passes with mocked Bedrock

Parallelization

Batches 2, 3, and 4 are fully independent and can be executed in parallel after Batch 1 completes. Within Batch 2, tasks TASK-005 through TASK-009 can all be executed in parallel (each tool is independent, all depend only on TASK-004).

Test Commands

All tests use standard pytest:

# Run all tests
python3 -m pytest tests/ -v

# Run a specific test file
python3 -m pytest tests/test_sandbox.py -v

# Run with coverage (if coverage is available)
python3 -m pytest tests/ --cov=lib --cov=system2 -v

Traceability

Requirements to Tasks

Requirement(s) Task(s)
REQ-001, REQ-002, REQ-003, REQ-003a, REQ-004 TASK-018
REQ-010, REQ-011, REQ-012, REQ-015, REQ-016, REQ-081, REQ-095, REQ-103 TASK-010
REQ-013, REQ-014 TASK-011
REQ-020, REQ-021 TASK-005
REQ-022 TASK-006
REQ-023, REQ-096 TASK-007
REQ-024, REQ-025 TASK-008
REQ-026, REQ-027, REQ-028, REQ-028a, REQ-115 TASK-009
REQ-029, REQ-030, REQ-080 TASK-004
REQ-040 TASK-001
REQ-041, REQ-050, REQ-051, REQ-052, REQ-053, REQ-083, REQ-092 TASK-015
REQ-054, REQ-055, REQ-102 TASK-013
REQ-060, REQ-061, REQ-062, REQ-063, REQ-064, REQ-090, REQ-091 TASK-012
REQ-070, REQ-071, REQ-072, REQ-073, REQ-093, REQ-114 TASK-017
REQ-082 TASK-001
REQ-084, REQ-085, REQ-094, REQ-140, REQ-141, REQ-143, REQ-153 TASK-002
REQ-110 TASK-004, TASK-005, TASK-006, TASK-007, TASK-008
REQ-111, REQ-112, REQ-113 TASK-016
REQ-120, REQ-121, REQ-130, REQ-131, REQ-132, REQ-133 TASK-014
REQ-122 TASK-015
REQ-125, REQ-126 TASK-003
REQ-142, REQ-152 TASK-018
REQ-150, REQ-151 TASK-010 (verified; no file modification)
REQ-N01 through REQ-N06 TASK-019 (verified via code review / grep)

Deferred Requirements (Phase 2+)

Requirement Phase Notes
REQ-005 (--auto-approve) Phase 2 Gate auto-approval flag
REQ-023a (unified diff fallback) Phase 2 Edit tool extension
REQ-042, REQ-043, REQ-043a, REQ-044, REQ-047 (full delegation workflow) Phase 2 Multi-agent sequencing + gates
REQ-045, REQ-046 (post-execution workflow) Phase 3 Boomerang cycles
REQ-124 (structured JSON logging) Phase 2 JSON log format option
REQ-160, REQ-161, REQ-162 (compliance) Phase 1 (inherited from BedrockClient) Already satisfied by existing code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment