OpenClaw Architecture Analysis - Retry and Error Handling Patterns

Generated: 2026-02-11 Source: Research for issue conductorbot-6cr

Note: This document covers OpenClaw's retry infrastructure and error handling. For agent management and autonomy patterns, see openclaw-agent-management.md.

1. Agent Harness

Main Entry Point

File: /Users/ajsharp/code/github/openclaw/src/commands/agent.ts

The agent harness wraps agent execution through the agentCommand function, which:

Validates input and session parameters
Resolves workspace, agent configuration, and model selection
Delegates to either CLI agents or embedded Pi agents
Handles model fallback and delivery

Core Agent Runner

File: /Users/ajsharp/code/github/openclaw/src/agents/pi-embedded-runner/run.ts

The runEmbeddedPiAgent function is the main orchestration layer:

export async function runEmbeddedPiAgent(
  params: RunEmbeddedPiAgentParams,
): Promise<EmbeddedPiRunResult>

Key responsibilities:

Session lane management - Uses queue-based execution lanes to prevent concurrent runs
Workspace resolution - Determines working directory with fallback logic
Auth profile management - Rotates through API keys/profiles when rate-limited
Retry orchestration - Coordinates multiple retry strategies
Result building - Constructs payloads from assistant responses and tool calls

2. Looping Mechanism

Multi-Level Retry Strategy

The architecture implements three concurrent retry loops:

A. Auth Profile Rotation Loop

Location: runEmbeddedPiAgent (lines 357-384)

while (profileIndex < profileCandidates.length) {
  const candidate = profileCandidates[profileIndex];
  if (candidate && isProfileInCooldown(authStore, candidate)) {
    profileIndex += 1;
    continue;
  }
  await applyApiKeyInfo(profileCandidates[profileIndex]);
  break;
}

Termination: Exhausts all available auth profiles or finds one not in cooldown

B. Main Execution Loop with Context Overflow Handling

Location: runEmbeddedPiAgent (lines 392-863)

const MAX_OVERFLOW_COMPACTION_ATTEMPTS = 3;
let overflowCompactionAttempts = 0;

while (true) {
  attemptedThinking.add(thinkLevel);

  const attempt = await runEmbeddedAttempt({ ... });

  // Handle context overflow with auto-compaction
  if (contextOverflowError) {
    if (!isCompactionFailure &&
        overflowCompactionAttempts < MAX_OVERFLOW_COMPACTION_ATTEMPTS) {
      overflowCompactionAttempts++;
      const compactResult = await compactEmbeddedPiSessionDirect({ ... });
      if (compactResult.compacted) {
        continue; // Retry after compaction
      }
    }
    // Try tool result truncation as last resort
    if (!toolResultTruncationAttempted) {
      const truncResult = await truncateOversizedToolResultsInSession({ ... });
      if (truncResult.truncated) {
        overflowCompactionAttempts = 0;
        continue;
      }
    }
    return { /* context overflow error */ };
  }

  // Handle auth failures with profile rotation
  if (shouldRotate) {
    if (lastProfileId) {
      await markAuthProfileFailure({ ... });
    }
    const rotated = await advanceAuthProfile();
    if (rotated) {
      continue;
    }
  }

  // Success path
  return { payloads, meta };
}

Termination conditions:

Successful completion
All auth profiles exhausted
Context overflow unrecoverable
Non-retryable error (AbortError, image errors, role ordering conflicts)

C. Model Fallback Loop

File: /Users/ajsharp/code/github/openclaw/src/agents/model-fallback.ts

export async function runWithModelFallback<T>(params: {
  cfg: OpenClawConfig | undefined;
  provider: string;
  model: string;
  fallbacksOverride?: string[];
  run: (provider: string, model: string) => Promise<T>;
}): Promise<{ result: T; provider: string; model: string; attempts: FallbackAttempt[] }>

Retry logic:

for (let i = 0; i < candidates.length; i += 1) {
  const candidate = candidates[i];

  try {
    const result = await params.run(candidate.provider, candidate.model);
    return { result, provider: candidate.provider, model: candidate.model, attempts };
  } catch (err) {
    if (shouldRethrowAbort(err)) throw err;
    const normalized = coerceToFailoverError(err, { ... });
    if (!isFailoverError(normalized)) throw err;
    lastError = normalized;
    attempts.push({ /* failed attempt */ });
  }
}

3. Inter-Agent Communication

Subagent Spawning System

Tool: /Users/ajsharp/code/github/openclaw/src/agents/tools/sessions-spawn-tool.ts

Agents spawn isolated subagents via the sessions_spawn tool:

const childSessionKey = `agent:${targetAgentId}:subagent:${crypto.randomUUID()}`;

await callGateway({
  method: "agent",
  params: {
    sessionKey: childSessionKey,
    message: task,
    deliver: false,
    model: subagentModel,
    thinking: subagentThinking,
    systemPrompt: buildSubagentSystemPrompt(task),
  },
});

Subagent Result Delivery

File: /Users/ajsharp/code/github/openclaw/src/agents/subagent-announce.ts

export async function runSubagentAnnounceFlow(params: {
  childSessionKey: string;
  childRunId: string;
  requesterSessionKey: string;
  task: string;
  timeoutMs: number;
  cleanup: "delete" | "keep";
}): Promise<boolean>

Three delivery modes:

Steer - Inject message into active parent run (real-time)
Queue - Enqueue for delivery after current run completes
Interrupt - Force delivery for urgent notifications

4. Error Handling

Hierarchical Error Classification

type FailoverReason =
  | "auth"             // Authentication failure → rotate profiles
  | "rate_limit"       // Rate limiting → cooldown + rotate
  | "billing"          // Quota exceeded → skip provider
  | "timeout"          // Request timeout → retry or fallback
  | "context_overflow" // Prompt too large → compaction
  | "image_size"       // Image too large → user error
  | "role_ordering"    // Message ordering violation → user error
  | "unknown";         // Unclassified → fallback

Recovery Strategy Matrix

Error Type	Primary Recovery	Secondary Recovery	Final Fallback
`auth`	Rotate API key profile	Mark profile failed + cooldown	Model fallback
`rate_limit`	Rotate to next profile	Wait for cooldown	Model fallback
`context_overflow`	Auto-compact session (max 3x)	Truncate tool results	User error message
`timeout`	Retry same model	None	Model fallback
`billing`	Skip provider entirely	None	Model fallback
`image_size`	None	None	User error message
`role_ordering`	None	None	User error message

Key Patterns for ConductorBot

✅ Adopt

Multi-level retry loops - Auth rotation, execution retries, model fallback
Context overflow auto-compaction - Prevent hitting token limits
Error classification system - Structured recovery based on error type
Model fallback configuration - YAML-defined fallback chains

⚠️ Consider

Parallel step execution - For independent workflow steps
Subagent announce flow - Async step completion notifications

❌ Skip

Agent-driven task decomposition - Conflicts with declarative YAML paradigm
Session file persistence - SQLite already handles this
Gateway architecture - Not needed for single-tenant setup

Reference Files

OpenClaw

~/code/github/openclaw/src/agents/pi-embedded-runner/run.ts (retry orchestration)
~/code/github/openclaw/src/agents/model-fallback.ts (model fallback)
~/code/github/openclaw/src/agents/pi-embedded-helpers.ts (error classification)
~/code/github/openclaw/src/agents/subagent-announce.ts (async result delivery)

ConductorBot

claude-conductor/src/core/workflow-engine.ts (add retries)
claude-conductor/src/core/context-store.ts (add compaction)
claude-conductor/src/providers/provider.ts (wrap with fallback)
claude-conductor/src/schemas/workflow-schema.ts (add retry config)

ajsharp/openclaw-architecture-analysis.md

Select an option

No results found