Skip to content

Instantly share code, notes, and snippets.

@ajsharp
Created February 12, 2026 15:36
Show Gist options
  • Select an option

  • Save ajsharp/10eb861b14f10f25c7cf037399ba3195 to your computer and use it in GitHub Desktop.

Select an option

Save ajsharp/10eb861b14f10f25c7cf037399ba3195 to your computer and use it in GitHub Desktop.
OpenClaw Architecture Analysis - Retry and Error Handling Patterns

OpenClaw Architecture Analysis - Retry and Error Handling Patterns

Generated: 2026-02-11 Source: Research for issue conductorbot-6cr

Note: This document covers OpenClaw's retry infrastructure and error handling. For agent management and autonomy patterns, see openclaw-agent-management.md.


1. Agent Harness

Main Entry Point

File: /Users/ajsharp/code/github/openclaw/src/commands/agent.ts

The agent harness wraps agent execution through the agentCommand function, which:

  • Validates input and session parameters
  • Resolves workspace, agent configuration, and model selection
  • Delegates to either CLI agents or embedded Pi agents
  • Handles model fallback and delivery

Core Agent Runner

File: /Users/ajsharp/code/github/openclaw/src/agents/pi-embedded-runner/run.ts

The runEmbeddedPiAgent function is the main orchestration layer:

export async function runEmbeddedPiAgent(
  params: RunEmbeddedPiAgentParams,
): Promise<EmbeddedPiRunResult>

Key responsibilities:

  • Session lane management - Uses queue-based execution lanes to prevent concurrent runs
  • Workspace resolution - Determines working directory with fallback logic
  • Auth profile management - Rotates through API keys/profiles when rate-limited
  • Retry orchestration - Coordinates multiple retry strategies
  • Result building - Constructs payloads from assistant responses and tool calls

2. Looping Mechanism

Multi-Level Retry Strategy

The architecture implements three concurrent retry loops:

A. Auth Profile Rotation Loop

Location: runEmbeddedPiAgent (lines 357-384)

while (profileIndex < profileCandidates.length) {
  const candidate = profileCandidates[profileIndex];
  if (candidate && isProfileInCooldown(authStore, candidate)) {
    profileIndex += 1;
    continue;
  }
  await applyApiKeyInfo(profileCandidates[profileIndex]);
  break;
}

Termination: Exhausts all available auth profiles or finds one not in cooldown

B. Main Execution Loop with Context Overflow Handling

Location: runEmbeddedPiAgent (lines 392-863)

const MAX_OVERFLOW_COMPACTION_ATTEMPTS = 3;
let overflowCompactionAttempts = 0;

while (true) {
  attemptedThinking.add(thinkLevel);

  const attempt = await runEmbeddedAttempt({ ... });

  // Handle context overflow with auto-compaction
  if (contextOverflowError) {
    if (!isCompactionFailure &&
        overflowCompactionAttempts < MAX_OVERFLOW_COMPACTION_ATTEMPTS) {
      overflowCompactionAttempts++;
      const compactResult = await compactEmbeddedPiSessionDirect({ ... });
      if (compactResult.compacted) {
        continue; // Retry after compaction
      }
    }
    // Try tool result truncation as last resort
    if (!toolResultTruncationAttempted) {
      const truncResult = await truncateOversizedToolResultsInSession({ ... });
      if (truncResult.truncated) {
        overflowCompactionAttempts = 0;
        continue;
      }
    }
    return { /* context overflow error */ };
  }

  // Handle auth failures with profile rotation
  if (shouldRotate) {
    if (lastProfileId) {
      await markAuthProfileFailure({ ... });
    }
    const rotated = await advanceAuthProfile();
    if (rotated) {
      continue;
    }
  }

  // Success path
  return { payloads, meta };
}

Termination conditions:

  • Successful completion
  • All auth profiles exhausted
  • Context overflow unrecoverable
  • Non-retryable error (AbortError, image errors, role ordering conflicts)

C. Model Fallback Loop

File: /Users/ajsharp/code/github/openclaw/src/agents/model-fallback.ts

export async function runWithModelFallback<T>(params: {
  cfg: OpenClawConfig | undefined;
  provider: string;
  model: string;
  fallbacksOverride?: string[];
  run: (provider: string, model: string) => Promise<T>;
}): Promise<{ result: T; provider: string; model: string; attempts: FallbackAttempt[] }>

Retry logic:

for (let i = 0; i < candidates.length; i += 1) {
  const candidate = candidates[i];

  try {
    const result = await params.run(candidate.provider, candidate.model);
    return { result, provider: candidate.provider, model: candidate.model, attempts };
  } catch (err) {
    if (shouldRethrowAbort(err)) throw err;
    const normalized = coerceToFailoverError(err, { ... });
    if (!isFailoverError(normalized)) throw err;
    lastError = normalized;
    attempts.push({ /* failed attempt */ });
  }
}

3. Inter-Agent Communication

Subagent Spawning System

Tool: /Users/ajsharp/code/github/openclaw/src/agents/tools/sessions-spawn-tool.ts

Agents spawn isolated subagents via the sessions_spawn tool:

const childSessionKey = `agent:${targetAgentId}:subagent:${crypto.randomUUID()}`;

await callGateway({
  method: "agent",
  params: {
    sessionKey: childSessionKey,
    message: task,
    deliver: false,
    model: subagentModel,
    thinking: subagentThinking,
    systemPrompt: buildSubagentSystemPrompt(task),
  },
});

Subagent Result Delivery

File: /Users/ajsharp/code/github/openclaw/src/agents/subagent-announce.ts

export async function runSubagentAnnounceFlow(params: {
  childSessionKey: string;
  childRunId: string;
  requesterSessionKey: string;
  task: string;
  timeoutMs: number;
  cleanup: "delete" | "keep";
}): Promise<boolean>

Three delivery modes:

  • Steer - Inject message into active parent run (real-time)
  • Queue - Enqueue for delivery after current run completes
  • Interrupt - Force delivery for urgent notifications

4. Error Handling

Hierarchical Error Classification

type FailoverReason =
  | "auth"             // Authentication failure → rotate profiles
  | "rate_limit"       // Rate limiting → cooldown + rotate
  | "billing"          // Quota exceeded → skip provider
  | "timeout"          // Request timeout → retry or fallback
  | "context_overflow" // Prompt too large → compaction
  | "image_size"       // Image too large → user error
  | "role_ordering"    // Message ordering violation → user error
  | "unknown";         // Unclassified → fallback

Recovery Strategy Matrix

Error Type Primary Recovery Secondary Recovery Final Fallback
auth Rotate API key profile Mark profile failed + cooldown Model fallback
rate_limit Rotate to next profile Wait for cooldown Model fallback
context_overflow Auto-compact session (max 3x) Truncate tool results User error message
timeout Retry same model None Model fallback
billing Skip provider entirely None Model fallback
image_size None None User error message
role_ordering None None User error message

Key Patterns for ConductorBot

✅ Adopt

  1. Multi-level retry loops - Auth rotation, execution retries, model fallback
  2. Context overflow auto-compaction - Prevent hitting token limits
  3. Error classification system - Structured recovery based on error type
  4. Model fallback configuration - YAML-defined fallback chains

⚠️ Consider

  1. Parallel step execution - For independent workflow steps
  2. Subagent announce flow - Async step completion notifications

❌ Skip

  1. Agent-driven task decomposition - Conflicts with declarative YAML paradigm
  2. Session file persistence - SQLite already handles this
  3. Gateway architecture - Not needed for single-tenant setup

Reference Files

OpenClaw

  • ~/code/github/openclaw/src/agents/pi-embedded-runner/run.ts (retry orchestration)
  • ~/code/github/openclaw/src/agents/model-fallback.ts (model fallback)
  • ~/code/github/openclaw/src/agents/pi-embedded-helpers.ts (error classification)
  • ~/code/github/openclaw/src/agents/subagent-announce.ts (async result delivery)

ConductorBot

  • claude-conductor/src/core/workflow-engine.ts (add retries)
  • claude-conductor/src/core/context-store.ts (add compaction)
  • claude-conductor/src/providers/provider.ts (wrap with fallback)
  • claude-conductor/src/schemas/workflow-schema.ts (add retry config)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment