Skip to content

Instantly share code, notes, and snippets.

@jacoblyles
Created February 7, 2026 19:26
Show Gist options
  • Select an option

  • Save jacoblyles/9f567475b8e07c5d26a27fe101304171 to your computer and use it in GitHub Desktop.

Select an option

Save jacoblyles/9f567475b8e07c5d26a27fe101304171 to your computer and use it in GitHub Desktop.
OpenClaw Rescue Agent - RFC Spec Page #pagedrop
<!DOCTYPE html>
<html lang="en" data-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>OpenClaw Rescue Agent - Intelligent Self-Healing for Your AI Assistant</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@picocss/pico@2/css/pico.min.css">
<script type="module">
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
mermaid.initialize({ startOnLoad: true, theme: 'dark', securityLevel: 'loose' });
</script>
<style>
:root {
--primary: #88a6ff;
--primary-hover: #6b8eff;
}
.hero {
text-align: center;
padding: 4rem 0 3rem;
background: linear-gradient(135deg, #1a1f35 0%, #0d1117 100%);
border-radius: 1rem;
margin-bottom: 3rem;
}
.hero h1 {
font-size: 3rem;
margin-bottom: 1rem;
background: linear-gradient(135deg, #88a6ff 0%, #6b8eff 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
.hero .tagline {
font-size: 1.5rem;
color: #8b949e;
margin-bottom: 2rem;
}
.badge {
display: inline-block;
padding: 0.25rem 0.75rem;
border-radius: 1rem;
font-size: 0.875rem;
font-weight: 600;
margin: 0.25rem;
}
.badge.status { background: #1f6feb; color: white; }
.badge.version { background: #238636; color: white; }
.feature-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
gap: 1.5rem;
margin: 2rem 0;
}
.feature-card {
background: #161b22;
border: 1px solid #30363d;
border-radius: 0.5rem;
padding: 1.5rem;
transition: transform 0.2s, border-color 0.2s;
}
.feature-card:hover {
transform: translateY(-2px);
border-color: #88a6ff;
}
.feature-card h3 {
margin-top: 0;
color: #88a6ff;
}
.feature-icon {
font-size: 2rem;
margin-bottom: 0.5rem;
}
.problem-statement {
background: #1c2128;
border-left: 4px solid #f85149;
padding: 1.5rem;
border-radius: 0.5rem;
margin: 2rem 0;
}
.solution-highlight {
background: #1c2128;
border-left: 4px solid #238636;
padding: 1.5rem;
border-radius: 0.5rem;
margin: 2rem 0;
}
.cta-section {
text-align: center;
padding: 3rem 0;
background: linear-gradient(135deg, #1a1f35 0%, #0d1117 100%);
border-radius: 1rem;
margin-top: 3rem;
}
.cta-button {
display: inline-block;
padding: 1rem 2rem;
background: #88a6ff;
color: #0d1117;
font-weight: 700;
border-radius: 0.5rem;
text-decoration: none;
transition: background 0.2s;
margin: 0.5rem;
}
.cta-button:hover {
background: #6b8eff;
color: #0d1117;
}
pre code {
font-size: 0.875rem;
}
.timeline {
position: relative;
padding-left: 2rem;
}
.timeline::before {
content: '';
position: absolute;
left: 0;
top: 0;
bottom: 0;
width: 2px;
background: #30363d;
}
.timeline-item {
position: relative;
margin-bottom: 2rem;
}
.timeline-item::before {
content: '';
position: absolute;
left: -2.5rem;
top: 0.5rem;
width: 1rem;
height: 1rem;
border-radius: 50%;
background: #88a6ff;
border: 3px solid #0d1117;
}
.access-method {
background: #161b22;
border: 1px solid #30363d;
border-radius: 0.5rem;
padding: 1rem;
margin-bottom: 1rem;
}
.access-method h4 {
margin: 0 0 0.5rem 0;
color: #88a6ff;
}
.pros-cons {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
margin-top: 0.5rem;
font-size: 0.875rem;
}
.pros { color: #7ee787; }
.cons { color: #ff7b72; }
</style>
</head>
<body>
<main class="container">
<!-- Hero Section -->
<div class="hero">
<h1>🛟 OpenClaw Rescue Agent</h1>
<p class="tagline">Intelligent Self-Healing for Your AI Assistant</p>
<div>
<span class="badge status">RFC / Design Proposal</span>
<span class="badge version">v1.0</span>
</div>
</div>
<!-- Executive Summary -->
<section>
<h2>What is the Rescue Agent?</h2>
<p>
The <strong>OpenClaw Rescue Agent</strong> is an independent, AI-powered debugging and recovery system
that monitors, diagnoses, and repairs the main OpenClaw gateway when it fails. Unlike simple restart
mechanisms, it uses AI to understand problems, autonomously fix configuration issues, analyze logs,
and take corrective actions with appropriate guardrails.
</p>
</section>
<!-- Problem Statement -->
<section>
<h2>The Problem</h2>
<div class="problem-statement">
<h3>🚨 When OpenClaw Breaks, You're on Your Own</h3>
<p>
The main OpenClaw gateway is a complex system with many failure modes:
</p>
<ul>
<li><strong>Configuration errors</strong> from manual edits (trailing commas, typos, invalid JSON)</li>
<li><strong>Process crashes</strong> from resource exhaustion or uncaught exceptions</li>
<li><strong>Plugin failures</strong> cascading to the main process</li>
<li><strong>Port conflicts</strong> when orphaned processes block the gateway</li>
<li><strong>Database issues</strong> or migration failures</li>
</ul>
<p>
<strong>Current recovery process:</strong> SSH into the server → dig through logs →
diagnose the issue → fix config or kill processes → restart → hope it works.
</p>
<p>
This is <em>frustrating</em>, <em>time-consuming</em>, and requires technical expertise.
For non-technical users, it's a complete blocker.
</p>
</div>
</section>
<!-- Solution -->
<section>
<h2>The Solution</h2>
<div class="solution-highlight">
<h3>✅ An Always-On Safety Net</h3>
<p>
The Rescue Agent is a separate, independent process that:
</p>
<ul>
<li><strong>Survives when the main gateway crashes</strong> — it's a completely independent process</li>
<li><strong>Thinks intelligently</strong> — uses AI to diagnose problems, not just pattern matching</li>
<li><strong>Fixes problems autonomously</strong> — auto-repairs common issues (config errors, port conflicts, restarts)</li>
<li><strong>Stays accessible</strong> — 8 different access methods ensure you can always reach it</li>
<li><strong>Stays secure</strong> — guardrails prevent dangerous operations, audit logs track everything</li>
</ul>
</div>
</section>
<!-- Architecture -->
<section>
<h2>Architecture</h2>
<p>
The Rescue Agent runs as a completely separate process from the main gateway,
with minimal dependencies and maximum resilience.
</p>
<pre class="mermaid">
graph TB
subgraph "Main OpenClaw Gateway"
A[Gateway Process]
B[All Plugins]
C[Full AI Stack]
D[User Channels]
end
subgraph "Rescue Agent"
E[Independent Process]
F[Lightweight AI]
G[Health Monitor]
H[8 Access Methods]
end
subgraph "Shared Resources"
I[Config Files]
J[Log Files]
K[Database]
end
A -->|monitors| G
G -->|auto-repair| A
E --> F
E --> G
E --> H
A -.->|reads/writes| I
A -.->|writes| J
A -.->|reads/writes| K
E -.->|reads/writes| I
E -.->|reads| J
E -.->|reads only| K
style A fill:#1f6feb
style E fill:#238636
style G fill:#f85149
</pre>
<h3>Key Design Principles</h3>
<ul>
<li><strong>Process Separation:</strong> Rescue Agent is a separate executable (<code>openclaw rescue start</code>)</li>
<li><strong>No Shared Dependencies:</strong> Doesn't import main gateway code — can start even if main is broken</li>
<li><strong>Minimal Footprint:</strong> Lightweight AI model (Claude Haiku) for speed and cost efficiency</li>
<li><strong>Shared State:</strong> Only shares config files, logs, and database (read-only for safety)</li>
</ul>
</section>
<!-- Key Features -->
<section>
<h2>Key Features</h2>
<h3>🧠 AI-Powered Diagnostics</h3>
<p>
Instead of simple pattern matching, the Rescue Agent uses Claude to:
</p>
<ul>
<li>Analyze log files and extract meaningful error patterns</li>
<li>Diagnose root causes, not just symptoms</li>
<li>Suggest fixes with confidence scores</li>
<li>Learn from past failures to prevent future issues</li>
</ul>
<details>
<summary><strong>Example: AI Log Analysis</strong></summary>
<pre><code>User: /rescue diagnose
Rescue Agent:
🔍 Analyzing last 500 log lines...
Found 3 issues:
1. ❌ CRITICAL: Uncaught exception in Telegram plugin
Pattern: "TypeError: Cannot read property 'id' of undefined"
Frequency: 47 times in last 10 minutes
Likely cause: Missing user validation in message handler
2. ⚠️ WARNING: Redis connection timeout
Pattern: "Error: connect ETIMEDOUT"
Frequency: 12 times in last hour
Likely cause: Redis server not running or firewall issue
3. 💡 INFO: High memory usage trend
Current: 1.2GB (up from 800MB 2 hours ago)
Likely cause: Memory leak in session manager
Suggested actions:
- Restart gateway (will clear memory leak)
- Check if Redis is running: systemctl status redis
- Review Telegram plugin code at line 342</code></pre>
</details>
<h3>🔧 Autonomous Repair Capabilities</h3>
<div class="feature-grid">
<div class="feature-card">
<div class="feature-icon">📝</div>
<h3>Config Auto-Fix</h3>
<p>Automatically detects and repairs JSON syntax errors, missing fields, and common typos. Creates backups before every change.</p>
</div>
<div class="feature-card">
<div class="feature-icon">🔄</div>
<h3>Smart Restarts</h3>
<p>Auto-restarts the gateway when unhealthy, with exponential backoff to prevent restart loops.</p>
</div>
<div class="feature-card">
<div class="feature-icon">⚔️</div>
<h3>Port Conflict Resolution</h3>
<p>Detects when ports are blocked by orphaned processes and kills them (with confirmation).</p>
</div>
<div class="feature-card">
<div class="feature-icon">⏮️</div>
<h3>Config Rollback</h3>
<p>Maintains backups of every config change, with one-command rollback to any previous version.</p>
</div>
</div>
<h3>🌐 8 Access Methods</h3>
<p>
Multiple ways to reach the Rescue Agent ensure you're never locked out:
</p>
<div class="access-method">
<h4>1. 📱 Telegram (<code>/rescue</code> prefix)</h4>
<p>Most convenient — chat-based interface using your existing bot.</p>
<div class="pros-cons">
<div class="pros">✅ Familiar interface<br>✅ Works from anywhere<br>✅ Built-in auth</div>
<div class="cons">❌ Requires internet<br>❌ Single point of failure</div>
</div>
</div>
<div class="access-method">
<h4>2. 📧 Email-based Access</h4>
<p>Send commands via email, get responses back automatically.</p>
<div class="pros-cons">
<div class="pros">✅ Works when Telegram is down<br>✅ Any email client</div>
<div class="cons">❌ Slower (1-2 min latency)<br>❌ Requires IMAP setup</div>
</div>
</div>
<div class="access-method">
<h4>3. 🔐 SSH Command Interface</h4>
<p>Interactive shell: <code>openclaw rescue shell</code></p>
<div class="pros-cons">
<div class="pros">✅ Full terminal control<br>✅ Works when network is degraded</div>
<div class="cons">❌ Requires SSH access<br>❌ Terminal-only</div>
</div>
</div>
<div class="access-method">
<h4>4. 🔌 Unix Socket API</h4>
<p>Local IPC for scripting: <code>/tmp/openclaw-rescue.sock</code></p>
<div class="pros-cons">
<div class="pros">✅ Fast local communication<br>✅ No network required</div>
<div class="cons">❌ Local access only<br>❌ Requires socket permissions</div>
</div>
</div>
<div class="access-method">
<h4>5. 🌍 Tailscale Web UI</h4>
<p>Browser-based GUI accessible via Tailscale IP.</p>
<div class="pros-cons">
<div class="pros">✅ User-friendly interface<br>✅ Works on any device</div>
<div class="cons">❌ Requires Tailscale<br>❌ Limited to Tailscale network</div>
</div>
</div>
<div class="access-method">
<h4>6. 📡 mDNS/Bonjour Discovery</h4>
<p>Auto-discover rescue agent on local network.</p>
<div class="pros-cons">
<div class="pros">✅ Zero-config discovery<br>✅ Works offline</div>
<div class="cons">❌ LAN only<br>❌ Requires mDNS support</div>
</div>
</div>
<div class="access-method">
<h4>7. 💬 SMS via Twilio (Optional)</h4>
<p>Send commands via SMS when internet is down.</p>
<div class="pros-cons">
<div class="pros">✅ Works without internet<br>✅ Most reliable fallback</div>
<div class="cons">❌ Costs money<br>❌ SMS delays</div>
</div>
</div>
<div class="access-method">
<h4>8. 🔴 Physical Hardware Button (Advanced)</h4>
<p>GPIO button (Raspberry Pi) triggers emergency restart.</p>
<div class="pros-cons">
<div class="pros">✅ Works when ALL network is down<br>✅ Physical confirmation</div>
<div class="cons">❌ Requires GPIO hardware<br>❌ Single action only</div>
</div>
</div>
</section>
<!-- How It Works -->
<section>
<h2>How It Works</h2>
<p>
The Rescue Agent continuously monitors the main gateway's health and responds to failures:
</p>
<pre class="mermaid">
sequenceDiagram
participant U as User
participant R as Rescue Agent
participant G as Main Gateway
participant AI as Claude AI
loop Health Check (every 30s)
R->>G: HTTP GET /health
alt Gateway Healthy
G-->>R: 200 OK
R->>R: Reset failure counter
else Gateway Down
G--xR: No response (timeout)
R->>R: Increment failure counter
end
end
alt 3 Consecutive Failures
R->>R: Trigger auto-repair
R->>G: Attempt graceful restart
alt Restart Successful
G-->>R: 200 OK on health check
R->>U: ✅ Auto-repaired (Telegram/Email)
else Restart Failed
R->>AI: Analyze logs + config
AI-->>R: Diagnosis + fix suggestions
R->>U: ⚠️ Manual intervention needed
U->>R: /rescue diagnose
R->>U: 🔍 [Detailed AI diagnosis]
U->>R: /rescue fix config
R->>R: Apply AI-suggested fix
R->>G: Restart gateway
G-->>R: 200 OK
R->>U: ✅ Fixed!
end
end
</pre>
<h3>Proactive Monitoring</h3>
<ul>
<li><strong>HTTP Health Checks:</strong> Ping <code>/health</code> endpoint every 30 seconds</li>
<li><strong>Process Monitoring:</strong> Verify main gateway PID exists</li>
<li><strong>Log Activity:</strong> Detect if logs are stale (possible hang)</li>
<li><strong>Resource Usage:</strong> Alert on high memory, CPU, or low disk space</li>
</ul>
<h3>Auto-Restart Logic</h3>
<ol>
<li>3 consecutive failed health checks → trigger restart</li>
<li>Graceful restart with 30s timeout</li>
<li>If fails, retry with exponential backoff (1min, 5min)</li>
<li>After 3 failed attempts → stop and alert user</li>
<li>Max 3 restarts per 10 minutes (prevent loops)</li>
</ol>
</section>
<!-- Configuration -->
<section>
<h2>Configuration</h2>
<p>
The Rescue Agent is configured via <code>~/.openclaw/rescue.json</code>:
</p>
<details open>
<summary><strong>Example Configuration</strong></summary>
<pre><code>{
"version": "1.0",
"enabled": true,
"ai": {
"primaryModel": "anthropic/claude-haiku-3-5",
"fallbackModels": ["openai/gpt-4o-mini", "google/gemini-flash-2-0"],
"maxTokensPerRequest": 4096,
"temperature": 0.1
},
"monitoring": {
"enabled": true,
"healthCheckIntervalSeconds": 30,
"unhealthyThresholdChecks": 3,
"autoRestartEnabled": true,
"autoFixConfigEnabled": true
},
"access": {
"telegram": {
"enabled": true,
"commandPrefix": "/rescue",
"allowedUserIds": [123456789]
},
"ssh": {
"enabled": true,
"command": "openclaw rescue shell"
},
"web": {
"enabled": false,
"port": 7878,
"tailscaleOnly": true
}
},
"capabilities": {
"allowConfigEdit": true,
"allowProcessControl": true,
"allowCommandExecution": true,
"commandWhitelist": ["systemctl", "launchctl", "pm2", "git"],
"dangerousCommandsRequireConfirmation": true
},
"security": {
"auditLog": "~/.openclaw/logs/rescue-audit.log",
"requireAuthForAllAccess": true,
"maxConcurrentSessions": 3,
"rateLimitPerMinute": 20
}
}</code></pre>
</details>
</section>
<!-- Security -->
<section>
<h2>Security Model</h2>
<p>
The Rescue Agent has powerful capabilities and must be secured carefully:
</p>
<h3>🔒 Authentication</h3>
<ul>
<li><strong>Telegram:</strong> Whitelist of allowed user IDs (no fallback to "admin")</li>
<li><strong>Email:</strong> Whitelist of sender addresses + SPF/DKIM validation</li>
<li><strong>SSH:</strong> System SSH auth + optional rescue token</li>
<li><strong>Web UI:</strong> Tailscale Whois API + session cookies + CSRF tokens</li>
</ul>
<h3>🛡️ Guardrails</h3>
<ul>
<li><strong>Confirmation Required:</strong> Dangerous operations (delete, kill, config changes) need explicit approval</li>
<li><strong>Automatic Backups:</strong> Every destructive action creates a backup first</li>
<li><strong>Command Whitelist:</strong> Only allowed commands can execute (configurable)</li>
<li><strong>Rate Limiting:</strong> Max 20 requests/minute, max 3 restarts per 10 minutes</li>
<li><strong>Audit Logging:</strong> All actions logged to append-only file (cannot be disabled)</li>
</ul>
<h3>📋 Audit Log Example</h3>
<pre><code>{"timestamp":"2026-02-07T11:05:30Z","action":"config_edit","user":"telegram:123456789","details":{"file":"config.json","changes":"Fixed syntax error"},"success":true}
{"timestamp":"2026-02-07T11:06:15Z","action":"process_restart","user":"ssh:alice","details":{"service":"gateway"},"success":true}
{"timestamp":"2026-02-07T11:07:42Z","action":"command_exec","user":"telegram:987654321","details":{"command":"kill 12847"},"success":false,"error":"Unauthorized user"}</code></pre>
<h3>🚫 What the Rescue Agent CANNOT Do</h3>
<ul>
<li>Run commands as root (unless explicitly configured)</li>
<li>Execute arbitrary shell commands (whitelist enforcement)</li>
<li>Disable audit logging</li>
<li>Delete audit logs</li>
<li>Modify rescue agent's own config (requires main gateway)</li>
</ul>
</section>
<!-- Implementation Roadmap -->
<section>
<h2>Implementation Roadmap</h2>
<div class="timeline">
<div class="timeline-item">
<h3>Phase 1: Core Infrastructure (Weeks 1-2)</h3>
<ul>
<li>Separate process architecture</li>
<li>Basic CLI: <code>openclaw rescue start/stop/status</code></li>
<li>Config file with schema validation</li>
<li>Health check monitor (HTTP + process)</li>
<li>Audit logging</li>
</ul>
</div>
<div class="timeline-item">
<h3>Phase 2: AI Integration (Week 3)</h3>
<ul>
<li>Model selection and fallback logic</li>
<li>Log analysis with AI</li>
<li>Config auto-fix with AI</li>
<li>Diagnostic AI prompts</li>
</ul>
</div>
<div class="timeline-item">
<h3>Phase 3: Access Methods (Weeks 4-5)</h3>
<ul>
<li>Telegram <code>/rescue</code> commands</li>
<li>SSH shell interface</li>
<li>Unix socket API</li>
<li>Email monitoring (optional)</li>
</ul>
</div>
<div class="timeline-item">
<h3>Phase 4: Capabilities (Week 6)</h3>
<ul>
<li>Process control (restart/stop/start)</li>
<li>Config rollback</li>
<li>Command execution with guardrails</li>
<li>Service management (launchd/systemd)</li>
</ul>
</div>
<div class="timeline-item">
<h3>Phase 5: Monitoring & Alerts (Week 7)</h3>
<ul>
<li>Auto-restart logic</li>
<li>Alerting system</li>
<li>Quiet hours</li>
<li>Escalation flow</li>
</ul>
</div>
<div class="timeline-item">
<h3>Phase 6: Advanced Access (Week 8+)</h3>
<ul>
<li>Tailscale web UI</li>
<li>mDNS discovery</li>
<li>SMS via Twilio (optional)</li>
<li>Hardware button support</li>
</ul>
</div>
</div>
</section>
<!-- Success Metrics -->
<section>
<h2>Success Metrics</h2>
<article>
<h3>📉 Mean Time To Recovery (MTTR)</h3>
<p>
<strong>Baseline:</strong> ~30 minutes (current manual recovery)<br>
<strong>Target:</strong> &lt;5 minutes with auto-restart
</p>
</article>
<article>
<h3>🎯 Auto-Fix Success Rate</h3>
<p>
<strong>Target:</strong> &gt;70% of common issues fixed without human intervention
</p>
</article>
<article>
<h3>⏱️ Rescue Agent Uptime</h3>
<p>
<strong>Target:</strong> &gt;99.9% uptime (the safety net must always be there)
</p>
</article>
<article>
<h3>😊 User Satisfaction</h3>
<p>
<strong>Target:</strong> &gt;4.5/5 stars from users who've needed rescue
</p>
</article>
</section>
<!-- Why This Matters -->
<section>
<h2>Why This Matters</h2>
<blockquote>
"Your AI assistant should be as reliable as your phone. When it breaks, it should fix itself —
or at least make it dead simple for you to fix it."
</blockquote>
<p>
OpenClaw is designed to be a personal AI that's always available. But complex software breaks.
The Rescue Agent ensures that:
</p>
<ul>
<li><strong>Non-technical users</strong> can recover without SSH expertise</li>
<li><strong>Technical users</strong> save time with AI-powered diagnostics</li>
<li><strong>Common failures</strong> auto-repair without human intervention</li>
<li><strong>Emergency access</strong> is always available via multiple fallback methods</li>
</ul>
<p>
This isn't just a nice-to-have — it's critical infrastructure for production OpenClaw deployments.
</p>
</section>
<!-- Call to Action -->
<div class="cta-section">
<h2>Get Involved</h2>
<p>
This is an open RFC for the OpenClaw community. We'd love your feedback!
</p>
<div style="margin: 2rem 0;">
<a href="https://github.com/Martian-Engineering/openclaw/tree/feature/rescue-agent" class="cta-button">
📂 View GitHub Branch
</a>
<a href="https://github.com/Martian-Engineering/openclaw/issues/new?title=Rescue%20Agent%20RFC%20Feedback" class="cta-button">
💬 Leave Feedback
</a>
</div>
<p style="color: #8b949e; margin-top: 2rem;">
Want to contribute? Check out the implementation roadmap above and pick a phase!
</p>
</div>
<!-- Footer -->
<footer style="text-align: center; padding: 2rem 0; border-top: 1px solid #30363d; margin-top: 3rem; color: #8b949e;">
<p>
<strong>OpenClaw Rescue Agent</strong> — RFC v1.0<br>
Created: 2026-02-07 | Author: OpenClaw Team<br>
<a href="https://github.com/Martian-Engineering/openclaw" style="color: #88a6ff;">OpenClaw on GitHub</a>
</p>
</footer>
</main>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment