Skip to content

Instantly share code, notes, and snippets.

@drewstone
Created February 5, 2026 02:32
Show Gist options
  • Select an option

  • Save drewstone/3830d6bdff1e11a3ead66de530f0d599 to your computer and use it in GitHub Desktop.

Select an option

Save drewstone/3830d6bdff1e11a3ead66de530f0d599 to your computer and use it in GitHub Desktop.
Senior Engineer Audit: @tangle/agent-driver

πŸ” Senior Engineer Audit: @tangle/agent-driver

Package: @tangle/agent-driver v0.1.0
Auditor: Ferdinand (AI)
Date: 2026-02-04
Scope: Architecture, API design, observability, production readiness


Executive Summary

This is a clean, minimal LLM-driven browser agent with good bones. The core observe→decide→execute loop is well-implemented, and the separation of concerns (Driver, Brain, Runner) shows architectural maturity. However, it's clearly in "MVP/prototype" stage—fine for tests, but missing critical features for production use.

Verdict: Solid foundation. Needs ~2 sprints of hardening for production.


1. maxTurns Semantics

Question: Is it clear that maxTurns = max observe→decide→execute cycles (any action)?

Answer: Mostly, but could be clearer.

What the code does:

for (let i = 1; i <= maxTurns; i++) {
  // 1. Observe
  // 2. Decide
  // 3. Execute
}

Each loop iteration is ONE complete cycle. Any action countsβ€”click, type, scroll, wait, etc.

The ambiguity:

  • The JSDoc says /** Max turns before giving up */ β€” vague
  • The Turn type says /** One observe β†’ decide β†’ execute cycle */ β€” better!
  • Someone might assume "turns" means "user interactions" or "typing turns"

Recommendation:

export interface Scenario {
  /**
   * Maximum observe→decide→execute cycles before aborting.
   * Each cycle is one LLM call + one action (click, type, scroll, etc.)
   * @default 20
   */
  maxTurns?: number;
}

Rating: 7/10 β€” Semantics are correct, documentation could be crisper.


2. Directive Flexibility

Question: Can users pass any goal/directive? Is it flexible enough?

Answer: Yes, it's very flexible.

export interface Scenario {
  goal: string;           // βœ… Any natural language goal
  startUrl?: string;      // βœ… Optional starting point
  maxTurns?: number;      // βœ… Configurable limit
}

Strengths:

  • goal is free-form natural language
  • No rigid structure imposed
  • Works for: "Login as admin", "Add item to cart", "Find the pricing page"

Limitations:

  • No support for multi-step scenarios (first do X, then Y)
  • No way to pass context/hints (e.g., "the password is in env var")
  • No assertion/validation hooks ("verify checkout total is $99")

Recommendation β€” Add optional context field:

export interface Scenario {
  goal: string;
  startUrl?: string;
  maxTurns?: number;
  /** Additional context for the LLM (credentials, hints, etc.) */
  context?: string;
  /** Expected success criteria for validation */
  assertions?: string[];
}

Rating: 8/10 β€” Great for simple goals, needs extension for complex scenarios.


3. Logs & Telemetry

Question: What's captured? What's MISSING?

βœ… Currently Captured:

Data Where Notes
Turn number Turn.turn Good
Page state Turn.state URL, title, snapshot
Action taken Turn.action Full action object
Raw LLM response Turn.rawLLMResponse βœ… Excellent for debugging
Duration Turn.durationMs Per-turn timing
Errors Turn.error When caught
Total time AgentResult.totalMs Aggregate

❌ MISSING (Critical for Production):

Missing Impact Priority
Conversation history LLM has no memory of previous turns! πŸ”΄ Critical
Screenshots Can't debug visual issues πŸ”΄ Critical
Reasoning/CoT No visibility into "why" 🟑 High
Token usage Can't track costs 🟑 High
Action success/failure Did click actually work? 🟑 High
Retry mechanism One failure = total abort 🟑 High
Structured logging Only console.log with debug flag 🟒 Medium
Trace IDs Can't correlate across services 🟒 Medium

🚨 Critical Issue: No Conversation History!

// brain/index.ts
const response = await this.client.chat.completions.create({
  messages: [
    { role: 'system', content: SYSTEM_PROMPT },
    { role: 'user', content: prompt },  // ← Only current state!
  ],
});

The LLM has amnesia! Each turn is completely independent. This causes:

  • Agent clicks same button repeatedly
  • Agent retries failed actions identically
  • Agent can't learn from previous attempts
  • Multi-step reasoning is impossible

Fix:

class Brain {
  private history: ChatCompletionMessageParam[] = [];
  
  async decide(goal: string, state: PageState): Promise<...> {
    const userMessage = { role: 'user', content: buildPrompt(goal, state) };
    
    const response = await this.client.chat.completions.create({
      messages: [
        { role: 'system', content: SYSTEM_PROMPT },
        ...this.history,
        userMessage,
      ],
    });
    
    // Store for next turn
    this.history.push(userMessage);
    this.history.push({ role: 'assistant', content: response.choices[0].message.content });
    
    // Trim if too long
    if (this.history.length > 20) this.history = this.history.slice(-10);
  }
}

πŸ“Έ No Screenshot Capture

The PageState only has a text snapshot. For debugging:

  • You can't see what the agent "saw"
  • You can't verify element visibility
  • You can't debug selector issues

Recommendation:

export interface PageState {
  url: string;
  title: string;
  snapshot: string;
  screenshot?: Buffer;  // Optional, configurable
}

πŸ”„ No Retry Mechanism

} catch (err) {
  // Immediate abort, no retry
  return { success: false, reason: error, ... };
}

One transient failure (network glitch, slow load) = complete failure.

Recommendation:

interface AgentConfig {
  retries?: number;        // Default: 3
  retryDelayMs?: number;   // Default: 1000
  retryableErrors?: string[];  // Patterns to retry
}

Rating: 4/10 β€” Basic turn logging exists, but critical production features are missing.


4. Architecture Quality

Driver Interface

export interface Driver {
  observe(): Promise<PageState>;
  execute(action: Action): Promise<void>;
}

Assessment: Clean and minimal.

βœ… Perfect abstraction level
βœ… Easy to implement new drivers (Puppeteer, WebDriver, etc.)
βœ… Testable (easy to mock)
❌ execute returns void β€” no feedback on success/failure
❌ No lifecycle hooks (setup, teardown)

Recommendation:

export interface Driver {
  observe(): Promise<PageState>;
  execute(action: Action): Promise<ActionResult>;  // Did it work?
  screenshot?(): Promise<Buffer>;
  close?(): Promise<void>;
}

interface ActionResult {
  success: boolean;
  error?: string;
  changedElements?: string[];  // What changed after action
}

Brain Swappability

Currently: Hardcoded OpenAI SDK

import OpenAI from 'openai';
// ...
this.client = new OpenAI({ ... });

Can you use Anthropic? Technically yes, via baseUrl pointing to a compatible endpoint. But:

  • No native Anthropic SDK support
  • No Claude-specific features (extended thinking, tool use)
  • OpenAI response format is assumed

Recommendation β€” Abstract the LLM layer:

interface LLMProvider {
  complete(messages: Message[]): Promise<string>;
}

class OpenAIProvider implements LLMProvider { ... }
class AnthropicProvider implements LLMProvider { ... }

class Brain {
  constructor(private provider: LLMProvider) {}
}

Production Viability

Aspect Status Notes
Error handling ⚠️ Basic Single try/catch, no recovery
Graceful shutdown ❌ Missing No way to cancel mid-run
Resource cleanup ❌ Missing Page/browser left open
Rate limiting ❌ Missing Can hammer the LLM API
Circuit breaker ❌ Missing No backoff on repeated failures
Idempotency ❌ Missing Re-running may double-execute

Rating: 6/10 β€” Good abstraction, needs production hardening.


5. Missing Features for Production

Must Have (P0)

  1. Conversation history β€” LLM needs context from previous turns
  2. Screenshot capture β€” Debug visual state
  3. Retry mechanism β€” Handle transient failures
  4. Abort signal/cancellation β€” Stop long-running agents
  5. Action result feedback β€” Know if actions succeeded

Should Have (P1)

  1. Structured logging β€” JSON logs with trace IDs
  2. Token/cost tracking β€” Budget awareness
  3. Multi-LLM support β€” Anthropic, Gemini, local models
  4. Hooks/middleware β€” onBeforeAction, onAfterAction, onError
  5. State assertions β€” Verify expected outcomes

Nice to Have (P2)

  1. Visual element references β€” "Click the blue button" not just selectors
  2. Parallel action support β€” Fill multiple fields at once
  3. Record/replay β€” Capture runs for playback
  4. Human-in-the-loop β€” Pause and ask for help
  5. Metrics export β€” Prometheus/OpenTelemetry integration

Ratings Summary

Category Rating Notes
API Design 7/10 Clean, intuitive, good types. Minor gaps in docs.
Observability/Debugging 4/10 Turn logging is good, but missing screenshots, history, structured logs
Extensibility 6/10 Driver interface is solid. Brain is not swappable. No hooks.
Production Readiness 3/10 MVP only. Missing retries, cancellation, conversation history, error recovery
Code Quality 8/10 Clean, well-organized, proper TypeScript. Good separation of concerns.

Overall: 5.6/10

Translation: Great prototype, not production-ready. The bones are goodβ€”this could be excellent with 2-3 weeks of focused work.


Recommended Next Steps

Week 1: Critical Fixes

  • Add conversation history to Brain
  • Add screenshot capture to PageState
  • Add retry mechanism to runner
  • Add ActionResult feedback from execute()

Week 2: Production Hardening

  • Add cancellation/abort signal
  • Add structured logging with trace IDs
  • Add lifecycle hooks (onTurn, onError, onComplete)
  • Add token usage tracking

Week 3: Extensibility

  • Abstract LLM provider interface
  • Add Anthropic provider
  • Add configuration validation
  • Add comprehensive test suite

Code Snippets for Quick Wins

Fix 1: Conversation History

// brain/index.ts
import type { ChatCompletionMessageParam } from 'openai/resources/chat';

export class Brain {
  private history: ChatCompletionMessageParam[] = [];
  
  reset() {
    this.history = [];
  }

  async decide(goal: string, state: PageState): Promise<{ action: Action; raw: string }> {
    const userContent = `GOAL: ${goal}\n\nCURRENT PAGE:\nURL: ${state.url}\nTitle: ${state.title}\n\nELEMENTS:\n${state.snapshot}\n\nWhat action should you take?`;
    
    const response = await this.client.chat.completions.create({
      model: this.model,
      messages: [
        { role: 'system', content: SYSTEM_PROMPT },
        ...this.history,
        { role: 'user', content: userContent },
      ],
      temperature: 0,
      max_tokens: 200,
    });

    const raw = response.choices[0]?.message?.content || '';
    
    // Persist conversation
    this.history.push({ role: 'user', content: userContent });
    this.history.push({ role: 'assistant', content: raw });
    
    // Trim old history to avoid context overflow
    if (this.history.length > 16) {
      this.history = this.history.slice(-12);
    }

    return { action: this.parse(raw), raw };
  }
}

Fix 2: Screenshot Capture

// drivers/playwright.ts
export class PlaywrightDriver implements Driver {
  async observe(): Promise<PageState> {
    const [url, title, snapshot, screenshot] = await Promise.all([
      this.page.url(),
      this.page.title(),
      this.extractSnapshot(),
      this.options.captureScreenshots 
        ? this.page.screenshot({ type: 'jpeg', quality: 50 })
        : undefined,
    ]);

    return { url, title, snapshot, screenshot };
  }
}

Fix 3: Retry Wrapper

// runner.ts
async function withRetry<T>(
  fn: () => Promise<T>,
  retries: number = 3,
  delayMs: number = 1000
): Promise<T> {
  let lastError: Error | undefined;
  
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err instanceof Error ? err : new Error(String(err));
      if (i < retries - 1) {
        await new Promise(r => setTimeout(r, delayMs * (i + 1)));
      }
    }
  }
  
  throw lastError;
}

Conclusion

@tangle/agent-driver is a well-designed prototype that demonstrates good architectural instincts. The observe→decide→execute loop is clean, the types are well-thought-out, and the code is readable.

However, it's missing several table-stakes features for production:

  • Conversation history (the LLM is currently amnesiac!)
  • Screenshot capture
  • Retry logic
  • Structured observability

The good news: the foundation is solid enough that these can be added incrementally without major refactoring.

Recommendation: Add conversation history first (it's a ~20-line fix that dramatically improves agent behavior), then tackle screenshots and retries before any production use.


Audit conducted by Ferdinand β€’ @tangle/agent-driver v0.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment