🔍 Senior Engineer Audit: @tangle/agent-driver

Package: @tangle/agent-driver v0.1.0
Auditor: Ferdinand (AI)
Date: 2026-02-04
Scope: Architecture, API design, observability, production readiness

Executive Summary

This is a clean, minimal LLM-driven browser agent with good bones. The core observe→decide→execute loop is well-implemented, and the separation of concerns (Driver, Brain, Runner) shows architectural maturity. However, it's clearly in "MVP/prototype" stage—fine for tests, but missing critical features for production use.

Verdict: Solid foundation. Needs ~2 sprints of hardening for production.

1. maxTurns Semantics

Question: Is it clear that maxTurns = max observe→decide→execute cycles (any action)?

Answer: Mostly, but could be clearer.

What the code does:

for (let i = 1; i <= maxTurns; i++) {
  // 1. Observe
  // 2. Decide
  // 3. Execute
}

Each loop iteration is ONE complete cycle. Any action counts—click, type, scroll, wait, etc.

The ambiguity:

The JSDoc says /** Max turns before giving up */ — vague
The Turn type says /** One observe → decide → execute cycle */ — better!
Someone might assume "turns" means "user interactions" or "typing turns"

Recommendation:

export interface Scenario {
  /**
   * Maximum observe→decide→execute cycles before aborting.
   * Each cycle is one LLM call + one action (click, type, scroll, etc.)
   * @default 20
   */
  maxTurns?: number;
}

Rating: 7/10 — Semantics are correct, documentation could be crisper.

2. Directive Flexibility

Question: Can users pass any goal/directive? Is it flexible enough?

Answer: Yes, it's very flexible.

export interface Scenario {
  goal: string;           // ✅ Any natural language goal
  startUrl?: string;      // ✅ Optional starting point
  maxTurns?: number;      // ✅ Configurable limit
}

Strengths:

goal is free-form natural language
No rigid structure imposed
Works for: "Login as admin", "Add item to cart", "Find the pricing page"

Limitations:

No support for multi-step scenarios (first do X, then Y)
No way to pass context/hints (e.g., "the password is in env var")
No assertion/validation hooks ("verify checkout total is $99")

Recommendation — Add optional context field:

export interface Scenario {
  goal: string;
  startUrl?: string;
  maxTurns?: number;
  /** Additional context for the LLM (credentials, hints, etc.) */
  context?: string;
  /** Expected success criteria for validation */
  assertions?: string[];
}

Rating: 8/10 — Great for simple goals, needs extension for complex scenarios.

3. Logs & Telemetry

Question: What's captured? What's MISSING?

✅ Currently Captured:

Data	Where	Notes
Turn number	`Turn.turn`	Good
Page state	`Turn.state`	URL, title, snapshot
Action taken	`Turn.action`	Full action object
Raw LLM response	`Turn.rawLLMResponse`	✅ Excellent for debugging
Duration	`Turn.durationMs`	Per-turn timing
Errors	`Turn.error`	When caught
Total time	`AgentResult.totalMs`	Aggregate

❌ MISSING (Critical for Production):

Missing	Impact	Priority
Conversation history	LLM has no memory of previous turns!	🔴 Critical
Screenshots	Can't debug visual issues	🔴 Critical
Reasoning/CoT	No visibility into "why"	🟡 High
Token usage	Can't track costs	🟡 High
Action success/failure	Did click actually work?	🟡 High
Retry mechanism	One failure = total abort	🟡 High
Structured logging	Only `console.log` with debug flag	🟢 Medium
Trace IDs	Can't correlate across services	🟢 Medium

🚨 Critical Issue: No Conversation History!

// brain/index.ts
const response = await this.client.chat.completions.create({
  messages: [
    { role: 'system', content: SYSTEM_PROMPT },
    { role: 'user', content: prompt },  // ← Only current state!
  ],
});

The LLM has amnesia! Each turn is completely independent. This causes:

Agent clicks same button repeatedly
Agent retries failed actions identically
Agent can't learn from previous attempts
Multi-step reasoning is impossible

Fix:

class Brain {
  private history: ChatCompletionMessageParam[] = [];
  
  async decide(goal: string, state: PageState): Promise<...> {
    const userMessage = { role: 'user', content: buildPrompt(goal, state) };
    
    const response = await this.client.chat.completions.create({
      messages: [
        { role: 'system', content: SYSTEM_PROMPT },
        ...this.history,
        userMessage,
      ],
    });
    
    // Store for next turn
    this.history.push(userMessage);
    this.history.push({ role: 'assistant', content: response.choices[0].message.content });
    
    // Trim if too long
    if (this.history.length > 20) this.history = this.history.slice(-10);
  }
}

📸 No Screenshot Capture

The PageState only has a text snapshot. For debugging:

You can't see what the agent "saw"
You can't verify element visibility
You can't debug selector issues

Recommendation:

export interface PageState {
  url: string;
  title: string;
  snapshot: string;
  screenshot?: Buffer;  // Optional, configurable
}

🔄 No Retry Mechanism

} catch (err) {
  // Immediate abort, no retry
  return { success: false, reason: error, ... };
}

One transient failure (network glitch, slow load) = complete failure.

Recommendation:

interface AgentConfig {
  retries?: number;        // Default: 3
  retryDelayMs?: number;   // Default: 1000
  retryableErrors?: string[];  // Patterns to retry
}

Rating: 4/10 — Basic turn logging exists, but critical production features are missing.

4. Architecture Quality

Driver Interface

export interface Driver {
  observe(): Promise<PageState>;
  execute(action: Action): Promise<void>;
}

Assessment: Clean and minimal.

✅ Perfect abstraction level
✅ Easy to implement new drivers (Puppeteer, WebDriver, etc.)
✅ Testable (easy to mock)
❌ execute returns void — no feedback on success/failure
❌ No lifecycle hooks (setup, teardown)

Recommendation:

export interface Driver {
  observe(): Promise<PageState>;
  execute(action: Action): Promise<ActionResult>;  // Did it work?
  screenshot?(): Promise<Buffer>;
  close?(): Promise<void>;
}

interface ActionResult {
  success: boolean;
  error?: string;
  changedElements?: string[];  // What changed after action
}

Brain Swappability

Currently: Hardcoded OpenAI SDK

import OpenAI from 'openai';
// ...
this.client = new OpenAI({ ... });

Can you use Anthropic? Technically yes, via baseUrl pointing to a compatible endpoint. But:

No native Anthropic SDK support
No Claude-specific features (extended thinking, tool use)
OpenAI response format is assumed

Recommendation — Abstract the LLM layer:

interface LLMProvider {
  complete(messages: Message[]): Promise<string>;
}

class OpenAIProvider implements LLMProvider { ... }
class AnthropicProvider implements LLMProvider { ... }

class Brain {
  constructor(private provider: LLMProvider) {}
}

Production Viability

Aspect	Status	Notes
Error handling	⚠️ Basic	Single try/catch, no recovery
Graceful shutdown	❌ Missing	No way to cancel mid-run
Resource cleanup	❌ Missing	Page/browser left open
Rate limiting	❌ Missing	Can hammer the LLM API
Circuit breaker	❌ Missing	No backoff on repeated failures
Idempotency	❌ Missing	Re-running may double-execute

Rating: 6/10 — Good abstraction, needs production hardening.

5. Missing Features for Production

Must Have (P0)

Conversation history — LLM needs context from previous turns
Screenshot capture — Debug visual state
Retry mechanism — Handle transient failures
Abort signal/cancellation — Stop long-running agents
Action result feedback — Know if actions succeeded

Should Have (P1)

Structured logging — JSON logs with trace IDs
Token/cost tracking — Budget awareness
Multi-LLM support — Anthropic, Gemini, local models
Hooks/middleware — onBeforeAction, onAfterAction, onError
State assertions — Verify expected outcomes

Nice to Have (P2)

Visual element references — "Click the blue button" not just selectors
Parallel action support — Fill multiple fields at once
Record/replay — Capture runs for playback
Human-in-the-loop — Pause and ask for help
Metrics export — Prometheus/OpenTelemetry integration

Ratings Summary

Category	Rating	Notes
API Design	7/10	Clean, intuitive, good types. Minor gaps in docs.
Observability/Debugging	4/10	Turn logging is good, but missing screenshots, history, structured logs
Extensibility	6/10	Driver interface is solid. Brain is not swappable. No hooks.
Production Readiness	3/10	MVP only. Missing retries, cancellation, conversation history, error recovery
Code Quality	8/10	Clean, well-organized, proper TypeScript. Good separation of concerns.

Overall: 5.6/10

Translation: Great prototype, not production-ready. The bones are good—this could be excellent with 2-3 weeks of focused work.

Recommended Next Steps

Week 1: Critical Fixes

Add conversation history to Brain
Add screenshot capture to PageState
Add retry mechanism to runner
Add ActionResult feedback from execute()

Week 2: Production Hardening

Add cancellation/abort signal
Add structured logging with trace IDs
Add lifecycle hooks (onTurn, onError, onComplete)
Add token usage tracking

Week 3: Extensibility

Abstract LLM provider interface
Add Anthropic provider
Add configuration validation
Add comprehensive test suite

Code Snippets for Quick Wins

Fix 1: Conversation History

// brain/index.ts
import type { ChatCompletionMessageParam } from 'openai/resources/chat';

export class Brain {
  private history: ChatCompletionMessageParam[] = [];
  
  reset() {
    this.history = [];
  }

  async decide(goal: string, state: PageState): Promise<{ action: Action; raw: string }> {
    const userContent = `GOAL: ${goal}\n\nCURRENT PAGE:\nURL: ${state.url}\nTitle: ${state.title}\n\nELEMENTS:\n${state.snapshot}\n\nWhat action should you take?`;
    
    const response = await this.client.chat.completions.create({
      model: this.model,
      messages: [
        { role: 'system', content: SYSTEM_PROMPT },
        ...this.history,
        { role: 'user', content: userContent },
      ],
      temperature: 0,
      max_tokens: 200,
    });

    const raw = response.choices[0]?.message?.content || '';
    
    // Persist conversation
    this.history.push({ role: 'user', content: userContent });
    this.history.push({ role: 'assistant', content: raw });
    
    // Trim old history to avoid context overflow
    if (this.history.length > 16) {
      this.history = this.history.slice(-12);
    }

    return { action: this.parse(raw), raw };
  }
}

Fix 2: Screenshot Capture

// drivers/playwright.ts
export class PlaywrightDriver implements Driver {
  async observe(): Promise<PageState> {
    const [url, title, snapshot, screenshot] = await Promise.all([
      this.page.url(),
      this.page.title(),
      this.extractSnapshot(),
      this.options.captureScreenshots 
        ? this.page.screenshot({ type: 'jpeg', quality: 50 })
        : undefined,
    ]);

    return { url, title, snapshot, screenshot };
  }
}

Fix 3: Retry Wrapper

// runner.ts
async function withRetry<T>(
  fn: () => Promise<T>,
  retries: number = 3,
  delayMs: number = 1000
): Promise<T> {
  let lastError: Error | undefined;
  
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err instanceof Error ? err : new Error(String(err));
      if (i < retries - 1) {
        await new Promise(r => setTimeout(r, delayMs * (i + 1)));
      }
    }
  }
  
  throw lastError;
}

Conclusion

@tangle/agent-driver is a well-designed prototype that demonstrates good architectural instincts. The observe→decide→execute loop is clean, the types are well-thought-out, and the code is readable.

However, it's missing several table-stakes features for production:

Conversation history (the LLM is currently amnesiac!)
Screenshot capture
Retry logic
Structured observability

The good news: the foundation is solid enough that these can be added incrementally without major refactoring.

Recommendation: Add conversation history first (it's a ~20-line fix that dramatically improves agent behavior), then tackle screenshots and retries before any production use.

Audit conducted by Ferdinand • @tangle/agent-driver v0.1.0

drewstone/agent-driver-audit.md

Select an option

No results found

Select an option

No results found

🔍 Senior Engineer Audit: @tangle/agent-driver

Executive Summary

1. maxTurns Semantics

Question: Is it clear that maxTurns = max observe→decide→execute cycles (any action)?

2. Directive Flexibility

Question: Can users pass any goal/directive? Is it flexible enough?

3. Logs & Telemetry

Question: What's captured? What's MISSING?

✅ Currently Captured:

❌ MISSING (Critical for Production):

🚨 Critical Issue: No Conversation History!

📸 No Screenshot Capture

🔄 No Retry Mechanism

4. Architecture Quality

Driver Interface

Brain Swappability

Production Viability

5. Missing Features for Production

Must Have (P0)

Should Have (P1)

Nice to Have (P2)

Ratings Summary

Overall: 5.6/10

Recommended Next Steps

Week 1: Critical Fixes

Week 2: Production Hardening

Week 3: Extensibility

Code Snippets for Quick Wins

Fix 1: Conversation History

Fix 2: Screenshot Capture

Fix 3: Retry Wrapper

Conclusion