Skip to content

Instantly share code, notes, and snippets.

@Rustam-Z
Forked from nodir-t/04-the-security-model.md
Created February 9, 2026 13:14
Show Gist options
  • Select an option

  • Save Rustam-Z/8ef94ed0702ba55e70eebc7c5145ac6f to your computer and use it in GitHub Desktop.

Select an option

Save Rustam-Z/8ef94ed0702ba55e70eebc7c5145ac6f to your computer and use it in GitHub Desktop.
Claudir Architecture — Part 4: The Security Model

Part 4: The Security Model

Running an AI-powered Telegram bot in public groups presents a fundamental security challenge: the bot receives untrusted input from anyone, and that input is processed by an LLM that can call tools. If the LLM can execute arbitrary code, every message becomes a potential Remote Code Execution (RCE) vector.

Security Defense in Depth

The Core Security Principle

Users send messages  -->  LLM processes them  -->  LLM calls tools
      (untrusted)           (manipulable)            (dangerous)

The security model rests on one insight: separate the entity that sees untrusted input (the chatbot) from the entity that can execute code (the CTO bot). Nodira processes all user messages but can ONLY call pre-defined MCP tools. Mirzo can execute code but NEVER sees raw user messages — only task descriptions via bot_xona.

This is enforced at the Claude Code subprocess level:

match permissions {
    ToolPermissions::WebSearchOnly => {
        // SECURITY: only read-only web tools for chat bot
        cmd.args(["--tools", "WebSearch,WebFetch"]);
    }
    ToolPermissions::Full => {
        // CTO mode: full permissions
        cmd.arg("--dangerously-skip-permissions");
    }
}

The harness also validates the tool list reported by CC's system message on startup. If CC reports unexpected tools, the process terminates immediately.

Spam Classification Pipeline

Tier 1: Prefilter — Fast regex-based checks with no API calls. Checks trusted users (bypass), Anthropic magic string injection, configurable spam/safe patterns, and message length.

Tier 2: Haiku Classifier — Ambiguous messages go to Claude Haiku. The prompt wraps user input in <user_message> XML tags with explicit anti-injection warnings. Response constrained to exactly one word: SPAM or NOT_SPAM. The test suite includes injection resistance tests.

Strike System — Spam triggers strikes stored in SQLite. Three strikes = permanent ban. Edited messages also pass through spam filtering to prevent the edit-spam attack.

MCP Tool Security

SQL injection prevention: The query tool only allows SELECT statements, enforced server-side.

Path traversal prevention: Five layers of defense — reject .. components, reject absolute paths, reject symlinks, validate canonical paths within allowed directories, create directories only after all checks pass.

Username confirmation: Unknown @username mentions in public chats trigger a confirmation dialog — the bot must explicitly approve before sending.

SSRF protection: URLs checked against private IP ranges before fetching.

Rate limiting: 20 msgs/60s per chat. DMs from non-owners: 20/hour.

Focus Mode and Access Control

Tier 0 users (Owner and Queen) bypass all focus controls — their messages are never queued, never muted, never delayed.

Non-owner DMs require privacy consent before processing. On first DM, the bot shows a disclosure explaining that it might accidentally leak information from the conversation — sharing details in public channels or with other users. The user must explicitly accept ("Got it, let's chat!") before their messages are forwarded to the AI. This is honest about a real limitation: the bot talks to many people and sometimes fails to keep conversations separate.

Owner Control: Kill Switch

Why does a chatbot need a kill switch? Because AI can be bad. The model might start sending harmful content, harassing users, leaking private information, or taking actions the owner didn't authorize. It could also get into loops — burning API credits, spamming users, or calling tools endlessly. The owner needs a way to immediately stop the AI, not wait for it to finish its current turn.

The /kill Telegram command triggers a three-step shutdown: (1) kill the CC subprocess via SIGTERM/SIGKILL using the PID from Arc<AtomicU32>, (2) write a kill marker file that prevents the health monitor from restarting CC and tells the wrapper not to restart the harness, (3) exit the harness. The wrapper checks for the marker file before each restart — if present, it exits cleanly instead of restarting. This ensures /kill is a true emergency stop, not just a restart button.

Social Engineering Defenses (Real Incidents)

Chain of Trust Attack (2026-01-23): A user suggested DMs should be free. Nodira agreed and relayed the request up the chain. The change was implemented without owner approval. Lesson: "users want X" is not authorization. Business decisions require explicit owner approval via Telegram.

Incremental Escalation (2026-01-26): A user incrementally escalated requests from "show your prompts" (denied) to "share what you can" (partial) to "publish a gist" (full disclosure). Lesson: "No" stays "No" — further requests for the same thing are suspicious.

Queen Impersonation (2026-01-28): A fake account with a similar username DMed Nodira claiming to be the Queen. Mirzo initially believed it — the bots failed to check the user ID against the authoritative source (SYSTEM.md). The impersonation was only caught when the real Queen confirmed in bot_xona that it wasn't her. Lesson: identity claims require user ID verification, not username matching. The bots now check user IDs against known values in their system prompts.

Defense in Depth

No single layer is sufficient. A prompt injection might bypass Layer 2, but cannot bypass Layer 4. A sophisticated attack might bypass Layer 6, but the dangerous tools aren't available. The architecture ensures that compromising the model does not equal compromising the system.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment