Skip to content

Instantly share code, notes, and snippets.

@ianchanning
Last active February 9, 2026 08:01
Show Gist options
  • Select an option

  • Save ianchanning/2e6434c59d3d3930493355459186a409 to your computer and use it in GitHub Desktop.

Select an option

Save ianchanning/2e6434c59d3d3930493355459186a409 to your computer and use it in GitHub Desktop.
Kimi K2.5 sub-agents

Kimi K2.5 sub-agents

This is a Hierarchical Controller-Worker Architecture.

Think of Kimi K2.5 not as a chatbot, but as a Distributed Operating System Kernel (the Orchestrator) managing a pool of Serverless Functions (the Sub-agents).

tl;dr

  • Orchestrator + Frozen Sub-agents
  • Parallel Agent RL training
  • Reward Orchestrator to spawn threads that run to completion (Annealed reward to prevent spamming)
  • Optimise cost based on the slowest sub-agent in a parallel batch, not the sum of all agents (latency-first metric)

Orchestrator (Mutable Brain)

  • State: Trainable. This is the specific component being hit with the RL stick.
  • Function: It parses the problem, decomposes it into a DAG of dependencies, and spins up workers.
  • Output: Its "tokens" are system calls. It outputs commands to instantiate SubAgent_001 with SystemPrompt="Physics Researcher" and Task="Calculate the drag coefficient...".

Sub-Agents (Immutable Workers)

  • State: FROZEN. This is the critical technical detail buried in the text: "executed by dynamically instantiated, frozen subagents."
  • Why Frozen? Gradient Variance. If you try to train the Manager and the Workers simultaneously using RL on a complex task, the non-stationarity destroys convergence. The Manager can't learn a policy if the Workers' behavior changes every epoch.
  • Implementation: These are likely just standard K2.5 instances, but "instantiated" with specific, specialized system prompts (e.g., "You are a Fact Checker," "You are a Python Coder"). They are stateless ephemeral containers.

Training Algorithm: PARL (Parallel-Agent Reinforcement Learning)

They solved the "Lazy Manager" problem. In standard RL, agents are lazy. If a single agent can do the task (even slowly), the policy collapses to single-threading because managing threads is expensive (inference cost + complexity).

Reward

How do you force a lazy model to delegate? You pay it to spawn threads.

  • The Mechanism: They use a modified reward function ($r_{PARL}$) that explicitly grants "dopamine hits" ($r_{parallel}$) just for instantiating a sub-agent.
  • The Guardrail: To prevent Spurious Parallelism (spawning 100 useless agents just to farm rewards), they also reward Sub-agent Completion ($r_{finish}$).
  • Annealing: They treat these rewards like training wheels. As the model learns, they fade out the "bribe" ($\lambda \to 0$), leaving only the final performance reward ($r_{perf}$).

Critical Steps Metric

Standard LLM metrics (Total Tokens) actually penalize swarms because more agents = more total text generated.

  • The Fix: They switched to a Latency-First Metric called Critical Steps.
  • The Logic: It calculates cost based on the slowest sub-agent in a parallel batch, not the sum of all agents.
  • Result: This forces the RL policy to learn Amdahl’s Law—it only gets a high score if it parallelizes tasks to reduce wall-clock time, not just to look busy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment