This is a Hierarchical Controller-Worker Architecture.
Think of Kimi K2.5 not as a chatbot, but as a Distributed Operating System Kernel (the Orchestrator) managing a pool of Serverless Functions (the Sub-agents).
tl;dr
- Orchestrator + Frozen Sub-agents
- Parallel Agent RL training
- Reward Orchestrator to spawn threads that run to completion (Annealed reward to prevent spamming)
- Optimise cost based on the slowest sub-agent in a parallel batch, not the sum of all agents (latency-first metric)
- State: Trainable. This is the specific component being hit with the RL stick.
- Function: It parses the problem, decomposes it into a DAG of dependencies, and spins up workers.
- Output: Its "tokens" are system calls. It outputs commands to instantiate
SubAgent_001withSystemPrompt="Physics Researcher"andTask="Calculate the drag coefficient...".
- State: FROZEN. This is the critical technical detail buried in the text: "executed by dynamically instantiated, frozen subagents."
- Why Frozen? Gradient Variance. If you try to train the Manager and the Workers simultaneously using RL on a complex task, the non-stationarity destroys convergence. The Manager can't learn a policy if the Workers' behavior changes every epoch.
- Implementation: These are likely just standard K2.5 instances, but "instantiated" with specific, specialized system prompts (e.g., "You are a Fact Checker," "You are a Python Coder"). They are stateless ephemeral containers.
They solved the "Lazy Manager" problem. In standard RL, agents are lazy. If a single agent can do the task (even slowly), the policy collapses to single-threading because managing threads is expensive (inference cost + complexity).
How do you force a lazy model to delegate? You pay it to spawn threads.
-
The Mechanism: They use a modified reward function (
$r_{PARL}$ ) that explicitly grants "dopamine hits" ($r_{parallel}$ ) just for instantiating a sub-agent. -
The Guardrail: To prevent Spurious Parallelism (spawning 100 useless agents just to farm rewards), they also reward Sub-agent Completion (
$r_{finish}$ ). -
Annealing: They treat these rewards like training wheels. As the model learns, they fade out the "bribe" (
$\lambda \to 0$ ), leaving only the final performance reward ($r_{perf}$ ).
Standard LLM metrics (Total Tokens) actually penalize swarms because more agents = more total text generated.
- The Fix: They switched to a Latency-First Metric called Critical Steps.
- The Logic: It calculates cost based on the slowest sub-agent in a parallel batch, not the sum of all agents.
- Result: This forces the RL policy to learn Amdahl’s Law—it only gets a high score if it parallelizes tasks to reduce wall-clock time, not just to look busy.