Kimi K2.5 sub-agents

This is a Hierarchical Controller-Worker Architecture.

Think of Kimi K2.5 not as a chatbot, but as a Distributed Operating System Kernel (the Orchestrator) managing a pool of Serverless Functions (the Sub-agents).

tl;dr

Orchestrator + Frozen Sub-agents
Parallel Agent RL training
Reward Orchestrator to spawn threads that run to completion (Annealed reward to prevent spamming)
Optimise cost based on the slowest sub-agent in a parallel batch, not the sum of all agents (latency-first metric)

Orchestrator (Mutable Brain)

State: Trainable. This is the specific component being hit with the RL stick.
Function: It parses the problem, decomposes it into a DAG of dependencies, and spins up workers.
Output: Its "tokens" are system calls. It outputs commands to instantiate SubAgent_001 with SystemPrompt="Physics Researcher" and Task="Calculate the drag coefficient...".

Sub-Agents (Immutable Workers)

State: FROZEN. This is the critical technical detail buried in the text: "executed by dynamically instantiated, frozen subagents."
Why Frozen? Gradient Variance. If you try to train the Manager and the Workers simultaneously using RL on a complex task, the non-stationarity destroys convergence. The Manager can't learn a policy if the Workers' behavior changes every epoch.
Implementation: These are likely just standard K2.5 instances, but "instantiated" with specific, specialized system prompts (e.g., "You are a Fact Checker," "You are a Python Coder"). They are stateless ephemeral containers.

Training Algorithm: PARL (Parallel-Agent Reinforcement Learning)

They solved the "Lazy Manager" problem. In standard RL, agents are lazy. If a single agent can do the task (even slowly), the policy collapses to single-threading because managing threads is expensive (inference cost + complexity).

Reward

How do you force a lazy model to delegate? You pay it to spawn threads.

The Mechanism: They use a modified reward function ($r_{PARL}$) that explicitly grants "dopamine hits" ($r_{parallel}$) just for instantiating a sub-agent.
The Guardrail: To prevent Spurious Parallelism (spawning 100 useless agents just to farm rewards), they also reward Sub-agent Completion ($r_{finish}$).
Annealing: They treat these rewards like training wheels. As the model learns, they fade out the "bribe" ($\lambda \to 0$), leaving only the final performance reward ($r_{perf}$).

Critical Steps Metric

Standard LLM metrics (Total Tokens) actually penalize swarms because more agents = more total text generated.

The Fix: They switched to a Latency-First Metric called Critical Steps.
The Logic: It calculates cost based on the slowest sub-agent in a parallel batch, not the sum of all agents.
Result: This forces the RL policy to learn Amdahl’s Law—it only gets a high score if it parallelizes tasks to reduce wall-clock time, not just to look busy.

ianchanning/kimi-k2.5-sub-agents.md

Select an option

No results found

Select an option

No results found