Note: This gist outlines architectural patterns and infrastructure used for research. Source code and proprietary datasets are excluded per lab data policy.
Online RL Implementation
- GRPO (Group Relative Policy Optimization): Implemented group-based baselines for multi-turn agents, removing the need for a separate value network (Critic).
- Loss Variants: Support for DAPO, BNPO, and DR-GRPO with token-level masking for long-context reasoning.
- Curriculum Learning: Dynamic
CurriculumDatasetCallbackto shift training distributions from simple to complex tasks to prevent collapse.