Note: This gist outlines architectural patterns and infrastructure used for research. Source code and proprietary datasets are excluded per lab data policy.
Online RL Implementation
- GRPO (Group Relative Policy Optimization): Implemented group-based baselines for multi-turn agents, removing the need for a separate value network (Critic).
- Loss Variants: Support for DAPO, BNPO, and DR-GRPO with token-level masking for long-context reasoning.
- Curriculum Learning: Dynamic
CurriculumDatasetCallbackto shift training distributions from simple to complex tasks to prevent collapse.
Inference & Training Loop
- In-Loop vLLM: Embedded vLLM directly into the RL loop. Achieves 16+ parallel rollouts per prompt, removing the generation bottleneck common in standard HF
generate()loops. - Custom DDP Handling: Patched
acceleratesamplers to ensure correct RNG seeding and diversity across distributed GPU processes. - PEFT/LoRA: Targeted attention/MLP module adaptation to enable training 7B+ models on consumer-grade hardware.
RLAIF (AI Feedback)
- Async Judge System: Modular verifier using strong API models (e.g., Gemini 2.0 Flash) for dense scalar rewards.
- Rate Limiting: Implemented semaphore-based orchestration to handle high-concurrency requests without hitting API rate limits.
- Frameworks: PyTorch, Hugging Face (TRL, PEFT), Accelerate
- Inference: vLLM, Vertex AI SDK
- Algorithms: GRPO, LoRA, Curriculum Learning
- Ops: WandB, Custom Asyncio Pipelines