Skip to content

Instantly share code, notes, and snippets.

@DerOeko
Last active December 11, 2025 21:00
Show Gist options
  • Select an option

  • Save DerOeko/2b9b2268210c926353d3dde6d4d26cd4 to your computer and use it in GitHub Desktop.

Select an option

Save DerOeko/2b9b2268210c926353d3dde6d4d26cd4 to your computer and use it in GitHub Desktop.
Scalable Online RL Training System (vLLM + RLAIF) - Architecture Overview

Scalable Online RL Training System (vLLM + RLAIF)

Note: This gist outlines architectural patterns and infrastructure used for research. Source code and proprietary datasets are excluded per lab data policy.

Technical Highlights

Online RL Implementation

  • GRPO (Group Relative Policy Optimization): Implemented group-based baselines for multi-turn agents, removing the need for a separate value network (Critic).
  • Loss Variants: Support for DAPO, BNPO, and DR-GRPO with token-level masking for long-context reasoning.
  • Curriculum Learning: Dynamic CurriculumDatasetCallback to shift training distributions from simple to complex tasks to prevent collapse.

Inference & Training Loop

  • In-Loop vLLM: Embedded vLLM directly into the RL loop. Achieves 16+ parallel rollouts per prompt, removing the generation bottleneck common in standard HF generate() loops.
  • Custom DDP Handling: Patched accelerate samplers to ensure correct RNG seeding and diversity across distributed GPU processes.
  • PEFT/LoRA: Targeted attention/MLP module adaptation to enable training 7B+ models on consumer-grade hardware.

RLAIF (AI Feedback)

  • Async Judge System: Modular verifier using strong API models (e.g., Gemini 2.0 Flash) for dense scalar rewards.
  • Rate Limiting: Implemented semaphore-based orchestration to handle high-concurrency requests without hitting API rate limits.

Key Technologies

  • Frameworks: PyTorch, Hugging Face (TRL, PEFT), Accelerate
  • Inference: vLLM, Vertex AI SDK
  • Algorithms: GRPO, LoRA, Curriculum Learning
  • Ops: WandB, Custom Asyncio Pipelines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment