DerOeko/slingshot.md

Last active December 11, 2025 21:00

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/DerOeko/2b9b2268210c926353d3dde6d4d26cd4.js"></script>
Save DerOeko/2b9b2268210c926353d3dde6d4d26cd4 to your computer and use it in GitHub Desktop.

Download ZIP

Scalable Online RL Training System (vLLM + RLAIF) - Architecture Overview

Raw

slingshot.md

Scalable Online RL Training System (vLLM + RLAIF)

Note: This gist outlines architectural patterns and infrastructure used for research. Source code and proprietary datasets are excluded per lab data policy.

Technical Highlights

Online RL Implementation

GRPO (Group Relative Policy Optimization): Implemented group-based baselines for multi-turn agents, removing the need for a separate value network (Critic).
Loss Variants: Support for DAPO, BNPO, and DR-GRPO with token-level masking for long-context reasoning.
Curriculum Learning: Dynamic CurriculumDatasetCallback to shift training distributions from simple to complex tasks to prevent collapse.

Inference & Training Loop

In-Loop vLLM: Embedded vLLM directly into the RL loop. Achieves 16+ parallel rollouts per prompt, removing the generation bottleneck common in standard HF generate() loops.
Custom DDP Handling: Patched accelerate samplers to ensure correct RNG seeding and diversity across distributed GPU processes.
PEFT/LoRA: Targeted attention/MLP module adaptation to enable training 7B+ models on consumer-grade hardware.

RLAIF (AI Feedback)

Async Judge System: Modular verifier using strong API models (e.g., Gemini 2.0 Flash) for dense scalar rewards.
Rate Limiting: Implemented semaphore-based orchestration to handle high-concurrency requests without hitting API rate limits.

Key Technologies

Frameworks: PyTorch, Hugging Face (TRL, PEFT), Accelerate
Inference: vLLM, Vertex AI SDK
Algorithms: GRPO, LoRA, Curriculum Learning
Ops: WandB, Custom Asyncio Pipelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment