Skip to content

Latest commit

 

History

History
287 lines (192 loc) · 16.9 KB

File metadata and controls

287 lines (192 loc) · 16.9 KB

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Towards Async, Omni-Modal RL at Scale, Just Relax.

Relax

License Python 3.12 arXiv Documentation WeChat QR

📖 English | 📖 中文


Relax (Reinforcement Engine Leveraging Agentic X-modality) is a high-performance reinforcement learning post-training framework open-sourced by the rednote AI platform for multimodal large language models. Built on Ray Serve with a service-oriented architecture, Relax uses Megatron-LM as the training backend and SGLang as the inference engine. Through the TransferQueue data transfer system, it achieves complete decoupling of training and inference, supporting end-to-end multimodal RL training from text to images, videos, and audio.


✨ Highlights

  • 🌐 Full Omni-Modal Training — One unified framework for text, vision, and audio RL — one of the few systems capable of end-to-end Omni model (Qwen3-Omni) post-training
  • ⚙️ Service-Oriented Six-Layer Architecture — Every role is an independent Ray Serve deployment, with native service-level elastic scheduling and fault recovery
  • Fully Async via TransferQueue — Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters with streaming data exchange and configurable staleness
  • 🤖 Agentic RL — Multi-turn interaction, loss masking, flexible termination, and VLM multimodal context carry-over for closed-loop "execute → observe → decide" training
  • 🔀 Elastic Rollout Scaling — Dynamically grow/shrink inference engines mid-training via HTTP REST API, with same-cluster (ray_native) and cross-cluster (external) federation modes
  • 🧠 Rich Algorithm Suite — GRPO, GSPO, SAPO, and On-Policy Distillation out of the box, with pluggable rewards and built-in GenRM (LLM-as-judge) mode
  • 🚀 Megatron + SGLang Backends — Megatron-LM (TP/PP/CP/EP) for MoE and deep models, SGLang for high-throughput inference, DCS for NCCL-broadcast weight sync
  • 📦 Production-Ready Ops — HealthManager auto-recovery, centralized Metrics Service (WandB / TensorBoard / ClearML), and Apprise real-time notifications

📢 News

📣 Updates
[04/15/2026] 🎉 Relax is now open-source!

🏗️ Architecture

Relax Architecture

Relax adopts a six-layer service-oriented architecture where every role is deployed as an independent Ray Serve deployment, cleanly separating orchestration, components, engines, backends, and distributed capabilities:

Layer Responsibility
Entrypoints train.py — signal handling, CLI parsing, Ray cluster connection, Controller launch
Orchestration Controller (training loop, global restart), Service (placement groups, lifecycle), Registry (role & algorithm mapping)
Components Ray Serve deployments: Actor, Rollout, Critic, ActorFwd, Advantages, GenRM
Engine SGLang rollout engine, pluggable reward functions, request router, data filters
Backends Megatron-LM training backend (TP/PP/CP/EP) and SGLang inference engine
Distributed Ray Actor groups (RolloutManager / GenRMManager) and DCS (Distributed Checkpoint Service) for NCCL/GLOO weight sync

Two execution modes are supported:

  • Colocate (Sync) — Actor and Rollout time-share the same GPUs; Rollout writes a full batch to TransferQueue, then yields GPUs for training. Memory-efficient for constrained hardware and strict on-policy (max_staleness=0).
  • Fully Async — Actor, Rollout, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel, exchanging data through TransferQueue and syncing weights asynchronously through DCS for maximum throughput with configurable staleness.

📖 Learn more: Architecture Guide · Fully Async Training · Elastic Rollout Scaling


🧠 Supported Algorithms

Algorithm Type Description
GRPO Policy Optimization Group Relative Policy Optimization
GSPO Policy Optimization Group Sample Policy Optimization
SAPO Policy Optimization Sample-Aware Policy Optimization
On-Policy Distillation Knowledge Transfer Teacher-student KL penalty distillation

📖 Adding a new algorithm is straightforward — implement a service class, register it in the ALGOS registry, and you're done.


🤖 Supported Models

Relax is designed for omni-modal RL training — text, vision, and audio in one unified framework. Multimodal data is configured via the --multimodal-keys flag, with complete image/video/audio processing pipelines under relax/utils/multimodal/ for fine-grained control over image token counts, video frame sampling, and audio sample rates.

Model Family Sizes Modality Typical Tasks Backend
Qwen3 4B, 30B-A3B (MoE) Text Math reasoning, code, multi-turn dialogue, tool use Megatron
Qwen3-VL 4B, 30B-A3B Vision + Language Visual QA, image understanding, multimodal reasoning Megatron
Qwen3.5 30B-A3B Vision + Language Visual QA, image understanding, multimodal reasoning Megatron
Qwen3-Omni 30B-A3B Text + Vision + Audio Audio-visual QA, omni-modal understanding Megatron

📖 New architectures are integrated via Megatron Bridge for automatic HF ↔ Megatron weight conversion.


📦 Installation

The recommended way to run Relax is via the official Docker image, which ships with all CUDA, PyTorch, Megatron-LM, SGLang, and Ray dependencies pre-installed and version-matched.

# Pull the official image
docker pull relaxrl/relax:latest

# Launch a container with GPUs, shared memory, and your workspace mounted
docker run -it --gpus all --ipc=host --network=host \
  -v /path/to/your/workspace:/root \
  relaxrl/relax:latest bash

# Inside the container
git clone https://github.com/redai-infra/Relax.git /root/Relax
cd /root/Relax && pip install -e .

📖 For GPU driver requirements, multi-node setup, and persistent storage mounts, see the Installation Guide.


🚀 Quick Start

Three end-to-end tasks cover text, vision-language, and omni-modal training. Each task downloads a public HuggingFace dataset and model, then launches training with a single script. Set EXP_DIR=/root (or wherever your models and datasets live) and the scripts will locate them automatically.

Task 1 — DAPO Math (Text, 8 GPUs)

Train Qwen3-4B on dapo-math-17k with GRPO. Reward is rule-based answer extraction plus symbolic math verification.

hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

cd /root/Relax && export EXP_DIR=/root
bash scripts/training/text/run-qwen3-4B-8xgpu.sh

Task 2 — Open-R1 (Vision-Language, 8 GPUs)

Train Qwen3-VL-4B on multimodal-open-r1-8k-verified with GRPO using the openr1mm reward.

hf download --repo-type dataset lmms-lab/multimodal-open-r1-8k-verified \
  --local-dir /root/multimodal-open-r1-8k-verified
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir /root/Qwen3-VL-4B-Instruct

cd /root/Relax && export EXP_DIR=/root
bash scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.sh

Task 3 — AVQA (Omni-Modal: Image + Audio, 16 GPUs / 2 nodes)

Train Qwen3-Omni-30B-A3B on AVQA-R1-6K with GRPO and a multiple-choice reward.

hf download --repo-type dataset harryhsing/AVQA-R1-6K --local-dir /root/AVQA-R1-6K
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir /root/Qwen3-Omni-30B-A3B-Instruct

cd /root/Relax && export EXP_DIR=/root
bash -x scripts/entrypoint/spmd-multinode.sh \
  scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.sh

Once running, you should see logs like:

Finish rollout 0/200
training step 0/200

Checkpoints are saved in Megatron DCP format; convert them to HuggingFace weights with scripts/tools/convert_torch_dist_to_hf_bridge.py.

📖 Full walkthrough: Quick Start Guide · Customize Training · Configuration Guide


⚡ Key Features

Fully Async Training via TransferQueue

In fully-async mode, Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel. Three mechanisms make this efficient:

  • StreamingDataLoader — Actor begins consuming samples as Rollout incrementally writes them to TransferQueue, eliminating GPU idle time between phases.
  • Configurable staleness--max-staleness precisely controls how off-policy training data can drift, flexibly balancing on-policy accuracy and throughput.
  • DCS weight sync — After each training step, weights are NCCL-broadcast from Actor to Rollout/ActorFwd/Reference via the Distributed Checkpoint Service, overlapped with the next training computation.

Agentic RL

Relax provides first-class support for multi-turn, closed-loop "execute → observe → decide" training:

  • Multi-turn sampling with loss masking — model outputs (mask=1) are cleanly separated from environment observations (mask=0) so only model actions participate in training.
  • Environment / Rollout decoupling — a standard BaseInteractionEnv interface (reset, step, format_observation) lets environments evolve independently of the sampler.
  • VLM multimodal context carry-overimage_data on the Rollout side and multimodal_train_inputs on the training side are incrementally merged each turn so visual observations concatenate correctly.
  • Flexible termination — combine max_turns, token-budget exhaustion, and env-signalled done. The DeepEyes example demonstrates Agentic multi-turn GRPO with Qwen3-VL-30B-A3B.

Elastic Rollout Scaling

Since 60–70% of RL training time is spent in the Rollout phase, Relax exposes HTTP REST APIs to dynamically add or remove inference engines mid-training without interrupting the training loop:

  • ray_native mode — specify a target engine count; Relax allocates resources and launches new SGLang engines inside the current Ray cluster.
  • external mode — register SGLang engines already deployed in other clusters for cross-cluster federated inference on preemptible or idle resources.

Scaling is asynchronous, idempotent, mutually exclusive, and supports graceful drain-and-remove plus cancellation with rollback. Engines from startup parameters are protected; only dynamically added engines can be scaled in.

Megatron Training Backend & SGLang Inference

Training uses Megatron-LM with full Tensor / Pipeline / Context / Expert parallelism for MoE and ultra-deep models. Inference uses SGLang with process-lifecycle management. New model architectures plug in through Megatron Bridge for automatic HF ↔ Megatron weight conversion.

Pluggable Reward Hub

Built-in rewards for math (DeepScaler, DAPO), GPQA, F1, IFBench, multiple-choice, multimodal Open-R1, and GenRM (generative LLM-as-judge). Add a custom reward by dropping a single file into relax/engine/rewards/.

Production Operations

  • HealthManager — heartbeat monitoring with two-tier auto-recovery (in-place restart first, global restart as fallback).
  • Metrics Service — centralized Ray Serve deployment that fans out to TensorBoard, WandB, and ClearML.
  • Notifications — real-time training alerts via Apprise (Slack, WeChat, email, and more).

📚 Documentation

Full bilingual documentation is available at redai-infra.github.io/Relax.


🧪 Examples

Example Description
DeepEyes Multi-modal vision-language RL with Qwen3-VL
On-Policy Distillation Teacher-student knowledge distillation via KL penalty

🤝 Contributing

We welcome contributions of all kinds! Please read our Contributing Guide to get started.


📝 Citation

If you find Relax useful in your research, please cite:

@software{relax2026,
  title  = {Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale},
  author = {Relax Contributors},
  url    = {https://github.com/redai-infra/Relax},
  year   = {2026}
}

📜 License

This project is licensed under the Apache License 2.0.


🙏 Acknowledgements

Relax is built upon the shoulders of excellent open-source projects:

  • Slime — Scalable training and inference framework for reinforcement learning
  • SGLang — Fast serving framework for large language models
  • Megatron-LM & Megatron-Bridge — Large-scale distributed training framework and HF ↔ Megatron weight conversion bridge, with sincere thanks to the entire NVIDIA team
  • TransferQueue — High-performance distributed data transfer queue
  • Ray — Distributed computing framework
  • HuggingFace Transformers — State-of-the-art model hub

We sincerely thank all contributors and the open-source community for making this project possible.