Relax/README.md at main · li126com/Relax

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Towards Async, Omni-Modal RL at Scale, Just Relax.

Relax (Reinforcement Engine Leveraging Agentic X-modality) is a high-performance reinforcement learning post-training framework open-sourced by the rednote AI platform for multimodal large language models. Built on Ray Serve with a service-oriented architecture, Relax uses Megatron-LM as the training backend and SGLang as the inference engine. Through the TransferQueue data transfer system, it achieves complete decoupling of training and inference, supporting end-to-end multimodal RL training from text to images, videos, and audio.

✨ Highlights

🌐 Full Omni-Modal Training — One unified framework for text, vision, and audio RL — one of the few systems capable of end-to-end Omni model (Qwen3-Omni) post-training
⚙️ Service-Oriented Six-Layer Architecture — Every role is an independent Ray Serve deployment, with native service-level elastic scheduling and fault recovery
⚡ Fully Async via TransferQueue — Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters with streaming data exchange and configurable staleness
🤖 Agentic RL — Multi-turn interaction, loss masking, flexible termination, and VLM multimodal context carry-over for closed-loop "execute → observe → decide" training
🔀 Elastic Rollout Scaling — Dynamically grow/shrink inference engines mid-training via HTTP REST API, with same-cluster (ray_native) and cross-cluster (external) federation modes
🧠 Rich Algorithm Suite — GRPO, GSPO, SAPO, and On-Policy Distillation out of the box, with pluggable rewards and built-in GenRM (LLM-as-judge) mode
🚀 Megatron + SGLang Backends — Megatron-LM (TP/PP/CP/EP) for MoE and deep models, SGLang for high-throughput inference, DCS for NCCL-broadcast weight sync
📦 Production-Ready Ops — HealthManager auto-recovery, centralized Metrics Service (WandB / TensorBoard / ClearML), and Apprise real-time notifications

📢 News

📣 Updates
[04/15/2026] 🎉 Relax is now open-source!

🏗️ Architecture

Relax adopts a six-layer service-oriented architecture where every role is deployed as an independent Ray Serve deployment, cleanly separating orchestration, components, engines, backends, and distributed capabilities:

Layer	Responsibility
Entrypoints	`train.py` — signal handling, CLI parsing, Ray cluster connection, Controller launch
Orchestration	`Controller` (training loop, global restart), `Service` (placement groups, lifecycle), `Registry` (role & algorithm mapping)
Components	Ray Serve deployments: Actor, Rollout, Critic, ActorFwd, Advantages, GenRM
Engine	SGLang rollout engine, pluggable reward functions, request router, data filters
Backends	Megatron-LM training backend (TP/PP/CP/EP) and SGLang inference engine
Distributed	Ray Actor groups (RolloutManager / GenRMManager) and DCS (Distributed Checkpoint Service) for NCCL/GLOO weight sync

Two execution modes are supported:

Colocate (Sync) — Actor and Rollout time-share the same GPUs; Rollout writes a full batch to TransferQueue, then yields GPUs for training. Memory-efficient for constrained hardware and strict on-policy (max_staleness=0).
Fully Async — Actor, Rollout, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel, exchanging data through TransferQueue and syncing weights asynchronously through DCS for maximum throughput with configurable staleness.

📖 Learn more: Architecture Guide · Fully Async Training · Elastic Rollout Scaling

🧠 Supported Algorithms

Algorithm	Type	Description
GRPO	Policy Optimization	Group Relative Policy Optimization
GSPO	Policy Optimization	Group Sample Policy Optimization
SAPO	Policy Optimization	Sample-Aware Policy Optimization
On-Policy Distillation	Knowledge Transfer	Teacher-student KL penalty distillation

📖 Adding a new algorithm is straightforward — implement a service class, register it in the ALGOS registry, and you're done.

🤖 Supported Models

Relax is designed for omni-modal RL training — text, vision, and audio in one unified framework. Multimodal data is configured via the --multimodal-keys flag, with complete image/video/audio processing pipelines under relax/utils/multimodal/ for fine-grained control over image token counts, video frame sampling, and audio sample rates.

Model Family	Sizes	Modality	Typical Tasks	Backend
Qwen3	4B, 30B-A3B (MoE)	Text	Math reasoning, code, multi-turn dialogue, tool use	Megatron
Qwen3-VL	4B, 30B-A3B	Vision + Language	Visual QA, image understanding, multimodal reasoning	Megatron
Qwen3.5	30B-A3B	Vision + Language	Visual QA, image understanding, multimodal reasoning	Megatron
Qwen3-Omni	30B-A3B	Text + Vision + Audio	Audio-visual QA, omni-modal understanding	Megatron

📖 New architectures are integrated via Megatron Bridge for automatic HF ↔ Megatron weight conversion.

📦 Installation

The recommended way to run Relax is via the official Docker image, which ships with all CUDA, PyTorch, Megatron-LM, SGLang, and Ray dependencies pre-installed and version-matched.

# Pull the official image
docker pull relaxrl/relax:latest

# Launch a container with GPUs, shared memory, and your workspace mounted
docker run -it --gpus all --ipc=host --network=host \
  -v /path/to/your/workspace:/root \
  relaxrl/relax:latest bash

# Inside the container
git clone https://github.com/redai-infra/Relax.git /root/Relax
cd /root/Relax && pip install -e .

📖 For GPU driver requirements, multi-node setup, and persistent storage mounts, see the Installation Guide.

🚀 Quick Start

Three end-to-end tasks cover text, vision-language, and omni-modal training. Each task downloads a public HuggingFace dataset and model, then launches training with a single script. Set EXP_DIR=/root (or wherever your models and datasets live) and the scripts will locate them automatically.

Task 1 — DAPO Math (Text, 8 GPUs)

Train Qwen3-4B on dapo-math-17k with GRPO. Reward is rule-based answer extraction plus symbolic math verification.

hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

cd /root/Relax && export EXP_DIR=/root
bash scripts/training/text/run-qwen3-4B-8xgpu.sh

Task 2 — Open-R1 (Vision-Language, 8 GPUs)

Train Qwen3-VL-4B on multimodal-open-r1-8k-verified with GRPO using the openr1mm reward.

hf download --repo-type dataset lmms-lab/multimodal-open-r1-8k-verified \
  --local-dir /root/multimodal-open-r1-8k-verified
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir /root/Qwen3-VL-4B-Instruct

cd /root/Relax && export EXP_DIR=/root
bash scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.sh

Task 3 — AVQA (Omni-Modal: Image + Audio, 16 GPUs / 2 nodes)

Train Qwen3-Omni-30B-A3B on AVQA-R1-6K with GRPO and a multiple-choice reward.

hf download --repo-type dataset harryhsing/AVQA-R1-6K --local-dir /root/AVQA-R1-6K
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir /root/Qwen3-Omni-30B-A3B-Instruct

cd /root/Relax && export EXP_DIR=/root
bash -x scripts/entrypoint/spmd-multinode.sh \
  scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.sh

Once running, you should see logs like:

Finish rollout 0/200
training step 0/200

Checkpoints are saved in Megatron DCP format; convert them to HuggingFace weights with scripts/tools/convert_torch_dist_to_hf_bridge.py.

📖 Full walkthrough: Quick Start Guide · Customize Training · Configuration Guide

⚡ Key Features

Fully Async Training via TransferQueue

In fully-async mode, Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel. Three mechanisms make this efficient:

StreamingDataLoader — Actor begins consuming samples as Rollout incrementally writes them to TransferQueue, eliminating GPU idle time between phases.
Configurable staleness — --max-staleness precisely controls how off-policy training data can drift, flexibly balancing on-policy accuracy and throughput.
DCS weight sync — After each training step, weights are NCCL-broadcast from Actor to Rollout/ActorFwd/Reference via the Distributed Checkpoint Service, overlapped with the next training computation.

Agentic RL

Relax provides first-class support for multi-turn, closed-loop "execute → observe → decide" training:

Multi-turn sampling with loss masking — model outputs (mask=1) are cleanly separated from environment observations (mask=0) so only model actions participate in training.
Environment / Rollout decoupling — a standard BaseInteractionEnv interface (reset, step, format_observation) lets environments evolve independently of the sampler.
VLM multimodal context carry-over — image_data on the Rollout side and multimodal_train_inputs on the training side are incrementally merged each turn so visual observations concatenate correctly.
Flexible termination — combine max_turns, token-budget exhaustion, and env-signalled done. The DeepEyes example demonstrates Agentic multi-turn GRPO with Qwen3-VL-30B-A3B.

Elastic Rollout Scaling

Since 60–70% of RL training time is spent in the Rollout phase, Relax exposes HTTP REST APIs to dynamically add or remove inference engines mid-training without interrupting the training loop:

ray_native mode — specify a target engine count; Relax allocates resources and launches new SGLang engines inside the current Ray cluster.
external mode — register SGLang engines already deployed in other clusters for cross-cluster federated inference on preemptible or idle resources.

Scaling is asynchronous, idempotent, mutually exclusive, and supports graceful drain-and-remove plus cancellation with rollback. Engines from startup parameters are protected; only dynamically added engines can be scaled in.

Megatron Training Backend & SGLang Inference

Training uses Megatron-LM with full Tensor / Pipeline / Context / Expert parallelism for MoE and ultra-deep models. Inference uses SGLang with process-lifecycle management. New model architectures plug in through Megatron Bridge for automatic HF ↔ Megatron weight conversion.

Pluggable Reward Hub

Built-in rewards for math (DeepScaler, DAPO), GPQA, F1, IFBench, multiple-choice, multimodal Open-R1, and GenRM (generative LLM-as-judge). Add a custom reward by dropping a single file into relax/engine/rewards/.

Production Operations

HealthManager — heartbeat monitoring with two-tier auto-recovery (in-place restart first, global restart as fallback).
Metrics Service — centralized Ray Serve deployment that fans out to TensorBoard, WandB, and ClearML.
Notifications — real-time training alerts via Apprise (Slack, WeChat, email, and more).

📚 Documentation

Full bilingual documentation is available at redai-infra.github.io/Relax.

🧪 Examples

Example	Description
DeepEyes	Multi-modal vision-language RL with Qwen3-VL
On-Policy Distillation	Teacher-student knowledge distillation via KL penalty

🤝 Contributing

We welcome contributions of all kinds! Please read our Contributing Guide to get started.

📝 Citation

If you find Relax useful in your research, please cite:

@software{relax2026,
  title  = {Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale},
  author = {Relax Contributors},
  url    = {https://github.com/redai-infra/Relax},
  year   = {2026}
}

📜 License

This project is licensed under the Apache License 2.0.

🙏 Acknowledgements

Relax is built upon the shoulders of excellent open-source projects:

Slime — Scalable training and inference framework for reinforcement learning
SGLang — Fast serving framework for large language models
Megatron-LM & Megatron-Bridge — Large-scale distributed training framework and HF ↔ Megatron weight conversion bridge, with sincere thanks to the entire NVIDIA team
TransferQueue — High-performance distributed data transfer queue
Ray — Distributed computing framework
HuggingFace Transformers — State-of-the-art model hub

We sincerely thank all contributors and the open-source community for making this project possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

✨ Highlights

📢 News

🏗️ Architecture

🧠 Supported Algorithms

🤖 Supported Models

📦 Installation

🚀 Quick Start

Task 1 — DAPO Math (Text, 8 GPUs)

Task 2 — Open-R1 (Vision-Language, 8 GPUs)

Task 3 — AVQA (Omni-Modal: Image + Audio, 16 GPUs / 2 nodes)

⚡ Key Features

Fully Async Training via TransferQueue

Agentic RL

Elastic Rollout Scaling

Megatron Training Backend & SGLang Inference

Pluggable Reward Hub

Production Operations

📚 Documentation

🧪 Examples

🤝 Contributing

📝 Citation

📜 License

🙏 Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

✨ Highlights

📢 News

🏗️ Architecture

🧠 Supported Algorithms

🤖 Supported Models

📦 Installation

🚀 Quick Start

Task 1 — DAPO Math (Text, 8 GPUs)

Task 2 — Open-R1 (Vision-Language, 8 GPUs)

Task 3 — AVQA (Omni-Modal: Image + Audio, 16 GPUs / 2 nodes)

⚡ Key Features

Fully Async Training via TransferQueue

Agentic RL

Elastic Rollout Scaling

Megatron Training Backend & SGLang Inference

Pluggable Reward Hub

Production Operations

📚 Documentation

🧪 Examples

🤝 Contributing

📝 Citation

📜 License

🙏 Acknowledgements