Add glm5 70k 300 triton a8w8 blockscale configs#2743
Open
amd-pedghazi wants to merge 6 commits intoROCm:mainfrom
Open
Add glm5 70k 300 triton a8w8 blockscale configs#2743amd-pedghazi wants to merge 6 commits intoROCm:mainfrom
amd-pedghazi wants to merge 6 commits intoROCm:mainfrom
Conversation
…0k in and 300 out
Add FlyDSL-based `fused_qk_rope_reshape_and_cache` as a drop-in replacement for the Triton version, with automatic Triton fallback for unsupported features (GPT-J rotation, offsets, KV scaling, zeros output, full-dim cos/sin, T>1). Performance: 1.59x average speedup over Triton for decode (T=1) across Llama-8B/70B/405B, Qwen3-72B/235B model configs. Tests aligned with Triton test_fused_qk_rope_reshape_and_cache: same input gen, reference, tolerance, KV cache validation. 53 test cases. Requires flydsl >= 0.1.3 (no modifications needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
|
Hi @amd-pedghazi. Does your PR have any connection to FlyDSL? I'm seeing some FlyDSL related changes in the diff... From the PR description, I understand that your intention is to just add new configs to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add missing Triton GEMM A8W8 blockscale configs for GLM-5 model shapes on gfx942 (MI300X).
Technical Details
6 new shape-specialized JSON configs added to aiter/ops/triton/configs/gemm/ for N,K pairs: (2048,2048), (2624,6144), (3072,6144), (3584,512), (6144,1536), (6144,2048). Tuned on MI300X using the Triton tuning framework. Without these, all 6 shapes fall back to the generic default which only has a single "any" bucket (BLOCK_SIZE_M=128, NUM_KSPLIT=1), inefficient for small-M decode.
Test Plan
Configs follow existing format and are loaded via get_gemm_config(). Verified with the AITER microbenchmark (op_tests/op_benchmarks/triton/bench_gemm_a8w8_blockscale.py) and end-to-end serving benchmarks on GLM-5-FP8 (8xMI300X, SGLang).
Setup:
rocm/sgl-dev:v0.5.9-rocm720-mi30x-20260309, SGLang fromamd-pedghazi/sglang@aiter-glm5-attention-fixes, TP=8, aiter attention backend. Only difference between runs is presence/absence of the 6 JSON config files.Test Result
Microbenchmark shows ~0.4% average per-kernel gain, with best shape (N=6144, K=1536) up to +8.4%.
End-to-end serving (GLM-5-FP8, input=70k tokens, output=300 tokens, request_rate=inf):
The large end-to-end gains (especially at low concurrency) are because the default fallback config uses BLOCK_SIZE_M=128 with NUM_KSPLIT=1 for all M values, while the tuned configs provide per-M-bucket optimization with smaller block sizes and split-K for decode (M=1-32).
Submission Checklist