Add glm5 70k 300 triton a8w8 blockscale configs by amd-pedghazi · Pull Request #2743 · ROCm/aiter

amd-pedghazi · 2026-04-14T11:18:56Z

Motivation

Add missing Triton GEMM A8W8 blockscale configs for GLM-5 model shapes on gfx942 (MI300X).

Technical Details

6 new shape-specialized JSON configs added to aiter/ops/triton/configs/gemm/ for N,K pairs: (2048,2048), (2624,6144), (3072,6144), (3584,512), (6144,1536), (6144,2048). Tuned on MI300X using the Triton tuning framework. Without these, all 6 shapes fall back to the generic default which only has a single "any" bucket (BLOCK_SIZE_M=128, NUM_KSPLIT=1), inefficient for small-M decode.

Test Plan

Configs follow existing format and are loaded via get_gemm_config(). Verified with the AITER microbenchmark (op_tests/op_benchmarks/triton/bench_gemm_a8w8_blockscale.py) and end-to-end serving benchmarks on GLM-5-FP8 (8xMI300X, SGLang).

Setup: rocm/sgl-dev:v0.5.9-rocm720-mi30x-20260309, SGLang from amd-pedghazi/sglang@aiter-glm5-attention-fixes, TP=8, aiter attention backend. Only difference between runs is presence/absence of the 6 JSON config files.

Test Result

Microbenchmark shows ~0.4% average per-kernel gain, with best shape (N=6144, K=1536) up to +8.4%.

End-to-end serving (GLM-5-FP8, input=70k tokens, output=300 tokens, request_rate=inf):

Prompts	Total tok/s (before → after)	Impr	TTFT ms (before → after)	Impr	TPOT ms (before → after)	Impr	E2E ms (before → after)	Impr
1	795 → 1349	+70%	461 → 469	-2%	25.1 → 14.0	+44%	6850 → 4035	+41%
2	7069 → 9600	+36%	2988 → 3103	-4%	30.8 → 19.3	+37%	8067 → 6264	+22%
4	5771 → 7021	+22%	4155 → 3986	+4%	111.9 → 97.9	+13%	16433 → 14228	+13%
8	7614 → 8646	+14%	8284 → 7983	+4%	287.2 → 267.4	+7%	28715 → 26260	+9%
10	9641 → 11149	+16%	7567 → 7149	+6%	248.7 → 227.7	+8%	22868 → 20669	+10%
16	8228 → 9061	+10%	14552 → 13942	+4%	404.4 → 378.0	+7%	53120 → 49261	+7%
32	8629 → 9279	+8%	36066 → 34046	+6%	389.8 → 365.6	+6%	87352 → 81925	+6%

The large end-to-end gains (especially at low concurrency) are because the default fallback config uses BLOCK_SIZE_M=128 with NUM_KSPLIT=1 for all M values, while the tuned configs provide per-M-bucket optimization with smaller block sizes and split-K for decode (M=1-32).

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…0k in and 300 out

Add FlyDSL-based `fused_qk_rope_reshape_and_cache` as a drop-in replacement for the Triton version, with automatic Triton fallback for unsupported features (GPT-J rotation, offsets, KV scaling, zeros output, full-dim cos/sin, T>1). Performance: 1.59x average speedup over Triton for decode (T=1) across Llama-8B/70B/405B, Qwen3-72B/235B model configs. Tests aligned with Triton test_fused_qk_rope_reshape_and_cache: same input gen, reference, tolerance, KV cache validation. 53 test cases. Requires flydsl >= 0.1.3 (no modifications needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-configs

github-actions · 2026-04-14T11:19:42Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2743 --add-label <label>

brunomazzottiamd · 2026-04-14T13:37:43Z

Hi @amd-pedghazi. Does your PR have any connection to FlyDSL? I'm seeing some FlyDSL related changes in the diff... From the PR description, I understand that your intention is to just add new configs to aiter/ops/triton/configs/gemm directory. Am I correct? Can you please double check and rebase on top of main? Thanks.

root and others added 6 commits April 14, 2026 09:36

Add gfx942 Triton A8W8 blockscale GEMM configs for GLM-5 shapes for 7…

8e4d61b

…0k in and 300 out

updat

7c5d992

Pin flydsl version to ==0.1.3

1ae40e5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update

8047487

Merge branch 'ROCm:main' into add-glm5-70k-300-triton-a8w8-blockscale…

e74f1c2

…-configs

amd-pedghazi requested a review from a team April 14, 2026 11:18

brunomazzottiamd requested a review from azaidy April 14, 2026 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add glm5 70k 300 triton a8w8 blockscale configs#2743

Add glm5 70k 300 triton a8w8 blockscale configs#2743
amd-pedghazi wants to merge 6 commits intoROCm:mainfrom
amd-pedghazi:add-glm5-70k-300-triton-a8w8-blockscale-configs

amd-pedghazi commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

brunomazzottiamd commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amd-pedghazi commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Apr 14, 2026

🏷️ CI Guide

Uh oh!

brunomazzottiamd commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-pedghazi commented Apr 14, 2026 •

edited

Loading