Skip to content

Add glm5 70k 300 triton a8w8 blockscale configs#2743

Open
amd-pedghazi wants to merge 6 commits intoROCm:mainfrom
amd-pedghazi:add-glm5-70k-300-triton-a8w8-blockscale-configs
Open

Add glm5 70k 300 triton a8w8 blockscale configs#2743
amd-pedghazi wants to merge 6 commits intoROCm:mainfrom
amd-pedghazi:add-glm5-70k-300-triton-a8w8-blockscale-configs

Conversation

@amd-pedghazi
Copy link
Copy Markdown

@amd-pedghazi amd-pedghazi commented Apr 14, 2026

Motivation

Add missing Triton GEMM A8W8 blockscale configs for GLM-5 model shapes on gfx942 (MI300X).

Technical Details

6 new shape-specialized JSON configs added to aiter/ops/triton/configs/gemm/ for N,K pairs: (2048,2048), (2624,6144), (3072,6144), (3584,512), (6144,1536), (6144,2048). Tuned on MI300X using the Triton tuning framework. Without these, all 6 shapes fall back to the generic default which only has a single "any" bucket (BLOCK_SIZE_M=128, NUM_KSPLIT=1), inefficient for small-M decode.

Test Plan

Configs follow existing format and are loaded via get_gemm_config(). Verified with the AITER microbenchmark (op_tests/op_benchmarks/triton/bench_gemm_a8w8_blockscale.py) and end-to-end serving benchmarks on GLM-5-FP8 (8xMI300X, SGLang).

Setup: rocm/sgl-dev:v0.5.9-rocm720-mi30x-20260309, SGLang from amd-pedghazi/sglang@aiter-glm5-attention-fixes, TP=8, aiter attention backend. Only difference between runs is presence/absence of the 6 JSON config files.

Test Result

Microbenchmark shows ~0.4% average per-kernel gain, with best shape (N=6144, K=1536) up to +8.4%.

End-to-end serving (GLM-5-FP8, input=70k tokens, output=300 tokens, request_rate=inf):

Prompts Total tok/s (before → after) Impr TTFT ms (before → after) Impr TPOT ms (before → after) Impr E2E ms (before → after) Impr
1 795 → 1349 +70% 461 → 469 -2% 25.1 → 14.0 +44% 6850 → 4035 +41%
2 7069 → 9600 +36% 2988 → 3103 -4% 30.8 → 19.3 +37% 8067 → 6264 +22%
4 5771 → 7021 +22% 4155 → 3986 +4% 111.9 → 97.9 +13% 16433 → 14228 +13%
8 7614 → 8646 +14% 8284 → 7983 +4% 287.2 → 267.4 +7% 28715 → 26260 +9%
10 9641 → 11149 +16% 7567 → 7149 +6% 248.7 → 227.7 +8% 22868 → 20669 +10%
16 8228 → 9061 +10% 14552 → 13942 +4% 404.4 → 378.0 +7% 53120 → 49261 +7%
32 8629 → 9279 +8% 36066 → 34046 +6% 389.8 → 365.6 +6% 87352 → 81925 +6%

The large end-to-end gains (especially at low concurrency) are because the default fallback config uses BLOCK_SIZE_M=128 with NUM_KSPLIT=1 for all M values, while the tuned configs provide per-M-bucket optimization with smaller block sizes and split-K for decode (M=1-32).

Submission Checklist

root and others added 6 commits April 14, 2026 09:36
Add FlyDSL-based `fused_qk_rope_reshape_and_cache` as a drop-in replacement
for the Triton version, with automatic Triton fallback for unsupported features
(GPT-J rotation, offsets, KV scaling, zeros output, full-dim cos/sin, T>1).

Performance: 1.59x average speedup over Triton for decode (T=1) across
Llama-8B/70B/405B, Qwen3-72B/235B model configs.

Tests aligned with Triton test_fused_qk_rope_reshape_and_cache:
same input gen, reference, tolerance, KV cache validation. 53 test cases.

Requires flydsl >= 0.1.3 (no modifications needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@amd-pedghazi amd-pedghazi requested a review from a team April 14, 2026 11:18
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2743 --add-label <label>

@brunomazzottiamd brunomazzottiamd requested a review from azaidy April 14, 2026 13:33
@brunomazzottiamd
Copy link
Copy Markdown
Contributor

Hi @amd-pedghazi. Does your PR have any connection to FlyDSL? I'm seeing some FlyDSL related changes in the diff... From the PR description, I understand that your intention is to just add new configs to aiter/ops/triton/configs/gemm directory. Am I correct? Can you please double check and rebase on top of main? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants