Skip to content

feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models#2733

Open
sunway513 wants to merge 2 commits intomainfrom
bf16-gemm-retune-flydsl
Open

feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models#2733
sunway513 wants to merge 2 commits intomainfrom
bf16-gemm-retune-flydsl

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

  • Re-tuned BF16 GEMM configs for GPT-OSS, DSV3, Kimi-K2 with FlyDSL competing alongside ASM/hipBLASLt/Triton
  • Added new BF16 GEMM tuned configs for Llama 70B, Llama 405B, Qwen 32B
  • All tuning done on MI355X (gfx950) with --libtype all (full backend competition)

Backend Distribution (708 total shapes)

Model Shapes ASM FlyDSL hipBLASLt Triton
GPT-OSS 57 30 0 27 0
DSV3 58 6 10 42 0
Kimi-K2 125 74 40 9 2
Llama 70B 156 11 9 135 1
Llama 405B 156 12 3 138 3
Qwen 32B 156 24 8 121 3

FlyDSL highlights:

  • Kimi-K2: 32% of shapes won by FlyDSL (strongest)
  • DSV3: 17% of shapes won by FlyDSL
  • GPT-OSS: 0% (bias=True shapes not supported by FlyDSL)

Notes

  • GPT-OSS shapes all have bias=True which limits FlyDSL competitiveness
  • Large shapes (M=32768) heavily favor hipBLASLt
  • FlyDSL performs best on medium shapes without bias

Test plan

  • Verify tuned GEMM configs load correctly via gemm_tuner.py --input_file <tuned_csv>
  • Run inference sanity check on each model with new configs

Tuned on MI355X (gfx950) with all backends competing (ASM, hipBLASLt,
Triton, FlyDSL). New tuned configs for Llama 70B, Llama 405B, Qwen 32B.
Re-tuned existing configs for GPT-OSS, DSV3, Kimi-K2 to include FlyDSL.

Backend wins across 708 total shapes:
- hipBLASLt: 472 (66.7%)
- ASM: 131 (18.5%)
- FlyDSL: 70 (9.9%)
- Triton: 7 (1.0%)
- Mixed/other: 28 (4.0%)
@sunway513 sunway513 requested review from a team and Copilot April 14, 2026 07:13
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2733 --add-label <label>

@valarLip
Copy link
Copy Markdown
Collaborator

please remove hipblaslt...

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates tuned BF16 GEMM configuration CSVs to reflect new/retuned winning kernels across multiple backends (ASM / hipBLASLt / Triton / FlyDSL), based on MI355X (gfx950) tuning results.

Changes:

  • Added a new tuned BF16 GEMM config file for Qwen 32B, including FlyDSL winners alongside ASM/hipBLASLt/Triton entries.
  • Replaced/condensed GPT-OSS tuned shapes and updated the selected winning implementations per shape.
  • Expanded/updated DSV3 tuned shapes and introduced FlyDSL winners for several small/medium shapes.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

File Description
aiter/configs/model_configs/qwen32B_bf16_tuned_gemm.csv New tuned BF16 GEMM shape→kernel mapping for Qwen 32B, including FlyDSL entries for gfx950.
aiter/configs/model_configs/gptoss_bf16_tuned_gemm.csv Retuned GPT-OSS mappings; significantly reduced the set of entries and updated backend winners.
aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv Retuned/expanded DSV3 mappings and added FlyDSL winners for several shapes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +4
cu_num,M,N,K,bias,dtype,outdtype,scaleAB,bpreshuffle,libtype,solidx,splitK,us,kernelName,err_ratio,tflops,bw
256,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,11,4.4021,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0156,0.17,168.85
256,1,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,9.7108,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0082,1.52,1519.59
256,1,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,5,8.6543,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0097,1.36,1364.22
256,1,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,9.1855,auto,0.0,2.57,2570.02
256,1,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.2012,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.008,2.63,2634.29
256,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,11,4.4168,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0156,0.33,169.65
256,2,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,4,9.1765,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0109,3.21,1609.26
256,2,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,4,8.4399,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0102,2.8,1400.04
256,2,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,9.4381,auto,0.0,5.0,2502.71
256,2,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.402000000000001,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0075,5.17,2589.3
256,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,9,4.4035,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0098,0.67,172.9
256,4,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,5,9.9316,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0122,5.94,1489.1
256,4,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,8.1855,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0091,5.76,1445.96
256,4,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,9.7799,auto,0.0,9.65,2418.1
256,4,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.5586,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0074,10.21,2556.99
256,8,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,9,4.4643,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0127,1.32,175.93
256,8,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,9.7728,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0104,12.07,1517.75
256,8,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,3,8.6642,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0064,10.89,1370.62
256,8,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,10.2907,auto,0.0,18.34,2303.5
256,8,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.7116,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0077,20.14,2529.05
256,16,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,10,4.8658,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0146,2.42,171.31
256,16,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,9.8954,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.01,23.84,1507.74
256,16,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,8.7178,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0062,21.65,1371.24
256,16,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,10.4181,auto,0.0,36.23,2286.04
256,16,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,12.0177,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0077,39.26,2475.28
256,32,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,15,4.3956,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0234,5.37,211.53
256,32,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,4,9.7246,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0099,48.52,1552.12
256,32,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,5,9.5423,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0107,39.56,1269.28
256,32,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,5,11.2956,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0138,66.84,2128.21
256,32,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,12.2183,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0077,77.24,2455.6
256,48,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,13,4.7396,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0199,7.47,216.48
256,48,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,10.6698,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0101,66.34,1430.94
256,48,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,4,9.5586,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.009000000000000001,59.24,1283.62
256,48,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,5,11.8153,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0134,95.85,2053.49
256,48,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,3,13.0321,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0077,108.62,2321.9
256,64,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,15,4.9073,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.025,9.62,228.7
256,64,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,10.5605,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0074,89.36,1462.23
256,64,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,3,9.6563,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0067,78.18,1286.96
256,64,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,5,12.3105,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0137,122.66,1989.02
256,64,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,3,13.5485,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0077,139.31,2252.29
256,80,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,9,4.8408,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0184,12.18,251.73
256,80,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,7,5,12.6552,_ZN5aiter39bf16gemm_fp32bf16_tn_80x64_splitk_cleanE,0.0135,149.14,1952.49
256,80,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,7,3,14.0135,_ZN5aiter39bf16gemm_fp32bf16_tn_80x64_splitk_cleanE,0.0076,168.36,2195.83
256,96,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,15,5.1844,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.026000000000000002,13.65,253.61
256,96,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,5,13.2504,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.0137,170.93,1881.63
256,96,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,3,14.5394,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.0076,194.72,2134.01
256,112,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,9,5.0344,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0162,16.4,280.29
256,112,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,2,17.4414,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0048,151.5,1442.29
256,112,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,4,1,19.2107,_ZN5aiter37bf16gemm_fp32bf16_tn_48x64_pf3_splitkE,0.0,171.94,1628.43
256,128,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,8,4.7402,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.016,19.91,317.99
256,128,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,7,3,11.9095,_ZN5aiter39bf16gemm_fp32bf16_tn_80x64_splitk_cleanE,0.0074,158.48,1355.07
256,128,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,2,11.2484,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0035,134.24,1160.88
256,128,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,2,17.5461,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0048,172.11,1446.41
256,128,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,4,1,19.2945,_ZN5aiter37bf16gemm_fp32bf16_tn_48x64_pf3_splitkE,0.0,195.65,1634.62
256,256,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,7,6.5053,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0157,29.01,350.08
256,256,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,2,15.4372,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.0041,244.53,1135.63
256,256,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,6,1,15.4366,_ZN5aiter37bf16gemm_fp32bf16_tn_64x64_pf3_splitkE,0.0,195.63,927.64
304,1,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.0412,auto,0.0,1.73,1731.52
304,1,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618608,0,10.1359,auto,0.0,2.33,2329.04
304,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3001,auto,0.0,0.07,72.16
304,2,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,17.2327,auto,0.0,3.42,1713.21
304,2,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1022,auto,0.0,4.67,2338.19
304,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.4159,auto,0.0,0.14,71.94
304,4,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3212,auto,0.0,6.81,1706.3
304,4,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3169,auto,0.0,9.15,2292.24
304,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.3779,auto,0.0,0.28,73.36
304,8,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.2979,auto,0.0,13.64,1712.3
304,8,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1527,auto,0.0,18.59,2334.81
304,8,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.8012,auto,0.0,0.55,72.71
304,16,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3674,auto,0.0,27.17,1712.82
304,16,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3801,auto,0.0,36.37,2294.41
304,16,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,11.1528,auto,0.0,1.06,74.74
304,32,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,18.8286,auto,0.0,50.12,1593.49
304,32,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618604,0,12.1128,auto,0.0,62.33,1984.63
304,32,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.4643,auto,0.0,2.06,81.1
304,48,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618578,0,21.9233,auto,0.0,64.57,1380.23
304,48,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,13.6643,auto,0.0,82.88,1775.62
304,48,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.7001,auto,0.0,3.02,87.7
304,64,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.6181,auto,0.0,83.45,1349.15
304,64,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,14.0454,auto,0.0,107.5,1743.34
304,64,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8327,auto,0.0,3.99,94.85
304,80,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5254,auto,0.0,104.74,1366.07
304,80,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,15.9991,auto,0.0,117.97,1544.41
304,80,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.6769,auto,0.0,5.05,104.36
304,96,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5065,auto,0.0,125.79,1378.59
304,96,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,16.6728,auto,0.0,135.85,1495.39
304,96,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8916,auto,0.0,5.95,110.57
304,112,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.6222,auto,0.0,134.15,1270.53
304,112,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.3106,auto,0.0,152.65,1453.19
304,112,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.9001,auto,0.0,6.94,118.58
304,128,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.8097,auto,0.0,152.15,1271.24
304,128,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.9002,auto,0.0,168.71,1417.8
304,128,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.5906,auto,0.0,8.14,130.05
80,1,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.0412,auto,0.0,1.73,1731.52
80,1,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618608,0,10.1359,auto,0.0,2.33,2329.04
80,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3001,auto,0.0,0.07,72.16
80,2,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,17.2327,auto,0.0,3.42,1713.21
80,2,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1022,auto,0.0,4.67,2338.19
80,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.4159,auto,0.0,0.14,71.94
80,4,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3212,auto,0.0,6.81,1706.3
80,4,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3169,auto,0.0,9.15,2292.24
80,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.3779,auto,0.0,0.28,73.36
80,8,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.2979,auto,0.0,13.64,1712.3
80,8,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1527,auto,0.0,18.59,2334.81
80,8,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.8012,auto,0.0,0.55,72.71
80,16,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3674,auto,0.0,27.17,1712.82
80,16,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3801,auto,0.0,36.37,2294.41
80,16,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,11.1528,auto,0.0,1.06,74.74
80,32,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,18.8286,auto,0.0,50.12,1593.49
80,32,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618604,0,12.1128,auto,0.0,62.33,1984.63
80,32,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.4643,auto,0.0,2.06,81.1
80,48,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618578,0,21.9233,auto,0.0,64.57,1380.23
80,48,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,13.6643,auto,0.0,82.88,1775.62
80,48,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.7001,auto,0.0,3.02,87.7
80,64,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.6181,auto,0.0,83.45,1349.15
80,64,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,14.0454,auto,0.0,107.5,1743.34
80,64,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8327,auto,0.0,3.99,94.85
80,80,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5254,auto,0.0,104.74,1366.07
80,80,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,15.9991,auto,0.0,117.97,1544.41
80,80,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.6769,auto,0.0,5.05,104.36
80,96,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5065,auto,0.0,125.79,1378.59
80,96,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,16.6728,auto,0.0,135.85,1495.39
80,96,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8916,auto,0.0,5.95,110.57
80,112,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.6222,auto,0.0,134.15,1270.53
80,112,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.3106,auto,0.0,152.65,1453.19
80,112,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.9001,auto,0.0,6.94,118.58
80,128,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.8097,auto,0.0,152.15,1271.24
80,128,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.9002,auto,0.0,168.71,1417.8
80,128,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.5906,auto,0.0,8.14,130.05 No newline at end of file
256,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,13,4.5547,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0234,0.16,163.19
256,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,14,4.5902,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0195,0.32,163.24
256,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,14,4.9463,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0117,0.6,153.92
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updated GPT-OSS config only includes entries for cu_num=256. In the previous version of this file, there were also tuned entries for other cu_num values (e.g., 80 and 304). If the runtime selects configs by cu_num, dropping those entries can cause missing-config fallbacks (or outright lookup failures) on different GPU SKUs/partitions. Consider keeping the non-256 cu_num rows, or splitting configs by target device/cu_num (with explicit selection logic) so non-gfx950/256-CU cases continue to resolve deterministically.

Copilot uses AI. Check for mistakes.
256,1,5120,640,False,torch.bfloat16,torch.bfloat16,False,False,asm,2,1,6.9902,_ZN5aiter37bf16gemm_fp32bf16_tn_32x64_pf3_splitkE,0.0,0.94,939.19
256,1,5120,1280,False,torch.bfloat16,torch.bfloat16,False,False,hipblaslt,440554,0,7.5725,Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT32x16x256_MI16x16x1_SN_LDSB0_AFC0_AFEM1_AFEM1_ASEM1_CLR0_CADS0_DTLA1_DTLB1_DTVA0_DTVB0_EPS0_FDSI0_GRPM1_GRVWA8_GRVWB8_GSU0_GSUAMB_GLS0_ISA950_IU1_K1_LDSTI0_LBSPPA1024_LBSPPB1024_LBSPPM0_LPA16_LPB16_LPM0_LRVW8_LWPMn1_MIAV0_MIWT2_1_MO40_NTn1_NTA4_NTB1_NTC2_NTD1_NTM0_NEPBS12_NLCA1_NLCB1_ONLL1_PGR2_PLR1_PKA1_SIA3_SS0_SPO1_SRVW0_SSO0_SVW2_SK3_SKFTR0_SKXCCM8_TLDS1_ULSGRO0_USL1_UIOFGRO0_USFGRO0_VSn1_VWA2_VWB1_WSGRA0_WSGRB0_WS64_WG16_4_2,0.0,1.73,1732.59
256,1,5120,3200,False,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.6919,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0076,2.8,2804.05
256,1,5120,5120,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,122,16,12.4189,flydsl_gemm2_abf16_wbf16_bf16_t32x128x64_split_k16_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,4.22,4223.34
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config encodes the target architecture directly in kernelName (e.g., suffix 'gfx950' and many hipBLASLt names include 'ISA950'). If config loading doesn’t strictly gate these entries by detected GPU arch, deploying this file on non-gfx950 targets may select incompatible kernels. Ensure the loader filters by arch (or store these under an arch-scoped path/filename convention) so the config can’t be accidentally applied on other architectures.

Suggested change
256,1,5120,5120,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,122,16,12.4189,flydsl_gemm2_abf16_wbf16_bf16_t32x128x64_split_k16_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,4.22,4223.34

Copilot uses AI. Check for mistakes.
Re-tuned all BF16 GEMM configs on MI355X (gfx950) with --libtype
asm,triton,flydsl (no hipBLASLt). Added GLM-5 (88 shapes from CI log)
and new configs for Llama 70B, Llama 405B, Qwen 32B.

Backend wins across 796 total shapes (7 models):
- ASM: 437 (54.9%)
- FlyDSL: 224 (28.1%)
- Triton: 135 (17.0%)

Per-model breakdown:
- GPT-OSS (57): asm=54, triton=3 (bias=True, no FlyDSL support)
- DSV3 (58): flydsl=22, triton=18, asm=18
- Kimi-K2 (125): asm=77, flydsl=46, triton=2
- GLM-5 (88): asm=42, flydsl=30, triton=16
- Llama 70B (156): asm=84, flydsl=49, triton=23
- Llama 405B (156): asm=89, flydsl=43, triton=24
- Qwen 32B (156): asm=73, triton=49, flydsl=34

Tuning time without hipBLASLt: 4h total (long pole: 405B @ 4h)
vs with hipBLASLt: 10h+ total (long pole: 405B @ 8h+)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants