feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models#2733
feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models#2733
Conversation
Tuned on MI355X (gfx950) with all backends competing (ASM, hipBLASLt, Triton, FlyDSL). New tuned configs for Llama 70B, Llama 405B, Qwen 32B. Re-tuned existing configs for GPT-OSS, DSV3, Kimi-K2 to include FlyDSL. Backend wins across 708 total shapes: - hipBLASLt: 472 (66.7%) - ASM: 131 (18.5%) - FlyDSL: 70 (9.9%) - Triton: 7 (1.0%) - Mixed/other: 28 (4.0%)
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
|
please remove hipblaslt... |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates tuned BF16 GEMM configuration CSVs to reflect new/retuned winning kernels across multiple backends (ASM / hipBLASLt / Triton / FlyDSL), based on MI355X (gfx950) tuning results.
Changes:
- Added a new tuned BF16 GEMM config file for Qwen 32B, including FlyDSL winners alongside ASM/hipBLASLt/Triton entries.
- Replaced/condensed GPT-OSS tuned shapes and updated the selected winning implementations per shape.
- Expanded/updated DSV3 tuned shapes and introduced FlyDSL winners for several small/medium shapes.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| aiter/configs/model_configs/qwen32B_bf16_tuned_gemm.csv | New tuned BF16 GEMM shape→kernel mapping for Qwen 32B, including FlyDSL entries for gfx950. |
| aiter/configs/model_configs/gptoss_bf16_tuned_gemm.csv | Retuned GPT-OSS mappings; significantly reduced the set of entries and updated backend winners. |
| aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv | Retuned/expanded DSV3 mappings and added FlyDSL winners for several shapes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cu_num,M,N,K,bias,dtype,outdtype,scaleAB,bpreshuffle,libtype,solidx,splitK,us,kernelName,err_ratio,tflops,bw | ||
| 256,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,11,4.4021,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0156,0.17,168.85 | ||
| 256,1,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,9.7108,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0082,1.52,1519.59 | ||
| 256,1,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,5,8.6543,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0097,1.36,1364.22 | ||
| 256,1,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,9.1855,auto,0.0,2.57,2570.02 | ||
| 256,1,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.2012,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.008,2.63,2634.29 | ||
| 256,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,11,4.4168,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0156,0.33,169.65 | ||
| 256,2,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,4,9.1765,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0109,3.21,1609.26 | ||
| 256,2,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,4,8.4399,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0102,2.8,1400.04 | ||
| 256,2,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,9.4381,auto,0.0,5.0,2502.71 | ||
| 256,2,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.402000000000001,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0075,5.17,2589.3 | ||
| 256,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,9,4.4035,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0098,0.67,172.9 | ||
| 256,4,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,5,9.9316,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0122,5.94,1489.1 | ||
| 256,4,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,8.1855,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0091,5.76,1445.96 | ||
| 256,4,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,9.7799,auto,0.0,9.65,2418.1 | ||
| 256,4,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.5586,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0074,10.21,2556.99 | ||
| 256,8,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,9,4.4643,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0127,1.32,175.93 | ||
| 256,8,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,9.7728,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0104,12.07,1517.75 | ||
| 256,8,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,3,8.6642,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0064,10.89,1370.62 | ||
| 256,8,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,10.2907,auto,0.0,18.34,2303.5 | ||
| 256,8,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.7116,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0077,20.14,2529.05 | ||
| 256,16,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,10,4.8658,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0146,2.42,171.31 | ||
| 256,16,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,9.8954,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.01,23.84,1507.74 | ||
| 256,16,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,8.7178,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0062,21.65,1371.24 | ||
| 256,16,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,0,0,10.4181,auto,0.0,36.23,2286.04 | ||
| 256,16,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,12.0177,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0077,39.26,2475.28 | ||
| 256,32,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,15,4.3956,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0234,5.37,211.53 | ||
| 256,32,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,4,9.7246,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0099,48.52,1552.12 | ||
| 256,32,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,5,9.5423,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0107,39.56,1269.28 | ||
| 256,32,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,5,11.2956,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0138,66.84,2128.21 | ||
| 256,32,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,12.2183,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0077,77.24,2455.6 | ||
| 256,48,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,13,4.7396,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0199,7.47,216.48 | ||
| 256,48,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,4,10.6698,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0101,66.34,1430.94 | ||
| 256,48,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,4,9.5586,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.009000000000000001,59.24,1283.62 | ||
| 256,48,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,5,11.8153,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0134,95.85,2053.49 | ||
| 256,48,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,3,13.0321,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0077,108.62,2321.9 | ||
| 256,64,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,15,4.9073,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.025,9.62,228.7 | ||
| 256,64,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,10.5605,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0074,89.36,1462.23 | ||
| 256,64,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,3,9.6563,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0067,78.18,1286.96 | ||
| 256,64,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,5,12.3105,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0137,122.66,1989.02 | ||
| 256,64,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,3,13.5485,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0077,139.31,2252.29 | ||
| 256,80,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,3,9,4.8408,_ZN5aiter39bf16gemm_fp32bf16_tn_48x64_splitk_cleanE,0.0184,12.18,251.73 | ||
| 256,80,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,7,5,12.6552,_ZN5aiter39bf16gemm_fp32bf16_tn_80x64_splitk_cleanE,0.0135,149.14,1952.49 | ||
| 256,80,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,7,3,14.0135,_ZN5aiter39bf16gemm_fp32bf16_tn_80x64_splitk_cleanE,0.0076,168.36,2195.83 | ||
| 256,96,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,15,5.1844,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.026000000000000002,13.65,253.61 | ||
| 256,96,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,5,13.2504,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.0137,170.93,1881.63 | ||
| 256,96,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,3,14.5394,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.0076,194.72,2134.01 | ||
| 256,112,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,9,5.0344,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0162,16.4,280.29 | ||
| 256,112,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,2,17.4414,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0048,151.5,1442.29 | ||
| 256,112,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,4,1,19.2107,_ZN5aiter37bf16gemm_fp32bf16_tn_48x64_pf3_splitkE,0.0,171.94,1628.43 | ||
| 256,128,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,8,4.7402,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.016,19.91,317.99 | ||
| 256,128,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,7,3,11.9095,_ZN5aiter39bf16gemm_fp32bf16_tn_80x64_splitk_cleanE,0.0074,158.48,1355.07 | ||
| 256,128,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,2,11.2484,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0035,134.24,1160.88 | ||
| 256,128,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,2,17.5461,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0048,172.11,1446.41 | ||
| 256,128,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,4,1,19.2945,_ZN5aiter37bf16gemm_fp32bf16_tn_48x64_pf3_splitkE,0.0,195.65,1634.62 | ||
| 256,256,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,5,7,6.5053,_ZN5aiter39bf16gemm_fp32bf16_tn_64x64_splitk_cleanE,0.0157,29.01,350.08 | ||
| 256,256,2560,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,8,2,15.4372,_ZN5aiter39bf16gemm_fp32bf16_tn_96x64_splitk_cleanE,0.0041,244.53,1135.63 | ||
| 256,256,2880,2048,True,torch.bfloat16,torch.bfloat16,False,False,asm,6,1,15.4366,_ZN5aiter37bf16gemm_fp32bf16_tn_64x64_pf3_splitkE,0.0,195.63,927.64 | ||
| 304,1,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.0412,auto,0.0,1.73,1731.52 | ||
| 304,1,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618608,0,10.1359,auto,0.0,2.33,2329.04 | ||
| 304,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3001,auto,0.0,0.07,72.16 | ||
| 304,2,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,17.2327,auto,0.0,3.42,1713.21 | ||
| 304,2,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1022,auto,0.0,4.67,2338.19 | ||
| 304,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.4159,auto,0.0,0.14,71.94 | ||
| 304,4,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3212,auto,0.0,6.81,1706.3 | ||
| 304,4,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3169,auto,0.0,9.15,2292.24 | ||
| 304,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.3779,auto,0.0,0.28,73.36 | ||
| 304,8,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.2979,auto,0.0,13.64,1712.3 | ||
| 304,8,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1527,auto,0.0,18.59,2334.81 | ||
| 304,8,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.8012,auto,0.0,0.55,72.71 | ||
| 304,16,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3674,auto,0.0,27.17,1712.82 | ||
| 304,16,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3801,auto,0.0,36.37,2294.41 | ||
| 304,16,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,11.1528,auto,0.0,1.06,74.74 | ||
| 304,32,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,18.8286,auto,0.0,50.12,1593.49 | ||
| 304,32,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618604,0,12.1128,auto,0.0,62.33,1984.63 | ||
| 304,32,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.4643,auto,0.0,2.06,81.1 | ||
| 304,48,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618578,0,21.9233,auto,0.0,64.57,1380.23 | ||
| 304,48,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,13.6643,auto,0.0,82.88,1775.62 | ||
| 304,48,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.7001,auto,0.0,3.02,87.7 | ||
| 304,64,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.6181,auto,0.0,83.45,1349.15 | ||
| 304,64,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,14.0454,auto,0.0,107.5,1743.34 | ||
| 304,64,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8327,auto,0.0,3.99,94.85 | ||
| 304,80,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5254,auto,0.0,104.74,1366.07 | ||
| 304,80,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,15.9991,auto,0.0,117.97,1544.41 | ||
| 304,80,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.6769,auto,0.0,5.05,104.36 | ||
| 304,96,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5065,auto,0.0,125.79,1378.59 | ||
| 304,96,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,16.6728,auto,0.0,135.85,1495.39 | ||
| 304,96,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8916,auto,0.0,5.95,110.57 | ||
| 304,112,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.6222,auto,0.0,134.15,1270.53 | ||
| 304,112,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.3106,auto,0.0,152.65,1453.19 | ||
| 304,112,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.9001,auto,0.0,6.94,118.58 | ||
| 304,128,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.8097,auto,0.0,152.15,1271.24 | ||
| 304,128,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.9002,auto,0.0,168.71,1417.8 | ||
| 304,128,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.5906,auto,0.0,8.14,130.05 | ||
| 80,1,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.0412,auto,0.0,1.73,1731.52 | ||
| 80,1,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618608,0,10.1359,auto,0.0,2.33,2329.04 | ||
| 80,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3001,auto,0.0,0.07,72.16 | ||
| 80,2,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,17.2327,auto,0.0,3.42,1713.21 | ||
| 80,2,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1022,auto,0.0,4.67,2338.19 | ||
| 80,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.4159,auto,0.0,0.14,71.94 | ||
| 80,4,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3212,auto,0.0,6.81,1706.3 | ||
| 80,4,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3169,auto,0.0,9.15,2292.24 | ||
| 80,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,10.3779,auto,0.0,0.28,73.36 | ||
| 80,8,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.2979,auto,0.0,13.64,1712.3 | ||
| 80,8,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.1527,auto,0.0,18.59,2334.81 | ||
| 80,8,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.8012,auto,0.0,0.55,72.71 | ||
| 80,16,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618599,0,17.3674,auto,0.0,27.17,1712.82 | ||
| 80,16,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,10.3801,auto,0.0,36.37,2294.41 | ||
| 80,16,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618601,0,11.1528,auto,0.0,1.06,74.74 | ||
| 80,32,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618591,0,18.8286,auto,0.0,50.12,1593.49 | ||
| 80,32,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618604,0,12.1128,auto,0.0,62.33,1984.63 | ||
| 80,32,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.4643,auto,0.0,2.06,81.1 | ||
| 80,48,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618578,0,21.9233,auto,0.0,64.57,1380.23 | ||
| 80,48,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,13.6643,auto,0.0,82.88,1775.62 | ||
| 80,48,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618603,0,11.7001,auto,0.0,3.02,87.7 | ||
| 80,64,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.6181,auto,0.0,83.45,1349.15 | ||
| 80,64,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618592,0,14.0454,auto,0.0,107.5,1743.34 | ||
| 80,64,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8327,auto,0.0,3.99,94.85 | ||
| 80,80,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5254,auto,0.0,104.74,1366.07 | ||
| 80,80,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,15.9991,auto,0.0,117.97,1544.41 | ||
| 80,80,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.6769,auto,0.0,5.05,104.36 | ||
| 80,96,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618571,0,22.5065,auto,0.0,125.79,1378.59 | ||
| 80,96,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,16.6728,auto,0.0,135.85,1495.39 | ||
| 80,96,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.8916,auto,0.0,5.95,110.57 | ||
| 80,112,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.6222,auto,0.0,134.15,1270.53 | ||
| 80,112,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.3106,auto,0.0,152.65,1453.19 | ||
| 80,112,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.9001,auto,0.0,6.94,118.58 | ||
| 80,128,5120,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618570,0,24.8097,auto,0.0,152.15,1271.24 | ||
| 80,128,2880,4096,True,torch.bfloat16,torch.bfloat16,False,False,triton,618590,0,17.9002,auto,0.0,168.71,1417.8 | ||
| 80,128,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,triton,618593,0,11.5906,auto,0.0,8.14,130.05 No newline at end of file | ||
| 256,1,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,13,4.5547,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0234,0.16,163.19 | ||
| 256,2,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,14,4.5902,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0195,0.32,163.24 | ||
| 256,4,128,2880,True,torch.bfloat16,torch.bfloat16,False,False,asm,1,14,4.9463,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0117,0.6,153.92 |
There was a problem hiding this comment.
This updated GPT-OSS config only includes entries for cu_num=256. In the previous version of this file, there were also tuned entries for other cu_num values (e.g., 80 and 304). If the runtime selects configs by cu_num, dropping those entries can cause missing-config fallbacks (or outright lookup failures) on different GPU SKUs/partitions. Consider keeping the non-256 cu_num rows, or splitting configs by target device/cu_num (with explicit selection logic) so non-gfx950/256-CU cases continue to resolve deterministically.
| 256,1,5120,640,False,torch.bfloat16,torch.bfloat16,False,False,asm,2,1,6.9902,_ZN5aiter37bf16gemm_fp32bf16_tn_32x64_pf3_splitkE,0.0,0.94,939.19 | ||
| 256,1,5120,1280,False,torch.bfloat16,torch.bfloat16,False,False,hipblaslt,440554,0,7.5725,Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT32x16x256_MI16x16x1_SN_LDSB0_AFC0_AFEM1_AFEM1_ASEM1_CLR0_CADS0_DTLA1_DTLB1_DTVA0_DTVB0_EPS0_FDSI0_GRPM1_GRVWA8_GRVWB8_GSU0_GSUAMB_GLS0_ISA950_IU1_K1_LDSTI0_LBSPPA1024_LBSPPB1024_LBSPPM0_LPA16_LPB16_LPM0_LRVW8_LWPMn1_MIAV0_MIWT2_1_MO40_NTn1_NTA4_NTB1_NTC2_NTD1_NTM0_NEPBS12_NLCA1_NLCB1_ONLL1_PGR2_PLR1_PKA1_SIA3_SS0_SPO1_SRVW0_SSO0_SVW2_SK3_SKFTR0_SKXCCM8_TLDS1_ULSGRO0_USL1_UIOFGRO0_USFGRO0_VSn1_VWA2_VWB1_WSGRA0_WSGRB0_WS64_WG16_4_2,0.0,1.73,1732.59 | ||
| 256,1,5120,3200,False,torch.bfloat16,torch.bfloat16,False,False,asm,1,3,11.6919,_ZN5aiter39bf16gemm_fp32bf16_tn_32x64_splitk_cleanE,0.0076,2.8,2804.05 | ||
| 256,1,5120,5120,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,122,16,12.4189,flydsl_gemm2_abf16_wbf16_bf16_t32x128x64_split_k16_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,4.22,4223.34 |
There was a problem hiding this comment.
This config encodes the target architecture directly in kernelName (e.g., suffix 'gfx950' and many hipBLASLt names include 'ISA950'). If config loading doesn’t strictly gate these entries by detected GPU arch, deploying this file on non-gfx950 targets may select incompatible kernels. Ensure the loader filters by arch (or store these under an arch-scoped path/filename convention) so the config can’t be accidentally applied on other architectures.
| 256,1,5120,5120,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,122,16,12.4189,flydsl_gemm2_abf16_wbf16_bf16_t32x128x64_split_k16_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,4.22,4223.34 |
Re-tuned all BF16 GEMM configs on MI355X (gfx950) with --libtype asm,triton,flydsl (no hipBLASLt). Added GLM-5 (88 shapes from CI log) and new configs for Llama 70B, Llama 405B, Qwen 32B. Backend wins across 796 total shapes (7 models): - ASM: 437 (54.9%) - FlyDSL: 224 (28.1%) - Triton: 135 (17.0%) Per-model breakdown: - GPT-OSS (57): asm=54, triton=3 (bias=True, no FlyDSL support) - DSV3 (58): flydsl=22, triton=18, asm=18 - Kimi-K2 (125): asm=77, flydsl=46, triton=2 - GLM-5 (88): asm=42, flydsl=30, triton=16 - Llama 70B (156): asm=84, flydsl=49, triton=23 - Llama 405B (156): asm=89, flydsl=43, triton=24 - Qwen 32B (156): asm=73, triton=49, flydsl=34 Tuning time without hipBLASLt: 4h total (long pole: 405B @ 4h) vs with hipBLASLt: 10h+ total (long pole: 405B @ 8h+)
Summary
--libtype all(full backend competition)Backend Distribution (708 total shapes)
FlyDSL highlights:
Notes
bias=Truewhich limits FlyDSL competitivenessTest plan
gemm_tuner.py --input_file <tuned_csv>