Fix Triton MoE GEMM shared memory exhaustion by reducing stage count#2723
Open
Fix Triton MoE GEMM shared memory exhaustion by reducing stage count#2723
Conversation
- Reduce num_stages in kernel configs - Lowered LDS usage to avoid shared memory OOR - Fix triton.runtime.errors.OutOfResources errors in MoE GEMM kernels
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
This comment was marked as resolved.
This comment was marked as resolved.
…x950 to ensure no bottlenecks for gfx942
brunomazzottiamd
approved these changes
Apr 14, 2026
Contributor
brunomazzottiamd
left a comment
There was a problem hiding this comment.
LGTM! Let's wait for CI to pass.
Contributor
Author
|
To further support this: Gfx950 benchmark results using MoE Kernel Benchmark Summary (with
|
| TP | M val | DeepSeek-R1 | GPT-OSS 120B | Llama4 Maverick | Qwen3-235B-A22B |
|---|---|---|---|---|---|
| 1 | 512 | 285.8 | 164.0 | 27.75 | 283.4 |
| 1 | 1536 | 838.3 | 448.2 | 71.21 | 780.8 |
| 1 | 2560 | 829.7 | 181.4 | 251.10 | 712.5 |
| 1 | 3584 | 1052.0 | 219.8 | 303.80 | 949.6 |
| 1 | 4608 | 971.1 | 177.2 | 382.30 | 904.9 |
| 1 | 5632 | 1164.0 | 209.7 | 473.50 | 1059.0 |
| 1 | 6656 | 1246.0 | 246.3 | 510.50 | 994.4 |
| 1 | 7680 | 1342.0 | 247.7 | 535.80 | 1077.0 |
| 1 | 8704 | 1233.0 | 232.0 | 640.90 | 1091.0 |
| 1 | 9728 | 1295.0 | 240.9 | 713.00 | 1147.0 |
| 1 | 10752 | 1390.0 | 256.4 | 785.10 | 1140.0 |
| 1 | 11776 | 1446.0 | 267.3 | 856.20 | 1178.0 |
| 1 | 12800 | 1351.0 | 259.2 | 891.50 | 1183.0 |
| 1 | 13824 | 1357.0 | 250.3 | 954.40 | 1202.0 |
| 1 | 14848 | 1467.0 | 266.0 | 918.80 | 1202.0 |
| 1 | 15872 | 1471.0 | 270.9 | 860.90 | 1224.0 |
| 2 | 512 | 264.1 | 156.0 | 28.69 | 277.1 |
| 2 | 1536 | 661.4 | 407.4 | 78.79 | 686.6 |
| 2 | 2560 | 996.3 | 179.4 | 245.20 | 653.2 |
| 2 | 3584 | 1173.0 | 205.1 | 299.50 | 833.1 |
| 2 | 4608 | 1006.0 | 172.7 | 381.40 | 823.9 |
| 2 | 5632 | 1219.0 | 206.8 | 442.50 | 934.6 |
| 2 | 6656 | 1433.0 | 223.6 | 503.30 | 893.7 |
| 2 | 7680 | 1452.0 | 242.4 | 502.30 | 943.4 |
| 2 | 8704 | 1319.0 | 230.2 | 616.60 | 969.8 |
| 2 | 9728 | 1407.0 | 239.2 | 685.20 | 1019.0 |
| 2 | 10752 | 1507.0 | 254.4 | 752.90 | 1027.0 |
| 2 | 11776 | 1536.0 | 264.9 | 819.10 | 1063.0 |
| 2 | 12800 | 1504.0 | 256.1 | 882.70 | 1068.0 |
| 2 | 13824 | 1482.0 | 256.6 | 891.80 | 1095.0 |
| 2 | 14848 | 1534.0 | 255.8 | 856.00 | 1082.0 |
| 2 | 15872 | 1608.0 | 271.4 | 815.20 | 1094.0 |
| 4 | 512 | 252.2 | 121.4 | 30.20 | 235.5 |
| 4 | 1536 | 542.7 | 151.7 | 80.72 | 506.1 |
| 4 | 2560 | 859.6 | 105.1 | 238.80 | 524.9 |
| 4 | 3584 | 925.2 | 132.1 | 295.50 | 621.7 |
| 4 | 4608 | 833.9 | 111.9 | 360.70 | 626.9 |
| 4 | 5632 | 1011.0 | 135.2 | 434.80 | 736.1 |
| 4 | 6656 | 1057.0 | 140.4 | 469.50 | 705.6 |
| 4 | 7680 | 1165.0 | 153.6 | 462.80 | 746.5 |
| 4 | 8704 | 1095.0 | 144.5 | 574.40 | 764.6 |
| 4 | 9728 | 1172.0 | 155.8 | 635.80 | 808.4 |
| 4 | 10752 | 1218.0 | 165.1 | 696.90 | 810.1 |
| 4 | 11776 | 1268.0 | 165.2 | 756.60 | 825.4 |
| 4 | 12800 | 1197.0 | 167.2 | 762.80 | 840.3 |
| 4 | 13824 | 1242.0 | 171.4 | 809.30 | 861.4 |
| 4 | 14848 | 1260.0 | 172.0 | 811.60 | 851.1 |
| 4 | 15872 | 1315.0 | 174.9 | 783.50 | 859.2 |
| 8 | 512 | 241.2 | 87.84 | 32.70 | 177.5 |
| 8 | 1536 | 398.7 | 122.20 | 82.01 | 302.5 |
| 8 | 2560 | 650.5 | 78.46 | 210.70 | 365.4 |
| 8 | 3584 | 632.0 | 84.19 | 257.80 | 419.0 |
| 8 | 4608 | 623.3 | 79.48 | 333.50 | 461.9 |
| 8 | 5632 | 743.8 | 91.44 | 402.80 | 487.7 |
| 8 | 6656 | 733.6 | 96.01 | 419.60 | 501.2 |
| 8 | 7680 | 816.2 | 103.80 | 453.60 | 509.3 |
| 8 | 8704 | 806.6 | 103.80 | 537.10 | 533.5 |
| 8 | 9728 | 855.2 | 107.90 | 592.90 | 545.2 |
| 8 | 10752 | 861.5 | 110.70 | 647.80 | 561.4 |
| 8 | 11776 | 891.2 | 114.90 | 614.10 | 564.0 |
| 8 | 12800 | 891.9 | 115.10 | 663.10 | 577.9 |
| 8 | 13824 | 896.1 | 114.50 | 704.50 | 584.9 |
| 8 | 14848 | 906.0 | 116.40 | 734.40 | 581.7 |
| 8 | 15872 | 931.9 | 120.00 | 699.90 | 575.9 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
triton.runtime.errors.OutOfResourceserrors in MoE GEMM kernelsMotivation
Several Triton kernels were failing with:
triton.runtime.errors.OutOfResources: shared memoryThis was observed across multiple test cases:
test_moe_gemm_a4w4.py #current PRtest_moe_gemm_a8w4.py #current PRtest_moe_gemm_a8w8.py #current PRThe issue became prominent in testing after async copy was enabled, which increased LDS (shared memory) usage.
This PR is about the MoE GEMMs labeled with #current PR as the other two kernels are in progress.
Technical Details
The root cause is increased LDS pressure due to:
num_stagesTo address this:
num_stagesfrom 2 -> 1 in affected Triton kernel configs:moe_op_gemm_a4w4.pymoe_op_gemm_a8w4.pymoe_op_gemm_a8w8.pyNo other changes were introduced beyond parameter adjustments in the config files.
Test Plan
test_moe_gemm_a4w4.pytest_moe_gemm_a8w4.pytest_moe_gemm_a8w8.pyTest Result
test_moe_gemm_a4w4.pytest_moe_gemm_a8w4.pytest_moe_gemm_a8w8.pySubmission Checklist