[ROCm] Add waves_per_eu support to Triton GEMM config by nurmukhametov · Pull Request #769 · ROCm/xla

nurmukhametov · 2026-04-01T11:08:23Z

Add the ROCm-specific waves_per_eu occupancy hint to the Triton GEMM autotuning and compilation pipeline. This parameter controls the number of wavefronts per execution unit on AMD GPUs, allowing the LLVM backend to optimize register allocation for target occupancy.

The parameter flows through: TritonGemmKey proto -> TritonGemmConfig struct -> BlockLevelFusionConfig proto -> BlockLevelParameters -> xtile_compiler, where it is applied as the "amdgpu-waves-per-eu" LLVM function attribute on the kernel, matching Triton's own AMD backend behavior. Default value of 0 means no restriction.

The autotuner search space is extended with values {0,1,2,4} for ROCm targets.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

draganmladjenovic · 2026-04-01T11:35:19Z

On upstream this may require https://github.com/openxla/xla/blob/main/xla/service/gpu/autotuning/autotune_cache_key.h#L37 increment too.

claude · 2026-04-01T11:38:04Z

xla/autotuning.proto

    int64 num_ctas = 7;
    bool is_tma_allowed = 8;
    bool is_warp_specialization_allowed = 9;
+    int64 waves_per_eu = 10;


nit: waves_per_eu is declared as int64 here, but as int32 in backend_configs.proto (BlockLevelFusionConfig) and as int in the C++ structs (TritonGemmConfig, BlockLevelParameters). Since the max meaningful value is single digits ({0,1,2,4}), consider using int32 to match backend_configs.proto and the pattern used by num_ctas/num_stages there.

It is fine, I exactly follow num_ctas/num_stages pattern.

Resolved — you're right, num_ctas, num_stages, and num_warps all use int64 in this proto. Consistent with the existing pattern.

claude · 2026-04-01T11:38:09Z

xla/service/gpu/matmul_utils.cc

+      proto.is_tma_allowed(), proto.is_warp_specialization_allowed(),
+      proto.waves_per_eu());


All other numeric config fields validated above have TF_RET_CHECK(... >= 1) guards. waves_per_eu is missing a validation check — a negative value from a corrupt/malicious cache entry would be silently accepted and passed to the LLVM attribute.

Consider adding:

Suggested change

proto.is_tma_allowed(), proto.is_warp_specialization_allowed(),

proto.waves_per_eu());

proto.waves_per_eu());

TF_RET_CHECK(config.waves_per_eu >= 0);

return config;

Resolved — TF_RET_CHECK(proto.waves_per_eu() >= 0) added in this revision.

claude · 2026-04-01T11:38:12Z

xla/backends/gpu/autotuner/triton/dot_search_space_test.cc

+  // All waves_per_eu values should be from the expected set {0,1,2,4}.
+  EXPECT_THAT(configs, Each(WavesPerEuIs(Le(4))));


nit: The assertion Le(4) would also pass for value 3, which is not in the expected set. Consider using AnyOf(0, 1, 2, 4) to precisely match kWavesPerEuValues:

Suggested change

// All waves_per_eu values should be from the expected set {0,1,2,4}.

EXPECT_THAT(configs, Each(WavesPerEuIs(Le(4))));

// waves_per_eu should be from the expected set {0,1,2,4}.

Each(WavesPerEuIs(AnyOf(0, 1, 2, 4)))));

Resolved — now using AnyOf(0, 1, 2, 4) for precise matching.

claude · 2026-04-01T11:38:17Z

xla/backends/gpu/codegen/triton/tests/fusion_emitter_device_test.cc

+
+  ASSERT_NE(result.llvm_module, nullptr);
+  for (const auto& fn : result.llvm_module->functions()) {
+    if (fn.getName() == "test_fn") {
+      auto attr = fn.getFnAttribute("amdgpu-waves-per-eu");
+      EXPECT_FALSE(attr.isStringAttribute())
+          << "waves_per_eu=0 should not set amdgpu-waves-per-eu attribute";
+    }
+  }


Unlike the non-zero test above which asserts EXPECT_TRUE(found_attr), this zero-value test doesn't verify the kernel function test_fn was actually found in the module. If test_fn is missing (e.g., due to a test infrastructure change), the test would pass vacuously. Consider adding an assertion that the function was located:

Suggested change

ASSERT_NE(result.llvm_module, nullptr);

for (const auto& fn : result.llvm_module->functions()) {

if (fn.getName() == "test_fn") {

auto attr = fn.getFnAttribute("amdgpu-waves-per-eu");

EXPECT_FALSE(attr.isStringAttribute())

<< "waves_per_eu=0 should not set amdgpu-waves-per-eu attribute";

}

}

if (func.getName() == "test_fn") {

EXPECT_FALSE(func.hasFnAttribute("amdgpu-waves-per-eu"))

<< "waves_per_eu=0 should not set amdgpu-waves-per-eu attribute";

found_attr = true;

}

}

EXPECT_TRUE(found_attr)

<< "Kernel function 'test_fn' not found in LLVM module";

(reusing found_attr to mean "found the function" rather than "found the attribute")

Resolved — the zero-value test now tracks found_fn and asserts the kernel function was located.

claude · 2026-04-01T11:38:28Z

Review Summary

Clean, well-structured PR that threads waves_per_eu through the full Triton GEMM stack (proto → config → autotuner → LLVM IR). The propagation is complete across all required touchpoints and tests cover both the positive and zero-value cases.

4 inline comments posted:

Proto type inconsistency (int64 vs int32 elsewhere) — nit
Missing >= 0 validation in FromProto — suggested fix
Test assertion Le(4) is weaker than the intended set {0,1,2,4} — nit
Zero-value device test could pass vacuously if kernel function isn't found — suggested fix

No correctness issues found. The 4x search space expansion for ROCm is noted and appears intentional.

🤖 Generated with Claude Code

claude · 2026-04-01T14:35:36Z

Re-review Summary

All 4 previous review comments have been addressed:

Proto type consistency (int64 in autotuning.proto) — confirmed consistent with sibling fields
Validation check — TF_RET_CHECK(proto.waves_per_eu() >= 0) added in matmul_utils.cc
Test precision — AnyOf(0, 1, 2, 4) matcher now used in search space test
Vacuous test guard — found_fn tracking added to device test for zero-value case

No new issues found. The waves_per_eu plumbing is complete and well-tested across proto definitions, config structs, autotuner search space, and the xtile compiler. LGTM.

nurmukhametov · 2026-04-02T08:57:18Z

On upstream this may require https://github.com/openxla/xla/blob/main/xla/service/gpu/autotuning/autotune_cache_key.h#L37 increment too.

I have incremented it here. I wonder if I need also to do something similar due to a backend_proto change?

i-chaochen · 2026-04-07T14:16:30Z

xla/backends/gpu/autotuner/triton/dot_search_space.cc

+void TritonDotFusionSearchSpace::AddWavesPerEuParameter(
+    const ConfigWithNotes& config,
+    std::vector<ConfigWithNotes>& updated_configs) const {
+  static constexpr int kWavesPerEuValues[] = {0, 1, 2, 4};


could you add this one as the reference? https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html#auto-tunable-kernel-configurations

maybe this one is better? https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html#mi300x-occupancy-vgpr-table

I have added the comment based on both links here. I think it is better to be self-consistent and I believe that code comment will outlive any URL.

i-chaochen · 2026-04-07T14:19:34Z

xla/backends/gpu/codegen/triton/xtile_compiler.cc

      VerifyModule(*ll_triton_module);
    }

+    // Apply ROCm-specific waves_per_eu attribute if set.


add this one as the reference https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html#compute-the-occupancy-of-a-kernel

Add the ROCm-specific `waves_per_eu` hint to the Triton GEMM config. This parameter specifies the minimum number of wavefronts per execution unit on AMD GPUs. The LLVM backend uses this to: 1) limit the number of SGPRs and VGPRs available per wave, which affects register allocation; 2) set register pressure thresholds for the instruction scheduler. The parameter flows through: TritonGemmKey proto -> TritonGemmConfig struct -> BlockLevelFusionConfig proto -> BlockLevelParameters -> xtile_compiler, where it is applied as the "amdgpu-waves-per-eu" LLVM function attribute on the kernel, matching Triton's own AMD backend behavior. Default value of 0 means no restriction. The autotuner search space is extended with values {0,1,2,4} for ROCm targets.

nurmukhametov force-pushed the anurmukh/add-waves-per-eu-triton-gemm-config branch from 7ff0909 to 9c4694e Compare April 1, 2026 11:33

nurmukhametov added the claude-review Request a Claude AI code review for this PR label Apr 1, 2026

claude bot reviewed Apr 1, 2026

View reviewed changes

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 1, 2026

nurmukhametov added the claude-review Request a Claude AI code review for this PR label Apr 1, 2026

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 1, 2026

nurmukhametov requested review from draganmladjenovic and i-chaochen April 2, 2026 08:57

nurmukhametov force-pushed the anurmukh/add-waves-per-eu-triton-gemm-config branch 2 times, most recently from 3dac08c to b471c38 Compare April 7, 2026 13:31

i-chaochen reviewed Apr 7, 2026

View reviewed changes

nurmukhametov force-pushed the anurmukh/add-waves-per-eu-triton-gemm-config branch 4 times, most recently from b8cfb81 to 22d19a8 Compare April 7, 2026 16:29

nurmukhametov requested a review from i-chaochen April 8, 2026 08:30

nurmukhametov force-pushed the anurmukh/add-waves-per-eu-triton-gemm-config branch from 22d19a8 to 34bb84c Compare April 8, 2026 08:47

		proto.is_tma_allowed(), proto.is_warp_specialization_allowed(),
		proto.waves_per_eu());

		// All waves_per_eu values should be from the expected set {0,1,2,4}.
		EXPECT_THAT(configs, Each(WavesPerEuIs(Le(4))));

Conversation

nurmukhametov commented Apr 1, 2026

Submission Checklist

Uh oh!

draganmladjenovic commented Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 1, 2026

Review Summary

Uh oh!

claude bot commented Apr 1, 2026

Re-review Summary

Uh oh!

nurmukhametov commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants