[ROCm] Remove rocPRIM detail:: dependency from CUB scan kernel by magaonka-amd · Pull Request #792 · ROCm/xla

magaonka-amd · 2026-04-12T00:35:46Z

==========OPENED PR for DISCUSSION not for MERGING here=============

Replace rocprim::detail::default_scan_config_base with hardcoded scan config using the same generic formula (block_size=256, items_per_thread=max(1, 16/item_scale), transpose load/store, warp_scan)
Remove #include of rocprim/device/detail/device_config_helper.hpp
Fixes build breakage since TheRock >= 7.11 where default_scan_config_base was renamed/restructured as part of rocPRIM config modernization

why move to hardcoded configs instead of try_compile() approach ?
I honestly didn't find single instance of this being used in XLA so I would be the first one to do so wasn't sure it will fly in upstream.

why not version guard this?:
currently rocm version numbers are kind of all over the place due to rocm to theRock transition and this makes it version guarding hard.

also existing code was also pulling default configs, my bad I was trying to see if there is way to do it per arch but I was wrong. rocprim has arch specific scan configs ( up until gfx942 ) for runtime but here we are trying to have scan configs compile time.
so I think having hardcoded value is not bad. but which exact scan config we should choose is debatable.

debate part is listed below :

Default numbers I added in this PR is recommended default value from rocprim. But to get better picture I did small sweep exercise where I go through all possible scan configs and try to see if they are significantly better than default config. All my testing was on MI355 and using ROCm 7.2.

My benchmarking patch looks like this:

diff --git a/xla/stream_executor/rocm/BUILD b/xla/stream_executor/rocm/BUILD
index af04328591..fa9865ce64 100644
--- a/xla/stream_executor/rocm/BUILD
+++ b/xla/stream_executor/rocm/BUILD
@@ -1270,10 +1270,12 @@ xla_test(
         "//xla/tsl/platform:errors",
         "//xla/tsl/platform:statusor",
         "@com_google_absl//absl/cleanup",
+        "@com_google_absl//absl/log:check",
         "@com_google_absl//absl/status",
         "@com_google_absl//absl/strings:str_format",
         "@com_google_absl//absl/strings:string_view",
-        "@com_google_googletest//:gtest_main",
+        "//xla/tsl/platform:test_benchmark",
+        "//xla/tsl/platform:test_main",
         "@local_config_rocm//rocm:rocm_headers",
     ],
 )
diff --git a/xla/stream_executor/rocm/cub_scan_kernel_rocm_test.cc b/xla/stream_executor/rocm/cub_scan_kernel_rocm_test.cc
index 043ed8d041..c0c52d3d9a 100644
--- a/xla/stream_executor/rocm/cub_scan_kernel_rocm_test.cc
+++ b/xla/stream_executor/rocm/cub_scan_kernel_rocm_test.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include <gmock/gmock.h>
 #include <gtest/gtest.h>
 #include "absl/cleanup/cleanup.h"
+#include "absl/log/check.h"
 #include "absl/status/status.h"
 #include "absl/strings/str_format.h"
 #include "absl/strings/string_view.h"
@@ -39,6 +40,7 @@ limitations under the License.
 #include "xla/stream_executor/stream.h"
 #include "xla/stream_executor/stream_executor.h"
 #include "xla/tsl/lib/core/status_test_util.h"
+#include "xla/tsl/platform/test_benchmark.h"
 #include "xla/tsl/platform/errors.h"
 #include "xla/tsl/platform/statusor.h"
 #include "xla/xla_data.pb.h"
@@ -346,5 +348,72 @@ INSTANTIATE_TEST_SUITE_P(
                        ::testing::Values(false)),
     ParametersToString);
 
+//===----------------------------------------------------------------------===//
+// Performance benchmarks
+//===----------------------------------------------------------------------===//
+
+static se::Platform* GetRocmPlatform() {
+  return se::PlatformManager::PlatformWithName("ROCM").value();
+}
+
+static void BM_CubScan1D(benchmark::State& state) {
+  int64_t row_length = state.range(0);
+  se::StreamExecutor* executor =
+      GetRocmPlatform()->ExecutorForDevice(0).value();
+  auto stream = executor->CreateStream(std::nullopt).value();
+  se::DeviceAddress<float> device_data =
+      executor->AllocateArray<float>(row_length);
+  size_t scratch_bytes =
+      CubScanGetScratchSize(xla::F32, 1, row_length, 1, CubScanKind::kSum,
+                            false)
+          .value();
+  se::DeviceMemory<uint8_t> scratch =
+      executor->AllocateArray<uint8_t>(std::max(scratch_bytes, size_t{1}));
+  auto cleanup = absl::MakeCleanup([&]() {
+    executor->Deallocate(&device_data);
+    executor->Deallocate(&scratch);
+  });
+  auto hip_stream =
+      static_cast<hipStream_t>(stream->platform_specific_handle().stream);
+
+  for (auto _ : state) {
+    CHECK_OK(CubScanLaunchKernel(xla::F32, scratch.opaque(), scratch_bytes,
+                                 device_data.opaque(), device_data.opaque(), 1,
+                                 row_length, 1, CubScanKind::kSum, false,
+                                 hip_stream));
+    CHECK_OK(stream->BlockHostUntilDone());
+  }
+  state.SetBytesProcessed(state.iterations() * row_length * sizeof(float));
+}
+BENCHMARK(BM_CubScan1D)->RangeMultiplier(4)->Range(4096, 16 * 1024 * 1024);
+
+static void BM_CubScan2D(benchmark::State& state) {
+  int64_t row_length = state.range(0);
+  int64_t col_length = state.range(1);
+  int64_t total = row_length * col_length;
+  se::StreamExecutor* executor =
+      GetRocmPlatform()->ExecutorForDevice(0).value();
+  auto stream = executor->CreateStream(std::nullopt).value();
+  se::DeviceAddress<float> device_data = executor->AllocateArray<float>(total);
+  auto cleanup =
+      absl::MakeCleanup([&]() { executor->Deallocate(&device_data); });
+  auto hip_stream =
+      static_cast<hipStream_t>(stream->platform_specific_handle().stream);
+
+  for (auto _ : state) {
+    CHECK_OK(CubScanLaunchKernel(
+        xla::F32, nullptr, 0, device_data.opaque(), device_data.opaque(), 1,
+        row_length, col_length, CubScanKind::kSum, false, hip_stream));
+    CHECK_OK(stream->BlockHostUntilDone());
+  }
+  state.SetBytesProcessed(state.iterations() * total * sizeof(float));
+}
+BENCHMARK(BM_CubScan2D)
+    ->Args({1024, 1024})
+    ->Args({4096, 256})
+    ->Args({256, 4096})
+    ->Args({512, 2048})
+    ->Args({8192, 128});
+
 }  // namespace
 }  // namespace stream_executor::rocm

My iteration logic to sweep all combos:

BS means kBlockSize 
IPT  means kItemsPerThread
ALGO means block_scan_algorithm

for BS in 128 256; do
  for IPT in 4 8 12 16 20 24; do
    for ALGO in using_warp_scan reduce_then_scan; do
      run_bench $BS $IPT $ALGO
    done
  done
done

And below is my 2D scan results with various scan config combos:

**Hardware**: AMD Instinct MI355X (gfx950, CDNA4), ROCm 7.2  
**Benchmark**: Google Benchmark, F32 inclusive scan, `BM_CubScan2D`  
**Load/Store**: `block_load_transpose` / `block_store_transpose` (all configs)

#### 2D Batched Scan Throughput (GB/s) -- higher is better

| block_size | items_per_thread | scan_algorithm    | 1024x1024 | 4096x256 | 256x4096 | 512x2048 | 8192x128 | Avg GB/s |
|------------|------------------|-------------------|-----------|----------|----------|----------|----------|----------|
| 128        | 4                | using_warp_scan   | 293.9     | 225.5    | 295.1    | 304.9    | 172.7    | 258.4    |
| 128        | 4                | reduce_then_scan  | 266.5     | 207.8    | 274.5    | 278.6    | 158.3    | 237.2    |
| 128        | 8                | using_warp_scan   | **301.7** | 249.4    | 282.2    | **294.6**| 200.7    | **265.7**|
| 128        | 8                | reduce_then_scan  | 300.7     | 245.1    | 278.2    | 293.0    | 194.8    | 262.4    |
| 128        | 12               | using_warp_scan   | 293.9     | 255.8    | 271.1    | 289.7    | 204.8    | 263.0    |
| 128        | 12               | reduce_then_scan  | 278.4     | 236.4    | 251.3    | 270.6    | 192.0    | 245.7    |
| 128        | 16               | using_warp_scan   | 269.0     | 241.6    | 230.3    | 253.0    | 202.9    | 239.4    |
| 128        | 16               | reduce_then_scan  | 268.5     | 240.5    | 228.1    | 252.4    | 202.5    | 238.4    |
| 128        | 20               | using_warp_scan   | 284.5     | 256.8    | 242.8    | 268.0    | 205.0    | 251.4    |
| 128        | 20               | reduce_then_scan  | 261.2     | 239.3    | 227.4    | 252.5    | 193.9    | 234.8    |
| 128        | 24               | using_warp_scan   | 258.2     | 231.6    | 212.0    | 240.8    | 204.2    | 229.4    |
| 128        | 24               | reduce_then_scan  | 275.6     | 244.3    | 222.5    | 256.0    | 211.7    | 242.0    |
| 256        | 4                | using_warp_scan   | 281.7     | 244.2    | 249.2    | 265.9    | 206.7    | 249.5    |
| 256        | 4                | reduce_then_scan  | 282.2     | 243.5    | 252.7    | 270.6    | 199.9    | 249.8    |
| 256        | 8                | using_warp_scan   | 270.6     | 255.6    | 231.6    | 256.1    | 220.9    | 247.0    |
| 256        | 8                | reduce_then_scan  | 295.8     | **275.3**| 248.9    | 280.0    | **231.0**| 266.2    |
| 256        | 12               | using_warp_scan   | 285.5     | 267.3    | 235.3    | 267.4    | 242.1    | 259.5    |
| 256        | 12               | reduce_then_scan  | 268.0     | 248.6    | 223.6    | 252.5    | 226.3    | 243.8    |
| **256**    | **16**           | **using_warp_scan**| 251.7    | 264.6    | 196.3    | 230.9    | 236.5    | 236.0    |
| 256        | 16               | reduce_then_scan  | 251.4     | 261.8    | 196.7    | 231.0    | 235.4    | 235.3    |
| 256        | 20               | using_warp_scan   | 252.3     | 263.4    | 195.5    | 227.2    | 231.1    | 233.9    |
| 256        | 20               | reduce_then_scan  | 269.6     | 282.5    | 205.0    | 243.9    | 243.2    | 248.8    |
| 256        | 24               | using_warp_scan   | 258.9     | 277.5    | 183.0    | 223.3    | 238.1    | 236.1    |
| 256        | 24               | reduce_then_scan  | 241.6     | 258.5    | 172.3    | 210.7    | 224.7    | 221.6    |


#### Top 5 Configs by Average Throughput

| Rank | Config                            | Avg GB/s | vs Default |
|------|-----------------------------------|----------|------------|
| 1    | 256/8/reduce_then_scan            | 266.2    | +12.8%     |
| 2    | 128/8/using_warp_scan             | 265.7    | +12.6%     |
| 3    | 128/12/using_warp_scan            | 263.0    | +11.4%     |
| 4    | 128/8/reduce_then_scan            | 262.4    | +11.2%     |
| 5    | 256/12/using_warp_scan            | 259.5    | +10.0%     |

my experiments above show default is not the best configuration at least for MI355 case. I would like to hear opinion on if it is okay to move away from rocprim default config and introduce our own scan config numbers here??.
Also good number in MI355 may not mean good performance on all HW so I'm little confused on picking best config here.

- Replace rocprim::detail::default_scan_config_base<T> with hardcoded scan config using the same generic formula (block_size=256, items_per_thread=max(1, 16/item_scale), transpose load/store, warp_scan) - Remove #include of rocprim/device/detail/device_config_helper.hpp - Fixes build breakage since TheRock >= 7.11 where default_scan_config_base was renamed/restructured as part of rocPRIM config modernization

claude · 2026-04-12T00:43:16Z

xla/stream_executor/rocm/cub_scan_kernel_rocm_impl.cu.cc

+      (sizeof(T) + sizeof(int) - 1) / sizeof(int);
+  static constexpr int kBlockSize = 256;
+  static constexpr int kItemsPerThread =
+      static_cast<int>(16u / kItemScale > 0 ? 16u / kItemScale : 1u);


nit: The ternary expression 16u / kItemScale > 0 ? 16u / kItemScale : 1u is correct by C++ operator precedence, but can trip up readers who have to mentally verify the binding. Consider std::max(1u, 16u / kItemScale) — same semantics, immediately clear intent, and avoids evaluating the division expression twice (though the compiler would optimize that away for constexpr values anyway).

Suggested change

static_cast<int>(16u / kItemScale > 0 ? 16u / kItemScale : 1u);

static_cast<int>(std::max(1u, 16u / kItemScale));

claude · 2026-04-12T00:43:27Z

xla/stream_executor/rocm/cub_scan_kernel_rocm_impl.cu.cc

+// Scan config using rocPRIM's generic default formula for block size and
+// items-per-thread. Hardcoded to avoid depending on rocPRIM internal
+// detail:: API (default_scan_config_base was renamed/restructured since
+// TheRock >= 7.11).
 template <typename T>
 struct ScanConfig {
-  using RocprimConfig =
-      typename rocprim::detail::default_scan_config_base<T>::type;
-  static constexpr int kBlockSize = RocprimConfig::block_size;
-  static constexpr int kItemsPerThread = RocprimConfig::items_per_thread;
+  static constexpr unsigned int kItemScale =
+      (sizeof(T) + sizeof(int) - 1) / sizeof(int);
+  static constexpr int kBlockSize = 256;
+  static constexpr int kItemsPerThread =
+      static_cast<int>(16u / kItemScale > 0 ? 16u / kItemScale : 1u);
  static constexpr int kTileSize = kBlockSize * kItemsPerThread;
-  static constexpr auto kLoadMethod = RocprimConfig::block_load_method;
-  static constexpr auto kStoreMethod = RocprimConfig::block_store_method;
-  static constexpr auto kScanAlgorithm = RocprimConfig::block_scan_method;
+  static constexpr auto kLoadMethod =
+      rocprim::block_load_method::block_load_transpose;
+  static constexpr auto kStoreMethod =
+      rocprim::block_store_method::block_store_transpose;
+  static constexpr auto kScanAlgorithm =
+      rocprim::block_scan_algorithm::using_warp_scan;


Observation on performance vs. the CUDA path: the CUDA implementation uses CUB's policy_hub for architecture-tuned configs at compile time, meaning it automatically adapts to different GPU architectures. This hardcoded config can't adapt. Your own benchmark data shows config 256/16/using_warp_scan (which is what this PR produces for float) achieves ~236 GB/s average on MI355X — about 12.8% slower than the best config you found (256/8/reduce_then_scan at 266.2 GB/s).

Since this is opened for discussion, a few options to consider:

Keep the current generic default (as in this PR) for safety across architectures, but add a TODO comment noting the performance gap and intent to revisit if rocPRIM exposes a stable public config API.

Use the better-performing config (reduce_then_scan with items_per_thread=8 at block_size=256) if MI355X is the primary target — but this needs validation on other architectures (MI210, MI300X, etc.) to avoid regressions.

Parameterize by architecture at compile time using __gfx*__ macros, similar to how the CUDA path dispatches by SM version. This is more work but would let you pick optimal configs per GPU family.

Any of these are defensible; the key is documenting the choice and the known gap.

claude · 2026-04-12T00:43:38Z

Review Summary

Reviewed the removal of rocprim::detail::default_scan_config_base<T> and its replacement with hardcoded scan config values. The fix correctly addresses the build breakage from TheRock >= 7.11.

Key observations (details in inline comments):

The ternary for kItemsPerThread is correct but could be clearer — std::max suggested.
The chosen default config (256/16/using_warp_scan for float) is ~12.8% slower than the best config found in benchmarks (256/8/reduce_then_scan). Since this is opened for discussion, inline comment covers several paths forward.

No correctness issues found — the kernel interface is unchanged, so existing tests should continue to pass. The main open question is the performance/portability tradeoff for the scan config, which the author has already flagged.

nurmukhametov · 2026-04-13T10:29:48Z

It looks good to me.

Regarding the config change, I am OK with changing it but should we test performance for other datatypes (fp16, fp8) first?

draganmladjenovic · 2026-04-14T10:38:41Z

@magaonka-amd https://github.com/ROCm/rocm-libraries/blob/13bf528af264e243f181b85b23acb739ebb35d61/projects/rocprim/rocprim/include/rocprim/device/device_segmented_scan.hpp#L469
I might be celebrating prematurely but you can build 2d row scan on top of this? So all the grity details stay in rocprim.
It does require additional device array, so I might be celebrating prematurely.
Edit: It is an iterator, so you might get by with and iota iterator not having to use device memory at all.

magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 12, 2026

claude bot reviewed Apr 12, 2026

View reviewed changes

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Remove rocPRIM detail:: dependency from CUB scan kernel#792

[ROCm] Remove rocPRIM detail:: dependency from CUB scan kernel#792
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:fix/cubscan-remove-rocprim-detail-dep

magaonka-amd commented Apr 12, 2026 •

edited

Loading

Uh oh!

claude bot Apr 12, 2026

Uh oh!

claude bot Apr 12, 2026

Uh oh!

claude bot commented Apr 12, 2026

Uh oh!

nurmukhametov commented Apr 13, 2026

Uh oh!

draganmladjenovic commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	static_cast<int>(16u / kItemScale > 0 ? 16u / kItemScale : 1u);
	static_cast<int>(std::max(1u, 16u / kItemScale));

Conversation

magaonka-amd commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 12, 2026

Review Summary

Uh oh!

nurmukhametov commented Apr 13, 2026

Uh oh!

draganmladjenovic commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

magaonka-amd commented Apr 12, 2026 •

edited

Loading

draganmladjenovic commented Apr 14, 2026 •

edited

Loading