Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 3 additions & 6 deletions xla/stream_executor/rocm/rocm_executor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -682,12 +682,9 @@ absl::StatusOr<std::unique_ptr<Kernel>> RocmExecutor::LoadKernel(
const auto& cubin = spec.cuda_cubin_in_memory()->cubin_bytes;
const char* hsaco = reinterpret_cast<const char*>(cubin.data());
absl::MutexLock lock{in_memory_modules_mu_};
ModuleHandle module_handle = HsacoModuleHandle(hsaco, cubin.size());
hipModule_t& module = in_memory_modules_[module_handle];

if (module == nullptr) {
TF_ASSIGN_OR_RETURN(module, LoadHsaco(&rocm_context_, hsaco));
}
TF_ASSIGN_OR_RETURN(ModuleHandle module_handle,
LoadModuleFromHsaco(hsaco, cubin.size()));
Comment on lines +685 to +686
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core fix and looks correct. Previously LoadKernel populated in_memory_modules_ and kernel_to_gpu_binary_ but skipped gpu_binary_to_module_, so UnloadGpuBinary (line 627) would find no entry and return false without cleaning up. Routing through LoadModuleFromHsaco now correctly populates both gpu_binary_to_module_ (with refcount) and in_memory_modules_, making load/unload symmetric.

This also achieves full parity with the CUDA LoadKernel -> LoadModuleFromCuBin path (cuda_executor.cc:971-984).

hipModule_t module = gpu_binary_to_module_.at(module_handle).first;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .at() will throw std::out_of_range if the key is missing. In practice this is safe because LoadModuleFromHsaco always inserts into gpu_binary_to_module_ before returning (line 783), and the CUDA equivalent (cuda_executor.cc:978) uses the identical pattern. Just noting for reviewers that this acts as a defensive assertion rather than a silent failure path.

kernel_to_gpu_binary_[rocm_kernel.get()] = module_handle;

VLOG(2) << "getting function " << kernel_name << " from module " << module;
Expand Down
Loading