Skip to content

[Common] Reduced padding kernel compilation time#2827

Open
Oleg-Goncharov wants to merge 3 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_reduced_padding_kernel_compilation
Open

[Common] Reduced padding kernel compilation time#2827
Oleg-Goncharov wants to merge 3 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_reduced_padding_kernel_compilation

Conversation

@Oleg-Goncharov
Copy link
Copy Markdown
Collaborator

@Oleg-Goncharov Oleg-Goncharov commented Apr 2, 2026

Description

This PR reduces the compilation time of padding.cu from approximately 600 seconds to 3 seconds by removing the outer-loop unroll.

Kernel performance remains effectively unchanged across different outer-loop unroll factors. The input multi-tensor consists of square tensors with dimensions {1024, 2048, 4096, 8192, 16384}. Measured kernel runtime in microseconds:

  • 263.1 — unroll 8 (fully unrolled)
  • 318.7 — unroll 4
  • 263.2 — unroll 2
  • 262.2 — unroll 1
  • 261.8 — no unroll

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Removed the outer #pragma unroll directive.
  • Reduced compile-time overhead in the padding kernel.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov requested a review from ptrendx April 2, 2026 14:57
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR removes the #pragma unroll directive from the outer for loop in both multi_padding_kernel and multi_unpadding_kernel, reducing padding.cu compile time from ~600 s to ~3 s with no measurable runtime regression (benchmark data provided in the description). The inner loops retain their #pragma unroll directives, and n_iterations remains a constexpr so the compiler can still apply heuristic unrolling if profitable.

Confidence Score: 5/5

Safe to merge — minimal one-line removal per kernel with benchmark data confirming no performance regression.

The change removes two #pragma unroll directives from outer loops with a statically known trip count of 8. Inner loops remain unrolled, correctness is unaffected, and the PR includes empirical runtime data showing no regression. No logic, interface, or API changes are present.

No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/common/util/padding.cu Removed #pragma unroll from the outer iteration loop in both padding and unpadding kernels; inner loops still unrolled; no logic change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[multi_padding_kernel / multi_unpadding_kernel] --> B[Find tensor for block]
    B --> C["outer loop: for iter in 0..n_iterations\n(n_iterations = WARP_SIZE / n_warps = 8)\nNo longer force-unrolled"]
    C --> D["#pragma unroll\nfor i2 in 0..nvec"]
    D --> E[Load input vector]
    E --> F["#pragma unroll\nfor j2 in 0..nvec — copy to output"]
    F --> G{row < num_rows?}
    G -- yes --> H[Store output vector]
    G -- no --> I{row < padded_num_rows?}
    I -- yes --> J[Write zeros — padding kernel only]
    I -- no --> K[Skip]
    H --> C
    J --> C
    K --> C
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into pr_reduced_padd..." | Re-trigger Greptile

@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented Apr 9, 2026

Please benchmark the kernel before and after this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants