[Common] Reduced padding kernel compilation time by Oleg-Goncharov · Pull Request #2827 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-04-02T14:56:51Z

Description

This PR reduces the compilation time of padding.cu from approximately 600 seconds to 3 seconds by removing the outer-loop unroll.

Kernel performance remains effectively unchanged across different outer-loop unroll factors. The input multi-tensor consists of square tensors with dimensions {1024, 2048, 4096, 8192, 16384}. Measured kernel runtime in microseconds:

263.1 — unroll 8 (fully unrolled)
318.7 — unroll 4
263.2 — unroll 2
262.2 — unroll 1
261.8 — no unroll

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Removed the outer #pragma unroll directive.
Reduced compile-time overhead in the padding kernel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps · 2026-04-02T14:58:17Z

Greptile Summary

This PR removes the #pragma unroll directive from the outer for loop in both multi_padding_kernel and multi_unpadding_kernel, reducing padding.cu compile time from ~600 s to ~3 s with no measurable runtime regression (benchmark data provided in the description). The inner loops retain their #pragma unroll directives, and n_iterations remains a constexpr so the compiler can still apply heuristic unrolling if profitable.

Confidence Score: 5/5

Safe to merge — minimal one-line removal per kernel with benchmark data confirming no performance regression.

The change removes two #pragma unroll directives from outer loops with a statically known trip count of 8. Inner loops remain unrolled, correctness is unaffected, and the PR includes empirical runtime data showing no regression. No logic, interface, or API changes are present.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/util/padding.cu	Removed `#pragma unroll` from the outer iteration loop in both padding and unpadding kernels; inner loops still unrolled; no logic change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[multi_padding_kernel / multi_unpadding_kernel] --> B[Find tensor for block]
    B --> C["outer loop: for iter in 0..n_iterations\n(n_iterations = WARP_SIZE / n_warps = 8)\nNo longer force-unrolled"]
    C --> D["#pragma unroll\nfor i2 in 0..nvec"]
    D --> E[Load input vector]
    E --> F["#pragma unroll\nfor j2 in 0..nvec — copy to output"]
    F --> G{row < num_rows?}
    G -- yes --> H[Store output vector]
    G -- no --> I{row < padded_num_rows?}
    I -- yes --> J[Write zeros — padding kernel only]
    I -- no --> K[Skip]
    H --> C
    J --> C
    K --> C

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into pr_reduced_padd..." | Re-trigger Greptile}

ptrendx · 2026-04-09T16:33:02Z

Please benchmark the kernel before and after this change.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Reduced padding kernel compilation time

0136e94

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov requested a review from ptrendx April 2, 2026 14:57

Oleg-Goncharov and others added 2 commits April 10, 2026 13:04

Completely removed unroll for better performance

f649114

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_reduced_padding_kernel_compilation

e5bbc06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Reduced padding kernel compilation time#2827

[Common] Reduced padding kernel compilation time#2827
Oleg-Goncharov wants to merge 3 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_reduced_padding_kernel_compilation

Oleg-Goncharov commented Apr 2, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

ptrendx commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oleg-Goncharov commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

ptrendx commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oleg-Goncharov commented Apr 2, 2026 •

edited

Loading

greptile-apps bot commented Apr 2, 2026 •

edited

Loading