Skip to content

merge main into amd-main#2153

Open
z1-cciauto wants to merge 156 commits intoamd-mainfrom
upstream_merge_202604120006
Open

merge main into amd-main#2153
z1-cciauto wants to merge 156 commits intoamd-mainfrom
upstream_merge_202604120006

Conversation

@z1-cciauto
Copy link
Copy Markdown
Collaborator

No description provided.

aengelke and others added 30 commits April 10, 2026 14:29
Apparently required by some older libstdc++ versions.
…dify-Write Sequence, Fix llvm#189183 (llvm#190350)

This patch improves the SystemZ cost model to identify Read-Modify-Write
sequences
 that can be folded into a single instruction (e.g., ASI, NI, OI).
If a load, a scalar arithmetic operation (ADD, SUB, AND, OR, XOR) with
an
 immediate, and a store all target the same memory location and have no
 external uses, the cost of the arithmetic and store insn should bw 0.
This implementation does not include TTI::TCK_RecipThroughput CostKind,
as
 it causes regression in non-power-2-subvector-extract.ll.

Fixes llvm#189183. (Refer it for example)

---------

Co-authored-by: anoopkg6 <anoopkg6@github.com>
Summary:
Naked functions are intended to allow the user to write the entirety of
the function block, so we shouldn't include the `waitcnt` instructions
for them.
…#191208)

This moves the test of whether the iteration variable of an affected DO
loop is marked as threadprivate. This makes the `ordCollapseLevel`
member unnecessary.

Issue: llvm#191249
Added the generate-libc-headers custom target depending on libc-headers.

This allows troubleshooting headers without needing to install them
first.
…vm#191375)

While in this area I also removed unnecessary annotations for wchar_size
and also cleaned up some more function attributes.
…1408)

Failure to read all required fields for msgbuf isn't ObjectFile's fault
but FreeBSD-Kernel-Core plugin specific. Thus this should be logged
through `LLDBLog::Process` rather than `LLDBLog::Object`.

Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>
…lvm#186981)

This PR follows suit of the Extensions.md document and provides the same
file for OpenMP API extensions. These have previously been stored in
OpenMPSupport.md. Having a more centralized view and place for these
extensions seems useful.

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
llvm#191289)

Also, update the conformance script to look for closed issues when
searching for unlinked issues.
…ne table coverage in isolation (llvm#183790)

Patch 2 of 3 to add to llvm-dwarfdump the ability to measure DWARF
coverage of local variables in terms of source lines, as discussed in
[this
RFC](https://discourse.llvm.org/t/rfc-debug-info-coverage-tool-v2/83266).

This patch adds the ability to compare a variable’s coverage against a
baseline, e.g. an unoptimised compilation of the same code. This is
provided using the optional `--coverage-baseline` argument.

When a baseline is provided, the output also includes a per-variable
measure of the line table’s coverage (`LT`, `LTRatio`), distinct from
the variable’s coverage proper. See section 2.2 of the RFC for details
on this metric.
Reworked libc/docs/gpu/building.rst to match the style of
getting_started.rst:

* Removed mkdir and cd commands.
* Used -S and -B flags for CMake.
* Used -C flag for Ninja.
* Split commands into smaller blocks with brief explanations.

Use the same terminology as elsewhere in the LLVM libc docs and move
away from the deprecated runtime terms.

* Standard runtimes build -> Bootstrap Build
* Runtimes cross build -> Two-stage Cross-compiler Build
In llvm#178306, I made an incorrect assumption that traversing `allproc` in
reverse direction would give incremental pid order based on the fact
that new processes are added at the head of allproc. However, this
assumption is false under certain circumstance such as reusing pid
number, thus failing to sort threads correctly. Without using any
assumption, explicitly sort threads based on pid retrieved from memory.

Fixes: 5349c66 (llvm#178306)

---------

Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>
llvm#191231)

…ties

Some of the utilities may be used in symbol resolution which is before
the expression analysis is done. In such situations, the typedExpr's
normally stored in parser::Expr may not be available. To be able to
obtain the numeric values of expressions, using the analyzer directly
may be necessary, which requires SemanticsContext to be provided.
…m#191098)

The motivation of this PR is to refactor and expose DSO helper functions
so
they can be used by all compiler-rt libraries, including the profile
library,
without duplicating dlopen/dlsym (non-Windows) or
LoadLibrary/GetProcAddress
(Windows) logic in each runtime.

Implement the helpers in namespace __interception in
interception_linux.cpp for
non-Windows targets and interception_win.cpp for Windows, and use them
from the
existing Linux interception path for RTLD_NEXT/RTLD_DEFAULT/dlvsym
lookups.

This is NFC for existing libraries that already use interception's
public APIs;
sanitizer and interception lit behavior is unchanged.
In some cases the use of *-DAG seemed to confuse the update scripts
because of the clash with FileCheck's built-in -DAG suffix.
Specialize linalg.generic to linalg.mmt4d based on index map
…erage (llvm#187368)

We don't need to run the full exhaustive test for all floating points,
as long as we're testing the radix sort code path (which we are, since
radix sort triggers at 1024 elements).

This reduces the test execution time on my machine from 20s to 12s.

Fixes llvm#187329
Fix iterator misuse in four BOLT passes, caught by _GLIBCXX_DEBUG
(enabled via LLVM_ENABLE_EXPENSIVE_CHECKS=ON).

* AllocCombiner: combineAdjustments() erases instructions while
iterating in reverse via llvm::reverse(BB), invalidating the reverse
iterator. Defer erasures to after the loop using a SmallVector.
* ShrinkWrapping: processDeletions() uses
std::prev(BB.eraseInstruction(II)) which is undefined when II ==
begin(). Restructure to standard forward iteration with erase.
* DataflowAnalysis: run() unconditionally dereferences BB->rbegin(),
which crashes on empty basic blocks (possible after the ShrinkWrapping
fix). Guard with an emptiness check.
* IndirectCallPromotion: rewriteCall() dereferences the end iterator via
&(*IndCallBlock.end()). Replace with &IndCallBlock.back().
* TailDuplication: constantAndCopyPropagate() uses
std::prev(OriginalBB.eraseInstruction(Itr)) which is undefined when Itr
== begin(). Restructure to standard forward iteration with erase.
…8271)

Example:

    int foo(int a, int b) { return a - 1 + ~b; }

Before, on AArch64:

    mvn w8, w1
    add w8, w0, w8
    sub w0, w8, #1

After (matches gcc):

    sub w0, w0, w1
    sub w0, w0, #2

Proof: https://alive2.llvm.org/ce/z/g_bV01
…#191413)

Squelch the stage-2 compile time regression introduced by the variadic
m_Combine(And|Or) matchers, by replacing the std::apply on a std::tuple
with a recursive inheritance.
…ORTED for zOS (llvm#190835)

Tests in `llvm/test/Examples` and `llvm/test/ExecutionEngine` use JIT
which is unsupported for zOS causing the tests to fail.

---------

Co-authored-by: Bahareh Farhadi <bahareh.farhadi@ibm.com>
The default inliner policy changed slighlty, which was expected after PR
llvm#190168.
Coro haven't yet been fixed up for profcheck, so new tests are likely to
fail.

mtune.ll exercises loop vectorizer (not fixed)
When a user calls `omp_control_tool`, a tool is attached and it
registered the `ompt_control_tool` callback, the tool should receive a
callback with the users arguments.

However, in llvm#112924, it was discovered that this only happens after at
least one host side directive or runtime call calling into
`__kmp_do_middle_initialize` has been executed.

The check for `__kmp_init_middle` in `FTN_CONTROL_TOOL` did not try to
do the middle initialization and instead always returned `-2` (no tool).
A tool therefore received no callback. The user program did not get the
info that there is a tool attached. To fix this, change the explicit
return to a call of `__kmp_middle_initialize()`, as done in several
other places of `libomp`.

Further handling is then done in `__kmp_control_tool`, where the values
`-2` (no tool), `-1` (no callback), or the tools return value are
returned.

Also expand the tests to introduce checks where no callaback is
registered, or `omp_control_tool` is called before any OpenMP directive.

Fixes llvm#112924

CC @jprotze, @hansangbae

Signed-off-by: Jan André Reuter <j.reuter@fz-juelich.de>
…(NFC) (llvm#191430)

CompilationGraph owns all nodes and edges via `unique_ptr`, but exposes
pointers to the underlying objects. Make them non-movable to maintain
stable addresses.
Make them non-copyable since we don't want to copy `Command` objects
they hold or create duplicate root nodes.

Apply full rule-of-five to `CompilationGraph`.
…m IntegerExpandSetCCOperands. NFC (llvm#191353)

LHSLo and RHSLo must have the same type, we don't need to check both.
Same for LHSHi and RHSHi.
While running in server mode, multiple clients can be connected at the
same time. In LLDBUtils we had a static mutex that can cause other
clients to hang due to the single static lock.

Instead, I adjusted the logic to take the existing SBMutex as a paremter
and guard that mutex during command handling.
alexey-bataev and others added 26 commits April 11, 2026 08:18
…calar

The LLVM cost model uses integer-valued throughput costs which cannot
represent fractional costs. For 2-element vectors, this rounding can
make vectorization appear profitable when it actually produces more
instructions than the scalar code — the overhead from shuffles, inserts,
extracts, and buildvectors is underestimated.
Add an instruction-count safety check in getTreeCost that estimates
the number of vector instructions (including gathers, shuffles, and
extracts) and compares against the number of scalar instructions.
If the vector code would produce more instructions, reject the tree
regardless of what the cost model says. This catches cases where
fractional cost rounding hides real overhead.

The check is gated behind -slp-inst-count-check (default: on) and
only applies to 2-element root trees where rounding errors matter most.

Reviewers: hiraditya, bababuck, RKSimon

Pull Request: llvm#190414
When SLPReVec is enabled, getValueType returns the vector result type
for InsertElement instructions rather than the scalar element type. This
caused getEntryCost to propagate an incorrect ScalarTy (e.g. <4 x float>
instead of float) into getScalarizationOverhead and getWidenedType,
triggering an assertion failure and producing wrong cost estimates.
Narrow ScalarTy to its element type when costing vectorized
InsertElement entries whose inserted operands are scalars.
Fixes llvm#191175.

Reviewers: 

Pull Request: llvm#191628
Fixes:
```
warning: format specifies type 'long' but the argument has type 'intptr_t' ...
```
…91299)

After llvm#189372 both minimum
iteration checks for epilogue vectorization are created in VPlan, which
removes the last blocker for unconditionally running
materializeConstantVectorTripCount. This enables additional folds for
plans in the native path, as well as removes some trip count
computations with epilogue vectorization.

PR: llvm#191299
…#191498)

NSSW/NUSW on a wider AddRec does not imply NSSW/NUSW on a narrower
AddRec.

Fixes llvm#191382.
The output currently contains
```
            "unicode32"
            'u' or "unsigned decimal"
            'p' or
            "pointer"
            "char[]"
            "int8_t[]"
```
The 'p' and "pointer" are supposed to appear on the same line. When
we're about to print "pointer," we check whether it would exceed the
column limit (in which case, we insert a line feed). This check only
checks for spaces as separators, but in this case, "words" may be
separated by newlines as well. Look for them too.
…n (NFC) (llvm#189489)

This NFC prepares the scheduler's rematerialization stage for
integration with the target-independent rematerializer. It brings
various small design changes and optimizations to the stage's internal
state to make the not-exactly-NFC rematerializer integration as small as
possible.

The main changes are, in no particular order:

- Sort and pick useful rematerialization candidates by their index in
the vector of candidates instead of directly sorting objects within the
candidate vector. This reduces the amount of data movement and
simplifies the candidate selection logic.
- Move some data members from `PreRARematStage::RematReg` to
`PreRARematStage::ScoredRemat`. This makes the former a simplified
version of the rematerializer's own internal register representation
(`Rematerializer::Reg`), which can be cleanly deleted during
integration.
- Remove an inferable argument to `modifyRegionSchedule`. This allows
the stage to stop tracking the parent block of each region.
- Use a boolean (`RevertAllRegions`) to track scheduling revert decision
post rematerialization instead of clearing `RescheduleRegions`. This
allows to avoid re-computing the latter during rollback.
- Estimate usefulness of rematerialization from `GCNRegPressure` instead
of from `Register` (requires adding a new method variant in
`GCNRPTarget`).
We had a report of some assertion failures in

llvm#190054 (comment),
and some msan failures in
llvm#190056.

These appear to be due to default constructed StringRef's being used in
some cases. To address, we can provide default initializers that should
prevent such cases from causing further problems.
…leSpec when checking LoadScriptFromSymFile setting (llvm#191473)

We were incorrectly passing the script's `FileSpec` into
`GetScriptLoadStyleForModule`. Meaning if a script name wasn't actually
the same as the module name, the `target.auto-load-scripts-for-modules`
didn't take effect.

This patch passes the module's `FileSpec` instead. For `dSYM`s we save
the original `FileSpec` because the loop tries to strip extensions until
it finds a script. But we still want to use the module's name.

**AI Usage**:
- Used Claude to write the unit-test skeletons. Then reviewed/adjusted
them manually
Ensure all StringRef members are default initialized to avoid potential
bugs.
…eter packs (llvm#191484)

I believe that is the intent of SubstIndex in AssociatedConstraint.
So this enforces the checking explicitly, in case nested SubstIndexes
confuses our poor constraint evaluator.

I reverted the previous fix 257cc5a
because that was wrong.
As a drive-by fix, this also removes an strange assertion and an
unnecessary
SubstIndex setup in nested requirement transform.

No release note because this is a regression fix.

Fixes llvm#188505
Fixes llvm#190169
AsmPrinter needs to hold state between doInitialization,
runOnMachineFunction, and doFinalization, which are all separate passes
in the NewPM. Storing this state externally somewhere like
MachineModuleInfo or a new analysis is possible, but a bit messy given
some state, particularly EHHandler objects, has backreferences into the
AsmPrinter and assumes there is a single AsmPrinter throughout the
entire compilation. So instead, store AsmPrinter in an analysis that
stays constant throughout compilation which solves all these problems.
This also means we can also just let AsmPrinter continue to own the
MCStreamer, which means object file emission should work after this as
well.

This does require passing the ModuleAnalysisManager into
buildCodeGenPipeline to register the AsmPrinterAnalysis, but that seems
pretty reasonable to do.

Reviewers: paperchalice, RKSimon, arsenm

Pull Request: llvm#191535
…m#186766)

## Description

When `AMDGPUTargetLowering::performStoreCombine` inserts a synthetic
bitcast to convert vector types (e.g. `<1 x float>` → `i32`) for stores,
the bitcast inherits the **store's** SDLoc. When
`DAGCombiner::visitBITCAST` later folds `bitcast(load)` → `load`, the
resulting load loses its original debug location.

## Analysis

The bitcast is **not** present in the initial SelectionDAG — it is
inserted during DAGCombine by
`AMDGPUTargetLowering::performStoreCombine`. This can be observed with
`-debug-only=isel,dagcombine`:

```
Initial selection DAG: no bitcast, load is v1f32 directly used by store

Combining: t17: ch = store ... /tmp/beans.c:6:14
 ... into: t20: ch = store ... /tmp/beans.c:6:14

Combining: t19: i32 = bitcast [ORD=3] # D:1 t13, /tmp/beans.c:6:14
 ... into: t21: i32,ch = load ... /tmp/beans.c:6:14
```

In `performStoreCombine` (`AMDGPUISelLowering.cpp`):

```cpp
SDLoc SL(N);  // N = store node → SL has store's DebugLoc
...
SDValue CastVal = DAG.getNode(ISD::BITCAST, SL, NewVT, Val);
// bitcast gets store's DebugLoc, not load's
```

When `visitBITCAST` folds `bitcast(load)` → `load`, it uses `SDLoc(N)`
(the bitcast's loc = store's loc), so the resulting load loses its
original debug location.

```
Before (initial DAG):
  t13: v1f32 = load ...           line 2   ; original load
  t14: ch    = store t13, ...     line 3   ; store

After performStoreCombine:
  t13: v1f32 = load ...           line 2   ; original load
  t19: i32   = bitcast t13        line 3   ; synthetic bitcast (store's loc!)
  t20: ch    = store t19, ...     line 3

After visitBITCAST folds (incorrect):
  t21: i32 = load ...             line 0   ; lost debug location

After visitBITCAST folds (expected):
  t21: i32 = load ...             line 2   ; preserves load's location
```

## Fix

Target-specific fix in `AMDGPUISelLowering.cpp` `performStoreCombine`:
use `DAG.getBitcast()` instead of `DAG.getNode(ISD::BITCAST, SL, ...)`.
`getBitcast()` internally uses `SDLoc(V)` (the value operand's SDLoc),
so the synthetic bitcast naturally inherits the load's DebugLoc instead
of the store's:

```cpp
// Before:
SDValue CastVal = DAG.getNode(ISD::BITCAST, SL, NewVT, Val);
if (OtherUses) {
    SDValue CastBack = DAG.getNode(ISD::BITCAST, SL, VT, CastVal);

// After:
SDValue CastVal = DAG.getBitcast(NewVT, Val);
if (OtherUses) {
    SDValue CastBack = DAG.getBitcast(VT, CastVal);
```

This is consistent with `performLoadCombine` where the bitcast also uses
the load's `SDLoc`.
When building the LLVM installer on Windows, fix CRT / dllimport
mismatch and unused locals / tautological comparisons in env handling.
We prefer statically linking all library dependencies.
Fixes a few warnings found while building the LLVM installer with
`llvm/utils/release/build_llvm_release.bat --x64 --version 23.0.0
--skip-checkout --local-python`.
…romoting. (llvm#191568)

The conversion needs to be done by promoting to f32. If we're already at
LMUL=8, we need to split before we can promote.
@z1-cciauto z1-cciauto requested a review from a team April 12, 2026 04:06
@z1-cciauto
Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.