Skip to content

Add m4sme_p and m4sme_e sub-configurations#921

Open
Luislo1 wants to merge 8 commits intoflame:masterfrom
Luislo1:master
Open

Add m4sme_p and m4sme_e sub-configurations#921
Luislo1 wants to merge 8 commits intoflame:masterfrom
Luislo1:master

Conversation

@Luislo1
Copy link
Copy Markdown

@Luislo1 Luislo1 commented Mar 16, 2026

This PR adds two new arm64 sub-configurations to BLIS, called m4sme_p and m4sme_e. They are built for the ARM instruction set architecture and are optimized for the Apple Silicon M4 processor, utilizing the Scalable Matrix Extension 2(SME2). The m4sme_p sub-configuration is configured for execution in performance cores and m4sme_e in efficiency cores. Included in the armsme kernels folder are implementations for packing (1m) and the level 3 gemm kernel programmed with intrinsics.

The m4sme_p sub-configuration is chosen through a hardware detection heuristic if the machine supports the SME2 feature. The m4sme_e sub-configuration must be selected manually. The sub-configurations are temporarily blacklisted for all compilers except Clang 17 or later on Darwin.

Regarding the implementation details, the kernel lengths are SVL (Streaming Vector Length) agnostic. We have included 3 different gemm kernels for single precision and 4 for double precision based on the possibilities afforded by the ZA storage. Please note that despite the kernels themselves being agnostic, the bli_cntx_init file is specifically configured for the M4 silicon; Apple silicon currently only supports an SVL of 512b (64B) with an SME 2D array size of 4KiB.

Developers for the m4sme implementation:
@Luislo1
@figual
@luismacostero

We welcome your thoughts and look forward to your feedback on this implementation.

Luislo1 added 8 commits March 16, 2026 12:09
- The configuration is optimal for the Apple M4 chip's performance cores
with a SVL of 512 bits.
- The sgemm kernel's size is 2SVLx2SVL and the dgemm's is 4SVLx2SVL
- Use the ZA tiles' reading capabilities for more efficient packing
- Include packing routines for both kernels
- Disable new kernels using an #if 0 block in the m4sme_p subconfig
- Reuse the SVLx4SVL and SVLx8SVL packing routines
- Reuse the 4SVLx2SVL packing routine
- Adjust block size to optimal value for the smaller SME Engine shared
by the efficiency core cluster of the Apple M4 chip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant