Add m4sme_p and m4sme_e sub-configurations by Luislo1 · Pull Request #921 · flame/blis

Luislo1 · 2026-03-16T11:22:27Z

This PR adds two new arm64 sub-configurations to BLIS, called m4sme_p and m4sme_e. They are built for the ARM instruction set architecture and are optimized for the Apple Silicon M4 processor, utilizing the Scalable Matrix Extension 2(SME2). The m4sme_p sub-configuration is configured for execution in performance cores and m4sme_e in efficiency cores. Included in the armsme kernels folder are implementations for packing (1m) and the level 3 gemm kernel programmed with intrinsics.

The m4sme_p sub-configuration is chosen through a hardware detection heuristic if the machine supports the SME2 feature. The m4sme_e sub-configuration must be selected manually. The sub-configurations are temporarily blacklisted for all compilers except Clang 17 or later on Darwin.

Regarding the implementation details, the kernel lengths are SVL (Streaming Vector Length) agnostic. We have included 3 different gemm kernels for single precision and 4 for double precision based on the possibilities afforded by the ZA storage. Please note that despite the kernels themselves being agnostic, the bli_cntx_init file is specifically configured for the M4 silicon; Apple silicon currently only supports an SVL of 512b (64B) with an SME 2D array size of 4KiB.

Developers for the m4sme implementation:
@Luislo1
@figual
@luismacostero

We welcome your thoughts and look forward to your feedback on this implementation.

- The configuration is optimal for the Apple M4 chip's performance cores with a SVL of 512 bits. - The sgemm kernel's size is 2SVLx2SVL and the dgemm's is 4SVLx2SVL

- Use the ZA tiles' reading capabilities for more efficient packing

- Include packing routines for both kernels - Disable new kernels using an #if 0 block in the m4sme_p subconfig

- Reuse the SVLx4SVL and SVLx8SVL packing routines

- Reuse the 4SVLx2SVL packing routine

- Adjust block size to optimal value for the smaller SME Engine shared by the efficiency core cluster of the Apple M4 chip.

Luislo1 added 8 commits March 16, 2026 12:09

Create m4sme_p configuration and first kernels

6b49e47

- The configuration is optimal for the Apple M4 chip's performance cores with a SVL of 512 bits. - The sgemm kernel's size is 2SVLx2SVL and the dgemm's is 4SVLx2SVL

Add packing routines for sgemm and dgemm kernels

c8a29f8

- Use the ZA tiles' reading capabilities for more efficient packing

Add SVLx4SVL sgemm and SVLx8SVL dgemm kernels

294afc6

- Include packing routines for both kernels - Disable new kernels using an #if 0 block in the m4sme_p subconfig

Add 4SVLxSVL sgemm and 8SVLxSVL dgemm kernels

0cfed38

- Reuse the SVLx4SVL and SVLx8SVL packing routines

Add 2SVLx4SVL dgemm kernel

e839c2e

- Reuse the 4SVLx2SVL packing routine

Add m4sme_e configuration optimized for E-cores

491f1a3

- Adjust block size to optimal value for the smaller SME Engine shared by the efficiency core cluster of the Apple M4 chip.

Fix sysctl include

18e8a87

Fix register size values

562a2b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add m4sme_p and m4sme_e sub-configurations#921

Add m4sme_p and m4sme_e sub-configurations#921
Luislo1 wants to merge 8 commits intoflame:masterfrom
Luislo1:master

Luislo1 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luislo1 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant