Add m4sme_p and m4sme_e sub-configurations#921
Open
Luislo1 wants to merge 8 commits intoflame:masterfrom
Open
Conversation
- The configuration is optimal for the Apple M4 chip's performance cores with a SVL of 512 bits. - The sgemm kernel's size is 2SVLx2SVL and the dgemm's is 4SVLx2SVL
- Use the ZA tiles' reading capabilities for more efficient packing
- Include packing routines for both kernels - Disable new kernels using an #if 0 block in the m4sme_p subconfig
- Reuse the SVLx4SVL and SVLx8SVL packing routines
- Reuse the 4SVLx2SVL packing routine
- Adjust block size to optimal value for the smaller SME Engine shared by the efficiency core cluster of the Apple M4 chip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds two new arm64 sub-configurations to BLIS, called m4sme_p and m4sme_e. They are built for the ARM instruction set architecture and are optimized for the Apple Silicon M4 processor, utilizing the Scalable Matrix Extension 2(SME2). The m4sme_p sub-configuration is configured for execution in performance cores and m4sme_e in efficiency cores. Included in the armsme kernels folder are implementations for packing (1m) and the level 3 gemm kernel programmed with intrinsics.
The m4sme_p sub-configuration is chosen through a hardware detection heuristic if the machine supports the SME2 feature. The m4sme_e sub-configuration must be selected manually. The sub-configurations are temporarily blacklisted for all compilers except Clang 17 or later on Darwin.
Regarding the implementation details, the kernel lengths are SVL (Streaming Vector Length) agnostic. We have included 3 different gemm kernels for single precision and 4 for double precision based on the possibilities afforded by the ZA storage. Please note that despite the kernels themselves being agnostic, the bli_cntx_init file is specifically configured for the M4 silicon; Apple silicon currently only supports an SVL of 512b (64B) with an SME 2D array size of 4KiB.
Developers for the m4sme implementation:
@Luislo1
@figual
@luismacostero
We welcome your thoughts and look forward to your feedback on this implementation.