Open
Conversation
This patch introduces the 'zen4' configuration.
Key Changes:
- Added 'zen4' configuration directory and base make_defs.mk.
- Implemented an optimized daddv kernel (bli_daddv_zen4_int) using
AVX-512 intrinsics.
- The daddv implementation utilizes:
* 8x/4x/2x unrolling for unit-stride vectors to maximize FMA throughput.
* AVX-512 masked loads/stores for tail (fringe) cases, eliminating
the need for scalar fallback loops for non-unit multiples.
- Initial configuration uses 'zen' fallbacks for remaining Level-1
kernels, which are scheduled for AVX-512 optimization in future updates.
This patch introduces the 'zen4' configuration.
Key Changes:
- Added 'zen4' configuration directory and base make_defs.mk.
- Implemented an optimized daddv kernel (bli_daddv_zen4_int) using AVX-512 intrinsics.
- The daddv implementation utilizes:
* 8x/4x/2x unrolling for unit-stride vectors to maximize FMA throughput.
* AVX-512 masked loads/stores for tail (fringe) cases, eliminating the need for scalar fallback loops for non-unit multiples.
- Initial configuration uses 'zen' fallbacks for remaining Level-1 kernels, which are scheduled for AVX-512 optimization in future updates.
This commit introduces high-performance AVX-512 kernels for the SCALV and SETV operations, targeting the AMD Zen 4 architecture across S, D, and Z precisions. Key Changes: Instruction Set: Migrated core loops to use AVX-512 (ZMM) intrinsics to maximize data throughput. Throughput & Unrolling: * Implemented aggressive unrolling (e.g., 512 elements for ssetv, 48 complex elements for zscalv) to minimize loop overhead and saturate execution ports. Added logic for non-unit stride (incx != 1) fallback paths. Remainder Handling: Replaced manual scalar tail loops with AVX-512 masked loads/stores (_mm512_mask_storeu_ps/pd) for cleaner and more efficient fringe case processing. Precision Support: * Single (s), Double (d), and Double Complex (z) implementations added for both SCALV and SETV. Kernels Added: bli_sscalv_zen4_int, bli_dscalv_zen4_int, bli_zscalv_zen4_int bli_ssetv_zen4_int, bli_dsetv_zen4_int, bli_zsetv_zen4_int
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch introduces the 'zen4' configuration.
Key Changes:
AVX-512 intrinsics.
the need for scalar fallback loops for non-unit multiples.
kernels, which are scheduled for AVX-512 optimization in future updates.