Implement optimized ccopyv and zcopyv kernels for Zen/2/3#924
Open
harsdave wants to merge 1 commit intoflame:masterfrom
Open
Implement optimized ccopyv and zcopyv kernels for Zen/2/3#924harsdave wants to merge 1 commit intoflame:masterfrom
harsdave wants to merge 1 commit intoflame:masterfrom
Conversation
Description: This patch implements high-performance complex copy (ccopyv) and double-complex copy (zcopyv) kernels. Key Changes: Vectorization: Utilizes AVX/AVX2 intrinsics (__m256, __m256d) to process multiple complex elements per cycle for unit-strided (incx == 1, incy == 1) cases. Conjugation Support: Implements efficient on-the-fly conjugation for bli_is_conj cases using sign-flip masks (_mm256_setr_ps(1, -1, ...)), avoiding separate passes. Loop Unrolling: Employs an 8-register unrolling scheme (32 elements for ccopyv, 16 for zcopyv) to maximize instruction-level parallelism and hide memory latency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
This patch implements high-performance complex copy (ccopyv) and double-complex copy (zcopyv) kernels.
Key Changes:
Vectorization: Utilizes AVX/AVX2 intrinsics (__m256, __m256d) to process multiple complex elements per cycle for unit-strided (incx == 1, incy == 1) cases.
Conjugation Support: Implements efficient on-the-fly conjugation for bli_is_conj cases using sign-flip masks (_mm256_setr_ps(1, -1, ...)), avoiding separate passes.
Loop Unrolling: Employs an 8-register unrolling scheme (32 elements for ccopyv, 16 for zcopyv) to maximize instruction-level parallelism and hide memory latency.