Skip to content

Implement optimized ccopyv and zcopyv kernels for Zen/2/3#924

Open
harsdave wants to merge 1 commit intoflame:masterfrom
harsdave:amd-optimized-ccopyv-and-zcopyv-kernels
Open

Implement optimized ccopyv and zcopyv kernels for Zen/2/3#924
harsdave wants to merge 1 commit intoflame:masterfrom
harsdave:amd-optimized-ccopyv-and-zcopyv-kernels

Conversation

@harsdave
Copy link
Copy Markdown
Contributor

Description:
This patch implements high-performance complex copy (ccopyv) and double-complex copy (zcopyv) kernels.

Key Changes:

Vectorization: Utilizes AVX/AVX2 intrinsics (__m256, __m256d) to process multiple complex elements per cycle for unit-strided (incx == 1, incy == 1) cases.

Conjugation Support: Implements efficient on-the-fly conjugation for bli_is_conj cases using sign-flip masks (_mm256_setr_ps(1, -1, ...)), avoiding separate passes.

Loop Unrolling: Employs an 8-register unrolling scheme (32 elements for ccopyv, 16 for zcopyv) to maximize instruction-level parallelism and hide memory latency.

Description:
This patch implements high-performance complex copy (ccopyv) and double-complex copy (zcopyv) kernels.

Key Changes:

Vectorization: Utilizes AVX/AVX2 intrinsics (__m256, __m256d) to process multiple complex elements per cycle for unit-strided (incx == 1, incy == 1) cases.

Conjugation Support: Implements efficient on-the-fly conjugation for bli_is_conj cases using sign-flip masks (_mm256_setr_ps(1, -1, ...)), avoiding separate passes.

Loop Unrolling: Employs an 8-register unrolling scheme (32 elements for ccopyv, 16 for zcopyv) to maximize instruction-level parallelism and hide memory latency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant