Implement optimized ccopyv and zcopyv kernels for Zen/2/3 by harsdave · Pull Request #924 · flame/blis

harsdave · 2026-03-24T13:39:51Z

Description:
This patch implements high-performance complex copy (ccopyv) and double-complex copy (zcopyv) kernels.

Key Changes:

Vectorization: Utilizes AVX/AVX2 intrinsics (__m256, __m256d) to process multiple complex elements per cycle for unit-strided (incx == 1, incy == 1) cases.

Conjugation Support: Implements efficient on-the-fly conjugation for bli_is_conj cases using sign-flip masks (_mm256_setr_ps(1, -1, ...)), avoiding separate passes.

Loop Unrolling: Employs an 8-register unrolling scheme (32 elements for ccopyv, 16 for zcopyv) to maximize instruction-level parallelism and hide memory latency.

Description: This patch implements high-performance complex copy (ccopyv) and double-complex copy (zcopyv) kernels. Key Changes: Vectorization: Utilizes AVX/AVX2 intrinsics (__m256, __m256d) to process multiple complex elements per cycle for unit-strided (incx == 1, incy == 1) cases. Conjugation Support: Implements efficient on-the-fly conjugation for bli_is_conj cases using sign-flip masks (_mm256_setr_ps(1, -1, ...)), avoiding separate passes. Loop Unrolling: Employs an 8-register unrolling scheme (32 elements for ccopyv, 16 for zcopyv) to maximize instruction-level parallelism and hide memory latency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement optimized ccopyv and zcopyv kernels for Zen/2/3#924

Implement optimized ccopyv and zcopyv kernels for Zen/2/3#924
harsdave wants to merge 1 commit intoflame:masterfrom
harsdave:amd-optimized-ccopyv-and-zcopyv-kernels

harsdave commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harsdave commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant