Skip to content

Amd zen4 support#925

Open
harsdave wants to merge 5 commits intoflame:masterfrom
harsdave:amd-zen4-support
Open

Amd zen4 support#925
harsdave wants to merge 5 commits intoflame:masterfrom
harsdave:amd-zen4-support

Conversation

@harsdave
Copy link
Copy Markdown
Contributor

This patch introduces the 'zen4' configuration.

Key Changes:

  • Added 'zen4' configuration directory and base make_defs.mk.
  • Implemented an optimized daddv kernel (bli_daddv_zen4_int) using
    AVX-512 intrinsics.
  • The daddv implementation utilizes:
    • 8x/4x/2x unrolling for unit-stride vectors to maximize FMA throughput.
    • AVX-512 masked loads/stores for tail (fringe) cases, eliminating
      the need for scalar fallback loops for non-unit multiples.
  • Initial configuration uses 'zen' fallbacks for remaining Level-1
    kernels, which are scheduled for AVX-512 optimization in future updates.

This patch introduces the 'zen4' configuration.

Key Changes:
- Added 'zen4' configuration directory and base make_defs.mk.
- Implemented an optimized daddv kernel (bli_daddv_zen4_int) using
  AVX-512 intrinsics.
- The daddv implementation utilizes:
    * 8x/4x/2x unrolling for unit-stride vectors to maximize FMA throughput.
    * AVX-512 masked loads/stores for tail (fringe) cases, eliminating
      the need for scalar fallback loops for non-unit multiples.
- Initial configuration uses 'zen' fallbacks for remaining Level-1
  kernels, which are scheduled for AVX-512 optimization in future updates.
This patch introduces the 'zen4' configuration.

Key Changes:
- Added 'zen4' configuration directory and base make_defs.mk.
- Implemented an optimized daddv kernel (bli_daddv_zen4_int) using AVX-512 intrinsics.
- The daddv implementation utilizes:
    * 8x/4x/2x unrolling for unit-stride vectors to maximize FMA throughput.
    * AVX-512 masked loads/stores for tail (fringe) cases, eliminating the need for scalar fallback loops for non-unit multiples.
- Initial configuration uses 'zen' fallbacks for remaining Level-1 kernels, which are scheduled for AVX-512 optimization in future updates.
This commit introduces high-performance AVX-512 kernels for the SCALV and SETV
operations, targeting the AMD Zen 4 architecture across S, D, and Z precisions.

Key Changes:
Instruction Set: Migrated core loops to use AVX-512 (ZMM) intrinsics to maximize
data throughput.

Throughput & Unrolling:
* Implemented aggressive unrolling (e.g., 512 elements for ssetv, 48 complex elements for zscalv)
to minimize loop overhead and saturate execution ports.

Added logic for non-unit stride (incx != 1) fallback paths.

Remainder Handling: Replaced manual scalar tail loops with
AVX-512 masked loads/stores (_mm512_mask_storeu_ps/pd) for
cleaner and more efficient fringe case processing.

Precision Support: * Single (s), Double (d), and Double Complex (z)
implementations added for both SCALV and SETV.

Kernels Added:
bli_sscalv_zen4_int, bli_dscalv_zen4_int, bli_zscalv_zen4_int

bli_ssetv_zen4_int, bli_dsetv_zen4_int, bli_zsetv_zen4_int
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant