llama.cpp/ggml/src/ggml-hexagon/htp
Max Krasnyansky 95ea9e0861
Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

* hexagon: add support for SCALE fp32

* hexagon: replace scalar fp32 -> fp16 copy with HVX

* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.

* hexagon: use aligned mad_f16

* hexagon: flash_atten more aligned ops

* hexagon: optimize scale_f32 hvx helpers

* hexagon: unroll fa loops

* hexagon: remove unused set-rows log

* hexagon: flash_attn_ext add support for DMAing Q

- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.

* hexagon: fix handling of NANs hvx dotproducts

* hexagon: cleanup spad allocation in flash-atten

* hexagon: improve fp16/fp32 matmul

- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.

* hexagon: fix HVX_ARCH check

* hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

* hexagon: fix fp16 x fp16 matmuls and some minor refactoring

* hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

* hexagon: optimize set-rows threading

* hexagon: update adb/run-bench.sh to properly support experimental and verbose options

* hexagon: flash_atten use aligned vectors for dot products
2026-01-06 17:38:29 -08:00
..
CMakeLists.txt Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
act-ops.c ggml-hexagon: optimize activation function (#18393) 2026-01-02 21:24:24 -08:00
binary-ops.c hexagon: various Op fixes (#17135) 2025-11-11 15:25:04 -08:00
cmake-toolchain.cmake Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
flash-attn-ops.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
get-rows-ops.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
htp-ctx.h Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
htp-dma.c ggml-hexagon: gelu optimization (#18151) 2025-12-22 10:56:52 -08:00
htp-dma.h ggml-hexagon: gelu optimization (#18151) 2025-12-22 10:56:52 -08:00
htp-msg.h Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
htp-ops.h Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
htp_iface.idl Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
hvx-exp.c ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (#17212) 2025-11-23 14:26:36 -08:00
hvx-inverse.c ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (#17212) 2025-11-23 14:26:36 -08:00
hvx-sigmoid.c Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
hvx-utils.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
hvx-utils.h Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
main.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
matmul-ops.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
ops-utils.h hexagon: various Op fixes (#17135) 2025-11-11 15:25:04 -08:00
rope-ops.c ggml-hexagon: fix `rope` failure at `test-backend-ops` (#17565) 2025-12-10 14:45:43 -08:00
set-rows-ops.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
softmax-ops.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
unary-ops.c Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
worker-pool.c Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
worker-pool.h Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00