llama.cpp

History

Max Krasnyansky 95ea9e0861 Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 ) * hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot products		2026-01-06 17:38:29 -08:00
..
CMakeLists.txt	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
act-ops.c	ggml-hexagon: optimize activation function (#18393 )	2026-01-02 21:24:24 -08:00
binary-ops.c	hexagon: various Op fixes (#17135 )	2025-11-11 15:25:04 -08:00
cmake-toolchain.cmake	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00
flash-attn-ops.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
get-rows-ops.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
htp-ctx.h	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
htp-dma.c	ggml-hexagon: gelu optimization (#18151 )	2025-12-22 10:56:52 -08:00
htp-dma.h	ggml-hexagon: gelu optimization (#18151 )	2025-12-22 10:56:52 -08:00
htp-msg.h	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
htp-ops.h	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
htp_iface.idl	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00
hvx-exp.c	ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (#17212 )	2025-11-23 14:26:36 -08:00
hvx-inverse.c	ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (#17212 )	2025-11-23 14:26:36 -08:00
hvx-sigmoid.c	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00
hvx-utils.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
hvx-utils.h	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
main.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
matmul-ops.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
ops-utils.h	hexagon: various Op fixes (#17135 )	2025-11-11 15:25:04 -08:00
rope-ops.c	ggml-hexagon: fix `rope` failure at `test-backend-ops` (#17565 )	2025-12-10 14:45:43 -08:00
set-rows-ops.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
softmax-ops.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
unary-ops.c	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
worker-pool.c	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00
worker-pool.h	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00