llama.cpp/scripts/snapdragon/adb
Max Krasnyansky 95ea9e0861
Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

* hexagon: add support for SCALE fp32

* hexagon: replace scalar fp32 -> fp16 copy with HVX

* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.

* hexagon: use aligned mad_f16

* hexagon: flash_atten more aligned ops

* hexagon: optimize scale_f32 hvx helpers

* hexagon: unroll fa loops

* hexagon: remove unused set-rows log

* hexagon: flash_attn_ext add support for DMAing Q

- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.

* hexagon: fix handling of NANs hvx dotproducts

* hexagon: cleanup spad allocation in flash-atten

* hexagon: improve fp16/fp32 matmul

- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.

* hexagon: fix HVX_ARCH check

* hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

* hexagon: fix fp16 x fp16 matmuls and some minor refactoring

* hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

* hexagon: optimize set-rows threading

* hexagon: update adb/run-bench.sh to properly support experimental and verbose options

* hexagon: flash_atten use aligned vectors for dot products
2026-01-06 17:38:29 -08:00
..
llama-cli.farf Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
run-bench.sh Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611) 2026-01-06 17:38:29 -08:00
run-cli.sh ggml-hexagon: create generalized functions for cpu side op (#17500) 2025-12-22 23:13:24 -08:00
run-completion.sh ggml-hexagon: create generalized functions for cpu side op (#17500) 2025-12-22 23:13:24 -08:00
run-mtmd.sh ggml-hexagon: mm for mtmd (#17894) 2025-12-15 10:53:56 -08:00
run-tool.sh Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00