llama.cpp

History

Max Krasnyansky 95ea9e0861 Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 ) * hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot products		2026-01-06 17:38:29 -08:00
..
apple	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
jinja	scripts : add Jinja tester PySide6 simple app (#15756 )	2025-09-05 01:05:12 +02:00
snapdragon	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611 )	2026-01-06 17:38:29 -08:00
bench-models.sh	scripts : add script to bench models (#16894 )	2025-11-02 00:15:31 +02:00
build-info.sh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
check-requirements.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
compare-commits.sh	scripts: add sqlite3 check for compare-commits.sh (#15633 )	2025-08-28 19:23:22 +08:00
compare-llama-bench.py	scripts: strip "AMD Instinct" from GPU name (#15668 )	2025-08-29 22:04:08 +02:00
compare-logprobs.py	scripts: add script to compare logprobs of llama.cpp against other frameworks (#17947 )	2025-12-13 22:33:29 +01:00
create_ops_docs.py	Docs: add instructions for adding backends (#14889 )	2025-07-27 09:36:43 +08:00
debug-test.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
fetch_server_test_models.py	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00
gen-authors.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
gen-unicode-data.py	py : type-check all Python scripts with Pyright (#8341 )	2024-07-07 15:04:39 -04:00
get-flags.mk	build : pass all warning flags to nvcc via -Xcompiler (#5570 )	2024-02-18 16:21:52 -05:00
get-hellaswag.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
get-pg.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
get-wikitext-2.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
get-wikitext-103.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
get-winogrande.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
get_chat_template.py	scripts: corrected encoding when getting chat template (#11866 ) (#11907 )	2025-02-18 10:30:16 +01:00
hf.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
install-oneapi.bat	support SYCL backend windows build (#5208 )	2024-01-31 08:08:07 +05:30
serve-static.js	ggml webgpu: add support for emscripten builds (#17184 )	2025-12-03 10:25:34 +01:00
server-bench.py	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
sync-ggml-am.sh	scripts : update sync scripts	2025-08-18 22:06:44 +03:00
sync-ggml.last	sync : ggml	2025-12-31 18:54:43 +02:00
sync-ggml.sh	scripts : update sync scripts	2025-08-18 22:06:44 +03:00
sync_vendor.py	server: introduce API for serving / loading / unloading multiple models (#17470 )	2025-12-01 19:41:04 +01:00
tool_bench.py	server : speed up tests (#15836 )	2025-09-06 14:45:24 +02:00
tool_bench.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
verify-checksum-models.py	convert.py : add python logging instead of print() (#6511 )	2024-05-03 22:36:41 +03:00
xxd.cmake	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00