llama.cpp

History

hipudding 3707b58628 CANN: add GATED_DELTA_NET op support Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a batched approach that groups all attention heads into a single 3-D BatchMatMul per recurrence step, reducing kernel launches from O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens). Key design decisions: - Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch all H heads together for M×k, outer-product, and M×q steps - Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused across all time steps to avoid per-step allocations - Support both scalar gate (g shape [1,H]) and KDA per-dim gate (g shape [S_v,H]) via appropriate broadcast shapes - Fall back to naive per-head scalar loop for permuted/GQA/non-F32 inputs that don't meet batched path requirements - Relax CANN precision tolerance to 1e-6 in tests to account for different FP32 accumulation order in BatchMatMul vs scalar loops		2026-03-28 06:47:56 +00:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	llama: fix llama-model-saver (#20503 )	2026-03-25 12:53:16 +02:00
src	CANN: add GATED_DELTA_NET op support	2026-03-28 06:47:56 +00:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	ggml : bump version to 0.9.8 (ggml/1442)	2026-03-18 15:17:28 +02:00