llama.cpp/ggml/src
Georgi Gerganov e30f1fdf74
graph : remove redundant GDN state transposes (#20443)
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* llama : rever fgdn argument changes

* graph : remove GDN state transposes

* vulkan : adapt

* cuda : remove obsolete smem code

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
2026-03-13 22:12:54 +02:00
..
ggml-blas ggml: update comments for backends which have no memory to report (#20157) 2026-03-06 23:24:38 +08:00
ggml-cann CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (#18968) 2026-02-10 14:19:30 +08:00
ggml-cpu graph : remove redundant GDN state transposes (#20443) 2026-03-13 22:12:54 +02:00
ggml-cuda graph : remove redundant GDN state transposes (#20443) 2026-03-13 22:12:54 +02:00
ggml-hexagon hexagon: add f32 ssm_conv op (#20122) 2026-03-06 09:59:26 -08:00
ggml-hip hip: compile debug builds with -O2 on hip to avoid a compiler bug (#20392) 2026-03-12 10:37:10 +08:00
ggml-metal graph : remove redundant GDN state transposes (#20443) 2026-03-13 22:12:54 +02:00
ggml-musa CUDA: faster tile FA, add oob checks, more HSs (#16492) 2025-10-11 20:54:32 +02:00
ggml-opencl opencl: use larger workgroup size for get_rows (#20316) 2026-03-11 22:03:27 -07:00
ggml-rpc rpc : use unordered_map::reserve and emplace (#18513) 2026-01-02 12:09:36 +02:00
ggml-sycl fix op rope, add rope_back (#20293) 2026-03-11 09:53:34 +08:00
ggml-virtgpu ggml-virtgpu: improve the reliability of the code (#19846) 2026-02-26 20:00:57 +08:00
ggml-vulkan graph : remove redundant GDN state transposes (#20443) 2026-03-13 22:12:54 +02:00
ggml-webgpu ggml-webgpu: Add supports for `GGML_OP_REPEAT` (#20230) 2026-03-11 14:40:36 -07:00
ggml-zdnn ggml-zdnn : mark zDNN buffers as non-host (#18967) 2026-01-22 01:16:21 +01:00
ggml-zendnn ggml-zendnn: update code for latest ZenDNN API (#19923) 2026-02-27 08:43:41 +08:00
CMakeLists.txt hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150) 2026-01-29 12:33:21 -08:00
ggml-alloc.c ggml : make `ggml_is_view` as API (#19539) 2026-02-16 17:43:34 +02:00
ggml-backend-dl.cpp hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150) 2026-01-29 12:33:21 -08:00
ggml-backend-dl.h hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150) 2026-01-29 12:33:21 -08:00
ggml-backend-impl.h llama: use host memory if device reports 0 memory (#18587) 2026-01-09 05:34:56 +08:00
ggml-backend-reg.cpp ggml : use noexcept overload for is_regular_file in backend registration (#19452) 2026-02-10 10:57:48 +01:00
ggml-backend.cpp llama : disable graph reuse with pipeline parallelism (#20463) 2026-03-12 21:04:13 +02:00
ggml-common.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-impl.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-opt.cpp finetune: SGD optimizer, more CLI args (#13873) 2025-08-14 12:03:57 +02:00
ggml-quants.c ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-quants.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-threading.cpp ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-threading.h remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797) 2024-12-12 19:02:49 +01:00
ggml.c ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml.cpp ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) 2025-06-01 13:43:57 +03:00
gguf.cpp gguf : avoid too many file size calls (#19919) 2026-02-26 12:46:32 +02:00