llama.cpp/ggml/src
Zheyuan Chen a1cfb64530
ggml-webgpu: add vectorized flash attention (#20709)
* naive vectorized version

* add vectorized flash attention

* update vec version

* remove unused path and shader

* remove unused helper functions

* add comments

* remove pad path

* ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization

* change back to vec4

* enable multi split

* enable vec path when:
- Q->ne[1] < 20
- Q->ne[0] % 32 == 0
- V->ne[0] % 4 == 0
- K->type == f16

* update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select

* enable vec path for q4 and q8

* flash-attn vec nwg=1 fast path (skip tmp/reduce staging)

* use packed f16 K loads in flash-attn vec split

* use packed f16 K loads in flash-attn vec split on host side

* tune flash-attn vec f16 VEC_NE by head dim

* cleanup

* cleanup

* keep host side clean

* cleanup host side

* change back to original host wait/submit behavior

* formatting

* reverted param-buffer pool r ecfactor

* add helper functions

* ggml-webgpu: move flash-attn vec pipeline caching back into shader lib

* ggml-webgpu: remove duplicate functions

* ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation

* ggml-webgpu: revert unrelated change

* ggml-webgpu: revert deleted comment

* disable uniformity check

* remove unnecessary change

* Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl

* Update ggml/src/ggml-webgpu/ggml-webgpu.cpp

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-04-02 10:40:42 -07:00
..
ggml-blas ggml-blas: set mkl threads from thread context (#20602) 2026-03-18 01:16:49 +08:00
ggml-cann CANN: fix multi-thread set_tensor race conditions (#20151) 2026-03-31 17:00:51 +03:00
ggml-cpu ggml : fix RWKV ops thread assignment (#21226) 2026-04-01 11:10:25 +03:00
ggml-cuda CUDA: fix FA kernel selection logic (#21271) 2026-04-01 22:28:19 +03:00
ggml-hexagon hexagon : add cumsum op support (#21246) 2026-04-01 17:44:02 -07:00
ggml-hip ggml-cuda: native bf16 flash attention for vec kernel (#20525) 2026-03-22 11:05:51 +01:00
ggml-metal metal : Fix dimension constraint violation in matmul2d descriptor (#21048) 2026-03-27 09:05:21 +02:00
ggml-musa ggml-cuda: native bf16 flash attention for vec kernel (#20525) 2026-03-22 11:05:51 +01:00
ggml-opencl opencl: fix leak in Adreno q8_0 path (#21212) 2026-04-01 12:54:58 -07:00
ggml-openvino fix(openvino): explicit memset in buffer_context allocation (#20857) 2026-03-23 08:05:37 +02:00
ggml-rpc rpc : fix misleading error log (#21184) 2026-03-30 17:05:11 +03:00
ggml-sycl sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283) 2026-04-02 10:08:32 +03:00
ggml-virtgpu ggml-virtgpu: improve the reliability of the code (#19846) 2026-02-26 20:00:57 +08:00
ggml-vulkan vulkan: add noncontiguous GLU support (#21081) 2026-03-28 08:44:56 +01:00
ggml-webgpu ggml-webgpu: add vectorized flash attention (#20709) 2026-04-02 10:40:42 -07:00
ggml-zdnn ggml-zdnn : mark zDNN buffers as non-host (#18967) 2026-01-22 01:16:21 +01:00
ggml-zendnn ggml-zendnn: update code for latest ZenDNN API (#19923) 2026-02-27 08:43:41 +08:00
CMakeLists.txt ggml : add OpenVINO backend (#15307) 2026-03-14 07:56:55 +02:00
ggml-alloc.c ggml : make `ggml_is_view` as API (#19539) 2026-02-16 17:43:34 +02:00
ggml-backend-dl.cpp hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150) 2026-01-29 12:33:21 -08:00
ggml-backend-dl.h hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150) 2026-01-29 12:33:21 -08:00
ggml-backend-impl.h llama: use host memory if device reports 0 memory (#18587) 2026-01-09 05:34:56 +08:00
ggml-backend-reg.cpp ggml : add OpenVINO backend (#15307) 2026-03-14 07:56:55 +02:00
ggml-backend.cpp llama : disable graph reuse with pipeline parallelism (#20463) 2026-03-12 21:04:13 +02:00
ggml-common.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-impl.h llama: fix llama-model-saver (#20503) 2026-03-25 12:53:16 +02:00
ggml-opt.cpp
ggml-quants.c ggml : guard against sumq2 being 0 in IQ4_NL (#20460) 2026-03-15 10:47:28 +02:00
ggml-quants.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-threading.cpp
ggml-threading.h
ggml.c mtmd: Add DeepSeekOCR Support (#17400) 2026-03-25 19:57:40 +01:00
ggml.cpp
gguf.cpp llama: fix llama-model-saver (#20503) 2026-03-25 12:53:16 +02:00