llama.cpp

History

Gabe Goodhart 8d5a25d356 perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>		2025-07-17 11:19:30 -06:00
..
cmake	ggml-cpu : rework weak alias on apple targets (#14146 )	2025-06-16 13:54:15 +08:00
include	ggml: Add initial WebGPU backend (#14521 )	2025-07-16 18:18:51 +03:00
src	perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state	2025-07-17 11:19:30 -06:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	ggml: Add initial WebGPU backend (#14521 )	2025-07-16 18:18:51 +03:00