llama.cpp

History

Jeff Bolz 2bbe4c2cf8 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 ) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.		2026-01-12 12:32:13 +01:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628 )	2026-01-08 08:36:42 -08:00
src	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 )	2026-01-12 12:32:13 +01:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	ggml : bump version to 0.9.5 (ggml/1410)	2025-12-31 18:54:43 +02:00