llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ruben Ortlam	635ef78ec5	vulkan: work around Intel fp16 bug in mmq (#18814 )	2026-01-14 09:41:23 +01:00
Jeff Bolz	2bbe4c2cf8	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 ) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-12 12:32:13 +01:00
Jeff Bolz	cb14b06995	vulkan: optimize ssm_scan (#18630 ) * vulkan: optimize ssm_scan * fix warp vs subgroup naming	2026-01-08 15:16:54 +01:00
Eve	8c77a04cc7	vulkan: more mul mat optimizations (#18533 ) * q4_k * q5_k * q2_k * q4_1 * q5_1 * better buf index	2026-01-07 11:13:17 +01:00
Jeff Bolz	f1768d8f03	vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (#18582 )	2026-01-05 11:51:39 +01:00
Jeff Bolz	b37124d2d2	vulkan: handle quantize_q8_1 overflowing the max workgroup count (#18515 ) * vulkan: handle quantize_q8_1 overflowing the max workgroup count * vulkan: Fix small tile size matmul on lavapipe * fix mul_mat_id failures	2026-01-05 11:30:14 +01:00
Jeff Bolz	18ddaea2ae	vulkan: Optimize GGML_OP_CUMSUM (#18417 ) * vulkan: Optimize GGML_OP_CUMSUM There are two paths: The preexisting one that does a whole row per workgroup in a single shader, and one that splits each row into multiple blocks and does two passes. The first pass computes partials within a block, the second adds the block partials to compute the final result. The multipass shader is used when there are a small number of large rows. In the whole-row shader, handle multiple elements per invocation. * use 2 ELEM_PER_THREAD for AMD/Intel * address feedback	2026-01-02 15:32:30 -06:00
Jeff Bolz	706e3f93a6	vulkan: Implement mmvq for iq1_s/iq1_m (#18450 )	2026-01-02 20:19:04 +01:00
Jeff Bolz	be47fb9285	vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295 ) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk	2026-01-01 08:58:27 +01:00
Jeff Bolz	c9ced4910b	vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352 ) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed.	2025-12-26 16:12:58 -06:00
Jeff Bolz	7ac8902133	vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349 ) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads.	2025-12-26 18:15:50 +01:00
Eve	cb999704fb	vulkan: small dequantization improvements (#18380 ) * iq4_xs * quants	2025-12-26 18:12:11 +01:00
Jeff Bolz	b96b82fc85	vulkan: Support UPSCALE w/antialias (#18327 )	2025-12-26 17:00:57 +01:00
Jeff Bolz	10dc500bdb	vulkan: handle rope with large number of rows (#18306 )	2025-12-26 16:53:46 +01:00
Jeff Bolz	e3b35ddf1c	vulkan: Extend rope fusions to allow mrope (#18264 ) Extend the test-backend-ops tests as well.	2025-12-22 11:03:13 -06:00
Jeff Bolz	fd05c51cec	vulkan: fix im2col overflowing maxworkgroupcount (#18180 )	2025-12-21 10:32:58 +01:00
Jeff Bolz	cb64222b0c	vulkan: support GGML_UNARY_OP_XIELU (#18062 )	2025-12-21 10:17:58 +01:00
lovedheart	4117ae5557	Vulkan: some improvement on mul_mat_iq2_xs (#18031 ) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace	2025-12-21 09:59:52 +01:00
Ruben Ortlam	9e6649ecf2	vulkan: fix mul_mat_vec_iq1_s formatting (#18026 )	2025-12-14 14:52:46 +01:00
Jeff Bolz	3238b1400c	vulkan: Fix data race/hang in scalar/cm1 flash attention (#17887 )	2025-12-14 09:00:00 +01:00
lovedheart	4722671641	vulkan: improve mul_mat_vec_iq1_s speed (#17874 )	2025-12-14 08:47:49 +01:00
Eve	d15d177f43	vulkan: faster q6_k matmul (#17813 ) * q6_k faster mul mat * 8 values * fix comment * switch to two at a time * start ci for .glsl files	2025-12-14 08:29:37 +01:00
Jeff Bolz	36255a2268	vulkan: support get_rows for i32 (#17941 )	2025-12-13 10:12:53 +01:00
Jeff Bolz	3229a23fa6	vulkan: support GGML_OP_DIAG (#17893 )	2025-12-13 10:07:49 +01:00
Jeff Bolz	303f8615e9	vulkan: Multi-pass softmax for large number of cols (#17892 ) When the number of cols is large, split each row across multiple workgroups. There are three phases that communicate partial results through temp buffers: (1) compute max partials (2) take max of partials, compute sum(exp(x-max)) partials (3) sum partials, compute scaled result	2025-12-13 10:04:29 +01:00
Jeff Bolz	07a10c1090	vulkan: Allow non-pow2 n_experts in topk_moe (#17872 )	2025-12-13 08:40:04 +01:00
lovedheart	08f9d3cc1d	Vulkan: improve mul_mat_vec_iq1_m (#16907 ) * Optimize Vulkan shader for matrix-vector multiplication * Revert changes on compute_outputs and main Refactor compute_outputs to handle remaining rows correctly. * Fix trailing whitespace	2025-12-07 18:40:42 +01:00
Phylliida Dev	09c7c50e64	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985 ) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-06 15:07:02 +01:00
Jeff Bolz	c6c5e85979	vulkan: support solve_tri with larger N/K values (#17781 ) Split N into chunks to fit into shared memory. If K > 128, use a larger workgroup with enough invocations. Add perf tests matching qwen3next.	2025-12-06 08:56:45 +01:00
Masato Nakasaka	d8c0a7b085	vulkan: Fix mismatch in TOPK_MOE unit test (#17541 ) * Fix shader to support 2D workgroup mapping to a single subgroup * Set required_subgroup_size topk_moe shader requires static WARP_SIZE and actual subgroup size to match	2025-12-06 06:23:30 +01:00
Jeff Bolz	933414c0b6	vulkan: add more num_blocks instantiations in rms_norm (#17701 )	2025-12-05 22:08:56 +01:00
Jeff Bolz	a0f3897d53	vulkan: fix top_k bug when there are ties in the input (#17659 ) * vulkan: Reduce temporary memory usage for TOP_K - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB. * vulkan: fix top_k bug when there are ties in the input I noticed by inspection a bug in the vulkan top_k shader where if the least value in the top_k appears multiple times we could end up writing those extra copies out rather than some larger values (if the larger values are on higher numbered threads). I rewrote the test verification to handle this case, where the final index set is not necessarily the same. * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-05 22:03:19 +01:00
Acly	e15cd06a94	vulkan : support conv-2d with large output size (#17685 )	2025-12-05 21:46:39 +01:00
Jeff Bolz	61bde8e21f	vulkan: Reduce temporary memory usage for TOP_K (#17623 ) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-02 19:22:04 +01:00
Acly	385c3da5e6	vulkan : fix FA mask load with bounds check (coopmat2) (#17606 )	2025-11-30 01:03:21 +01:00
Ruben Ortlam	47a268ea50	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (#16900 ) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-11-29 09:37:22 +01:00
Jeff Bolz	35cf8887e1	vulkan: Implement GGML_OP_TRI (#17503 ) * vulkan: Implement GGML_OP_TRI * check types match	2025-11-28 10:07:29 +01:00
Jeff Bolz	4abef75f2c	vulkan: Implement SOLVE_TRI (#17486 ) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-11-27 15:48:00 +01:00
Jeff Bolz	879d673759	vulkan: Implement top-k (#17418 ) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-11-26 16:45:43 +01:00
Jeff Bolz	b3b03a7baf	vulkan: Implement GGML_OP_CUMSUM (#17479 )	2025-11-26 07:08:10 +01:00
Giuseppe Scrivano	7d77f07325	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (#17319 ) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-19 17:29:45 +01:00
Jeff Bolz	1fa4551af0	vulkan: support larger argsort (#17313 ) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-11-19 17:25:50 +01:00
Jeff Bolz	2eba631b81	vulkan: Add copy_transpose shader (#17371 )	2025-11-19 16:50:43 +01:00
Ruben Ortlam	38e2c1b412	vulkan: add log RTE support to fix Nvidia CI (#17320 ) * vulkan: add log RTE support to fix Nvidia CI * actually use the rte shader	2025-11-17 14:37:49 -06:00
Pavels Zaicenkovs	dbed61294a	vulkan: add LOG operation support for F32 and F16 (#17183 ) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md	2025-11-16 22:50:09 +01:00
Jeff Bolz	24dc769f1b	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287 ) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.	2025-11-15 19:54:23 +01:00
Ruben Ortlam	4dca015b7e	vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (#17285 )	2025-11-15 15:18:58 +01:00
Giuseppe Scrivano	1568d13c2c	vulkan: implement ABS and NEG (#17245 ) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-15 12:00:29 +01:00
Jeff Bolz	234ae7d7bd	vulkan: skip all-negative-inf blocks in FA (#17186 )	2025-11-15 10:37:25 +01:00
Ruben Ortlam	a19bd6f7ce	vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219 ) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-11-13 14:51:21 +01:00

1 2 3 4 5

247 Commits