llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	61bde8e21f	vulkan: Reduce temporary memory usage for TOP_K (#17623 ) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-02 19:22:04 +01:00
Tarek Dakhran	2ba719519d	model: LFM2-VL fixes (#17577 ) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-11-30 21:57:31 +01:00
Ruben Ortlam	47a268ea50	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (#16900 ) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-11-29 09:37:22 +01:00
Jeff Bolz	59d8d4e963	vulkan: improve topk perf for large k, fix overflow in unit tests (#17582 )	2025-11-29 08:39:57 +01:00
Jeff Bolz	35cf8887e1	vulkan: Implement GGML_OP_TRI (#17503 ) * vulkan: Implement GGML_OP_TRI * check types match	2025-11-28 10:07:29 +01:00
Jeff Bolz	4abef75f2c	vulkan: Implement SOLVE_TRI (#17486 ) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-11-27 15:48:00 +01:00
Acly	b78db3bd50	vulkan : move contiguous checks to device_supports_op (#17490 ) * vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op * im2col: remove contraints on src0 (kernel input)	2025-11-27 06:54:19 +01:00
Jeff Bolz	142df17c9c	vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (#17514 )	2025-11-27 06:32:30 +01:00
Jeff Bolz	eec1e33a9e	vulkan: allow graph_optimize for prompt processing workloads (#17475 )	2025-11-26 16:46:33 +01:00
Jeff Bolz	879d673759	vulkan: Implement top-k (#17418 ) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-11-26 16:45:43 +01:00
Jeff Bolz	b3b03a7baf	vulkan: Implement GGML_OP_CUMSUM (#17479 )	2025-11-26 07:08:10 +01:00
Jeff Bolz	d414db02d3	vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (#17455 )	2025-11-25 07:11:27 +01:00
Jeff Bolz	3d07caa99b	vulkan: more FA details in vk_perf_logger (#17443 )	2025-11-24 22:25:24 +01:00
Jeff Bolz	54d83bbe85	vulkan: remove a couple unnecessary switches (#17419 )	2025-11-23 06:29:40 +01:00
Jeff Bolz	f1ffbba68e	vulkan: disable async for older Intel devices (#17369 ) * vulkan: disable async for older Intel devices * update detection logic * use name string for detection	2025-11-21 09:58:17 +01:00
Giuseppe Scrivano	7d77f07325	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (#17319 ) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-19 17:29:45 +01:00
Jeff Bolz	1fa4551af0	vulkan: support larger argsort (#17313 ) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-11-19 17:25:50 +01:00
Jeff Bolz	2eba631b81	vulkan: Add copy_transpose shader (#17371 )	2025-11-19 16:50:43 +01:00
Ruben Ortlam	980b7cd17e	vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356 )	2025-11-19 08:46:26 +01:00
Jeff Bolz	da95bf2a85	vulkan: support noncontig i32 copy (#17328 )	2025-11-18 07:41:24 +01:00
Ruben Ortlam	38e2c1b412	vulkan: add log RTE support to fix Nvidia CI (#17320 ) * vulkan: add log RTE support to fix Nvidia CI * actually use the rte shader	2025-11-17 14:37:49 -06:00
Pavels Zaicenkovs	dbed61294a	vulkan: add LOG operation support for F32 and F16 (#17183 ) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md	2025-11-16 22:50:09 +01:00
Ruben Ortlam	80deff3648	vulkan: fix MMQ quantize_y condition (#17301 )	2025-11-16 19:38:17 +01:00
Jeff Bolz	24dc769f1b	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287 ) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.	2025-11-15 19:54:23 +01:00
Giuseppe Scrivano	1568d13c2c	vulkan: implement ABS and NEG (#17245 ) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-15 12:00:29 +01:00
Jeff Bolz	439342ea0b	vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244 ) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign	2025-11-15 11:56:15 +01:00
Jeff Bolz	234ae7d7bd	vulkan: skip all-negative-inf blocks in FA (#17186 )	2025-11-15 10:37:25 +01:00
Jeff Bolz	38eaf32af1	vulkan: change graph_compute to be async and enable get_tensor_async (#17158 ) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst	2025-11-15 09:06:41 +01:00
Eve	7d019cff74	disable rms norm mul rope for chips with no fp16 rte (#17134 )	2025-11-11 12:53:30 -06:00
Ruben Ortlam	85234a4b3a	vulkan: fix validation issue introduced by #16868 (#17145 )	2025-11-10 16:59:10 +01:00
Acly	1032256ec9	cuda/vulkan : bicubic interpolation (#17022 ) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-10 10:19:39 +01:00
Ruben Ortlam	392e09a608	vulkan: fix memory allocations (#17122 )	2025-11-09 16:14:41 +01:00
Ruben Ortlam	7f3e9d339c	vulkan: iGPU memory reporting fix (#17110 ) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 09:54:47 +01:00
Ruben Ortlam	8a3519b708	vulkan: fix mmq out of bounds reads (#17108 ) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings	2025-11-09 09:52:57 +01:00
Jeff Bolz	80a6cf6347	vulkan: fuse mul_mat_id + mul (#17095 ) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class	2025-11-09 09:48:42 +01:00
Jeff Bolz	53d7d21e61	vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978 ) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup	2025-11-08 13:24:29 -06:00
Jeff Bolz	b4e335d8dc	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977 ) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-08 08:52:15 +01:00
Jeff Bolz	d6fe40fa00	vulkan: Fix test-thread-safety crashes (#17024 ) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-08 08:39:45 +01:00
Acly	ac76d36201	vulkan : refactor buffer handling in vk_op_f32 (#16840 ) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-07 21:08:50 +01:00
Jeff Bolz	a44d77126c	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (#16919 )	2025-11-05 19:51:03 +01:00
Jeff Bolz	ad51c0a720	vulkan: remove the need for the dryrun (#16826 ) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-04 13:28:17 -06:00
Jeff Bolz	5d8bb900bc	vulkan: Fix multi_add invalid descriptor usage (#16899 )	2025-11-01 06:52:14 +01:00
Jeff Bolz	2e76e01360	vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868 ) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-01 06:45:28 +01:00
Masato Nakasaka	2976b0374d	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796 ) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-10-31 08:18:59 +01:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Jeff Bolz	052df28b0e	vulkan: Handle argsort with a large number of rows (#16851 )	2025-10-30 07:27:41 +01:00
Jeff Bolz	b9ce940177	vulkan: Fuse rope+set_rows (#16769 ) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-10-29 15:13:10 -05:00
Jeff Bolz	10fcc41290	vulkan: Update topk_moe fusion to handle gpt's late softmax (#16656 ) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-29 14:44:29 +01:00
Ruben Ortlam	bcf5bda6f5	Vulkan MMQ Integer Dot Refactor and K-Quant support (#16536 ) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-10-29 14:39:03 +01:00
Jeff Bolz	f549b0007d	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793 ) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-10-29 09:53:04 +01:00

1 2 3 4 5 ...

267 Commits