llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	914dde72ba	ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511 ) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-11 18:58:43 +02:00
Georgi Gerganov	9ab072ebbe	metal : extend l2_norm support for non-cont src0 (#19502 )	2026-02-11 14:53:19 +02:00
Max Krasnyansky	73cd5e1b97	hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (#19406 ) * hexagon: add ARGSORT op Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> * hexagon: argsort reject tensors with huge rows for now * Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend * hexagon : Add GEGLU op * hexagon: fix editor config check * hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA --------- Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>	2026-02-10 23:21:12 -08:00
Georgi Gerganov	89181c0b6d	ggml : extend bin bcast for permuted src1 (#19484 ) * tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify	2026-02-11 07:52:00 +02:00
Georgi Gerganov	ceaa89b786	metal : consolidate unary ops (#19490 )	2026-02-11 07:51:12 +02:00
Oliver Simons	612db61886	CUDA : Update CCCL-tag for 3.2 to final release from RC (#19486 ) CCCL 3.2 has been released since it was added to llama.cpp as part of the backend-sampling PR, and it makes sense to update from RC to final released version. https://github.com/NVIDIA/cccl/releases/tag/v3.2.0	2026-02-10 22:31:19 +01:00
Nikhil Jain	57487a64c8	[WebGPU] Plug memory leaks and free resources on shutdown (#19315 ) * Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool * Free pools * Cleanup * More cleanup * Run clang-format * Fix arg-parser and tokenizer test errors that free an unallocated buffer * Fix device lost callback to not print on device teardown * Fix include and run clang-format * remove unused unused * Update binary ops --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-02-10 08:04:00 -08:00
Alberto Cabrera Pérez	c03a5a46f0	ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (#19360 ) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention	2026-02-10 10:47:45 +00:00
k4ss4n	6948adc90d	ggml : use noexcept overload for is_regular_file in backend registration (#19452 ) using noexcept std::filesystem::directory_entry::is_regular_file overload prevents abnormal termination upon throwing an error (as caused by symlinks to non-existent folders on linux) Resolves: #18560	2026-02-10 10:57:48 +01:00
Raul Torres	f0bfe54f55	CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (#18968 )	2026-02-10 14:19:30 +08:00
hipudding	52e38faf8c	CANN: implement quantized MUL_MAT_ID for MoE models (#19228 ) Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).	2026-02-10 14:18:59 +08:00
Georgi Gerganov	a0d585537c	cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429 ) * cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad	2026-02-10 08:07:16 +02:00
Oliver Simons	e06088da0f	CUDA: Fix non-contig rope (#19338 ) * Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from https://github.com/ggml-org/llama.cpp/pull/19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-08 15:12:51 +02:00
Georgi Gerganov	8872ad2125	metal : consolidate bin kernels (#19390 ) * metal : refactor bin kernels * cont * cont : fix cv	2026-02-07 10:35:56 +02:00
Georgi Gerganov	34ba7b5a2f	metal : fix event synchronization in cpy_tensor_async (#19402 )	2026-02-07 07:37:15 +02:00
Abhijit Ramesh	7fbd36c50c	ggml-webgpu: JIT compile binary operators and handle binding overlaps (#19310 ) * ggml webgpu: port binary operators to use pre-wgsl * Add binary.wgsl: unified shader with conditionals for all 4 ops * Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor * Remove bin_op.tmpl.wgsl and binary.wgsl (Python template) * Update CMake to generate binary operator shaders at build time * ggml-webgpu: migrate binary ops to JIT compilation with overlap handling * port binary operators from AOT to pre-wgsl JIT compilation * add src1=dst overlap handling for binary ops * use compile-time workgroup size defines instead of runtime overrides * ggml-webgpu: complete overlap handling for binary ops * add support for inplace & overlap case in binding setup * restructure conditional logic to handle all overlap cases * ensure all buffer bindings are correctly assigned for edge cases * ggml-webgpu: remove unused binary overlap cases Remove src0==src1 binary overlap case that never occurs in practice. * keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT * remove unused src0==src1 and all-same variant * refactor wgsl to eliminate duplication	2026-02-06 10:33:30 -08:00
Nechama Krashinski	537eadb1b9	sycl: add F16 support for GGML_OP_CEIL (#19306 ) * Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL	2026-02-06 23:13:44 +08:00
Jeff Bolz	1946e46f4c	vulkan: For coopmat2 FA, use fp16 accumulators for the final result (#19376 ) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.	2026-02-06 09:15:13 +01:00
Jeff Bolz	f9bd518a6b	vulkan: make FA mask/softcap enables spec constants (#19309 ) * vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit	2026-02-06 08:49:58 +01:00
Georgi Gerganov	7fcf1ef45d	metal : skip loading all-zero mask (#19337 ) * metal : skip loading all-zero mask * cont : minor	2026-02-06 09:25:11 +02:00
Georgi Gerganov	3e21647666	cuda : cuda graphs now compare all node params (#19383 )	2026-02-06 07:55:06 +02:00
Georgi Gerganov	22cae83218	metal : adaptive CPU/GPU interleave based on number of nodes (#19369 )	2026-02-05 19:07:22 +02:00
Jeff Bolz	449ec2ab07	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281 ) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-05 09:26:38 -06:00
Georgi Gerganov	7a4f97d196	metal : add diag (#19330 )	2026-02-05 10:08:45 +02:00
Oleksandr Kuvshynov	a498c75ad1	vulkan: fix GPU deduplication logic. (#19222 ) * vulkan: fix GPU deduplication logic. As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the (same uuid, same driver) logic is problematic for windows+intel igpu. Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before `88d23ad5` - just dedup based on UUID. Verified that MacOS + 4xVega still reports 4 GPUs with this version. * vulkan: only skip dedup when both drivers are moltenVk	2026-02-05 09:06:59 +01:00
Jeff Bolz	3409ab842d	vulkan: Set k_load_shmem to false when K is too large (#19301 )	2026-02-05 08:48:33 +01:00
Jeff Bolz	c342c3b93d	vulkan: fix non-contig rope (#19299 )	2026-02-05 08:38:59 +01:00
will-lms	af252d0758	metal : add missing includes (#19348 )	2026-02-05 08:05:09 +02:00
Kevin Pouget	015deb9048	ggml-virtgpu: make the code thread safe (#19204 ) * ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function * ggml-virtgpu: deprecate buffer_type is_host remoting not necessary * ggml-virtgpu: stop using static vars as cache The static init isn't thread safe. * ggml-virtgpu: protect the use of the shared memory to transfer data * ggml-virtgpu: make the remote calls thread-safe * ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory * ggml-virtgpu: add a cleanup function for consistency * ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing * fix style and ordering * Remove the static variable in apir_device_get_count * ggml-virtgpu: improve the logging * fix review minor formatting changes	2026-02-04 10:46:18 +08:00
Aman Gupta	2ceda3f662	ggml-cpu: use LUT for converting e8->f32 scales on x86 (#19288 ) * ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro	2026-02-04 09:43:29 +08:00
Georgi Gerganov	44008ce8f9	metal : add solve_tri (#19302 )	2026-02-03 23:43:14 +02:00
Ruben Ortlam	32b17abdb0	vulkan: disable coopmat1 fa on Nvidia Turing (#19290 )	2026-02-03 17:37:32 +01:00
Aman Gupta	8bece2eb20	CUDA: use mmvq for mul-mat-id for small batch sizes (#18958 ) * CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path	2026-02-03 23:31:23 +08:00
Georgi Gerganov	c55bce4159	metal : minor cleanup (#19251 )	2026-02-03 13:43:29 +02:00
Oliver Simons	1f1e57f2bf	CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053 ) By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](`557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816)`) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: \| GPU \| Model \| Test \| t/s master \| t/s osimons/fix_bw_mmq_fixup_kernel \| Speedup \| \|:--------------------------------------------------------\|:----------------------\|:-------\|-------------:\|--------------------------------------:\|----------:\| \| NVIDIA RTX 6000 Ada Generation \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 8404.05 \| 8375.79 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| llama 3B Q4_K_M \| pp8096 \| 16148.93 \| 16019.60 \| 0.99 \| \| NVIDIA RTX 6000 Ada Generation \| llama 8B Q4_0 \| pp8096 \| 8008.29 \| 7978.80 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B BF16 \| pp8096 \| 4263.16 \| 4248.53 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B Q4_K_M \| pp8096 \| 5165.11 \| 5157.43 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 12582.80 \| 12758.37 \| 1.01 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 3B Q4_K_M \| pp8096 \| 16879.10 \| 17619.47 \| 1.04 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 8B Q4_0 \| pp8096 \| 10649.90 \| 10982.65 \| 1.03 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B BF16 \| pp8096 \| 7717.73 \| 7716.22 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B Q4_K_M \| pp8096 \| 7301.90 \| 7370.38 \| 1.01 \|	2026-02-03 11:33:14 +01:00
George	e9a859db3c	ggml: added cleanups in ggml_quantize_free (#19278 ) Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.	2026-02-03 08:43:39 +02:00
Gaurav Garg	41e3f02647	cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (#19227 ) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.	2026-02-03 08:41:02 +02:00
lhez	91ea44e89b	opencl: refactor some ops, concat, repeat, tanh and scale (#19226 ) * opencl: refactor concat * opencl: refactor repeat * opencl: refactor tanh * opencl: enable fp16 for tanh * opencl: refactor scale * opencl: fix unused variables	2026-02-02 15:54:43 -08:00
Aman Gupta	9f682fb640	ggml-cpu: FA split across kv for faster TG (#19209 ) * ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl	2026-02-03 01:19:55 +08:00
Neo Zhang	bf38346d13	Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246 ) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.	2026-02-02 21:06:21 +08:00
Tamar	4d5e972673	sycl: implement GGML_OP_TOP_K (#19242 )	2026-02-02 21:05:51 +08:00
Georgi Gerganov	6fdddb4987	metal : support virtual devices (#18919 ) * metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async	2026-02-02 14:29:44 +02:00
Johannes Gäßler	59377a6c87	ggml-backend: fix async set/get fallback sync (#19179 )	2026-02-02 10:00:05 +01:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
Nikhil Jain	2dc3ce2166	Remove pipeline cache mutexes (#19195 ) * Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format	2026-02-01 18:47:29 -08:00
Max Krasnyansky	3bc8d2cf23	Bump cmake max version (needed for Windows on Snapdragon builds) (#19188 ) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists	2026-02-01 14:13:38 -08:00
nullname	89f10baad5	ggml-hexagon: flash-attention and reduce-sum optimizations (#19141 ) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-30 21:14:20 -08:00
shaofeiqi	971facc38e	opencl: add optimized q8_0 mm kernel for adreno (#18871 ) * Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-01-30 10:19:27 -08:00
Georgi Gerganov	dfd6106c84	cuda : fix compile warnings (whisper/0)	2026-01-30 20:09:21 +02:00
Simon Redman	13f3ebfae1	Correctly fetch q8_1 quantize pipeline in test as needed by `8a3519b` (#19194 )	2026-01-30 17:27:16 +01:00
bssrdf	ecbf01d441	add tensor type checking as part of cuda graph properties (#19186 )	2026-01-30 12:57:52 +08:00
s8322	1025fd2c09	sycl: implement GGML_UNARY_OP_SOFTPLUS (#19114 ) * sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS	2026-01-30 12:01:38 +08:00
RachelMantel	c7358ddf64	sycl: implement GGML_OP_TRI (#19089 ) * sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI	2026-01-30 12:00:49 +08:00
Zheyuan Chen	bd90fc74c3	ggml-webgpu: improve flastAttention performance by software pipelining (#19151 ) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll QK accumlation inner loop ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-29 14:05:30 -08:00
Todor Boinovski	ce38a4db47	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 ) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-29 12:33:21 -08:00
Georgi Gerganov	4fdbc1e4db	cuda : fix nkvo, offload and cuda graph node properties matching (#19165 ) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert	2026-01-29 18:45:30 +02:00
yulo	f3dd7b8e68	HIP: add mmf for CDNA (#18896 ) * refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit `4d2ef3d483`. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>	2026-01-29 11:10:53 +01:00
Vishal Singh	b33df266d0	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159 )	2026-01-29 12:28:57 +08:00
Aman Gupta	3bcc990997	CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126 )	2026-01-29 10:31:28 +08:00
Neo Zhang	d4964a7c66	sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-29 09:20:22 +08:00
Ruben Ortlam	f6b533d898	Vulkan Flash Attention Coopmat1 Refactor (#19075 ) * vulkan: use coopmat for flash attention pv matrix multiplication fix P loading issue * fix barrier position * remove reduction that is no longer needed * move max thread reduction into loop * remove osh padding * add bounds checks and padding * remove unused code * fix shmem sizes, loop duration and accesses * don't overwrite Qf, add new shared psh buffer instead * add missing bounds checks * use subgroup reductions * optimize * move bounds check, reduce barriers * support other Bc values and other subgroup sizes * remove D_split * replace Of register array with shared memory Ofsh array * parallelize HSV across the rowgroups * go back to Of in registers, not shmem * vectorize sfsh * don't store entire K tile in shmem * fixes * load large k tiles to shmem on Nvidia * adapt shared memory host check function to shader changes * remove Bc 32 case * remove unused variable * fix missing mask reduction tmspsh barrier * fix mask bounds check * fix rowmax f16 under/overflow to inf * fix flash_attn_cm2 BLOCK_SIZE preprocessor directives	2026-01-28 18:52:45 +01:00
Patryk Kaminski	0cd7032ca4	ggml-sycl: remove unused syclcompat header (#19140 ) The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826). This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.	2026-01-28 23:33:54 +08:00
Oleksandr Kuvshynov	88d23ad515	vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058 ) Deduplication here relied on the fact that vulkan would return unique UUID for different physical GPUs. It is at the moment not always the case. On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total), MotlenVK would assign same UUID to pairs of GPUs, unless they are connected with Infinity Fabric. See more details here: KhronosGroup/MoltenVK#2683. The right way is to fix that in MoltenVK, but until it is fixed, llama.cpp would only recognize 2 of 4 GPUs in such configuration. The deduplication logic here is changed to only filter GPUs if UUID is same but driver is different.	2026-01-28 12:35:54 +01:00
Kevin Pouget	b7feacf7f3	ggml: new backend for Virglrenderer API Remoting acceleration (v2) (#18718 )	2026-01-28 17:49:40 +08:00
Alberto Cabrera Pérez	6ad70c5a77	ggml-cpu: arm64: Q4_K scale unroll and vectorization (#19108 )	2026-01-28 09:15:56 +02:00
Georgi Gerganov	631cbfcc7a	cuda : fix "V is K view" check for non-unified KV cache (#19145 )	2026-01-28 09:15:27 +02:00
Georgi Gerganov	2eee6c866c	CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (#19142 )	2026-01-28 09:15:11 +02:00
Nikhil Jain	06961e2876	ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (#18976 ) * Squashed commit of the following: commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (#11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit `5ca9b5e49e` Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit `e1f6baea31` Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit `8c70b8fece` Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit `f9282c660c` Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit `4cf28d7dec` Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit `74c6add176` Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit `362749910b` Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit `cb08583337` Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit `5360e2852a` Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit `7b09baa4aa` Merge: `8a6ec843` `74b8fc17` Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit `8a6ec843a5` Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit `c3ae38278a` Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit `aa1c9b2f88` Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> * Remove extra code and format * Add ops documentation (finally) * ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context * Small cleanup * Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state. * Cleanups * More cleanup * Move staging_buf mutex to global context * Resolve merge * Resolve merge * Resolve merge * Clean up merge errors, delete forward declaration, and run clang-format * Rename device_init to backend_init * Move webgpu_context to backend_context * Move buffer context members into global context and refactor function calls * Run clang-format * Remove commends * Move parameter buffers to per-thread, add single memset_tensor param buf * Fix CI compilation issue * Fix builds for emscripten not supporting subgroups * cleanup * cleanup --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-27 20:53:36 -08:00
Vishal Singh	f2571df8b7	ggml-zendnn : update ZenDNN git tag to main branch (#19133 )	2026-01-28 06:21:36 +08:00
Johannes Gäßler	a5bb8ba4c5	CUDA: tune GLM 4.7 Flash FA kernel selection logic (#19097 )	2026-01-27 14:28:56 +01:00
Alberto Cabrera Pérez	be8890e721	ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888 ) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-27 11:08:10 +02:00
Gaurav Garg	a83c73a18a	[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042 ) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation	2026-01-27 08:52:44 +02:00
shalinib-ibm	7afdfc9b84	ggml-cpu: Enable FP16 MMA kernels on PPC (#19060 )	2026-01-27 11:52:34 +08:00
lhez	94eeb5967c	opencl: add flattened q6_K mv (#19054 ) * opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat` * opencl: clean up * opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat` * opencl: tweak the workgroup size a bit * opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat` * opencl: proper alignment for q6_K * opencl: boundary handling for flattened q6_K mv * opencl: rename q6_K mv kernel file * opencl: put flattened q6_K mv in its own file * opencl: use lower k in file name * opencl: use K in variable names	2026-01-26 19:36:24 -08:00
Johannes Gäßler	b0311c16d2	CUDA: fix padding of GQA to power of 2 in FA (#19115 )	2026-01-26 23:24:58 +01:00
Johannes Gäßler	0c21677e43	CUDA: faster FA for GQA > 1 but not power of 2 (#19092 )	2026-01-25 21:19:47 +01:00
ccbinn	0440bfd160	metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088 ) Co-authored-by: chenbin11 <chenbin11@kuaishou.com>	2026-01-25 20:07:19 +02:00
Aman Gupta	bcb43163ae	ggml-cpu: Use tiled FA for prompt-processing (#19012 ) * ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier	2026-01-25 23:25:58 +08:00
Georgi Gerganov	d9c6ce46f7	kv-cache : support V-less cache (#19067 ) * kv-cache : support V-less cache * cuda : better check for V_is_K_view * cuda : improve V_is_K_view check * graph : add comments * hparams : refactor	2026-01-25 15:48:56 +02:00
Johannes Gäßler	4e5b83b226	GGUF: check that tensor size is representable (#19072 )	2026-01-24 21:57:51 +01:00
Johannes Gäßler	8f91ca54ec	CUDA: re-use MLA K data for V in MMA FA (#19057 )	2026-01-24 10:09:36 +01:00
Aman Gupta	81ab64f3c8	ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934 ) * ggml-cuda: add split-wise cuda graph * add n-cpu-moe compare_llama_bench.py * fix hip/musa builds	2026-01-24 14:25:20 +08:00
nullname	8af1f5f430	ggml-hexagon: flash-attn opt (#19025 ) * optimize flash attention kernel by improving score computation and online softmax update * wip * Refactor online softmax update in flash attention kernel for improved performance * Optimize flash attention kernel by replacing float array with HVX_Vector for score computation * wip	2026-01-23 22:02:07 -08:00
Neo Zhang	cb6caca191	[SYCL] use malloc to support both iGPU and dGPU in same time (#18992 ) * use malloc to support both iGPU and dGPU in same time * support windows --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-23 20:54:10 +08:00
Alberto Cabrera Pérez	091a46cb8d	ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860 ) * Boilerplate for q5_Kx8 REPACK on ARM and fallback Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Implements make_block_q5_Kx8 by extending make_block_q4_Kx8 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q5_K repack gemm and gemv generics * Gemm and Gemv ARM implementations (i8mm) * Improved qh manipulation looking at non-repack vec_dot implementation * Full unroll * Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments. Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix wrong fallback definitions of Q5_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed comments. Reverted unnecessary formatting Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed typo in generic definitions * Switching AND + Shift with Shift Insert. Better op interleaving. * Vectorize + unroll the block scales * Apply gemm optimizations to gemv * Improve bias calculation --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>	2026-01-23 09:55:08 +02:00
Georgi Gerganov	a5eaa1d6a3	mla : make the V tensor a view of K (#18986 ) * mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-22 22:09:01 +02:00
Johannes Gäßler	e2baf02162	CUDA: fix alignment check for FA (#19023 )	2026-01-22 20:39:25 +01:00
lhez	9c96465f99	opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970 ) * opencl: add `copy_to_contiguous` and utilize mm kernels * opencl: only copy to cont for f32 and f16 tensors * opencl: use cont mm for fallback when dst is large * opencl: use nb local to copy-to-cont * opencl: use local offset as well	2026-01-22 10:29:25 -08:00
Aman Gupta	b70d251076	CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953 )	2026-01-22 18:51:53 +08:00
shaofeiqi	5516b9c16a	opencl: add TRI op support (#18979 )	2026-01-21 22:05:54 -08:00
Aleksei Nikiforov	94242a62c0	ggml-zdnn : mark zDNN buffers as non-host (#18967 ) While buffers reside in host memory, additional transformation is needed to use buffers with zDNN. Fixes #18848	2026-01-22 01:16:21 +01:00
Jeff Bolz	bd544c94a3	vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945 ) * vulkan: Remove transfer_ctx, do everything in compute_ctx. We had a bug where a set_tensor_async (using transfer_ctx) didn't get submitted before the graph_compute (using compute_ctx) that came after it. To avoid this sort of issue, just do everything in compute_ctx. Remove transfer_cmd_pool, which was already unused. * fix crash with perf logger	2026-01-21 18:01:40 +01:00
Jeff Bolz	33f890e579	vulkan: support flash attention GQA/split_k with small batches (#18938 )	2026-01-21 17:43:43 +01:00
Masato Nakasaka	067b8d7af3	Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356 )" (#18831 ) This reverts commit `980b7cd17e`.	2026-01-21 17:13:43 +01:00
Jeff Bolz	50b7f076a5	vulkan: Use mul_mat_vec_id for small values of n (#18918 ) Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.	2026-01-21 16:22:02 +01:00
Matthieu Coudron	37c35f0e1c	gguf: display strerrno when cant load a model (#18884 ) I've had issues loading models with llama-server: [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' and I was sure it could access the file. Seems like --models-dir and --models-presets dont interact like I thought they would but I salvaged this snippet that helps troubleshooting [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)	2026-01-21 08:52:46 +02:00
Oliver Simons	5bd341c9a1	CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964 ) * CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator Strided iterator was added in [CCCL 3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into [CTK 13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5) * Unindent as per code review request	2026-01-21 02:34:29 +01:00
Oliver Simons	d1e3556481	CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930 ) * CUDA: Replace `init_offsets` with iterators in argsort This is a QOL improvement, saving us the cost of materializing the iterator * Remove unnecessary include from top-k.cu	2026-01-20 20:11:01 +08:00
Adrien Gallouët	08f3f4a8a3	ggml : cleanup path_str() (#18928 ) - Remove pragmas as `std::codecvt_utf8` is not used. - Avoid implicit `strlen()`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-20 11:42:49 +01:00
Georgi Gerganov	271191906c	metal : enable FA for MLA heads (#18950 )	2026-01-20 12:21:28 +02:00

1 2 3 4 5 ...

2057 Commits