* hexagon: refactor set/get/sum-rows ops to use local context
* hexagon: refactor ROPE and Softmax Ops to use local context
Improves performance a bit by precomputing things and saving in the context.
* hexagon: refactor activation ops to use local context struct
* hexagon: refactor unary ops to use local context struct and DMA/VTCM
* hexagon: use aligned hvx_scale function
* hexagon: remove unused fields from op_context
* hexagon: rewrite ROPE to use DMA and VTCM scratchpad
* hex-rope: keep N rows in scratchpad (instead of just two)
* hex-rope: introduce rowidx cache
* hex-rope: remove unused fields
* hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute
also removes the need for fastdiv.
* hex-rope: minor formatting
* hex-rope: use indices and unroll the loops
* hex-rope: more updates to cleanup rope-block handling
* hexagon: cleanup supported type/dims checks
* hexagon: all reduce funcs replicated across lanes
There is no need to explicitly replicate the first value.
* snapdragon: update adb and windows scripts to use ubatch-size 256
Updated Ops support handles larger ubatches.
* Improve CUDA graph capture
Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because:
- The first call always incurs CUDA graph capture overhead even if the graph is unstable
- Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode)
The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize.
This also fixes issues such as https://github.com/ggml-org/llama.cpp/discussions/19708
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Remove EM dashes
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* flashattention and matrix multiplication moved to new format
* clean up preprocessing
* Formatting
* remove duplicate constants
* Split large shaders into multiple static strings
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* vulkan: split mul_mat into multiple dispatches to avoid overflow
The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.
Fall back for the less common p021 and nc variants.
* address feedback
* opencl: optimize mean and sum_row kernels
* opencl: add comment for max subgroups
* opencl: format
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
When LTO enabled in build environments it forces all builds to have LTO
in place. But feature detection logic is fragile, and causing Illegal
instruction errors with lto. This disables LTO for the feature
detection code to prevent cross-module optimization from inlining
architecture-specific instructions into the score function. Without this,
LTO can cause SIGILL when loading backends on older CPUs (e.g., loading
power10 backend on power9 crashes before feature check runs).
* Updated repack.cpp
* Updated repack.cpp
* Updated repack.cpp
* Added if condition to support only vector length 256.
* Changed the format removed comments and duplicate variable
* If SVE 256 not present then was using generic function to compute, hence slowing the performance.
So added code if SVE 256 is not present then use NEON code.
* Code format change suggestion
---------
Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization
- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask
* cuda: iq2xxs: simplify sum scaling
express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`
saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small
* uint -> uint32_t
error: identifier "uint" is undefined
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).
I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
* ggml-cpu: FA add GEMM microkernel
* add guard for sizeless vector types
* fix case where DV % GGML_F32_EPR !=0
* move memset out of the loop
* move another memset out of the loop
* use RM=4 for arm
* simd_gemm: convert everything to int
* convert everything to size_t to avoid warnings
* fixup
* add pragma for ignoring aggressive loop optimizations
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL
Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.
The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.
* addressed code review comments
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml-hexagon: fa improvements
ggml-hexagon: optimize flash attention calculations with improved variable handling
ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32
ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements
ggml-hexagon: optimize flash attention by changing slope vector type to F16
* hexfa: fixed test-backend-ops failurs due to leftover element handling
* hexagon: refactor and optimize fa to use local context struct
* ggml-hexagon: optimize flash-attention using hvx_vec_expf
Use HVX for online softmax.
---------
Co-authored-by: chraac <chraac@gmail.com>
* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Do not mutate cgraph for fused ADDs
1. We should try to minimize in-place changes to the incoming
ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
step as we store the properties before modifying the graph in-place
in the cuda-backend
* Assert ggml_tensor is trivially copyable
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).
The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
2549 | using ARegsT = typename Impl::ARegsT;
```
Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2]. When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.
Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
* hexagon: add ARGSORT op
Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
* hexagon: argsort reject tensors with huge rows for now
* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend
* hexagon : Add GEGLU op
* hexagon: fix editor config check
* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA
---------
Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.
https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool
* Free pools
* Cleanup
* More cleanup
* Run clang-format
* Fix arg-parser and tokenizer test errors that free an unallocated buffer
* Fix device lost callback to not print on device teardown
* Fix include and run clang-format
* remove unused unused
* Update binary ops
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* First working version of GEMM and GEMV
* interleave loads and compute
* Clang-format
* Added missing fallback. Removed tested TODO.
* Swap M and N to be consistent with the repack template convention
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)
Resolves: #18560
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.
Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes
Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed
Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
* Rename variables + fix rope_neox
Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299
* Fix rope_multi
* Fix rope_vision
* Fix rope_norm
* Rename ne* to ne0* for consistent variable naming
* cont : consistent stride names
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml webgpu: port binary operators to use pre-wgsl
* Add binary.wgsl: unified shader with conditionals for all 4 ops
* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor
* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)
* Update CMake to generate binary operator shaders at build time
* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling
* port binary operators from AOT to pre-wgsl JIT compilation
* add src1=dst overlap handling for binary ops
* use compile-time workgroup size defines instead of runtime overrides
* ggml-webgpu: complete overlap handling for binary ops
* add support for inplace & overlap case in binding setup
* restructure conditional logic to handle all overlap cases
* ensure all buffer bindings are correctly assigned for edge cases
* ggml-webgpu: remove unused binary overlap cases
Remove src0==src1 binary overlap case that never occurs in practice.
* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT
* remove unused src0==src1 and all-same variant
* refactor wgsl to eliminate duplication
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.
I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.
I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.
Apply this optimization when the mask is relatively large (i.e. prompt
processing).
* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function
* ggml-virtgpu: deprecate buffer_type is_host remoting
not necessary
* ggml-virtgpu: stop using static vars as cache
The static init isn't thread safe.
* ggml-virtgpu: protect the use of the shared memory to transfer data
* ggml-virtgpu: make the remote calls thread-safe
* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory
* ggml-virtgpu: add a cleanup function for consistency
* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing
* fix style and ordering
* Remove the static variable in apir_device_get_count
* ggml-virtgpu: improve the logging
* fix review minor formatting changes
* CUDA: use mmvq for mul-mat-id for small batch sizes
* add mmvq too
* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs
* templatize multi_token_path
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
* Remove mutex for pipeline caches, since they are now per-thread.
* Add comment
* Run clang-format
* Cleanup
* Run CI again
* Run CI once more
* Run clang-format
* wip
* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation
* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations
* wip
* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance
* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability
* optimize vector dot product functions to use unified reduction for improved performance
* wip
* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation
* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations
* wip
* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance
* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability
* optimize vector dot product functions to use unified reduction for improved performance
* hexagon: optimize reduce-sum for v75+
* hexagon: always keep row_sums in sf/fp32
* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT
* fix compiling error after rebase
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* Add Q8_0 OpenCL kernel
Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
* opencl: fix build for non-adreno
* opencl: refactor q8_0
* opencl: enforce subgroup size of 64 for adreno for q8_0
* For A750 and older generations, subgroup size can be 64 or 128.
This kernel assumes subgroup size 64.
* opencl: suppress warning when adreno kernels are disabled
---------
Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
* sycl: add softplus unary op implementation
* sycl: add softplus unary op implementation
* docs(ops): mark SYCL SOFTPLUS as supported
* docs: update SYCL status for SOFTPLUS
* vulkan: use coopmat for flash attention p*v matrix multiplication
* fix P loading issue
* fix barrier position
* remove reduction that is no longer needed
* move max thread reduction into loop
* remove osh padding
* add bounds checks and padding
* remove unused code
* fix shmem sizes, loop duration and accesses
* don't overwrite Qf, add new shared psh buffer instead
* add missing bounds checks
* use subgroup reductions
* optimize
* move bounds check, reduce barriers
* support other Bc values and other subgroup sizes
* remove D_split
* replace Of register array with shared memory Ofsh array
* parallelize HSV across the rowgroups
* go back to Of in registers, not shmem
* vectorize sfsh
* don't store entire K tile in shmem
* fixes
* load large k tiles to shmem on Nvidia
* adapt shared memory host check function to shader changes
* remove Bc 32 case
* remove unused variable
* fix missing mask reduction tmspsh barrier
* fix mask bounds check
* fix rowmax f16 under/overflow to inf
* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.
See more details here: KhronosGroup/MoltenVK#2683.
The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.
The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
* Boilerplate for q6_K repack
* q6_K repack to q6_Kx8 implementation
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* q6_K generic gemv and gemm
* wip, gemm_q6_K 8x8
* Still WIP: loading of q8s, q6h and q6l
* first working version of q6_K gemm
* Moved q6 loads outside of sb block, Unrolled inner loop
* Replaced modulo with mask
* First implementation of GEMV
* ggml_vdotq_s32 -> vdotq_s32
* Reduce width of accumulators in q6_K gemv
* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.
* Reuse scales in GEMM (same GEMV opt)
* Added todos for bsum and different qh repack
* Arch fallback
* VSLIQ for merging qh adn ql
* Removed TODO, already tested
* Apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Removed unused import
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
* Set the env variable in the CUDA backend registry allocation
* Add link to PR in code comment
* Remove warning logs and update documentation
* opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat`
* opencl: clean up
* opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat`
* opencl: tweak the workgroup size a bit
* opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat`
* opencl: proper alignment for q6_K
* opencl: boundary handling for flattened q6_K mv
* opencl: rename q6_K mv kernel file
* opencl: put flattened q6_K mv in its own file
* opencl: use lower k in file name
* opencl: use K in variable names
* ggml-cpu: Use tiled FA for prompt-processing
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
* fix out of bounds for mask
* skip rows where there are all masks
* skip tile if mask is inf
* store mask in worksize
* check inf tile earlier
* Boilerplate for q5_Kx8 REPACK on ARM and fallback
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* q5_K repack gemm and gemv generics
* Gemm and Gemv ARM implementations (i8mm)
* Improved qh manipulation looking at non-repack vec_dot implementation
* Full unroll
* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix wrong fallback definitions of Q5_K
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed comments. Reverted unnecessary formatting
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed typo in generic definitions
* Switching AND + Shift with Shift Insert. Better op interleaving.
* Vectorize + unroll the block scales
* Apply gemm optimizations to gemv
* Improve bias calculation
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* mla : pass V as a view of K to the FA op
* cuda : adjust mla logic to new layout
* kv-cache : fix rope shift
* tests : remove comment
* cuda : fix reusable_cutoff
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* opencl: add `copy_to_contiguous` and utilize mm kernels
* opencl: only copy to cont for f32 and f16 tensors
* opencl: use cont mm for fallback when dst is large
* opencl: use nb local to copy-to-cont
* opencl: use local offset as well
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu