* hexagon: add ARGSORT op
Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
* hexagon: argsort reject tensors with huge rows for now
* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend
* hexagon : Add GEGLU op
* hexagon: fix editor config check
* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA
---------
Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.
https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool
* Free pools
* Cleanup
* More cleanup
* Run clang-format
* Fix arg-parser and tokenizer test errors that free an unallocated buffer
* Fix device lost callback to not print on device teardown
* Fix include and run clang-format
* remove unused unused
* Update binary ops
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* First working version of GEMM and GEMV
* interleave loads and compute
* Clang-format
* Added missing fallback. Removed tested TODO.
* Swap M and N to be consistent with the repack template convention
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)
Resolves: #18560
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.
Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes
Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed
Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
* Rename variables + fix rope_neox
Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299
* Fix rope_multi
* Fix rope_vision
* Fix rope_norm
* Rename ne* to ne0* for consistent variable naming
* cont : consistent stride names
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml webgpu: port binary operators to use pre-wgsl
* Add binary.wgsl: unified shader with conditionals for all 4 ops
* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor
* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)
* Update CMake to generate binary operator shaders at build time
* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling
* port binary operators from AOT to pre-wgsl JIT compilation
* add src1=dst overlap handling for binary ops
* use compile-time workgroup size defines instead of runtime overrides
* ggml-webgpu: complete overlap handling for binary ops
* add support for inplace & overlap case in binding setup
* restructure conditional logic to handle all overlap cases
* ensure all buffer bindings are correctly assigned for edge cases
* ggml-webgpu: remove unused binary overlap cases
Remove src0==src1 binary overlap case that never occurs in practice.
* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT
* remove unused src0==src1 and all-same variant
* refactor wgsl to eliminate duplication
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.
I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.
I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.
Apply this optimization when the mask is relatively large (i.e. prompt
processing).
* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function
* ggml-virtgpu: deprecate buffer_type is_host remoting
not necessary
* ggml-virtgpu: stop using static vars as cache
The static init isn't thread safe.
* ggml-virtgpu: protect the use of the shared memory to transfer data
* ggml-virtgpu: make the remote calls thread-safe
* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory
* ggml-virtgpu: add a cleanup function for consistency
* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing
* fix style and ordering
* Remove the static variable in apir_device_get_count
* ggml-virtgpu: improve the logging
* fix review minor formatting changes
* CUDA: use mmvq for mul-mat-id for small batch sizes
* add mmvq too
* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs
* templatize multi_token_path
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
* Remove mutex for pipeline caches, since they are now per-thread.
* Add comment
* Run clang-format
* Cleanup
* Run CI again
* Run CI once more
* Run clang-format
* wip
* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation
* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations
* wip
* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance
* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability
* optimize vector dot product functions to use unified reduction for improved performance
* wip
* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation
* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations
* wip
* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance
* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability
* optimize vector dot product functions to use unified reduction for improved performance
* hexagon: optimize reduce-sum for v75+
* hexagon: always keep row_sums in sf/fp32
* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT
* fix compiling error after rebase
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* Add Q8_0 OpenCL kernel
Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
* opencl: fix build for non-adreno
* opencl: refactor q8_0
* opencl: enforce subgroup size of 64 for adreno for q8_0
* For A750 and older generations, subgroup size can be 64 or 128.
This kernel assumes subgroup size 64.
* opencl: suppress warning when adreno kernels are disabled
---------
Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
* sycl: add softplus unary op implementation
* sycl: add softplus unary op implementation
* docs(ops): mark SYCL SOFTPLUS as supported
* docs: update SYCL status for SOFTPLUS
* vulkan: use coopmat for flash attention p*v matrix multiplication
* fix P loading issue
* fix barrier position
* remove reduction that is no longer needed
* move max thread reduction into loop
* remove osh padding
* add bounds checks and padding
* remove unused code
* fix shmem sizes, loop duration and accesses
* don't overwrite Qf, add new shared psh buffer instead
* add missing bounds checks
* use subgroup reductions
* optimize
* move bounds check, reduce barriers
* support other Bc values and other subgroup sizes
* remove D_split
* replace Of register array with shared memory Ofsh array
* parallelize HSV across the rowgroups
* go back to Of in registers, not shmem
* vectorize sfsh
* don't store entire K tile in shmem
* fixes
* load large k tiles to shmem on Nvidia
* adapt shared memory host check function to shader changes
* remove Bc 32 case
* remove unused variable
* fix missing mask reduction tmspsh barrier
* fix mask bounds check
* fix rowmax f16 under/overflow to inf
* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.
See more details here: KhronosGroup/MoltenVK#2683.
The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.
The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
* Boilerplate for q6_K repack
* q6_K repack to q6_Kx8 implementation
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* q6_K generic gemv and gemm
* wip, gemm_q6_K 8x8
* Still WIP: loading of q8s, q6h and q6l
* first working version of q6_K gemm
* Moved q6 loads outside of sb block, Unrolled inner loop
* Replaced modulo with mask
* First implementation of GEMV
* ggml_vdotq_s32 -> vdotq_s32
* Reduce width of accumulators in q6_K gemv
* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.
* Reuse scales in GEMM (same GEMV opt)
* Added todos for bsum and different qh repack
* Arch fallback
* VSLIQ for merging qh adn ql
* Removed TODO, already tested
* Apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Removed unused import
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
* Set the env variable in the CUDA backend registry allocation
* Add link to PR in code comment
* Remove warning logs and update documentation
* opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat`
* opencl: clean up
* opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat`
* opencl: tweak the workgroup size a bit
* opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat`
* opencl: proper alignment for q6_K
* opencl: boundary handling for flattened q6_K mv
* opencl: rename q6_K mv kernel file
* opencl: put flattened q6_K mv in its own file
* opencl: use lower k in file name
* opencl: use K in variable names
* ggml-cpu: Use tiled FA for prompt-processing
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
* fix out of bounds for mask
* skip rows where there are all masks
* skip tile if mask is inf
* store mask in worksize
* check inf tile earlier
* Boilerplate for q5_Kx8 REPACK on ARM and fallback
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* q5_K repack gemm and gemv generics
* Gemm and Gemv ARM implementations (i8mm)
* Improved qh manipulation looking at non-repack vec_dot implementation
* Full unroll
* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix wrong fallback definitions of Q5_K
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed comments. Reverted unnecessary formatting
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed typo in generic definitions
* Switching AND + Shift with Shift Insert. Better op interleaving.
* Vectorize + unroll the block scales
* Apply gemm optimizations to gemv
* Improve bias calculation
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* mla : pass V as a view of K to the FA op
* cuda : adjust mla logic to new layout
* kv-cache : fix rope shift
* tests : remove comment
* cuda : fix reusable_cutoff
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* opencl: add `copy_to_contiguous` and utilize mm kernels
* opencl: only copy to cont for f32 and f16 tensors
* opencl: use cont mm for fallback when dst is large
* opencl: use nb local to copy-to-cont
* opencl: use local offset as well
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu