- adapt ggml-zendnn.cpp to the new lowoha::matmul interface
- update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08)
- add static lib support in CMake
* ggml-virtgpu-backend: validate the consistency of the received objects
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
* ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods
```
1. bck->iface.synchronize(bck)
2. buft->iface.get_alloc_size(buft, op)
3. buft->iface.get_max_size(buft)
```
these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
* ggml-virtgpu-backend: fix log format missing argument
* ggml-virtgpu-backend: improve the abort message
* ggml-virtgpu-backend: more safety checks
* ggml-virtgpu-backend: new error code
* ggml-virtgpu-backend: initialize all the error codes
* ggml-virtgpu: add a missing comment generated by the code generator
* ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names
* ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message
* ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR
(still unmerged)
* ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file
* ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error
* docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun
* ggml-virtgpu: turn off clang-format header ordering for some of the files
Compilation breaks when ordered alphabetically.
* ggml-virtgpu: clang-format
* ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes
* ggml-virtgpu: add backend documentation
Assisted-by-AI: Claude Code
* CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget
* README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md
* docs/ggml-virt: add link to testing + configuration
* Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget"
This reverts commit 8ece8e72e2.
* drop the ggml- prefix
* s/ggerganov/ggml-org
* Relocate VirtGPU.md
* reorganize the text
* turn turn the ascii diagram into a mermaid
* README.md: update the link to the main doc
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
* sycl: add softplus unary op implementation
* sycl: add softplus unary op implementation
* docs(ops): mark SYCL SOFTPLUS as supported
* docs: update SYCL status for SOFTPLUS
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
* Set the env variable in the CUDA backend registry allocation
* Add link to PR in code comment
* Remove warning logs and update documentation
* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars
* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32
Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY
* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs
hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc
hvx-utils library got a nice round of cleanup and refactoring to reduce duplication
use hvx_vec_store_a where possible
* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h
Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.
* hexagon: factor out hvx-sqrt.h
* hexagon: mintor update to hvx-utils.h
* hexagon: remove spurios log
* hexagon: factor out and optimize hvx_add/sub/mul
* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions
* hexagon: refactor reduction functions to hvx-reduce.h
Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.
* hexagon: refactor the rest of arithmetic functions to hvx-arith.h
Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.
Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h
Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.
* hexagon: refactor hvx_sum_of_squares_f32
- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.
* hexagon: use hvx_splat instead of memset
* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML
* hexagon: fix hvx_copy_f16_f32 on v75 and older
* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL
* scripts: update snapdragon/adb scripts to enable host param
* arg: support remote preset
* proof reading
* allow one HF repo to point to multiple HF repos
* docs: mention about multiple GGUF use case
* correct clean_file_name
* download: also return HTTP status code
* fix case with cache file used
* fix --offline option
* ggml-webgpu: add CEIL operation support
Add support for the CEIL unary operation in the WebGPU backend:
- Add CEIL_FUNC shader template in unary_op.wgsl
- Add 4 shader variants (f32, f16, inplace versions)
- Initialize CEIL pipelines in ggml-webgpu.cpp
- Register CEIL in supports_op function
* docs: update WebGPU ops support for CEIL
This commit implements operator fusion for ADD + RMS_NORM operations
in the CANN backend to reduce memory access overhead and improve
performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION
environment variable (default: false).
Changes:
- Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm
- Add ggml_cann_can_fuse() to check fusion eligibility
- Integrate fusion logic into computation graph evaluation
- Add test cases for ADD + RMS_NORM fusion
- Update documentation with new environment variable
The fusion combines ADD and RMS_NORM into a single kernel call,
which is more efficient than executing them separately.
* Clarify setup steps for Linux
Added note that setup steps apply to Linux as well.
* Added note for backtick replacement
* clarify that backtick replacement only applies on linux
* clarified Linux specific steps
So actually some changes are needed for Linux but they are minor.
* clarify change execution
* clarify by placing info after steps
* clarify which steps
* Make instructions consistent across OSes
* Rm whitespace
* Update docs/backend/OPENCL.md
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* Update docs/backend/OPENCL.md
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* Update docs/backend/OPENCL.md
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* add count equal for metal
* remove trailing whitespace
* updated doc ops table
* changed shmem to i32
* added multi tg and templating
* removed BLAS support from Metal docs
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add memset to set dst to 0
* metal : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility
* refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility
* refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity
* add comment
* refactor: remove redundant buffer checks in hexagon supported operations
* wip
* add missing include to fix weak symbol warning
* add ggml_hexagon_op_generic
* refactor: simplify tensor operation initialization and buffer management in hexagon implementation
* refactor: streamline hexagon operation initialization and buffer management
* refactor: update function signatures and streamline request handling in hexagon operations
* wip
* ggml-hexagon: clean up code formatting and improve unary operation handling
* wip
* rename
* fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility
refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility
refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity
refactor: remove redundant buffer checks in hexagon supported operations
add missing include to fix weak symbol warning
add ggml_hexagon_op_generic
refactor: simplify tensor operation initialization and buffer management in hexagon implementation
refactor: streamline hexagon operation initialization and buffer management
refactor: update function signatures and streamline request handling in hexagon operations
ggml-hexagon: clean up code formatting and improve unary operation handling
fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations
# Conflicts:
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
* hexagon: fix merge conflicts
* hexagon: minor cleanup for buffer support checks
* hexagon: factor out op_desc and the overal op logging
* hexagon: further simplify and cleanup op dispatch logic
* snapdragon: update adb scripts to use llama-cli and llama-completion
* fix pipeline failure
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>