* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Do not mutate cgraph for fused ADDs
1. We should try to minimize in-place changes to the incoming
ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
step as we store the properties before modifying the graph in-place
in the cuda-backend
* Assert ggml_tensor is trivially copyable
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).
The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
2549 | using ARegsT = typename Impl::ARegsT;
```
Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2]. When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.
Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
* hexagon: add ARGSORT op
Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
* hexagon: argsort reject tensors with huge rows for now
* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend
* hexagon : Add GEGLU op
* hexagon: fix editor config check
* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA
---------
Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.
https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool
* Free pools
* Cleanup
* More cleanup
* Run clang-format
* Fix arg-parser and tokenizer test errors that free an unallocated buffer
* Fix device lost callback to not print on device teardown
* Fix include and run clang-format
* remove unused unused
* Update binary ops
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* First working version of GEMM and GEMV
* interleave loads and compute
* Clang-format
* Added missing fallback. Removed tested TODO.
* Swap M and N to be consistent with the repack template convention
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)
Resolves: #18560
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.
Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes
Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed
Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
* Rename variables + fix rope_neox
Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299
* Fix rope_multi
* Fix rope_vision
* Fix rope_norm
* Rename ne* to ne0* for consistent variable naming
* cont : consistent stride names
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml webgpu: port binary operators to use pre-wgsl
* Add binary.wgsl: unified shader with conditionals for all 4 ops
* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor
* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)
* Update CMake to generate binary operator shaders at build time
* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling
* port binary operators from AOT to pre-wgsl JIT compilation
* add src1=dst overlap handling for binary ops
* use compile-time workgroup size defines instead of runtime overrides
* ggml-webgpu: complete overlap handling for binary ops
* add support for inplace & overlap case in binding setup
* restructure conditional logic to handle all overlap cases
* ensure all buffer bindings are correctly assigned for edge cases
* ggml-webgpu: remove unused binary overlap cases
Remove src0==src1 binary overlap case that never occurs in practice.
* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT
* remove unused src0==src1 and all-same variant
* refactor wgsl to eliminate duplication
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.
I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.
I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.
Apply this optimization when the mask is relatively large (i.e. prompt
processing).
* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function
* ggml-virtgpu: deprecate buffer_type is_host remoting
not necessary
* ggml-virtgpu: stop using static vars as cache
The static init isn't thread safe.
* ggml-virtgpu: protect the use of the shared memory to transfer data
* ggml-virtgpu: make the remote calls thread-safe
* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory
* ggml-virtgpu: add a cleanup function for consistency
* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing
* fix style and ordering
* Remove the static variable in apir_device_get_count
* ggml-virtgpu: improve the logging
* fix review minor formatting changes