CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.
https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool
* Free pools
* Cleanup
* More cleanup
* Run clang-format
* Fix arg-parser and tokenizer test errors that free an unallocated buffer
* Fix device lost callback to not print on device teardown
* Fix include and run clang-format
* remove unused unused
* Update binary ops
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* support qwen3.5 series
* remove deepstack for now, and some code clean
* code clean
* add FULL_ATTENTION_INTERVAL metadata
* code clean
* reorder v heads for linear attention to avoid expensive interleaved repeat
* First working version of GEMM and GEMV
* interleave loads and compute
* Clang-format
* Added missing fallback. Removed tested TODO.
* Swap M and N to be consistent with the repack template convention
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)
Resolves: #18560
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.
Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes
Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed
Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
* ci : add metal server workflows
* cont : try fix python init
* cont : move to a separate workflow that runs only on master
* cont : fix num jobs
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml-virtgpu: add backend documentation
Assisted-by-AI: Claude Code
* CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget
* README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md
* docs/ggml-virt: add link to testing + configuration
* Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget"
This reverts commit 8ece8e72e2.
* drop the ggml- prefix
* s/ggerganov/ggml-org
* Relocate VirtGPU.md
* reorganize the text
* turn turn the ascii diagram into a mermaid
* README.md: update the link to the main doc
* Unified delta net handling
* Remove old methods.
* Refactor and optimize
* Adapt autoregressive version from @ymcki
* Change to decay mask approach
* Fix bad permute
* Qwen 3.5 support
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Further fixes
* Use inheritance, remove unneeded conts
* Not like this!
* Remove ggml.h explicit import
* Remove transformers, fix the views
* ACTUALLY fix views, make super calls explicit in conversion.
* Fix conversion again
* Remove extra ggml.h imports
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Rename variables + fix rope_neox
Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299
* Fix rope_multi
* Fix rope_vision
* Fix rope_norm
* Rename ne* to ne0* for consistent variable naming
* cont : consistent stride names
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* gguf-py: Bump sentencepiece version
There's a new version that's been out for a while that addresses the issues mentioned in https://github.com/ggml-org/llama.cpp/pull/14200. There's a long chain of reasons I would like this change, but the short version is that it allows people who use both `sentencepiece` and `gguf` to take advantage of these fixes. On conda-forge, currently, it locks the version (since there is no notion of optional dependencies).
Regardless, I don't think this should be too controversial.
* review feedback
* ggml webgpu: port binary operators to use pre-wgsl
* Add binary.wgsl: unified shader with conditionals for all 4 ops
* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor
* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)
* Update CMake to generate binary operator shaders at build time
* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling
* port binary operators from AOT to pre-wgsl JIT compilation
* add src1=dst overlap handling for binary ops
* use compile-time workgroup size defines instead of runtime overrides
* ggml-webgpu: complete overlap handling for binary ops
* add support for inplace & overlap case in binding setup
* restructure conditional logic to handle all overlap cases
* ensure all buffer bindings are correctly assigned for edge cases
* ggml-webgpu: remove unused binary overlap cases
Remove src0==src1 binary overlap case that never occurs in practice.
* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT
* remove unused src0==src1 and all-same variant
* refactor wgsl to eliminate duplication