Reuse the buffer for the ggml context which is used for creating the
compute graph on the server side. This partially addresses a memory leak
created by the CUDA backend due to using buffer addresses as cache
keys.
ref: #21265
ref: #20315
* ci : add AMD CPU label to PR labeler
Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files
* ci : rename label AMD CPU to AMD ZenDNN in labeler config
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Bump ROCm version on Linux from 7.2 to 7.2.1
Add gfx1102 target
Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add unit test coverage for llama_tensor_get_type
* Fix merge conflicts, add more schemas
* clang formatter changes
* Trailing whitespace
* Update name
* Start rebase
* Updating files with upstream changes prior to rebase
* Changes needed from rebase
* Update attn_qkv schema, change throw behaviour
* Fix merge conflicts
* White space
* Update with latest changes to state counters
* Revert accidental personal CLAUDE.md changes
* Change quotation mark
* Reuse metadata.name since we have it
* Move test-only stuff out of llama-quant.cpp
* Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns
* cont : inital deslop guidelines
* Cleanup based on review comments
* Continue cleanup
* Small cleanup
* Manually set proper ordering of tensors, mostly applies to gemma
* Formatting
* Update tests/test-quant-type-selection.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix merge conflicts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* tests: allow exporting graph ops from HF file without downloading weights
* use unique_ptr for llama_context in HF metadata case
* fix missing non-required tensors falling back to type f32
* use unique pointers where possible
* use no_alloc instead of fixing f32 fallback
* fix missing space
* Relax prefill parser to allow space.
* Move changes from prefix() to parser generation
* Only allow spaces if we're not having a pure content parser next
* chat : add Granite 4.0 chat template with correct tool_call role mapping
Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite
3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`).
The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the
`assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`.
Without a matching C++ handler, the fallback path emits the literal role
`assistant_tool_call` which the model does not recognize, breaking tool
calling when `--jinja` is not used.
Changes:
- Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X`
(preserves existing 3.x behavior unchanged)
- Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler
- Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0,
otherwise → 3.x
- Add production Granite 4.0 Jinja template
- Add tests for both 3.x and 4.0 template paths (C++ and Jinja)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Code review: follow standard format and use common logic in test-chat-template.cpp
* Rename custom_conversation variable for extra_conversation to give it a more meaningful name
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* hexagon : add cumsum op support
* hexagon: enable dma for cumsum op
* Fix line-ending
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* fix: Bypass API Key validation for static bundle assets
* refactor: All bypassed routes in `public_endpoints`
* test: Update static assets API Key test
* kleidiai: add cpu feature detection to CI run script
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a
* kleidiai: revert unrelated requirements change
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
* kleidiai: removed cpu feature detection from CI run script
* As per the maintainers' suggestion, removed cpu feature detection
from CI run script as CMake handles it already
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
* Introduced NVFP4 generic MMQ kernel
* Added extra FP8 guard, hope to solve ci HIP failure
* Rename tiles and use HIP_FP8_AVAILABLE
* Removed remaning FP8 straggler and added const int
* Const
* Removed DECL_MMQ_CASE artifact
* Removed newline
* Removed space after else
* Changed HIP FP8 NVFP4 conversion gate
* Added new line to bottom of mmq.cu 270
* Removed extra spaces
* Removed single space in front of else on line 814
* Added NVFP4 to generate cu script so HIP can see it, further tightened logic
* Include generated mmq-instance-nvfp4.cu
* Added NVFP4 mmq to HIP Check ignore list
* Update ggml/src/ggml-cuda/mmq.cuh
Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Added function name ending for end if
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Added function names to closing endif
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:
split_equal: sequential split is not supported when there are
coupled sequences in the input batch (you may need to use the
-kvu flag)
PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.
This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.
Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):
- HellaSwag: 83.0% (400 tasks)
- Winogrande: 74.5% (400 tasks)
- MMLU: 41.2%
- ARC-Challenge: 56.2%
- TruthfulQA: 37.7%
All previously failed with llama_decode() error.
The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.
Fixes#21191
* flash attention support for head dimension 512 added
* FA D=512 - match 576 configs, limit ncols2, revert vec cap
* fix HIP tile kernel build for D=512
* fix HIP tile kernel occupancy for D=512 on AMD
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* fix tile FA compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Refactor llama_model_quantize_params to expose a pure C interface
* Restore comment and cleanup struct def
* Code review refactoring
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Code review refactoring
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion