Commit Graph

8642 Commits

Author SHA1 Message Date
Slobodan Josic 7c7d6ce5c7
[HIP] Bump ROCm version to 7.2.1 (#21066)
Bump ROCm version on Linux from 7.2 to 7.2.1
Add gfx1102 target
Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-03 00:59:20 +02:00
Piotr Wilkin (ilintar) 5208e2d5ba
fix: gemma 4 template (#21326) 2026-04-02 23:31:02 +02:00
Bartowski 7992aa7c8e
tests : add unit test coverage for llama_tensor_get_type (#20112)
* Add unit test coverage for llama_tensor_get_type

* Fix merge conflicts, add more schemas

* clang formatter changes

* Trailing whitespace

* Update name

* Start rebase

* Updating files with upstream changes prior to rebase

* Changes needed from rebase

* Update attn_qkv schema, change throw behaviour

* Fix merge conflicts

* White space

* Update with latest changes to state counters

* Revert accidental personal CLAUDE.md changes

* Change quotation mark

* Reuse metadata.name since we have it

* Move test-only stuff out of llama-quant.cpp

* Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns

* cont : inital deslop guidelines

* Cleanup based on review comments

* Continue cleanup

* Small cleanup

* Manually set proper ordering of tensors, mostly applies to gemma

* Formatting

* Update tests/test-quant-type-selection.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix merge conflicts

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-02 22:53:58 +02:00
Zheyuan Chen a1cfb64530
ggml-webgpu: add vectorized flash attention (#20709)
* naive vectorized version

* add vectorized flash attention

* update vec version

* remove unused path and shader

* remove unused helper functions

* add comments

* remove pad path

* ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization

* change back to vec4

* enable multi split

* enable vec path when:
- Q->ne[1] < 20
- Q->ne[0] % 32 == 0
- V->ne[0] % 4 == 0
- K->type == f16

* update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select

* enable vec path for q4 and q8

* flash-attn vec nwg=1 fast path (skip tmp/reduce staging)

* use packed f16 K loads in flash-attn vec split

* use packed f16 K loads in flash-attn vec split on host side

* tune flash-attn vec f16 VEC_NE by head dim

* cleanup

* cleanup

* keep host side clean

* cleanup host side

* change back to original host wait/submit behavior

* formatting

* reverted param-buffer pool r ecfactor

* add helper functions

* ggml-webgpu: move flash-attn vec pipeline caching back into shader lib

* ggml-webgpu: remove duplicate functions

* ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation

* ggml-webgpu: revert unrelated change

* ggml-webgpu: revert deleted comment

* disable uniformity check

* remove unnecessary change

* Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl

* Update ggml/src/ggml-webgpu/ggml-webgpu.cpp

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-04-02 10:40:42 -07:00
Ruben Ortlam 5803c8d115
tests: allow exporting graph ops from HF file without downloading weights (#21182)
* tests: allow exporting graph ops from HF file without downloading weights

* use unique_ptr for llama_context in HF metadata case

* fix missing non-required tensors falling back to type f32

* use unique pointers where possible

* use no_alloc instead of fixing f32 fallback

* fix missing space
2026-04-02 18:19:20 +02:00
Xuan-Son Nguyen 63f8fe0ef4
model, mtmd: fix gguf conversion for audio/vision mmproj (#21309)
* fix gguf conversion for audio/vision mmproj

* fix test
2026-04-02 17:10:32 +02:00
Aldehir Rojas 223373742b
common : add commentary rules for gpt-oss-20b (#21286) 2026-04-02 08:59:59 -05:00
Piotr Wilkin (ilintar) e15efe007d
Relax prefill parser to allow space. (#21240)
* Relax prefill parser to allow space.

* Move changes from prefix() to parser generation

* Only allow spaces if we're not having a pure content parser next
2026-04-02 11:29:11 +02:00
Jesus Talavera 6137c325a1
chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804)
* chat : add Granite 4.0 chat template with correct tool_call role mapping

Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite
3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`).

The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the
`assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`.
Without a matching C++ handler, the fallback path emits the literal role
`assistant_tool_call` which the model does not recognize, breaking tool
calling when `--jinja` is not used.

Changes:
- Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X`
  (preserves existing 3.x behavior unchanged)
- Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler
- Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0,
  otherwise → 3.x
- Add production Granite 4.0 Jinja template
- Add tests for both 3.x and 4.0 template paths (C++ and Jinja)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Code review: follow standard format and use common logic in test-chat-template.cpp

* Rename custom_conversation variable for extra_conversation to give it a more meaningful name

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 11:28:56 +02:00
Georgi Gerganov 17193cce34
kv-cache : do not quantize SWA KV cache (#21277) 2026-04-02 11:54:05 +03:00
Roger Chen d6dac92bfd
Ignore Transfer-Encoding header. (#20269) 2026-04-02 10:41:19 +02:00
Georgi Gerganov dae2bf41c9 sync : ggml 2026-04-02 10:39:00 +03:00
Georgi Gerganov bc07d55922 ggml : bump version to 0.9.11 (ggml/1456) 2026-04-02 10:39:00 +03:00
Neo Zhang 4888137b17
sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283) 2026-04-02 10:08:32 +03:00
Todor Boinovski fbd441c379
hexagon : add cumsum op support (#21246)
* hexagon : add cumsum op support

* hexagon: enable dma for cumsum op

* Fix line-ending

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-04-01 17:44:02 -07:00
Xuan-Son Nguyen c30e012253
contrib : rewrite AGENTS.md, make it more clear about project values (#21270)
* contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage

* permit AI for writing code
2026-04-01 23:31:51 +02:00
lhez 95a6ebabb2
opencl: fix leak in Adreno q8_0 path (#21212) 2026-04-01 12:54:58 -07:00
Aleksander Grygier 12dbf1da95
server: Bypass API Key validation for WebUI static bundle assets (#21269)
* fix: Bypass API Key validation for static bundle assets

* refactor: All bypassed routes in `public_endpoints`

* test: Update static assets API Key test
2026-04-01 21:32:15 +02:00
Johannes Gäßler 86221cf6da
CUDA: fix FA kernel selection logic (#21271) 2026-04-01 22:28:19 +03:00
Martin Klacer 6de97b9d3e
kleidiai: add CPU feature detection to CI run script (#20394)
* kleidiai: add cpu feature detection to CI run script

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a

* kleidiai: revert unrelated requirements change

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* kleidiai: removed cpu feature detection from CI run script

 * As per the maintainers' suggestion, removed cpu feature detection
   from CI run script as CMake handles it already

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
2026-04-01 20:02:41 +03:00
Nikhil Jain 5a0ed5150a
Update Dawn version in WebGPU CI (#20784)
* Pin Dawn version

* Update docs with new Dawn commit hash
2026-04-01 09:53:05 -07:00
Aparna M P 8710e5f9b9
hexagon: improve RMS_NORM and DIV accuracy (#21251)
* hexagon-rms_norm: fix RMS_NORM for non-aligned tensor sizes

Co-authored-by: Krishna Sridhar <srsr@qti.qualcomm.com>

* hexagon-div: perform DIV in fp16 domain for lower dsp archs

---------

Co-authored-by: Krishna Sridhar <srsr@qti.qualcomm.com>
2026-04-01 08:43:08 -07:00
Jonathan 1d6d4cf7a5
fix: tool call parsing for LFM2 and LFM2.5 models (#21242)
* fix: tool call parsing for LFM2 and LFM2.5 models'

* refactor: add test / break out lfm2 and lfm2.5 parsing logic
2026-04-01 16:22:44 +02:00
Georgi Gerganov 744c0c7310
llama : rotate activations for better quantization (#21038)
* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices
2026-04-01 16:58:01 +03:00
Xuan-Son Nguyen 0356e33aaf
scripts: add function call test script (#21234)
* scripts: add function call test script

* add reasoning_content

* fix lint
2026-04-01 15:31:58 +02:00
Georgi Gerganov 6422036fcb sync : ggml 2026-04-01 16:03:17 +03:00
Georgi Gerganov 296bc0538b ggml : bump version to 0.9.10 (ggml/1454) 2026-04-01 16:03:17 +03:00
Neo Zhang 6b949d1078
sycl : support nvfp4 type in mul_mat (#21227) 2026-04-01 13:54:15 +03:00
Michael Wand 84f82e846c
ggml-cuda: Add generic NVFP4 MMQ kernel (#21074)
* Introduced NVFP4 generic MMQ kernel

* Added extra FP8 guard, hope to solve ci HIP failure

* Rename tiles and use HIP_FP8_AVAILABLE

* Removed remaning FP8 straggler and added const int

* Const

* Removed DECL_MMQ_CASE artifact

* Removed newline

* Removed space after else

* Changed HIP FP8 NVFP4 conversion gate

* Added new line to bottom of mmq.cu 270

* Removed extra spaces

* Removed single space in front of else on line 814

* Added NVFP4 to generate cu script so HIP can see it, further tightened logic

* Include generated mmq-instance-nvfp4.cu

* Added NVFP4 mmq to HIP Check ignore list

* Update ggml/src/ggml-cuda/mmq.cuh

Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Added function name ending for end if

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Added function names to closing endif

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-01 12:04:58 +02:00
Ettore Di Giacinto e1cb817483
memory: respect unified KV cache in hybrid memory for eval tasks (#21224)
The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the
  -kvu flag)

PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.

This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.

Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):
- HellaSwag: 83.0% (400 tasks)
- Winogrande: 74.5% (400 tasks)
- MMLU: 41.2%
- ARC-Challenge: 56.2%
- TruthfulQA: 37.7%
All previously failed with llama_decode() error.
2026-04-01 12:50:17 +03:00
uvos 88d5f8ffc3
CUDA/HIP: Fix kernel slection for mmvq mmid kernel to align host selection with device launch bounds (#21238)
The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.

Fixes #21191
2026-04-01 10:21:20 +02:00
Georgi Gerganov d43375ff7f
ggml : fix RWKV ops thread assignment (#21226) 2026-04-01 11:10:25 +03:00
Taimur Ahmad 2b86e5cae6
ggml-cpu: fix fallback for RVV kernels without zvfh (#21157)
* ggml-cpu: refactor sgemm; fix rvv checks

* ggml-cpu: refactor rvv kernels; set zvfbfwma default to off
2026-04-01 11:10:03 +03:00
Anav Prasad 88458164c7
CUDA: Add Flash Attention Support for Head Dimension 512 (#20998)
* flash attention support for head dimension 512 added

* FA D=512 - match 576 configs, limit ncols2, revert vec cap

* fix HIP tile kernel build for D=512

* fix HIP tile kernel occupancy for D=512 on AMD

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix tile FA compilation

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-01 09:07:24 +02:00
Ed Addario 4951250235
llama : refactor llama_model_quantize_params to expose a pure C interface (#20346)
* Refactor llama_model_quantize_params to expose a pure C interface

* Restore comment and cleanup struct def

* Code review refactoring

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review refactoring

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-01 08:43:00 +03:00
Reese Levine 82764c341a
ggml webgpu: quantized buffers to u32 + wider browser/device support (#21046)
* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs
2026-04-01 08:38:24 +03:00
Abhijit Ramesh 825eb91a66
ggml-webgpu: port all AOT operators to JIT (#20728)
* port cpy pipeline to shader lib with JIT compilation
 * port glu pipeline to shader lib with JIT compilation
 * port rope pipeline to shader lib with JIT compilation
 * port soft_max pipeline to shader lib with JIT compilation
 * removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
2026-03-31 15:38:16 -07:00
Aleksander Grygier 0fcb3760b2
fix: Use lower-case proxy headers naming (#21235) 2026-03-31 17:47:46 +02:00
Adrien Gallouët 6307ec07d3
common : cleanup logs and modernize the progress bar (#21215)
```
$ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100%
Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100%
...
```

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-31 16:18:00 +02:00
hipudding 632219af73
CANN: fix multi-thread set_tensor race conditions (#20151)
* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D
2026-03-31 17:00:51 +03:00
Xuan-Son Nguyen 4a00bbfed6
server: (webui) no more gzip compression (#21073)
* webui: no more gzip

* try changing a small line

* Revert "try changing a small line"

This reverts commit 0d7a353159.

* fix lint

* fix test

* rebuild

* split into html/css/js

* lint

* chore: update webui build output

* chore: Update git hooks script

* server: update webui build output

* chore: Update pre-commit hook

* refactor: Cleanup

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-31 15:44:26 +02:00
Aldehir Rojas 624733d631
common : gpt-oss handle builtin and unsolicited tool calls (#21213) 2026-03-31 13:52:42 +02:00
lainon1 0b6ff47996
fix: correct misspellings in code comments (#21217)
- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp,
gemma-embedding.cpp)
- imlpemented → implemented (llama-adapter.cpp)
- interere → interfere (llama-graph.cpp)
- overridde → overridden (chat.cpp)
- stastistics → statistics (ngram-map.h)
- layed → laid (llama-kv-cache.h)
- worster → worst (llama-context.cpp)
- sequantial → sequential (llama-batch.h)
2026-03-31 13:50:51 +02:00
Seungmin Kim eec6f85d7b
CI: Enable CPU and Vulkan ARM64 Release (#21207) 2026-03-31 19:02:56 +08:00
Georgi Gerganov 9281dd135d sync : ggml 2026-03-31 14:00:41 +03:00
Georgi Gerganov 0be6c7c9ce ggml : bump version to 0.9.9 (ggml/1449) 2026-03-31 14:00:41 +03:00
Adrien Gallouët 41361c8599
common : move up common_init() and fix Windows UTF-8 logs (#21176)
The build info is now only for debug, so we avoid the duplicate
with `--version`.

The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-31 12:53:41 +02:00
Neo Zhang 62278cedde
sycl : enhance fattn perf (#21185) 2026-03-31 13:31:50 +03:00
mtmcp 90aa83c6bd
common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082)
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load

* Revert a308e584ca

* Add regression test

* Remove regression test for init-fail sampler check
2026-03-31 13:04:42 +03:00
SATISH K C fcc2d598c8
fix: include API key in CORS proxy requests for MCP connections (#21193)
* fix: include API key in CORS proxy requests for MCP connections

When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.

Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.

Fixes #21167

* fix: simplify headers assignment based on reviewer suggestion

Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
2026-03-31 10:52:34 +02:00