Commit Graph

8507 Commits

Author SHA1 Message Date
Johannes Gäßler 7d732af8e5
Merge 60312f6a46 into 1fb2290a51 2026-03-23 22:04:54 +00:00
Johannes Gäßler 60312f6a46 fix model saving 2026-03-23 22:59:37 +01:00
Johannes Gäßler dd2564bc38 fix prints 2026-03-23 22:59:37 +01:00
Johannes Gäßler 445fc0bf21 refactor tests 2026-03-23 22:59:37 +01:00
Johannes Gäßler e7f31055b3 fixup 2026-03-23 22:59:37 +01:00
Johannes Gäßler c66fd8a227 roundtrip tests 2026-03-23 22:59:37 +01:00
Johannes Gäßler e8e2f634e7 fix llama-model-saver 2026-03-23 22:59:37 +01:00
Johannes Gäßler e0ee16ce77 fix tensor names 2026-03-23 22:59:37 +01:00
Johannes Gäßler f76e53108c fixup 2026-03-23 22:59:37 +01:00
Siddhesh2377 6de1857936 llama : use FILE pointer consistently, address review feedback 2026-03-23 22:59:37 +01:00
Siddhesh2377 c44d34ee73 llama : use FILE pointer instead of fd in public API 2026-03-23 22:59:37 +01:00
Siddhesh2377 2c3223177d llama : address review feedback for fd-based model loading 2026-03-23 22:59:37 +01:00
Siddhesh2377 4101758ab6 llama : add fd-based model loading via llama_model_load_from_fd 2026-03-23 22:59:37 +01:00
Max Krasnyansky 1fb2290a51
Add codeowners for scripts/snapdragon and docs/snapdragon (#20915)
* Add codeowners for scripts/snapdragon

* Also add docs/backends/snapdragon
2026-03-23 14:57:18 -07:00
lhez 1772701f99
opencl: add q6_K gemm and gemv kernels for Adreno (#20089)
* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast
2026-03-23 12:44:18 -07:00
las7 39bf0d3c6a
rpc : RCE patch (#20908) 2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen bd6992180b
contrib: add "Requirements" section to PR template (#20841)
* contrib: add "Requirements" section to PR template

* typo [no ci]

* use h2, add "Additional information"

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-03-23 16:59:02 +01:00
Davi Henrique Linhares fd18364755
devops: upgraded default oneAPI version (#20731) 2026-03-23 21:47:34 +08:00
Aleksander Grygier 11fb11b901
webui: Improve chat form positioning (#20901) 2026-03-23 14:30:55 +01:00
Geo Maciolek 35b662bb5d
docs: Fix typo in reasoning flag documentation (#20780)
Tested to verify - the typo is just in the docs, not the actual flag.
2026-03-23 21:24:55 +08:00
Georgi Gerganov f93c09e267
memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887) 2026-03-23 14:08:46 +02:00
Eric Zhang 841bc203e2
docs : rerun llama-gen-docs to include new CLI args (#20892) 2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen 31a5cf4c3f
server: use httplib dynamic threads (#20817)
* server: use httplib dynamic threads

* change to n_threads_http + 1024
2026-03-23 12:22:46 +01:00
Georgi Gerganov e32d243849
ai : update gh permissions (#20895) 2026-03-23 13:21:41 +02:00
Pascal c44a932cf4
webui: fix --webui-config-file settings not applied on load (#20823)
* webui: fix --webui-config-file settings not applied on load

* chore: update webui build output
2026-03-23 11:25:35 +01:00
Rashid Ul Islam 177c75852a
metal: add CONV_3D (#19927)
* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* metal:add conv_3d backend

Rebased with master and resolved conflicts.

* Resolved issues related to changes in variable names

* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 09:45:34 +02:00
Jhen-Jie Hong 7a0b6a635e
common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859) 2026-03-23 08:35:27 +01:00
Chenguang Li 07ff000551
CANN: add RoPE cache preload before ACL graph capture (#20747)
ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
  and allocation branches are skipped
2026-03-23 15:24:06 +08:00
Dan Hoffman cc18f965b6
fix(openvino): explicit memset in buffer_context allocation (#20857)
* fix(openvino): explicit memset in buffer_context allocation

* minor

---------

Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 08:05:37 +02:00
shaofeiqi 84ffd0c192
opencl: add flattened Q4_K mv and general Q4_K mm (#20773) 2026-03-22 22:45:11 -07:00
bssrdf ec2b787ebe
mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847)
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)

* add min/max dynamic patch to gguf meta

* clean up

* simplified handling min/max dynamic patch

* reuse llava_uhd logic for slice images

* provide default values for older models

* flake8

* prevent writing 0 value to gguf

* remove duplicated resolution candidates with a better algorithm

* fix indentation

* format

* add protection from divide by zero

* change to 0 to be safe

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-23 01:06:30 +01:00
DorianRudolph d3ac030a5d
mtmd : fix LightOnOCR image preprocessing (#20877) 2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen 49bfddeca1
server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Johannes Gäßler bd3f1d9d65
CUDA: fix BF16 FA compilation (#20865) 2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret 23c9182ce8
jinja : refactor token advancement (#20864)
* refactor token advancement

* exercise sub-expressions
2026-03-22 17:45:10 +01:00
Evgeny Kurnevsky 81bc4d3ddc
server: fix Host header (#20843)
It should include port when it's not default.
2026-03-22 22:29:22 +08:00
Neo Zhang f40a80b4f3
support bf16 and quantized type (#20803) 2026-03-22 22:06:27 +08:00
Patrick Buckley db9d8aa428
ggml-cuda: native bf16 flash attention for vec kernel (#20525)
* ggml-cuda: native bf16 flash attention for vec and tile kernels

mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo

* ggml-cuda: address code owner review feedback

reverted tile kernel changes to avoid larger refactor

* fix ci failures on turing and hip

* fix bf16 vec kernel compile on hip v_dot2 platforms

* add comments

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-22 11:05:51 +01:00
Gaurav Garg ccb87fa3ee
[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635)
* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space
2026-03-22 16:49:35 +08:00
ddh0 3306dbaef7
misc : prefer ggml-org models in docs and examples (#20827)
* misc : prefer ggml-org models in docs and examples

Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.

* remove accidentally committed file
2026-03-21 22:00:26 +01:00
Andrea Arcangeli 990e4d9698
common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
* grammar: add test case for nullable symbol loop

Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= ( [x]* )*"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: prevent stack overflow with nullable symbol loop

Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

* grammar: convert recursive llama_grammar_advance_stack to iterative

This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration

convert from recursive to interactive"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.

Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* grammar: add test case for hang in repetition grammar processing

This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= (([^x]*){0,99}){0,99}"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: add repetition threshold check

The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.

A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.
2026-03-21 18:43:35 +01:00
Tom Hillbrunner 212f4521b0
context : use n_embd_out for pooled embedding extraction (#20840)
The MEAN/CLS/LAST pooling paths in encode() and decode() used
n_embd_inp() (16384 for qwen3vl with deepstack) to read from the
pooled embedding tensor, which only has n_embd_out() (4096) floats
per sequence. This caused a tensor read out of bounds assertion.

Fixes embedding mode for Qwen3-VL-Embedding models.
2026-03-21 19:35:00 +02:00
Xuan-Son Nguyen 568aec82d2
docs : explicit about banning accounts that violates policy (#19593) 2026-03-21 15:50:16 +01:00
y198 2bcdddd5e3
fix(rpc): prevent division by zero in deserialize_tensor (#20712)
rpc : prevent division by zero in deserialize_tensor

When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. 

This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0.

(Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here).

* style: remove trailing whitespace
2026-03-21 15:59:43 +02:00
Michael Wand eac9c6ea83
Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730)
* Corrected convert script for NVFP4 naming and updated gguf constants

* Add mostly_MXFP4 to FileType

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* simplify

* set initial value [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-21 13:35:21 +02:00
Sigbjørn Skjæret 29b28a9824
ci : switch from pyright to ty (#20826)
* type fixes

* switch to ty

* tweak rules

* tweak more rules

* more tweaks

* final tweak

* use common import-not-found rule
2026-03-21 08:54:34 +01:00
Matt Corallo cea560f483
Add shader count for Intel Arc Pro B60 (#20818) 2026-03-21 05:22:51 +01:00
Piotr Wilkin (ilintar) b1c70e2e54
common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825) 2026-03-21 00:19:04 +01:00
shalinib-ibm e6ec21e62f
ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)
Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-03-21 07:11:45 +08:00
Georgi Gerganov 4cb7e0bd61
ai : limit runtime of the agent (#20816) 2026-03-20 20:31:25 +02:00