Piotr Wilkin (ilintar)
c301172f66
jinja: support none|string ( #18995 )
...
* jinja: support none|string
* Update common/jinja/value.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-jinja.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add as_string()
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-21 19:24:37 +01:00
Hendrik Erz
3802d3c78f
fix: Use `tabular-nums` for chat message statistics ( #18915 )
...
* fix: Use `tabular-nums` for chat message statistics
* fix: Rebuild WebUI
2026-01-21 18:46:01 +01:00
Daniel Bevenius
9da3dcd753
llama : clarify nemotron-h.cpp comment about RoPE [no ci] ( #18997 )
...
This commit removes the mention of RoPE in the comment for the Q and K
computation as RoPE is not applied.
2026-01-21 18:31:34 +01:00
Jeff Bolz
bd544c94a3
vulkan: Remove transfer_ctx, do everything in compute_ctx. ( #18945 )
...
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
2026-01-21 18:01:40 +01:00
Adrien Gallouët
14be5a39b1
common : improve error message when HTTPS is missing but required ( #18987 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-21 17:58:38 +01:00
손희준
fbbf3ad190
server: /v1/responses (partial) ( #18486 )
...
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
2026-01-21 17:47:23 +01:00
Jeff Bolz
33f890e579
vulkan: support flash attention GQA/split_k with small batches ( #18938 )
2026-01-21 17:43:43 +01:00
Masato Nakasaka
067b8d7af3
Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash ( #17356 )" ( #18831 )
...
This reverts commit 980b7cd17e .
2026-01-21 17:13:43 +01:00
Jeff Bolz
50b7f076a5
vulkan: Use mul_mat_vec_id for small values of n ( #18918 )
...
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-21 16:22:02 +01:00
Tarek Dakhran
ad8d85bd94
memory : add llama_memory_hybrid_iswa ( #18601 )
...
* memory : add llama_memory_hybrid_iswa
* Update src/llama-memory-hybrid-iswa.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-21 14:30:23 +02:00
Piotr Wilkin (ilintar)
12a4a47e6a
Fix GLM 4.7 Lite MoE gating func ( #18980 )
...
* Fix GLM 4.7 MoE gating func
* Update src/models/deepseek2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-01-21 12:35:20 +01:00
Matthieu Coudron
37c35f0e1c
gguf: display strerrno when cant load a model ( #18884 )
...
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
2026-01-21 08:52:46 +02:00
Oliver Simons
5bd341c9a1
CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator ( #18964 )
...
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator
Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0 ), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5 )
* Unindent as per code review request
2026-01-21 02:34:29 +01:00
Adrien Gallouët
1c7cf94b22
common, server : use the same User-Agent by default ( #18957 )
...
This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-20 18:28:43 +01:00
Xuan-Son Nguyen
2c1f199653
cli : fix reasoning responses in CLI ( #18961 )
...
* cli : fix reasoning responses in CLI
* fix build
* fix build (2)
2026-01-20 18:23:25 +01:00
Oliver Simons
d1e3556481
CUDA: Replace init_offsets kernel with iterators in cub-based argsort ( #18930 )
...
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu
2026-01-20 20:11:01 +08:00
Adrien Gallouët
08f3f4a8a3
ggml : cleanup path_str() ( #18928 )
...
- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-20 11:42:49 +01:00
Georgi Gerganov
271191906c
metal : enable FA for MLA heads ( #18950 )
2026-01-20 12:21:28 +02:00
Daniel Bevenius
7dee9ff59a
convert : use n_groups instead of hardcoded values in reshape ( #18929 )
...
* convert : use n_groups instead of hardcoded values in reshape
This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.
* use self.n_group instead of self.hparams["n_groups"]
2026-01-20 06:55:24 +01:00
Xuan-Son Nguyen
6df686bee6
server : refactor oai_parser_opt, move it to server_chat_params ( #18937 )
...
* server_chat_params
* move chat format into CLI
* use meta whenever possible
* clean up, no more chatml fallback
2026-01-19 23:28:01 +01:00
ddh0
1706a6d7c6
convert : support Glm4MoeLite ( #18936 )
...
* initial commit for branch
* add glm-4.7-flash, move tokenizer hash
* use `glm4` pretok
* silence flake8 E302 (CI)
* apply review feedback
* add <|user|> as eog
* also add EOG `<|observation|>`
* revert llama-vocab
* inherit vocab from glm4
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-01-19 23:09:20 +01:00
Sigbjørn Skjæret
959ecf7f23
jinja : fix undefined keys and attributes and int/float as bool ( #18924 )
...
* fix undefined keys and attributes
* add falsy tests
* as_bool for integers and floats
* more falsy/truthy tests
* --typo
2026-01-19 20:29:43 +01:00
Sigbjørn Skjæret
4037093c66
ci : run test-jinja -py on high perf [no ci] ( #18916 )
2026-01-19 20:29:15 +01:00
Lennart Austenfeld
18361c579c
server: fix memory reservations in populate_token_probs ( #18787 )
2026-01-19 19:13:31 +01:00
Georgi Gerganov
365a3e8c31
ggml : add ggml_build_forward_select ( #18550 )
...
* ggml : add ggml_build_forward_select
* cuda : adapt CUDA graph compat to new feature
* vulkan : update logic to handle command buffer closing
* ggml : check compute for fusion
* ggml : add comment
2026-01-19 20:03:19 +02:00
Daniel Bevenius
3d55846a5c
model-conversion : add BUILD_DIR variable to run-converted-model scripts ( #18927 )
...
This commit adds a BUILD_DIR variable to the scripts used for running
converted models.
The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.
2026-01-19 13:12:38 +01:00
Julius Tischbein
287a33017b
llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file ( #18887 )
2026-01-18 18:35:57 +02:00
Francisco Herrera
293a1565dc
docs: add linux to index ( #18907 )
2026-01-18 18:03:35 +08:00
Xuan-Son Nguyen
fe44d35574
tests : add test-jinja -py option for cross-checking ( #18906 )
...
* tests : add test-jinja -py option or cross-checking
* Update tests/test-jinja.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix + add source
* SandboxedEnvironment
* fix array.map case
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-18 08:14:27 +01:00
Sigbjørn Skjæret
bbcdac0189
jinja : fix object item order (and properly implement dictsort) ( #18904 )
...
* fix object item order
* as_ordered_object
* copy whole object
2026-01-18 03:40:06 +01:00
Sigbjørn Skjæret
d03c45c9c5
jinja : attribute support for join, map and sort ( #18883 )
...
* support negative array index and default value
* attribute support (int and str) for join, map and sort
* add tests
* update CODEOWNERS
* improve fixme sorting comment
2026-01-18 02:53:01 +01:00
Sigbjørn Skjæret
10c98cbdf6
jinja : add missing tojson filter for bool ( #18900 )
...
* add missing tojson for bool
* add more literal tests
2026-01-18 01:05:09 +01:00
Sigbjørn Skjæret
420960ab92
jinja : fix lexing of float literals with sign ( #18901 )
...
* fix lexing of float literals with sign
* add test
* consume_numeric
2026-01-18 00:57:51 +01:00
Xuan-Son Nguyen
f55b033ae6
jinja: correct member access rule ( #18905 )
2026-01-18 00:48:55 +01:00
lhez
d1b4757ded
opencl: fix q6_K mv for m=1 ( #18893 )
2026-01-17 13:50:32 -08:00
Sigbjørn Skjæret
57c0beaed0
ci : add label for jinja changes ( #18903 )
2026-01-17 21:52:02 +01:00
Georgi Gerganov
2fbde785bc
kv-cache : optimize KQ mask construction ( #18842 )
...
* kv-cache : optimize KQ mask construction
* cont : add explanation + improve
* cont : fix
2026-01-17 15:42:42 +02:00
Reese Levine
a89002f07b
ggml webgpu: support for backend sampling ( #18880 )
...
* ggml webgpu: add SOFTPLUS unary operator
Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern
* ggml webgpu: add EXPM1 unary operator
Implements EXPM1 (exp(x) - 1) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add FLOOR unary operator
Implements FLOOR (rounds down to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add CEIL unary operator
Implements CEIL (rounds up to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add ROUND unary operator
Implements ROUND (rounds to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add TRUNC unary operator
Implements TRUNC (truncates towards zero) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)
* Updates to webgpu get_memory
* Add argmax
* Add argmax,cumsum,sum,sum_rows
* Add necessary CPY/GET_ROWS operators
* Support for argsort using multi-pass strategy
* Update set_rows for i32 indices, move to pre-wgsl
* Port unary operators to pre-wgsl and support FILL
* Implement PAD
* Add support for top-k
* clean up, scope pipeline init mutex
* fix newline
* Add support for log
* Update LOG for better precision, and ops doc
---------
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>
2026-01-16 16:12:43 -08:00
Thore Koritzius
388ce82241
ggml : extend ggml_pool_1d + metal ( #16429 )
...
* chore: resolve conflicts
* feat: ggml metal impl
* fix: ggml_metal_kargs_pool_1d struct
* fix: require contiguous input
* chore: test pool_1d
* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts
* chore: add p0 and s0 to testing
* fix: allow padding for cpu and metal
* Update ggml/src/ggml-metal/ggml-metal.metal
* fix: correct single-threaded loop
* ggml : cleanup
* tests : add ne[1] != 1 tests
* fix: ne[1] handling in np
* cont : fixes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 16:59:56 +02:00
hipudding
6ba6a3c76f
docs : update ops.md for CANN backend ( #18654 )
2026-01-16 13:32:17 +01:00
Perry Naseck
0802d4cfb3
ggml-blas: hide warnings from included BLAS headers ( #18818 )
...
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set
* ggml-blas: hide warnings from included BLAS headers
2026-01-16 13:38:25 +02:00
Tarek Dakhran
c945aaaef2
mtmd : Fix ASR for LFM2.5-Audio-1.5B ( #18876 )
2026-01-16 11:23:08 +01:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine ( #18462 )
...
* jinja vm
* lexer
* add vm types
* demo
* clean up
* parser ok
* binary_expression::execute
* shadow naming
* bin ops works!
* fix map object
* add string builtins
* add more builtins
* wip
* use mk_val
* eval with is_user_input
* render gemma tmpl ok
* track input string even after transformations
* support binded functions
* keyword arguments and slicing array
* use shared_ptr for values
* add mk_stmt
* allow print source on exception
* fix negate test
* testing more templates
* mostly works
* add filter_statement
* allow func to access ctx
* add jinja-value.cpp
* impl global_from_json
* a lot of fixes
* more tests
* more fix, more tests
* more fixes
* rm workarounds
* demo: type inferrence
* add placeholder for tojson
* improve function args handling
* rm type inference
* no more std::regex
* trailing spaces
* make testing more flexible
* make output a bit cleaner
* (wip) redirect minja calls
* test: add --output
* fix crash on macro kwargs
* add minimal caps system
* add some workarounds
* rm caps_apply_workarounds
* get rid of preprocessing
* more fixes
* fix test-chat-template
* move test-chat-jinja into test-chat-template
* rm test-chat-jinja from cmake
* test-chat-template: use common
* fix build
* fix build (2)
* rename vm --> interpreter
* improve error reporting
* correct lstrip behavior
* add tojson
* more fixes
* disable tests for COMMON_CHAT_FORMAT_GENERIC
* make sure tojson output correct order
* add object.length
* fully functional selectattr / rejectattr
* improve error reporting
* more builtins added, more fixes
* create jinja rendering tests
* fix testing.h path
* adjust whitespace rules
* more fixes
* temporary disable test for ibm-granite
* r/lstrip behavior matched with hf.js
* minimax, glm4.5 ok
* add append and pop
* kimi-k2 ok
* test-chat passed
* fix lstrip_block
* add more jinja tests
* cast to unsigned char
* allow dict key to be numeric
* nemotron: rm windows newline
* tests ok
* fix test
* rename interpreter --> runtime
* fix build
* add more checks
* bring back generic format support
* fix Apertus
* [json.exception.out_of_range.403] key 'content' not found
* rm generic test
* refactor input marking
* add docs
* fix windows build
* clarify error message
* improved tests
* split/rsplit with maxsplit
* non-inverse maxsplit
forgot to change after simplifying
* implement separators for tojson and fix indent
* i like to move it move it
* rename null -- > none
* token::eof
* some nits + comments
* add exception classes for lexer and parser
* null -> none
* rename global -> env
* rm minja
* update docs
* docs: add input marking caveats
* imlement missing jinja-tests functions
* oops
* support trim filter with args, remove bogus to_json reference
* numerous argument fixes
* updated tests
* implement optional strip chars parameter
* use new chars parameter
* float filter also has default
* always leave at least one decimal in float string
* jinja : static analysis + header cleanup + minor fixes
* add fuzz test
* add string.cpp
* fix chat_template_kwargs
* nits
* fix build
* revert
* unrevert
sorry :)
* add fuzz func_args, refactor to be safer
* fix array.map()
* loosen ensure_vals max count condition, add not impl for map(int)
* hopefully fix windows
* check if empty first
* normalize newlines
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Julius Tischbein
aa1dc3770a
Setting mmap and direct_io to false as default in llama-bench.cpp ( #18841 )
2026-01-16 09:46:51 +01:00
Raul Torres
4ea2eaac01
CANN: Remove unused `ggml_cann_get_device` function ( #18625 )
2026-01-16 16:34:09 +08:00
Chenguang Li
e20fa27a02
CANN: fix an issue where get_env was not fully renamed ( #18796 )
...
* CANN: fix an issue where get_env was not fully renamed
* ci: add cann with acl group
* ci: define use_acl_graph using GitHub Action
* ci: update cann dockerfile with acl graph
2026-01-16 16:24:04 +08:00
hipudding
baa4ba0aec
CANN: support gated linear attn ( #18653 )
...
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: optimize OP gla
Optimize gla for high preformance
* Remove unused comments
---------
Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-16 16:18:49 +08:00
shaofeiqi
785a710085
OpenCL: add SOLVE_TRI op support ( #18846 )
2026-01-15 11:17:17 -08:00
Georgi Gerganov
6e7fc8a146
cuda : print less debug logs when disabling cuda graphs ( #18868 )
2026-01-15 20:53:01 +02:00
Georgi Gerganov
be8e3d9515
context : do not reserve scheduler for warmups ( #18867 )
2026-01-15 19:35:57 +02:00