Commit Graph

8035 Commits

Author SHA1 Message Date
Aleksander Grygier 80e829a248 chore: update webui build output 2026-01-12 14:49:11 +01:00
Aleksander Grygier 60ef752d0f refactor: Architecture improvements 2026-01-12 14:45:24 +01:00
Aleksander Grygier a63a421952 chore: update webui build output 2026-01-12 14:18:15 +01:00
Aleksander Grygier 58ab834b18 refactor: MCP state management + stores/clients relationship 2026-01-12 14:17:06 +01:00
Daniel Bevenius 4150da9a95
examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.

The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
2026-01-12 13:47:58 +01:00
Jeff Bolz 8e2da778da
vulkan: change memory_logger to be controlled by an env var (#18769) 2026-01-12 13:32:55 +01:00
Xuan-Son Nguyen ce3bf9b1a4
server: update docs for sleeping [no ci] (#18777) 2026-01-12 13:01:24 +01:00
Jeff Bolz 2bbe4c2cf8
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-12 12:32:13 +01:00
Aleksander Grygier 9c53bd4486 chore: update webui build output 2026-01-12 11:16:18 +01:00
Aleksander Grygier 528a560a25 fix: Distinguish streaming vs incomplete tool calls in UI 2026-01-12 11:15:58 +01:00
Aleksander Grygier aa9054367a chore: update webui build output 2026-01-12 11:10:24 +01:00
Aleksander Grygier cead02ee58 fix: Restore live reactive UI progress for tool calls 2026-01-12 11:07:56 +01:00
Aleksander Grygier c6843d0054 chore: update webui build output 2026-01-12 11:02:42 +01:00
Aleksander Grygier b5226ebd86 Merge origin/allozaur/mcp-mvp: enable streaming of tool call arguments
Resolves conflicts by:
- Keeping clean store architecture (agentic.svelte.ts delegates to client)
- Updating agentic.client.ts to use TOOL_ARGS_START/END format
- Accepting remote AgenticContent.svelte with direct JSON parsing
- Updating ChatMessageAssistant to match new AgenticContent props
2026-01-12 10:55:34 +01:00
Aleksander Grygier 01dfe0ee4c chore: update webui build output 2026-01-12 10:37:12 +01:00
Aleksander Grygier 144148125b refactor: Cleanup 2026-01-12 10:28:59 +01:00
Ruben Ortlam 1051ecd289
vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-12 07:29:35 +01:00
Xuan-Son Nguyen 0c3b7a9efe
model: fix qwen3next broken due to #18683 (#18762) 2026-01-11 21:00:10 +01:00
Ruben Ortlam 0e76501e1d
Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-11 17:33:33 +01:00
Xuan-Son Nguyen 4b060bf240
security: make it clear about subtopics in server (#18754)
* security: make it clear about subtopics in server

* exclude DoS
2026-01-11 16:51:03 +01:00
Daniel Bevenius 9789e28459
debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692)
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check

This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.

* debug : normalize both pooled and token embeddings

This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.

The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
2026-01-11 16:34:41 +01:00
Georgi Gerganov 84ae04f163
tests : refactor test-backend-sampler (#18753)
* tests : use "auto", use std::string

* tests : refactor test-backend-sampler.cpp

* cmake : remove redundant declarations

* ci : use smaller model

* tests : add struct test_params

* tests : reduce logit bias 100.0f -> 10.0f
2026-01-11 17:31:03 +02:00
Xuan-Son Nguyen 506bb6e010
model: try to improve Qwen3 Next (#18683)
* qwen3next: simplify qkvz projection

* use ggml_swiglu_split

* revert swiglu_split, but remove redundant repeat()

* fix missing reshape

* rm 2 redundant transposes

* move mul_mat(k,q) to outside of chunking

* rm redundant cont

* improve g_cs_chunk

* add comments about no cont

* use std::pair instead of ggml_concat

* vectorize key_gdiff calculation

* rm unused tensor

* avoid ggml_concat inside loop

* bring back ggml_concat as it may not work on other backend

* nits
2026-01-11 12:53:33 +01:00
thom-dev-fr 79456a690a
readme : update UIs (#18751) 2026-01-11 13:46:50 +02:00
Xuan-Son Nguyen 28068af789
security: narrow down the scope of what we consider a vulnerability (#18752)
* security: narrow down the scope of what we consider a vulnerability

* fix typo
2026-01-11 12:23:36 +01:00
shaofeiqi 707cbafcaa
opencl: add SOFTPLUS op support (#18726) 2026-01-10 21:57:44 -08:00
Pascal a02acca38d fix: reset tool call state between turns 2026-01-10 19:14:13 +01:00
Pascal b7288a4dd7 webui: enable streaming of tool call arguments 2026-01-10 18:59:57 +01:00
Aman Gupta b137718878
test-backend-ops: fix mxfp4 tests on blackwell (#18736) 2026-01-11 01:12:57 +08:00
Johannes Gäßler d2ff4e23ac
HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666) 2026-01-10 17:19:01 +01:00
Perry Naseck 657a2e644b
cmake : update blas logic (#18205) 2026-01-10 18:00:54 +02:00
Georgi Gerganov f307926482
server : adjust unified KV cache tests (#18716) 2026-01-10 17:51:56 +02:00
Sigbjørn Skjæret 7fdc8c893d
scripts : follow api redirects in pr2wt.sh (#18739) 2026-01-10 16:04:05 +01:00
Xuan-Son Nguyen 23f82f2420
preset: allow named remote preset (#18728)
* preset: allow named remote preset

* nits: fix docs

* cont docs
2026-01-10 15:12:29 +01:00
Aaron Teo 2656c0d265
docs(ggml): update backend ops (#18734)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-01-10 18:48:17 +08:00
Michael Wand 600a366478
Corrected: changed s13 = src1->nb[3] instead of nb[2] (#18724) 2026-01-10 10:16:07 +01:00
Adrien Gallouët ea23c15990
common : add --license to display embedded licenses (#18696)
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.

This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
2026-01-10 09:46:24 +01:00
Xuan-Son Nguyen 9ac2693a30
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00
Simranjeet Singh a61c8bc3bf
mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

* Add mobilenetv5 impl

* Fix comments, remove unused vars

* Fix permute and remove transpose of projection weights

* Fix comments, remove debugging prints from hf_to_gguf

* 1. Hard-code image_mean = 0 and image_std = 1
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

* 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp
2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size

* Remove obsolete comments

* - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading

* - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology

* Fix stem conv bias name

* Remove explicit handling of bias term for stem conv

* - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable

* clean up conversion script

* fix code style

* also preserve audio tensors

* trailing space

* split arch A and V

* rm unused gemma3 func

* fix alignment

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-01-09 23:42:38 +01:00
shaofeiqi 593da7fa49
opencl: add EXPM1 op (#18704) 2026-01-09 10:13:13 -08:00
Reese Levine 9e41884dce
Updates to webgpu get_memory (#18707) 2026-01-09 08:17:18 -08:00
Pascal ec8fd7876b
Webui/file upload (#18694)
* webui: fix restrictive file type validation

* webui: simplify file processing logic

* chore: update webui build output

* webui: remove file picker extension whitelist (1/2)

* webui: remove file picker extension whitelist (2/2)

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* fix: update ChatForm storybook test after removing accept attribute

* chore: update webui build output

* refactor: more cleanup

* chore: update webui build output
2026-01-09 16:45:32 +01:00
Asbjørn Olling a180ba78c7
cmake: only build cli when server is enabled (#18670) 2026-01-09 16:43:26 +01:00
Georgi Gerganov 53eb9435da
server : fix timing of prompt/generation (#18713) 2026-01-09 12:59:50 +02:00
Georgi Gerganov d3435efc8a
scripts : pr2wt.sh reset to remote head (#18695)
* scripts : pr2wt.sh reset to remote head

* cont : cleaner

* cont : restore --set-upstream-to
2026-01-09 12:16:40 +02:00
Georgi Gerganov f5f8812f7c
server : use different seeds for child completions (#18700)
* server : use different seeds for child completions

* cont : handle default seed

* cont : note
2026-01-09 09:33:50 +02:00
Pascal 74b119e81e webui: prevent mobile dropdown immediate close on synthetic click 2026-01-08 22:48:56 +01:00
Xuan-Son Nguyen 8ece3836b4
common: support remote preset (#18520)
* arg: support remote preset

* proof reading

* allow one HF repo to point to multiple HF repos

* docs: mention about multiple GGUF use case

* correct clean_file_name

* download: also return HTTP status code

* fix case with cache file used

* fix --offline option
2026-01-08 22:35:40 +01:00
Aaron Teo 046d5fd44e
llama: use host memory if device reports 0 memory (#18587) 2026-01-09 05:34:56 +08:00
Masashi Yoshimura 480160d472
ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628)
* Fix GGML_MEM_ALIGN to 8 for emscripten.

* Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten
2026-01-08 08:36:42 -08:00