Commit Graph

7964 Commits

Author SHA1 Message Date
Aleksander Grygier 372202632e refactor: Cleanup 2026-01-25 00:31:49 +01:00
Aleksander Grygier ba230c5cce refactor: Naming + remove redundant component 2026-01-24 23:58:17 +01:00
Aleksander Grygier f7b5f62586 refactor: Remove unused code 2026-01-24 23:45:06 +01:00
Aleksander Grygier 22d9e645aa chore: update webui build output 2026-01-24 23:39:04 +01:00
Aleksander Grygier d938994395 refactor: Cleanup 2026-01-24 23:38:37 +01:00
Aleksander Grygier fc4c392dce chore: update webui build output 2026-01-24 20:54:24 +01:00
Aleksander Grygier 79e606eb99 refactor: Constants 2026-01-24 20:52:19 +01:00
Aleksander Grygier 3d7426cdd4 refactor: Cleanup 2026-01-24 20:47:32 +01:00
Aleksander Grygier 8bf2d38da1 chore: update webui build output 2026-01-24 20:32:53 +01:00
Aleksander Grygier 14911e51fc feat: MCP Prompts implementation improvements 2026-01-24 20:30:52 +01:00
Aleksander Grygier 801ef93522 refactor: Message Height CSS Variable 2026-01-24 19:15:38 +01:00
Aleksander Grygier 13f756421c refactor: Enums 2026-01-24 18:37:43 +01:00
Pascal 85b8da45f9 fix: resolve TypeScript error in tool response content 2026-01-24 18:04:01 +01:00
Pascal 9ddc54b668 webui: enable vision in agentic tool responses
- Include images from all message roles (not just user)
- Add multipart content support for tool responses
- Images from MCP tools now accessible in same agentic turn
2026-01-24 17:58:20 +01:00
Aleksander Grygier 172e93d494 Merge remote-tracking branch 'ggml-org/master' into allozaur/mcp-mvp 2026-01-24 15:13:58 +01:00
Aleksander Grygier da9c245838 chore: update webui build output 2026-01-24 13:59:52 +01:00
Aleksander Grygier 7c4bedda87 feat: Improve formatting performance time 2026-01-24 13:58:23 +01:00
Aleksander Grygier c39c6ef436 fix: System prompt sorting 2026-01-24 13:44:41 +01:00
Aleksander Grygier 2601bf0f59 fix: Save draft message in Chat Form when adding System Prompt from new chat view 2026-01-24 13:32:49 +01:00
Aleksander Grygier a647edfc0b fix: Chat Form submission 2026-01-24 12:33:24 +01:00
Johannes Gäßler 8f91ca54ec
CUDA: re-use MLA K data for V in MMA FA (#19057) 2026-01-24 10:09:36 +01:00
Aman Gupta 81ab64f3c8
ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934)
* ggml-cuda: add split-wise cuda graph

* add n-cpu-moe compare_llama_bench.py

* fix hip/musa builds
2026-01-24 14:25:20 +08:00
nullname 8af1f5f430
ggml-hexagon: flash-attn opt (#19025)
* optimize flash attention kernel by improving score computation and online softmax update

* wip

* Refactor online softmax update in flash attention kernel for improved performance

* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation

* wip
2026-01-23 22:02:07 -08:00
Aleksander Grygier bd16b6145c chore: update webui build output 2026-01-24 01:32:36 +01:00
Aleksander Grygier 8428741034 feat: MCP Prompts WIP 2026-01-24 01:26:17 +01:00
Georgi Gerganov 557515be1e
graph : utilize `ggml_build_forward_select()` to avoid reallocations (#18898)
* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
2026-01-23 18:22:34 +02:00
Aleksander Grygier 3d88d0b6b2 chore: update webui build output 2026-01-23 15:21:56 +01:00
Aleksander Grygier 9c391d8e0d feat: UI improvements 2026-01-23 15:21:03 +01:00
Neo Zhang cb6caca191
[SYCL] use malloc to support both iGPU and dGPU in same time (#18992)
* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-23 20:54:10 +08:00
Xuan-Son Nguyen b5b8fa1c8b
chat : fix translategemma crash on common_chat_format_example (#19019) 2026-01-23 12:03:42 +01:00
Daniel Bevenius a14b960bc7
model-conversion : use BUILD_DIR variable in all scripts (#19015)
This commit modifies all the utility scripts to use an optional
BUILD_DIR variable/argument to specify the build directory.

The motivation for this is that Commit
3d55846a5c ("model-conversion : add
BUILD_DIR variable to run-converted-model scripts") introduced this
variable to the causal and embeddings scripts, but I missed the scripts
in the utils directory.
2026-01-23 09:01:36 +01:00
Alberto Cabrera Pérez 091a46cb8d
ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860)
* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
2026-01-23 09:55:08 +02:00
Aldehir Rojas a3e812811d
cli : load parser definition (#19031)
* cli : load parser definition

* cont : only unload if a parser is defined
2026-01-22 20:31:22 -06:00
Xuan-Son Nguyen 51fa458a92
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
2026-01-22 21:30:06 +01:00
Georgi Gerganov a5eaa1d6a3
mla : make the V tensor a view of K (#18986)
* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-22 22:09:01 +02:00
Johannes Gäßler e2baf02162
CUDA: fix alignment check for FA (#19023) 2026-01-22 20:39:25 +01:00
Aman Gupta e34d6d03b2
convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866) 2026-01-23 02:58:07 +08:00
lhez 9c96465f99
opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970)
* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well
2026-01-22 10:29:25 -08:00
Xuan-Son Nguyen 4e595b250a
server: do not log certain endpoints (avoid log spam) (#19028) 2026-01-22 19:24:37 +01:00
Aleksander Grygier 963711cccb chore: update webui build output 2026-01-22 18:20:55 +01:00
Aleksander Grygier 6018f85c65 feat: Architectural improvements 2026-01-22 18:19:37 +01:00
Aleksander Grygier c02e83c32a feat: Per-conversation agentic loop state 2026-01-22 17:38:51 +01:00
Georgi Gerganov 0e4ebeb057
quant : manual overrides of tensor types take precedence (#18952) 2026-01-22 16:17:06 +02:00
Aaron Teo 8b30840703
release: update github api (#19022) 2026-01-22 21:38:02 +08:00
Xuan-Son Nguyen 9eb5bfec1a
mtmd : update docs to use llama_model_n_embd_inp (#18999) 2026-01-22 14:36:32 +01:00
손희준 c6926d1d95
server: Reorder methods in `server-task.cpp` (#19016)
* Move `task_result_state::update_chat_msg` to match with header

* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header

---------

Co-authored-by: openingnow <>
2026-01-22 14:36:04 +01:00
Aman Gupta b70d251076
CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953) 2026-01-22 18:51:53 +08:00
shaofeiqi 5516b9c16a
opencl: add TRI op support (#18979) 2026-01-21 22:05:54 -08:00
Aleksei Nikiforov 94242a62c0
ggml-zdnn : mark zDNN buffers as non-host (#18967)
While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848
2026-01-22 01:16:21 +01:00
Pádraic Slattery 6b99a223e3
ci : update GitHub Actions versions [no ci] (#18935) 2026-01-22 00:57:18 +01:00