Commit Graph

8413 Commits

Author SHA1 Message Date
Aman Gupta 908a9e5a1e
CUDA: disable cuda graph when using n-cpu-moe (#18593)
* CUDA: disable cuda graph when using n-cpu-moe

* call ggml_cuda_set_device
2026-01-05 01:37:48 +08:00
Aman Gupta 5126c41c1c
ggml-cuda: remove unused params in ggml_cuda_graph (#18579) 2026-01-05 01:37:09 +08:00
Aldehir Rojas cef1d23c5a
common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342)
* grammar : add support for std::regex_search() with trigger patterns

* common : update hermes2 pro trigger to search instead of match

* common : use regex_search with anchoring for partial matching

* common : adjust regex partial tests to use new pattern

* grammar : check pattern directly instead of adding a type

* common : adjust existing patterns to match new semantics
2026-01-03 16:02:43 -06:00
Georgi Gerganov c69c7ebc90
graph : fix graph reuse logic when `n_pos_per_embd > 1` (#18566) 2026-01-03 23:59:06 +02:00
Aman Gupta e57f52334b
ggml-cuda: fixes for concurrent streams (#18496) 2026-01-03 23:15:01 +08:00
Georgi Gerganov a554a1ecc7
context : fix reserve token padding to n_seqs (#18536) 2026-01-03 15:45:34 +02:00
Johannes Gäßler 0f2e42ca1d
CUDA: only allocate FA tmp buffer if needed (#18564) 2026-01-03 13:55:53 +01:00
Imad Saddik db8d1acd3a chore: update webui build output 2026-01-03 11:46:18 +01:00
pl752 9dba9f5352
(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-03 11:13:40 +01:00
Imad Saddik 72af4199c4 Replace <label> with <Label> 2026-01-03 09:44:26 +01:00
Imad Saddik 7295444cd2 Fix autoChatWidth checkbox to reset customChatWidth when enabled 2026-01-03 09:41:02 +01:00
Imad Saddik 6487840b0a Pass missing style prop 2026-01-03 09:39:45 +01:00
Imad Saddik 22731da153 chore: update webui build output 2026-01-03 09:23:16 +01:00
Imad Saddik eb997f61f9 Add autoChatWidth and customChatWidth to syncable parameters 2026-01-03 09:21:55 +01:00
Imad Saddik 21b35be366 chore: update webui build output 2026-01-03 09:12:29 +01:00
Imad Saddik c9b34bc00d Replace getChatWidth utility with chatWidthClasses in chat components 2026-01-03 09:10:44 +01:00
Imad Saddik b6536f6589 Format code 2026-01-03 08:41:30 +01:00
Imad Saddik 36f334f4af Put the constant into constants/chat-width.ts 2026-01-03 08:40:21 +01:00
Imad Saddik 00e6cafda6 Rename component to ChatSettingsComboboxCustomWidth 2026-01-03 08:37:09 +01:00
Imad Saddik 0143112cf0 chore: update webui build output 2026-01-03 08:35:42 +01:00
Imad Saddik d7528b41fa Merge remote-tracking branch 'upstream/master' into feat/change_chat_screen_width 2026-01-03 08:34:29 +01:00
Shouyu bcfc8c3cec
ggml-hexagon: optimize activation function (#18393)
* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue
2026-01-02 21:24:24 -08:00
Jeff Bolz 18ddaea2ae
vulkan: Optimize GGML_OP_CUMSUM (#18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-02 15:32:30 -06:00
Jeff Bolz 706e3f93a6
vulkan: Implement mmvq for iq1_s/iq1_m (#18450) 2026-01-02 20:19:04 +01:00
Prabod 5755e52d15
model : Maincoder-1B support (#18534)
* Add Maincoder model support

* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp

* removed set_vocab

* added new line

* Fix formatting

* Add a new line for PEP8
2026-01-02 20:11:59 +01:00
Georgi Gerganov f38de16341
metal : adjust extra size for FA buffer to avoid reallocations (#18545) 2026-01-02 19:02:18 +02:00
Georgi Gerganov af1e8e1a6c
graph : reduce topology branching (#18548) 2026-01-02 19:01:56 +02:00
Georgi Gerganov d84a6a98be
vocab : reduce debug logs about non-EOG control tokens (#18541)
* vocab : reduce debug logs about non-EOG control tokens

* cont : add comment
2026-01-02 16:17:33 +02:00
Chris Rohlf c6f0e832da
rpc : use unordered_map::reserve and emplace (#18513) 2026-01-02 12:09:36 +02:00
MeeMin e86f3c2221
cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
2026-01-02 00:24:20 +01:00
Sigbjørn Skjæret 169ee68ffb
model : remove modern-bert iswa template (#18529)
* remove modern-bert iswa template

* forgotten
2026-01-02 00:06:42 +01:00
tt ced765be44
model: support youtu-vl model (#18479)
* Support Youtu-VL Model

* merge code

* fix bug

* revert qwen2 code & support rsplit in minja.hpp

* update warm info

* fix annotation

* u

* revert minja.hpp

* fix

* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None

* fix expert_weights_scale

* LGTM after whitespace fixes

* fix

* fix

* fix

* layers to layer_index

* enum fix

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 19:25:54 +01:00
Piotr Wilkin (ilintar) 3ccccc83f7
Add conversion support for IQuestCoderForCausalLM (#18524) 2026-01-01 18:45:55 +01:00
o7si d0a6a31470
model : add support for JinaBertModel with non-gated ffn (#18475)
* WIP: Initial commit for fixing JinaBert original FF type support

* convert: add jina-v2-de tokenizer variant for German_Semantic_V3

* convert: fix token collision in BERT phantom vocab conversion

* convert: add feed_forward_type metadata

* model: add feed_forward_type metadata for jina-bert-v2

* model: jina-bert-v2 support standard GELU FFN variant

* model: remove ffn_type, detect FFN variant from tensor dimensions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* revert collision fix to be handled in separate PR

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:38:51 +01:00
o7si 2b2afade9f
convert : fix encoding of WPM vocab for BERT models (#18500)
* convert: avoid token collision when stripping ## prefix

* convert: use token types for BERT special tokens check

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:27:07 +01:00
HelloKS f4f5019254
model: add Solar Open model (#18511)
* model: add Solar-Open model

* vocab: add solar-open to end eog blacklist

* model: add proper llm type

* chat: basic template for solar open

* typo: fix comment about vocab

* convert: sugested changes

* convert: suggested changes

* chat: change reasoning end tag for solar-open

* llama-chat: add solar-open template
2026-01-01 18:01:43 +01:00
Anri Lombard d5574c919c
webui: fix code copy stripping XML/HTML tags (#18518)
* webui: fix code copy stripping XML/HTML tags

* webui: update static build
2026-01-01 13:44:11 +01:00
Aman Gupta 26831bded9
ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502) 2026-01-01 19:18:43 +08:00
Jeff Bolz be47fb9285
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-01 08:58:27 +01:00
triplenom 9e10bd2eaf
llama: handle short reads in direct I/O path (#18504) 2026-01-01 10:24:43 +08:00
Anri Lombard 4cd162a123
chat: make tool description and parameters optional per OpenAI spec (#18478)
* chat: make tool description and parameters optional per OpenAI spec

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

* refactor: use value() for cleaner optional field access
2025-12-31 17:21:37 -06:00
Georgi Gerganov 13814eb370 sync : ggml 2025-12-31 18:54:43 +02:00
Georgi Gerganov 54f67b9b66 ggml : bump version to 0.9.5 (ggml/1410) 2025-12-31 18:54:43 +02:00
Anri Lombard 33ded988ba
quantize: prevent input/output file collision (#18451)
Check if input and output files are the same before quantizing to prevent
file corruption when mmap reads from a file being written to.

Fixes #12753
2025-12-31 23:29:03 +08:00
Sigbjørn Skjæret 0db8109849
convert : lint fix (#18507) 2025-12-31 14:28:21 +01:00
Henry147147 9b8329de7a
mtmd : Adding support for Nvidia Music Flamingo Model (#18470)
* Inital commit, debugging q5_k_s quant

* Made hf_to_gguf extend whisper to reduce code duplication

* addressed convert_hf_to_gguf pull request issue

---------

Co-authored-by: Henry D <henrydorsey147@gmail.com>
2025-12-31 12:13:23 +01:00
gatbontonpc 9a6369bb60
metal : add count_equal op (#18314)
* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-31 10:39:48 +02:00
Johannes Gäßler ecc343de63
CUDA: fix KQ max calculation (#18487) 2025-12-31 09:37:00 +01:00
Georgi Gerganov 01ade96e71
metal : remove BF16 x F16 kernels (#18456) 2025-12-31 09:53:48 +02:00
Aman Gupta 7bcaf815c2
sycl: add newline at the end of CMakeLists.txt (#18503) 2025-12-31 14:23:44 +08:00