Commit Graph

786 Commits

Author SHA1 Message Date
Georgi Gerganov 1bde70785d
sampling : remove redundant calls to ggml_build_forward_expand 2025-12-04 14:25:28 +02:00
Georgi Gerganov fce571ee51
sampling : simplify temp sampling 2025-12-04 14:23:02 +02:00
Daniel Bevenius ac9e164714
sampling : fix backend temp sampling to use logits masking 2025-12-04 09:39:20 +01:00
Georgi Gerganov a67ef0f47f
llama : fix sanity checks during quantization (#17721) 2025-12-04 10:33:42 +02:00
Daniel Bevenius 10bd640aae
Revert "sampling : stop short if backend sampler sampled a token"
This reverts commit 87b2719eca.
2025-12-04 08:26:33 +01:00
Daniel Bevenius c0b182f4d6
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-04 08:17:50 +01:00
Daniel Bevenius 87b2719eca
sampling : stop short if backend sampler sampled a token
This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.

It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).
2025-12-04 08:13:49 +01:00
Georgi Gerganov cce3b2a8ad
sampling : minor cleanup 2025-12-03 15:39:44 +02:00
Herman Semenoff 37adc9c6ba
ggml, llama : use defaulted constructors/destructors (#17649) 2025-12-03 07:12:18 +01:00
Daniel Bevenius aad5a6afd7
sampling : implement temp_ext_backend sampling
This commit implements the apply function for the extended temperature
sampling.
2025-12-02 17:26:04 +01:00
Adrien Gallouët f3a9674ae8
llama : fix signed comparison warning on FreeBSD (#17497)
This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit).

    warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
      488 |         if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) {
          |                         ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 12:05:38 +01:00
Daniel Bevenius db8972e251
squash! sampling : fix backend temp sampler for zero temperature
This modifies the parent commit to simply return the most probably token
instead of masking the logits.
2025-12-02 11:53:29 +01:00
Daniel Bevenius 3e9a258c14 Merge remote-tracking branch 'upstream/master' into gpu-sampling 2025-12-02 09:26:04 +01:00
Daniel Bevenius 739b597804 sampling : fix backend temp sampler for zero temperature
This commit fixes the implementation of the temperature-based sampler
for the case when the temperature is set to zero. This now correctly
selects the most probable token by masking out all other tokens in the
logits.
2025-12-02 09:13:07 +01:00
Piotr Wilkin (ilintar) 746f9ee889
Override SSM_A op for Qwen3 Next to reduce splits (#17587)
* Override SSM_A op for Qwen3 Next to reduce splits

* New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context.

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-02 00:43:13 +01:00
Gilad S. 00c361fe53
fix: llama arch implementation (#17665) 2025-12-01 21:21:13 +01:00
Georgi Gerganov 88cca45bb8
sampling : fix top_p empty condition 2025-12-01 18:02:34 +02:00
Georgi Gerganov 04f2822a86
sampling : do not create empty samplers 2025-12-01 17:52:07 +02:00
Georgi Gerganov 4032ce2378
common : simplify sampler chain initialization 2025-12-01 17:11:11 +02:00
Oliver Simons 217469f07f Make backend's top_p sampler inclusive
In addition to match the algorithm proposed in the original
[paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case
where `max_p is > top_p` for a single logit, where the mask would
otherwise be empty (and we thus sample from the whole vocabulary with
equal likelihood)
2025-12-01 15:28:06 +01:00
Oliver Simons ae0bb6a6da Factor out `ggml_sort` into its own function 2025-12-01 15:28:06 +01:00
Georgi Gerganov 16451d6bc3
Merge branch 'master' into HEAD 2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen cd3c118908
model: support Ministral3 (#17644)
* conversion script

* support ministral 3

* maybe this is better?

* add TODO for rope_yarn_log_mul

* better ppl (tested on 14B-Instruct)

* Add Ministral3 support to Mistral format

* improve arch handling

* add sizes

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

---------

Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Oliver Simons 8bee483c97 Fix backend_top_p_sampler
softmax(softmax) will return uniform distribution, so we should not
return the softmax but the logits instead.
2025-12-01 12:07:30 +01:00
Aman Gupta 6eea666912
llama-graph: avoid expand_forward for fusion (#17633) 2025-12-01 11:12:48 +02:00
Daniel Bevenius cf0e1475c5
sampling : lower log level for output buffer reallocations [no ci]
This commit changes the logging level for output buffer reallocations
in the llama_context::output_reserve function from INFO to DEBUG.

The motivation for this is that it currently logs to info and when
enabling verbose logging for llama-cli this will get mixed with the
output, for example:

```console
What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB
 1. Stockholm
2\. Helsinki
Based are the options
1. Stockholm
Explanation: Stockholm is the capital of
...
```
2025-12-01 09:13:47 +01:00
Georgi Gerganov 80742cbaeb
cont : naming 2025-11-30 11:24:30 +02:00
Georgi Gerganov c187003d81
llama : naming 2025-11-30 00:05:47 +02:00
Georgi Gerganov 1760bd69b3
llama : reserve graphs with samplers 2025-11-29 23:57:25 +02:00
Georgi Gerganov ff7b0bf632
llama : call backend_init once 2025-11-29 23:09:53 +02:00
Georgi Gerganov d8d98bb4bb
Merge branch 'master' into HEAD 2025-11-29 22:38:44 +02:00
Georgi Gerganov 9028ebfea8
llama : cleanup + naming 2025-11-29 22:37:07 +02:00
Georgi Gerganov fbc8f49f3c
llama : simplify 2025-11-29 17:01:00 +02:00
Diego Devesa e072b2052e
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 17:33:23 +02:00
Georgi Gerganov 2464d1b3fc
sampling : simplify 2025-11-28 17:21:12 +02:00
Daniel Bevenius 8cac9dee45
sampling : use logits directly for min-p filtering 2025-11-28 16:12:05 +01:00
Oliver Simons 333da805fe Add initial version for top-p sampling
As we only support static graphs for the time and we don't know the size
of the output of top-p, we have to do value-scaling same as for min-p
operator.

Further improvements can be applied to the unit-test (i.e. check for
equivalence of top_p happening on backend with top_p happening on cpu)
and also by constructing candidates and sorting those as opposed to
reversing the sort of the logits (this would be arange +
get_rows instead of argsort + get_rows)
2025-11-28 15:16:20 +01:00
Georgi Gerganov 117e2079a9
refactor : simplify and improve memory management 2025-11-28 16:09:42 +02:00
Daniel Bevenius 459b7ae7b9
squash! sampling : support intermixed backend/cpu samplers
Fix llama-save-load-state which currently fails by handling the case
when batch.logits is nullptr (like when loading state) by allocating
space for all outputs as CPU logits.
2025-11-28 13:50:47 +01:00
Piotr Wilkin (ilintar) ff55414c42
model : Qwen3 Next (#16095)
* Qwen3 Next - cleaned up version

* Whitespaces and stuff

* Correct minor errors

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Misc. fixes.

* Clean up code, add missing hybrid qualifier

* Did someone transpose the SOLVE_TRI result matrix? Perhaps...

* Whitespace

* Proper tensors for cb calls

* Use llama-graph.h vertical alignment

* BROKEN: chunking

* Set new tensors as inputs.

* Proper chunk logic

* It's the circle of life...

* More shenanigans for n_seq > 1

* Nail in the coffin?

* Fix Windows build

* Eh, one fails on Windows, the other fails on Mac... just use general capture.

* quant : cleanup

* model : cleanup

* qwen3 : cleanup

* cont : cleanup

* cont : cleanup

* ggml : revert change

* qwen3 : cleanup

* cont : cleanup

* Readd cmath

* qwen3 : fix typo

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Usual suspects

* fix my bad suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 12:02:56 +01:00
Daniel Bevenius 9ad6522be6
squash! sampling : support intermixed backend/cpu samplers
Add check that logits is not null which is can happen for embeddings.
2025-11-28 08:57:48 +01:00
Daniel Bevenius 74be332e24
sampling : support intermixed backend/cpu samplers
This commit updates the backend sampling implementation to support
intermixed usage of backend and CPU samplers within the same batch.

The initial implementation was developed as an all-or-nothing solution:
either perform backend sampling for the entire batch, or perform CPU
sampling for the entire batch.

The motivation for this change is to support batches with mixed
sequences. For example, we may have a backend sampler configured for
sequence 0, while sequence 1 in the same batch uses CPU sampling. This
was not supported in the initial implementation.

This issue manifested in llama-server with the webui: decoding with
backend samplers would work initially, but after changing to CPU
sampling, a slot (sequence) could still be using a backend sampler.
This meant that logits in output_reserve would not be allocated,
resulting in an error.

The solution in this commit inspects the batch to determine which
sampling modes are needed and allocates buffers accordingly. However,
there is a known inefficiency: when we have intermixed backend/CPU
samplers in the same batch, we currently copy all logits to the host,
even for sequences using backend samplers.

Added test_backend_cpu_mixed_batch to verify correct behavior with
mixed backend/CPU samplers in a single batch, including dynamic
sampler switching between decode calls.
2025-11-28 08:38:05 +01:00
Georgi Gerganov c386114922
arch : add description about LLM_TENSOR_INFOS (#17550) 2025-11-27 16:34:13 +02:00
Georgi Gerganov 6783b11fb0
models : fix LFM2 tensors (#17548) 2025-11-27 16:04:29 +02:00
Daniel Bevenius 172208afbf
sampling : add comments about backend sampler [no ci]
This commit adds a comment to llama_context's constructor explaining why
backend samplers are initialized early in the process.
2025-11-27 14:59:52 +01:00
Daniel Bevenius d9d736102b
sampling : use argmax for min-p sampling 2025-11-27 07:38:44 +01:00
Daniel Bevenius b45d504e70
sampling : add min-p backend sampler 2025-11-26 10:50:58 +01:00
Daniel Bevenius ec047e12ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 15:16:44 +01:00
Georgi Gerganov 583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Daniel Bevenius 2b4c7927ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 06:10:33 +01:00
Aaron Teo 877566d512
llama: introduce support for model-embedded sampling parameters (#17120) 2025-11-25 09:56:07 +08:00
Daniel Bevenius 134e6940ca
llama : skip output reordering for single token batches (#17466)
This commit adds a check to skip the output reordering logic when
n_outputs == 1. With a single output token, the data is trivially
sorted and the reordering code is currently doing unnecessary work
(resetting and rebuilding output_ids to the same values).

The motivation for this change is improved code clarity and avoiding
confusion when debugging. While the performance impact is probably
negligible, this unnecessary work happens on every decode call in
llama-server when processing batches with single-token outputs.
2025-11-24 21:06:17 +01:00
Daniel Bevenius a02adf4211
sampling : add assertions for contiguous tensors in async copy functions 2025-11-24 21:01:06 +01:00
Georgi Gerganov 883a87043a
samplers : add missing cont 2025-11-24 21:46:57 +02:00
Daniel Bevenius 25f33806d3
sampling : add debug log when backend sampler selects token
This commit adds a debug log statement in the llama_sampler_sample
to indicate when a backend sampler has selected a token for a given
index.

The modification helps in tracing the sampling process and understanding
the flow of control when backend samplers are used.
2025-11-24 15:03:41 +01:00
Daniel Bevenius 8eb9b4769d
sampling : remove redundant checks for stride and size [no ci] 2025-11-24 13:53:29 +01:00
Daniel Bevenius 4a90583d7d
sampling : cleanup and clarify output_reserve 2025-11-24 13:26:18 +01:00
Daniel Bevenius 7816f0bb56
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-24 07:44:06 +01:00
william pan 4902eebe33
models : Added support for RND1 Diffusion Language Model (#17433)
* Converted RND1 model to GGUF weights

* RND1 llama.cpp support v1

* RND1 llama.cpp support v2 non causal bug

* RND1 llama.cpp support v3 doccumentation

* RND1 llama.cpp support v4 clean code

* linting issues

* RND1 pr fixes v1

* RND1 pr fixes v2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Diffusion documentation edits

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-24 14:16:56 +08:00
Daniel Bevenius 9e273f7aa4
sampling : fix copying both sampled tokens and logits/probs from backend
This commit fixes the issue where both sampled tokens and logits/probs
were not being copied correctly from the backend to the host when
multiple backend samplers were used.

A test for this scenario has also been added to ensure that both types
of data are copied correctly when different backend samplers are
employed.
2025-11-23 13:12:01 +01:00
Daniel Bevenius ae23d2d2c1
sampling: clarify candidate ids usage in comments 2025-11-23 11:28:19 +01:00
Daniel Bevenius 65500d05ab
sampling : add stride variable for clarity 2025-11-23 11:27:54 +01:00
Daniel Bevenius 79b8cf2a75
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-21 16:38:32 +01:00
ubergarm 23bc779a6e
model : detect GigaChat3-10-A1.8B as deepseek lite (#17420)
* Detect GigaChat3-10-A1.8B as deepseek lite

Hardcodes checking number of layers to detect if lite version of deepseek.

* Add commnent identifying deepseek lite variants

deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B
2025-11-21 14:51:38 +01:00
Daniel Bevenius 61ffe41dc1
sampling : use pinned memory for backend sampling buffers 2025-11-21 14:02:16 +01:00
Xuan-Son Nguyen 054a45c3d3
grammar: fix regression caused by #17381 (#17412)
* grammar: fix regression caused by #17381

* more readable
2025-11-20 18:35:10 +01:00
Daniel Bevenius 0d28b16bdc
sampling : introduce sampling_info struct
This commit introduces a sampling_info struct to encapsulate all
backend sampling related data within the llama_context class.

It also updates to use more descriptive names for sampled tokens and
candidates in the backend sampler ggml data structure.
2025-11-20 14:45:56 +01:00
Piotr Wilkin (ilintar) 92c0b387a9
grammar : fix integer overflow (#17381)
* Fix DoS / integer overflow

* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)

* White space

* Actually, since it's unsigned, use UINT64_MAX
2025-11-20 14:47:04 +02:00
Georgi Gerganov 196f5083ef
common : more accurate sampling timing (#17382)
* common : more accurate sampling timing

* eval-callback : minor fixes

* cont : add time_meas impl

* cont : fix log msg [no ci]

* cont : fix multiple definitions of time_meas

* llama-cli : exclude chat template init from time measurement

* cont : print percentage of unaccounted time

* cont : do not reset timings
2025-11-20 13:40:10 +02:00
Daniel Bevenius ed4345bdd9 squash! common : fix regression caused by extra memory allocations during sampling
Apply the same changes to llama-sampling.cpp, llama_sampler_sample as
were applied in commit 38f408c25.
2025-11-20 07:56:33 +01:00
Daniel Bevenius 0c660e7390
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-20 06:57:24 +01:00
Daniel Bevenius 18ed4d8f96
squash! sampling : simplify backend sampling logic decode
The commit fixes a variable shadowing issue in the
`llama_context::decode` function which was introduced in a previous
refactoring.
2025-11-19 15:10:15 +01:00
Daniel Bevenius d74eb61aa7
squash! sampling : simplify backend sampling logic decode
Fix condition to check if backend actually sampled tokens, not just that
backend samplers are available.
2025-11-19 11:29:26 +01:00
Daniel Bevenius 7e98ebcc6b
sampling : simplify backend sampling logic decode
This commit tries to simplify the backend sampling logic in
llama_context::decode.
2025-11-19 09:31:33 +01:00
Daniel Bevenius 51fee29822
sampling : always populate logits for sampled probs
This commit updates common/sampler.cpp set_logits and
src/llama-sampling.cpp llama_sampler_sample to always populate the
logits field when backend sampled probabilities are available.

The motivation for this is that this ensure that CPU sampler always have
access to the logits values even when probabilites have been produced by
backend samplers.
2025-11-19 07:14:11 +01:00
Daniel Bevenius 0da7e7dccc
sampling : remove version from sampler chain
This commit removes the version field from the sampler chain and instead
used the sampler pointer itself for change detection.
2025-11-19 06:59:03 +01:00
Haiyue Wang a045492088
vocab : call reserve() for building plamo-2-translate suffix (#17343)
Test 'Q4_K_M' quantization on https://huggingface.co/pfnet/plamo-2-translate

The 'suffix_to_score' size is 193510, it needs 19 memory allocation with final
capacity 262144 to hold the value, if not preserve the memory.

Signed-off-by: Haiyue Wang <haiyuewa@163.com>
2025-11-18 18:58:22 +01:00
Daniel Bevenius 311c1a347f
sampling : ensure at most one output token per seq
This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.
2025-11-18 16:06:23 +01:00
Daniel Bevenius 82957a90f2
sampling : always expose sampled_ids
This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.
2025-11-18 15:11:59 +01:00
Georgi Gerganov 4b52e59903
graph : do not include llama-model.h 2025-11-18 13:53:25 +02:00
Daniel Bevenius 7884b0e0ac
sampling : add support for backend sampling
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
2025-11-17 16:15:58 +01:00
Bartowski e1fcf8b09b
model : add AfmoeForCausalLM support (#16477)
* Add AFMOE model support

* Update to vocab

* Add model sizing

* Undo Rope change for ARCEE model

* Address review comments

* Update modeling code is_sliding -> use_rope, replace hard-coded logic

* Fix AFMOE tokenizer

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update AFMoE tokenizer class identification to be more unique

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-14 13:54:10 +01:00
Marek Hradil jr. 6cd0cf72ce
fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048)
* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047)

* Replace 'static' workaround, with keeping variable in scope for longer

* Create std::array directly and pass into llama_grammar_init_impl

* Add back the trigger pattern

* Missed array include
2025-11-14 14:35:26 +02:00
Aman Gupta a90eb94ca9
CUDA: fuse rope + set_rows (#16884)
* CUDA: add fused rope

* move k forward_expand up

* create helper function instead of re-using params

* make assert statement more in line with comment

* rope_norm: coalesced writes to global mem
2025-11-13 08:50:01 +08:00
o7si ffb6f3d921
vocab : correct bounds check for UGM XCDA array access (#17215) 2025-11-12 23:41:02 +01:00
Mike Abbott 4a5b8aff40
cmake : add version to all shared object files (#17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned.  This applies a version to all generated so files, allowing the package to build without errors.
2025-11-11 13:19:50 +02:00
Sigbjørn Skjæret 7bef684118
models : move build_inp_out_ids outside loop (#17151)
* move build_inp_out_ids outside loop

* realign
2025-11-10 22:55:30 +01:00
Gabe Goodhart 0c74f32632
memory: Hybrid context shift (#17009)
* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: compilade <git@compilade.net>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-10 17:14:23 +02:00
Sigbjørn Skjæret 9008027aa3
hparams : add n_embd_inp() to support extended embed (#16928)
* add n_embd_full to support extended embed

* don't change output

* rename to n_embd_inp

* restore n_embd where applicable
2025-11-07 19:27:58 +01:00
Georgi Gerganov 16bcc1259d
kv-cache : pad the cache size to 256 for performance (#17046)
* kv-cache : pad the size of the small SWA cache for performance

* context : pad the total context to 256

* cont : future-proof the swa pad

* server : adjust test params to new logic
2025-11-07 20:03:25 +02:00
Johannes Gäßler aa374175c3
CUDA: fix crash on uneven context without FA (#16988) 2025-11-06 14:05:47 +01:00
Li Pengzhan 9f052478c2
model : add openPangu-Embedded (#16941)
* Model: add openPangu-Embedded

* fixed according to reviewer's comments

* fixed the chat template check condition

* Apply suggestions from code review

change the chat-template check condition and some formatting issue

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* whitespace cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-05 10:28:58 +01:00
Sigbjørn Skjæret b164259bba
chore : fix models indent after refactor (#16992) 2025-11-04 12:29:15 +01:00
Georgi Gerganov cd5e3b5754
server : support unified cache across slots (#16736)
* server : support unified context across slots

* cont : fix speculative decoding initialization

* context : fix n_ctx_per_seq computation

* server : purge slots one by one

* tests : add unified cache server tests

* llama : update per-seq context computation

* test-thread-safety : handle tiny training context of the input model

* server : fix server_tokens clear()

* server : use 4 slots + unified KV by default

* llama : add note about context size queries

* cont : update todos [no ci]

* context : do not cap the size of the context

* tests : adjust parameters to be CI friendlier

* context : add warning
2025-11-02 18:14:04 +02:00
Piotr Wilkin (ilintar) bea04522ff
refactor : llama-model.cpp (#16252)
* Sqashed: llama-model.cpp refactoring

* Fix formatting of attn / ffn / ffn_moe calls

* Fix import regression / unify spacing in models.h

* totally DID NOT miss those!

* Add missing qwen3vl(moe) models

* Add missing new .cpp files to build

* Remove extra semicolons

* Editor checker

* Update src/models/models.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar) 0de0a01576
model : Minimax M2 (#16831)
* Model: Minimax M2

* Cleanup

* Cleanup pt. 2

* Cleanup pt. 3

* Update convert_hf_to_gguf_update.py - merge catch blocks

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove vocab models and test

* Remove all redundant hparam settings covered by TextModel

* Move super to start, don't set block_count

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 21:20:47 +01:00
Giuseppe Scrivano e58d585604
model : add Granite Hybrid nano types (#16896)
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-31 21:20:07 +01:00
Georgi Gerganov 8da3c0e200
batch : fix consistency checks for the input positions (#16890) 2025-10-31 13:50:33 +02:00
JJJYmmm d261223d24
model: add support for qwen3vl series (#16780)
* support qwen3vl series.

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

* bugfix: fix the arch check for qwen3vl-moe.

* use build_ffn

* optimize deepstack structure

* optimize deepstack feature saving

* Revert "optimize deepstack feature saving" for temporal fix

This reverts commit f321b9fdf1.

* code clean

* use fused qkv in clip

* clean up / rm is_deepstack_layers for simplification

* add test model

* move test model to "big" section

* fix imrope check

* remove trailing whitespace

* fix rope fail

* metal : add imrope support

* add imrope support for sycl

* vulkan: add imrope w/o check

* fix vulkan

* webgpu: add imrope w/o check

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix tensor mapping

---------

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Tianyue-Zhao bacddc049a
model: Add support for CogVLM model (#15002)
* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-30 12:18:50 +01:00