Commit Graph

113 Commits

Author SHA1 Message Date
Daniel Bevenius c5d44b8525
llama : fix typo in comment [no ci] 2025-12-17 09:02:30 +01:00
Daniel Bevenius ad1b60abc4
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-16 09:45:08 +01:00
Georgi Gerganov c560316440
graph : reuse SSM graphs (#16490)
* graph : reuse hybrid graphs

* graph : reuse recurrent graphs

* graph : fix reuse check for recurrent inputs

* memory : move the recurrent state into the memory context

* Revert "memory : move the recurrent state into the memory context"

This reverts commit 00f115fe810815d4a22a6dee0acc346131e970e1.

* cont : fix build
2025-12-16 09:36:21 +02:00
Daniel Bevenius 2995341730
llama : add support for NVIDIA Nemotron 3 Nano (#18058)
* llama : add support for NVIDIA Nemotron Nano 3

This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling
the conversion and running of this model.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 07:19:26 +01:00
Georgi Gerganov 0086c246ee
Merge branch 'master' into HEAD 2025-12-14 16:44:30 +02:00
Xuan-Son Nguyen 0759b09c90
graph: add f_attn_temp_offset (#18025) 2025-12-14 13:05:59 +01:00
Georgi Gerganov 609a2d0268
models : fix YaRN regression + consolidate logic (#18006)
* models : fix YaRN regression + consolidate logic

* cont : fix the fix

* cont : remove header

* cont : add header
2025-12-14 08:34:56 +02:00
Georgi Gerganov 7bed317f53
models : fix the attn_factor for mistral3 graphs + improve consistency (#17945)
* models : fix the attn_factor for mistral3 graphs

* cont : rework attn_factor correction logic

* cont : make deepseek2 consistent

* cont : add TODO

* cont : special-case DSv2

* cont : revert Mistral 3 Large changes

* cont : fix DS2 to use the original attn_factor

* cont : minor comments
2025-12-12 17:12:40 +02:00
Georgi Gerganov 4d10b78e23
Merge branch 'master' into HEAD 2025-12-11 14:42:56 +02:00
Georgi Gerganov 4dff236a52
ggml : remove GGML_KQ_MASK_PAD constant (#17910)
* ggml : remove GGML_KQ_MASK_PAD constant

* cont : remove comment
2025-12-10 20:53:16 +02:00
Georgi Gerganov 804e7e3795
graph : respect sampler order for graph reuse 2025-12-10 20:40:15 +02:00
Georgi Gerganov c02654eb7d
graph : make the compute graph constant with respect to active samplers 2025-12-10 16:19:18 +02:00
Georgi Gerganov 81cb5783c8
Merge branch 'master' into HEAD 2025-12-10 13:41:32 +02:00
Sigbjørn Skjæret c8554b66e0
graph : use fill instead of scale_bias in grouped expert selection (#17867)
* use fill instead of scale_bias in grouped expert selection

* do not explicitly use _inplace
2025-12-08 21:29:59 +01:00
Georgi Gerganov 8ef5f900db
cont : fixes 2025-12-07 15:45:00 +02:00
Georgi Gerganov 30742a6ff5
sampling : expand support (wip) 2025-12-06 16:51:56 +02:00
Georgi Gerganov 7864074fdb
sampling : fix outputs and device checks 2025-12-04 19:33:01 +02:00
Daniel Bevenius 10bd640aae
Revert "sampling : stop short if backend sampler sampled a token"
This reverts commit 87b2719eca.
2025-12-04 08:26:33 +01:00
Daniel Bevenius 87b2719eca
sampling : stop short if backend sampler sampled a token
This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.

It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).
2025-12-04 08:13:49 +01:00
Georgi Gerganov 4032ce2378
common : simplify sampler chain initialization 2025-12-01 17:11:11 +02:00
Georgi Gerganov 16451d6bc3
Merge branch 'master' into HEAD 2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen cd3c118908
model: support Ministral3 (#17644)
* conversion script

* support ministral 3

* maybe this is better?

* add TODO for rope_yarn_log_mul

* better ppl (tested on 14B-Instruct)

* Add Ministral3 support to Mistral format

* improve arch handling

* add sizes

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

---------

Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Aman Gupta 6eea666912
llama-graph: avoid expand_forward for fusion (#17633) 2025-12-01 11:12:48 +02:00
Georgi Gerganov c187003d81
llama : naming 2025-11-30 00:05:47 +02:00
Georgi Gerganov 1760bd69b3
llama : reserve graphs with samplers 2025-11-29 23:57:25 +02:00
Georgi Gerganov ff7b0bf632
llama : call backend_init once 2025-11-29 23:09:53 +02:00
Georgi Gerganov 9028ebfea8
llama : cleanup + naming 2025-11-29 22:37:07 +02:00
Daniel Bevenius ec047e12ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 15:16:44 +01:00
Georgi Gerganov 583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Daniel Bevenius 0d28b16bdc
sampling : introduce sampling_info struct
This commit introduces a sampling_info struct to encapsulate all
backend sampling related data within the llama_context class.

It also updates to use more descriptive names for sampled tokens and
candidates in the backend sampler ggml data structure.
2025-11-20 14:45:56 +01:00
Daniel Bevenius 0da7e7dccc
sampling : remove version from sampler chain
This commit removes the version field from the sampler chain and instead
used the sampler pointer itself for change detection.
2025-11-19 06:59:03 +01:00
Georgi Gerganov 4b52e59903
graph : do not include llama-model.h 2025-11-18 13:53:25 +02:00
Daniel Bevenius 7884b0e0ac
sampling : add support for backend sampling
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
2025-11-17 16:15:58 +01:00
Aman Gupta a90eb94ca9
CUDA: fuse rope + set_rows (#16884)
* CUDA: add fused rope

* move k forward_expand up

* create helper function instead of re-using params

* make assert statement more in line with comment

* rope_norm: coalesced writes to global mem
2025-11-13 08:50:01 +08:00
Sigbjørn Skjæret 9008027aa3
hparams : add n_embd_inp() to support extended embed (#16928)
* add n_embd_full to support extended embed

* don't change output

* rename to n_embd_inp

* restore n_embd where applicable
2025-11-07 19:27:58 +01:00
Jan Boon d7395115ba
llama : use std::abs instead of abs (#16853) 2025-10-30 08:30:58 +02:00
Sigbjørn Skjæret f696428ce8
graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (#16655)
* add missing norm topk bias

* use clamping instead, update number and add comment
2025-10-26 17:20:32 +01:00
Aman Gupta f77c13b91f
CUDA: General GEMV fusion (#16715) 2025-10-26 19:28:04 +08:00
Sigbjørn Skjæret 84bf3c6778
model : add BailingMoeV2 support (#16063)
* add BailingMoeV2 support

* update llm types

* undo

* undo

* update llm types

* add model collection link

* update

* almost working

* correct group selection and rename n_group_exp

* avoid large top_k and use argmax instead for now

if we had something like argmax2 that would be equivalent, but this works fine until then

* poke

* skip group selection when there are no tokens

* fix 1T conversion

* hopefully fixed expert group selection

third time's the charm?

* make expert group selection generally available

The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.

* allow n_expert_groups to be 1 (Kimi K2)

* address review suggestions
2025-10-20 21:38:20 +02:00
Georgi Gerganov e60f241eac
metal : FA support F32 K and V and head size = 32 (#16531)
* metal : FA support F32 K and V and head size = 32

* graph : remove obsolete comment [no ci]
2025-10-13 23:07:57 +03:00
Georgi Gerganov e38b7c6e9e
graph : support cacheless embeddings with FA and iSWA (#16528)
* graph : support cacheless embeddings with FA and iSWA

* cont : deduplicate mask creation

* cont : fix name
2025-10-13 22:42:37 +03:00
Saba Fallah e08db42595
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367)
* model: EmbeddingGemma sentence-transformers dense linear projections support

* model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.

See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

* model: add support for EmbeddingGemma SentenceTransformers dense linear projections

- converting model with dense-layers is optional
- introduced dense config params

* Update convert_hf_to_gguf.py

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* fixed formatting issues

* Update src/llama-graph.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims

* fix python lint

* fix ubuntu gcc build warning

* - fixed thread-safety test
- moved asserts to load_hparams

* - tidying up code
- simplifying graph-context expecting both dense weights

* minor : add TODO

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-09 09:39:18 +03:00
Sigbjørn Skjæret 835b2b915c
model : add GroveMoE support (#15510)
* add GroveMoE support

* remove constexpr that fails on certain compilers

* revert crude scalar div implementation, use cast

* build_attn_inp_kv_unified -> build_attn_inp_kv

* fix build_attn

* re-apply ffn_exps regex changes
2025-09-25 19:50:28 +02:00
Aman Gupta 077c94d0ca
CUDA: add a fused top-K MoE kernel (#16130)
* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback
2025-09-25 16:35:05 +02:00
Douglas Hanley b5bd037832
llama : add support for qwen3 reranker (#15824) 2025-09-25 11:53:09 +03:00
Sigbjørn Skjæret b8e09f08b9
model : add grok-2 support (#15539)
* add grok-2 support

* type fix

* type fix

* type fix

* "fix" vocab for invalid sequences

* fix expert tensor mapping and spaces in vocab

* add chat template

* fix norm tensor mapping

* rename layer_out_norm to ffn_post_norm

* ensure ffn_post_norm is mapped

* fix experts merging

* remove erroneous FFN_GATE entry

* concatenate split tensors and add more metadata

* process all expert layers and try cat instead of hstack

* add support for community BPE vocab

* fix expert feed forward length and ffn_down concat

* commit this too

* add ffn_up/gate/down, unsure if sequence is right

* add ffn_gate/down/up to tensor names

* correct residual moe (still not working)

* mess--

* fix embedding scale being applied twice

* add built in chat template

* change beta fast for grok if default value

* remove spm vocab in favor of community bpe vocab

* change attention temp length metadata type to integer

* update attention temp length metadata

* remove comment

* replace M_SQRT2 with std::sqrt(2)

* add yarn metadata, move defaults to hparams
2025-09-14 23:00:59 +02:00
Sigbjørn Skjæret 6ab397e12b
graph : support non-contiguous Q in build_attn_mha (#15908)
* support non-contiguous Q in build_attn_mha

* Update src/llama-graph.cpp

ggml-ci

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-10 19:08:59 +02:00
Georgi Gerganov 663027fd54
context : fix n_outputs during reserve (#15858)
ggml-ci
2025-09-08 10:26:36 +03:00
Georgi Gerganov c610b6c11b
kv-cache : fix SWA checks + disable cacheless iSWA (#15811)
ggml-ci
2025-09-05 10:39:22 +03:00
Daniel Bevenius fb15d649ed
llama : add support for EmbeddingGemma 300m (#15798)
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.

This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.

With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.

Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
2025-09-04 18:10:29 +02:00