* llama : add support for NVIDIA Nemotron Nano 3
This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling
the conversion and running of this model.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA: add fused rope
* move k forward_expand up
* create helper function instead of re-using params
* make assert statement more in line with comment
* rope_norm: coalesced writes to global mem
* add BailingMoeV2 support
* update llm types
* undo
* undo
* update llm types
* add model collection link
* update
* almost working
* correct group selection and rename n_group_exp
* avoid large top_k and use argmax instead for now
if we had something like argmax2 that would be equivalent, but this works fine until then
* poke
* skip group selection when there are no tokens
* fix 1T conversion
* hopefully fixed expert group selection
third time's the charm?
* make expert group selection generally available
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
* allow n_expert_groups to be 1 (Kimi K2)
* address review suggestions
* model: EmbeddingGemma sentence-transformers dense linear projections support
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
- converting model with dense-layers is optional
- introduced dense config params
* Update convert_hf_to_gguf.py
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* fixed formatting issues
* Update src/llama-graph.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims
* fix python lint
* fix ubuntu gcc build warning
* - fixed thread-safety test
- moved asserts to load_hparams
* - tidying up code
- simplifying graph-context expecting both dense weights
* minor : add TODO
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
* Review: format + micro-optimizations
* Fix bug: fix tie breakers
* Add optional norm + clean-up code
* Use smem for final write
* Add bounds check
* Use better memory pattern for writeback
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
* support non-contiguous Q in build_attn_mha
* Update src/llama-graph.cpp
ggml-ci
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.
This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.
With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.
Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
* support smallthinker
* support 20b softmax, 4b no sliding window
* new build_moe_ffn_from_probs, and can run 4b
* fix 4b rope bug
* fix python type check
* remove is_moe judge
* remove set_dense_start_swa_pattern function and modify set_swa_pattern function
* trim trailing whitespace
* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* better whitespace
Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* use GGML_ASSERT for expert count validation
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Improve null pointer check for probs
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* use template parameter for SWA attention logic
* better whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* move the creation of inp_out_ids before the layer loop
* remove redundant judge for probs
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba
(and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : use std::find for seq_nodes in llama_rs_cache
* llama : state checkpoints for recurrent models
* llama : correctly handle more edge cases for the rs cache
* llama : rename many llama_kv_cache_* functions
* llama : remove useless return value for some llama_cache_* functions
* llama : rethink recurrent state cell counts
* llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.
* llama : gracefully fail when not finding hybrid slot
* llama : support Jamba
* llama : fix BERT inference without KV cache
* convert-hf : check for unprocessed Jamba experts
* convert-hf : support Mini-Jamba conversion
* llama : fix Jamba quantization sanity checks
* llama : sequence-length-aware batch splitting
* llama : use equal-sequence-length sub-batches for recurrent models
* ggml : simplify SSM-related operators
* llama : make recurrent state slot allocation contiguous
* llama : adapt internal uses of batches to llama_ubatch
* llama : fix batch split output count for embeddings
* llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
* llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
* llama : avoid copies for simple batch splits
* ggml : make ggml_ssm_scan not modify its source tensors
* llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
* llama : fix .base() compilation error on Windows
* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it,
and this makes Mamba's conv step slightly faster.
* mamba : fix non-contiguous usage of ggml_silu
* llama : session saving and reloading for hybrid models
* convert_hf : fix Jamba conversion
* llama : fix mixed signedness comparison
* llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
* llama : begin renaming llama_past back to llama_kv_cache
* llama : remove implicit recurrent state rollbacks
* llama : partially apply clang-format style
* convert : fix jamba conv1d shape squeezing
* graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).
* model : add Jamba to Mamba-specific hparams printing
* jamba : remove redundant nullptr initializations
* model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : make falcon-h1 use shared mamba2 layer builder
* memory : avoid referring to KV in recurrent cache logs
* gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size