This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.
It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).
This commit introduces a sampling_info struct to encapsulate all
backend sampling related data within the llama_context class.
It also updates to use more descriptive names for sampled tokens and
candidates in the backend sampler ggml data structure.
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.
The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.
For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.
It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.
Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
* CUDA: add fused rope
* move k forward_expand up
* create helper function instead of re-using params
* make assert statement more in line with comment
* rope_norm: coalesced writes to global mem
* add BailingMoeV2 support
* update llm types
* undo
* undo
* update llm types
* add model collection link
* update
* almost working
* correct group selection and rename n_group_exp
* avoid large top_k and use argmax instead for now
if we had something like argmax2 that would be equivalent, but this works fine until then
* poke
* skip group selection when there are no tokens
* fix 1T conversion
* hopefully fixed expert group selection
third time's the charm?
* make expert group selection generally available
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
* allow n_expert_groups to be 1 (Kimi K2)
* address review suggestions
* model: EmbeddingGemma sentence-transformers dense linear projections support
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
- converting model with dense-layers is optional
- introduced dense config params
* Update convert_hf_to_gguf.py
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* fixed formatting issues
* Update src/llama-graph.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims
* fix python lint
* fix ubuntu gcc build warning
* - fixed thread-safety test
- moved asserts to load_hparams
* - tidying up code
- simplifying graph-context expecting both dense weights
* minor : add TODO
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
* Review: format + micro-optimizations
* Fix bug: fix tie breakers
* Add optional norm + clean-up code
* Use smem for final write
* Add bounds check
* Use better memory pattern for writeback
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
* support non-contiguous Q in build_attn_mha
* Update src/llama-graph.cpp
ggml-ci
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.
This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.
With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.
Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
* support smallthinker
* support 20b softmax, 4b no sliding window
* new build_moe_ffn_from_probs, and can run 4b
* fix 4b rope bug
* fix python type check
* remove is_moe judge
* remove set_dense_start_swa_pattern function and modify set_swa_pattern function
* trim trailing whitespace
* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* better whitespace
Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* use GGML_ASSERT for expert count validation
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Improve null pointer check for probs
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* use template parameter for SWA attention logic
* better whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* move the creation of inp_out_ids before the layer loop
* remove redundant judge for probs
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>