llama.cpp/src
Daniel Bevenius fb15d649ed
llama : add support for EmbeddingGemma 300m (#15798)
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.

This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.

With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.

Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
2025-09-04 18:10:29 +02:00
..
CMakeLists.txt kv-cache : drop the "unified" prefix (#15467) 2025-08-21 17:00:33 +03:00
llama-adapter.cpp model : jina-embeddings-v3 support (#13693) 2025-08-28 15:49:50 +02:00
llama-adapter.h model : jina-embeddings-v3 support (#13693) 2025-08-28 15:49:50 +02:00
llama-arch.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-arch.h llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-batch.cpp perplexity : provide a helpful hint for has_cpl case in split_equal error. (#15304) 2025-08-14 14:03:30 +03:00
llama-batch.h llama : reuse compute graphs (#14482) 2025-07-17 19:08:33 +03:00
llama-chat.cpp model : add support for Seed-OSS (#15490) 2025-08-23 15:21:52 +02:00
llama-chat.h model : add support for Seed-OSS (#15490) 2025-08-23 15:21:52 +02:00
llama-context.cpp llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791) 2025-09-04 15:40:44 +02:00
llama-context.h llama : separate compute buffer reserve from fattn check (#15696) 2025-08-31 15:49:03 +02:00
llama-cparams.cpp cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188) 2025-06-15 10:08:58 +03:00
llama-cparams.h llama : remove KV cache defragmentation logic (#15473) 2025-08-22 12:22:13 +03:00
llama-grammar.cpp `server`: streaming of tool calls and thoughts when `--jinja` is on (#12379) 2025-05-25 01:48:08 +01:00
llama-grammar.h `tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034) 2025-03-05 13:05:13 +00:00
llama-graph.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-graph.h llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-hparams.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-hparams.h llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-impl.cpp GGUF: C++ refactor, backend support, misc fixes (#11030) 2025-01-07 18:01:58 +01:00
llama-impl.h llama: use FA + max. GPU layers by default (#15434) 2025-08-30 16:32:10 +02:00
llama-io.cpp llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) 2025-03-13 12:35:44 +02:00
llama-io.h llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) 2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-kv-cache-iswa.h kv-cache : support layer reuse (#15504) 2025-08-24 13:07:07 +03:00
llama-kv-cache.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-kv-cache.h llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-kv-cells.h llama : remove KV cache defragmentation logic (#15473) 2025-08-22 12:22:13 +03:00
llama-memory-hybrid.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-memory-hybrid.h llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-memory-recurrent.cpp kv-cache : support layer reuse (#15504) 2025-08-24 13:07:07 +03:00
llama-memory-recurrent.h kv-cache : support layer reuse (#15504) 2025-08-24 13:07:07 +03:00
llama-memory.cpp memory : correctly handle failure in apply() (#14438) 2025-06-30 18:03:03 +03:00
llama-memory.h kv-cache : support layer reuse (#15504) 2025-08-24 13:07:07 +03:00
llama-mmap.cpp llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013) 2025-06-05 11:57:42 +02:00
llama-mmap.h llama-mmap: fix missing include (#11796) 2025-02-10 20:58:18 +02:00
llama-model-loader.cpp nvidia nemotron nano v2 (nemotronh) (#15507) 2025-08-28 18:39:31 -06:00
llama-model-loader.h model: support GLM 4.5 family of models (#14939) 2025-08-04 20:29:25 +02:00
llama-model-saver.cpp llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
llama-model-saver.h llama/ggml: add LLM training support (#10544) 2025-05-12 14:44:49 +02:00
llama-model.cpp llama : add support for EmbeddingGemma 300m (#15798) 2025-09-04 18:10:29 +02:00
llama-model.h llama : fix incorrect model type for Gemma 270M (#15764) 2025-09-03 13:35:49 +02:00
llama-quant.cpp convert : support non-mxfp4 HF model (#15153) 2025-08-07 23:26:03 +02:00
llama-quant.h llama : refactor `src/llama.cpp` (#10902) 2025-01-03 10:18:53 +02:00
llama-sampling.cpp sampling : optimize dist sampler (#15704) 2025-09-03 18:16:26 +03:00
llama-sampling.h llama : add `llama_vocab`, functions -> methods, naming (#11110) 2025-01-12 11:32:42 +02:00
llama-vocab.cpp model : jina-embeddings-v3 support (#13693) 2025-08-28 15:49:50 +02:00
llama-vocab.h model : add hunyuan dense (#14878) 2025-08-01 15:31:12 +02:00
llama.cpp llama: use FA + max. GPU layers by default (#15434) 2025-08-30 16:32:10 +02:00
unicode-data.cpp server : better security control for public deployments (#9776) 2024-10-08 13:27:04 +02:00
unicode-data.h llama : reduce compile time and binary size (#9712) 2024-10-02 15:49:55 +02:00
unicode.cpp model : add Kimi-K2 support (#14654) 2025-07-15 21:54:22 +02:00
unicode.h model : add Kimi-K2 support (#14654) 2025-07-15 21:54:22 +02:00