Commit Graph

88 Commits

Author SHA1 Message Date
Diego Devesa 5a4ff43e7d
llama : disable pipeline parallelism if compute buffer allocation fails (#16748) 2025-10-27 21:51:28 +01:00
takuya kodama 7062dd8460
llama-context: only warn on pooling_type when user specified (#16674)
The unexpeced pooling_type warning was incorrectly shown when users did not
specify the --pooling-type parameter. In this case, the parameter
defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code
automatically applies the model's default pooling type.

Example of spurious warning:
```
$ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello"
...
llama_init_from_model: model default pooling_type is [2], but [-1] was specified
...
```

This fix ensures the warning only appears when users explicitly specify
a pooling type that differs from the model's default (e.g., using
--pooling-type mean on a model that expects CLS pooling).
2025-10-20 10:44:21 +03:00
Saba Fallah e08db42595
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367)
* model: EmbeddingGemma sentence-transformers dense linear projections support

* model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.

See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

* model: add support for EmbeddingGemma SentenceTransformers dense linear projections

- converting model with dense-layers is optional
- introduced dense config params

* Update convert_hf_to_gguf.py

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* fixed formatting issues

* Update src/llama-graph.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims

* fix python lint

* fix ubuntu gcc build warning

* - fixed thread-safety test
- moved asserts to load_hparams

* - tidying up code
- simplifying graph-context expecting both dense weights

* minor : add TODO

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-09 09:39:18 +03:00
Johannes Gäßler e789095502
llama: print memory breakdown on exit (#15860)
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Sigbjørn Skjæret b8e09f08b9
model : add grok-2 support (#15539)
* add grok-2 support

* type fix

* type fix

* type fix

* "fix" vocab for invalid sequences

* fix expert tensor mapping and spaces in vocab

* add chat template

* fix norm tensor mapping

* rename layer_out_norm to ffn_post_norm

* ensure ffn_post_norm is mapped

* fix experts merging

* remove erroneous FFN_GATE entry

* concatenate split tensors and add more metadata

* process all expert layers and try cat instead of hstack

* add support for community BPE vocab

* fix expert feed forward length and ffn_down concat

* commit this too

* add ffn_up/gate/down, unsure if sequence is right

* add ffn_gate/down/up to tensor names

* correct residual moe (still not working)

* mess--

* fix embedding scale being applied twice

* add built in chat template

* change beta fast for grok if default value

* remove spm vocab in favor of community bpe vocab

* change attention temp length metadata type to integer

* update attention temp length metadata

* remove comment

* replace M_SQRT2 with std::sqrt(2)

* add yarn metadata, move defaults to hparams
2025-09-14 23:00:59 +02:00
Haiyue Wang f4e664f838
context : remove redundant explicit casting to the same type (#15948)
The function 'output_reserve' return type is 'uint32_t', so need to add
explicit casting.
2025-09-12 18:16:32 +03:00
Daniel Bevenius 86587da03b
llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893)
This commit adds check for two function pointers returned from
ggml_backend_reg_get_proc_address.

The motivation for this is that the function pointer could be nullptr if
the get proc address function changes in the future. This is also
consistent with all the other calls to ggml_backend_reg_get_proc_address
in the code base.
2025-09-10 05:33:58 +02:00
Georgi Gerganov 663027fd54
context : fix n_outputs during reserve (#15858)
ggml-ci
2025-09-08 10:26:36 +03:00
Daniel Bevenius d1e2adba65
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791)
* llama : set n_outputs to 1 to avoid 0 outputs mean-pooling

This commit modifies the llama_context constructor to set n_outputs to
1.

The motivation for this is that when using pooling, and specifically
mean pooling, for embeddings having n_outputs set to 0 can lead to the
following error:
```console
$ build/bin/llama-embedding -m models/nomic-embed-text-1.5-Q4_K_M.gguf \
   --pooling mean -p "Hello, how are you?"
...
llama_context:        CPU  output buffer size =     0.12 MiB
/home/danbev/work/ai/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
0x0000743c96d107e3 in __GI___wait4 (pid=292978, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
30	in ../sysdeps/unix/sysv/linux/wait4.c
196	        waitpid(child_pid, NULL, 0);
230	        ggml_print_backtrace();
3023	    GGML_ASSERT(ggml_can_mul_mat(a, b));
1823	                cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
18983	    llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
1399	    auto * gf = model.build_graph(gparams);
292	            auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true);
2329	        auto * ctx = new llama_context(*model, params);
913	    llama_context * lctx = llama_init_from_model(model, cparams);
105	    common_init_result llama_init = common_init_from_params(params);
[Inferior 1 (process 292976) detached]
Aborted (core dumped)
```

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add comment about not reserving graphs with zero outputs

* add assert in graph_reserve to ensure n_outputs >= 1

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-04 15:40:44 +02:00
Diego Devesa 274966226f
llama : fix fattn reserve call n_seqs parameter (#15699)
ggml-ci
2025-08-31 18:47:05 +03:00
Diego Devesa 9777032dcc
llama : separate compute buffer reserve from fattn check (#15696)
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
2025-08-31 15:49:03 +02:00
Johannes Gäßler e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Georgi Gerganov 8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks (#15505)
ggml-ci
2025-08-28 12:27:02 +03:00
Georgi Gerganov 85cc1ae998
context : print graph stats for memory-less contexts (#15586)
ggml-ci
2025-08-26 12:47:00 +03:00
Georgi Gerganov 9ebebef62f
llama : remove KV cache defragmentation logic (#15473)
ggml-ci
2025-08-22 12:22:13 +03:00
Georgi Gerganov cd36b5e5c7
llama : remove deprecated llama_kv_self API (#15472)
ggml-ci
2025-08-21 19:13:45 +03:00
Georgi Gerganov 715a6db02c
kv-cache : drop the "unified" prefix (#15467)
* kv-cache : drop the "unified" prefix

ggml-ci

* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Georgi Gerganov 9d262f4bad
server : remove swa_full warning (#15399) 2025-08-19 08:45:26 +03:00
Georgi Gerganov d32e03f449
server : add SWA checkpoints (#15293)
* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
Jonathan Graehl 5cdb27e091
finetune: SGD optimizer, more CLI args (#13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
compilade ee3a9fcf88
context : fix index overflow on huge outputs (#15080)
* context : fix overflow when re-ordering huge outputs

* context : fix logits size overflow for huge batches
2025-08-05 11:27:45 +02:00
Georgi Gerganov a4569c41fd
llama : enable LLAMA_SET_ROWS=1 by default (#14959)
ggml-ci
2025-08-02 17:14:21 +03:00
Georgi Gerganov ba42794c9e
graph : fix equal_seq() check (#14986)
ggml-ci
2025-08-01 06:38:12 +03:00
Daniel Bevenius ca0ef2dddb
llama : clarify comment about pp and tg graphs [no ci] (#14895)
* llama : clarify comment about pp and tg graphs [no ci]

This commit clarifies the comment in `llama-context.cpp` regarding the
prefill prompt (pp), and token generation (tg) graphs.

The motivation for this is that I've struggled to remember these and had
to look them up more than once, so I thought it would be helpful to add
a comment that makes it clear what these stand for.

* squash! llama : clarify comment about pp and tg graphs [no ci]

Change "pp" to "prompt processing".
2025-07-27 12:10:51 +02:00
Georgi Gerganov c1dbea752a
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)
ggml-ci
2025-07-25 14:28:06 +03:00
Georgi Gerganov e4868d16d2
context : perform output reorder lazily upon access after sync (#14853)
* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
2025-07-24 16:31:48 +03:00
Georgi Gerganov d498af3d5a
graph : avoid huge warm-up graphs for MoE models (#14753)
* graph : avoid huge warm-up graphs for MoE models

ggml-ci

* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Georgi Gerganov 8f974bc1e9
graph : refactor context to not pass gf explicitly (#14629)
ggml-ci
2025-07-18 08:29:28 +03:00
Georgi Gerganov 01612b7409
llama : reuse compute graphs (#14482)
* llama : reuse compute graphs

ggml-ci

* llama-bench : add graph reuse parameter

ggml-ci

* cont : remove the parameter and the sched resets

ggml-ci

* graph : rename update() to can_reuse()

ggml-ci

* params : remove is_same()

ggml-ci

* graph : set res->params in llm_graph_context constructor

ggml-ci

* graph : avoid set_max_nodes in llm_graph_result

ggml-ci

* kv-cache : reuse llama_context's graph result instance

ggml-ci

* context : reset the previous graph result upon memory updates

ggml-ci

* batch : llama_ubatch now carries its data instead of pointing to balloc

ggml-ci

* merge : fix build

ggml-ci

* graph : fix can_reuse() checks when flash-attention is disabled

* graph : move llm_graph_result impl in source file + debug env

ggml-ci
2025-07-17 19:08:33 +03:00
Georgi Gerganov 225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta 9c9e4fc635
llama-context: add ability to get logits (#14672) 2025-07-14 21:01:41 +08:00
Georgi Gerganov 7b50d589a8
kv-cells : fix tracking of seq_pos (#14339)
* kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

* cont : improve error message

ggml-ci

* cont : add more comments
2025-06-23 12:27:35 +03:00
Georgi Gerganov 692e3cdd0a
memory : rename interface to llama_memory_context_i (#14296)
* memory : rename interface to llama_memory_context_i

ggml-ci

* cont : fix comments

* cont : use "mctx" for referencing a memory context

ggml-ci
2025-06-21 08:03:46 +03:00
Georgi Gerganov 4c9fdfbe15
ubatch : new splitting logic (#14217)
ggml-ci
2025-06-20 10:14:14 +03:00
Georgi Gerganov d3e64b9f49
llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
Georgi Gerganov c311ac664d
cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)
ggml-ci
2025-06-15 10:08:58 +03:00
Georgi Gerganov b9912ac570
batch : auto-gen positions + verify multi-sequence input (#14177)
* batch : verify multi-sequence input batches

ggml-ci

* cont : auto-gen positions + verify multi-seq input

ggml-ci

* cont : first print debug info, then perform validation

ggml-ci

* cont : fix position auto-gen + add comments

ggml-ci
2025-06-15 09:18:37 +03:00
Georgi Gerganov 60c666347b
batch : rework llama_batch_allocr (#14153)
* batch : rework llama_batch_allocr

ggml-ci

* cont : move validation inside class

ggml-ci

* cont : move output counting to class

ggml-ci

* cont : minor

ggml-ci

* batch : add TODOs

ggml-ci
2025-06-13 13:47:55 +03:00
Georgi Gerganov f6e1a7aa87
context : simplify output counting logic during decode (#14142)
* batch : remove logits_all flag

ggml-ci

* context : simplify output counting logic during decode

ggml-ci

* cont : fix comments
2025-06-12 11:50:01 +03:00
Georgi Gerganov c3ee46fab4
batch : remove logits_all flag (#14141)
ggml-ci
2025-06-12 11:49:26 +03:00
Georgi Gerganov 9596506965
kv-cache : fix split_equal handling in unified implementation (#14130)
ggml-ci
2025-06-12 10:02:15 +03:00
compilade a20b2b05bc
context : round n_tokens to next multiple of n_seqs when reserving (#14140)
This fixes RWKV inference which otherwise failed
when the worst case ubatch.n_seq_tokens rounded to 0.
2025-06-12 02:56:04 -04:00
Georgi Gerganov 745aa5319b
llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov 487a5e0401
context : fix SWA-related warning for multiple sequences (#14045) 2025-06-06 13:29:18 +03:00
Sigbjørn Skjæret d17a809ef0
llama : support multiple classifier outputs and labels (#13940) 2025-06-06 09:03:25 +02:00
Georgi Gerganov 7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory (#14006)
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci
2025-06-05 15:29:22 +03:00
Georgi Gerganov 9e31bec4fd
context : fix pos_min initialization upon error decode (#14008)
ggml-ci
2025-06-05 09:06:29 +03:00
Georgi Gerganov 3e63a58ef7
kv-cache : refactor the update/defrag mechanism (#13988)
* kv-cache : refactor update mechanism

ggml-ci

* memory : improve status handling

* defrag : reset head + add comments

ggml-ci

* cont : minor fixes

ggml-ci
2025-06-04 18:58:20 +03:00
Georgi Gerganov 803f8baf4f
llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov 3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00