llama.cpp

History

Daniel Bevenius d1e2adba65 llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791 ) * llama : set n_outputs to 1 to avoid 0 outputs mean-pooling This commit modifies the llama_context constructor to set n_outputs to 1. The motivation for this is that when using pooling, and specifically mean pooling, for embeddings having n_outputs set to 0 can lead to the following error: ```console $ build/bin/llama-embedding -m models/nomic-embed-text-1.5-Q4_K_M.gguf \ --pooling mean -p "Hello, how are you?" ... llama_context: CPU output buffer size = 0.12 MiB /home/danbev/work/ai/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed 0x0000743c96d107e3 in __GI___wait4 (pid=292978, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory 30 in ../sysdeps/unix/sysv/linux/wait4.c 196 waitpid(child_pid, NULL, 0); 230 ggml_print_backtrace(); 3023 GGML_ASSERT(ggml_can_mul_mat(a, b)); 1823 cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean); 18983 llm->build_pooling(cls, cls_b, cls_out, cls_out_b); 1399 auto * gf = model.build_graph(gparams); 292 auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true); 2329 auto * ctx = new llama_context(model, params); 913 llama_context lctx = llama_init_from_model(model, cparams); 105 common_init_result llama_init = common_init_from_params(params); [Inferior 1 (process 292976) detached] Aborted (core dumped) ``` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add comment about not reserving graphs with zero outputs * add assert in graph_reserve to ensure n_outputs >= 1 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>		2025-09-04 15:40:44 +02:00
..
CMakeLists.txt	kv-cache : drop the "unified" prefix (#15467 )	2025-08-21 17:00:33 +03:00
llama-adapter.cpp	model : jina-embeddings-v3 support (#13693 )	2025-08-28 15:49:50 +02:00
llama-adapter.h	model : jina-embeddings-v3 support (#13693 )	2025-08-28 15:49:50 +02:00
llama-arch.cpp	nvidia nemotron nano v2 (nemotronh) (#15507 )	2025-08-28 18:39:31 -06:00
llama-arch.h	nvidia nemotron nano v2 (nemotronh) (#15507 )	2025-08-28 18:39:31 -06:00
llama-batch.cpp	perplexity : provide a helpful hint for has_cpl case in split_equal error. (#15304 )	2025-08-14 14:03:30 +03:00
llama-batch.h	llama : reuse compute graphs (#14482 )	2025-07-17 19:08:33 +03:00
llama-chat.cpp	model : add support for Seed-OSS (#15490 )	2025-08-23 15:21:52 +02:00
llama-chat.h	model : add support for Seed-OSS (#15490 )	2025-08-23 15:21:52 +02:00
llama-context.cpp	llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791 )	2025-09-04 15:40:44 +02:00
llama-context.h	llama : separate compute buffer reserve from fattn check (#15696 )	2025-08-31 15:49:03 +02:00
llama-cparams.cpp	cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188 )	2025-06-15 10:08:58 +03:00
llama-cparams.h	llama : remove KV cache defragmentation logic (#15473 )	2025-08-22 12:22:13 +03:00
llama-grammar.cpp	`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379 )	2025-05-25 01:48:08 +01:00
llama-grammar.h	`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )	2025-03-05 13:05:13 +00:00
llama-graph.cpp	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
llama-graph.h	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
llama-hparams.cpp	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-hparams.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-impl.cpp	GGUF: C++ refactor, backend support, misc fixes (#11030 )	2025-01-07 18:01:58 +01:00
llama-impl.h	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
llama-io.cpp	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-io.h	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-kv-cache-iswa.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-kv-cache.cpp	kv-cache : fix find_slot to not search for continuous slot (#15638 )	2025-08-28 17:09:05 +03:00
llama-kv-cache.h	kv-cache : remove LLAMA_SET_ROWS checks (#15505 )	2025-08-28 12:27:02 +03:00
llama-kv-cells.h	llama : remove KV cache defragmentation logic (#15473 )	2025-08-22 12:22:13 +03:00
llama-memory-hybrid.cpp	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-memory-hybrid.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-memory-recurrent.cpp	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-memory-recurrent.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-memory.cpp	memory : correctly handle failure in apply() (#14438 )	2025-06-30 18:03:03 +03:00
llama-memory.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-mmap.cpp	llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013 )	2025-06-05 11:57:42 +02:00
llama-mmap.h	llama-mmap: fix missing include (#11796 )	2025-02-10 20:58:18 +02:00
llama-model-loader.cpp	nvidia nemotron nano v2 (nemotronh) (#15507 )	2025-08-28 18:39:31 -06:00
llama-model-loader.h	model: support GLM 4.5 family of models (#14939 )	2025-08-04 20:29:25 +02:00
llama-model-saver.cpp	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
llama-model-saver.h	llama/ggml: add LLM training support (#10544 )	2025-05-12 14:44:49 +02:00
llama-model.cpp	llama : fix incorrect model type for Gemma 270M (#15764 )	2025-09-03 13:35:49 +02:00
llama-model.h	llama : fix incorrect model type for Gemma 270M (#15764 )	2025-09-03 13:35:49 +02:00
llama-quant.cpp	convert : support non-mxfp4 HF model (#15153 )	2025-08-07 23:26:03 +02:00
llama-quant.h	llama : refactor `src/llama.cpp` (#10902 )	2025-01-03 10:18:53 +02:00
llama-sampling.cpp	sampling : optimize dist sampler (#15704 )	2025-09-03 18:16:26 +03:00
llama-sampling.h	llama : add `llama_vocab`, functions -> methods, naming (#11110 )	2025-01-12 11:32:42 +02:00
llama-vocab.cpp	model : jina-embeddings-v3 support (#13693 )	2025-08-28 15:49:50 +02:00
llama-vocab.h	model : add hunyuan dense (#14878 )	2025-08-01 15:31:12 +02:00
llama.cpp	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
unicode-data.cpp	server : better security control for public deployments (#9776 )	2024-10-08 13:27:04 +02:00
unicode-data.h	llama : reduce compile time and binary size (#9712 )	2024-10-02 15:49:55 +02:00
unicode.cpp	model : add Kimi-K2 support (#14654 )	2025-07-15 21:54:22 +02:00
unicode.h	model : add Kimi-K2 support (#14654 )	2025-07-15 21:54:22 +02:00