llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	d3e64b9f49	llama : rework embeddings logic (#14208 ) * llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci	2025-06-16 14:14:00 +03:00
Bartowski	d7da8dc83a	model : Add support for Arcee AI's upcoming AFM model (#14185 ) * Add Arcee AFM support * Add draft update code * Fix linter and update URL, may still not be final * Update src/llama-model.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Remote accidental blank line --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-06-16 01:04:06 +02:00
Ed Addario	30e5b01de2	quantize : change int to unsigned int for KV overrides (#14197 )	2025-06-15 18:53:45 +02:00
Georgi Gerganov	5fce5f948d	kv-cache : fix use-after-move of defrag info (#14189 ) ggml-ci	2025-06-15 10:52:11 +03:00
Mikko Juola	9ae4143bc6	model : add dots.llm1 architecture support (#14044 ) (#14118 ) Adds: * Dots1Model to convert_hf_to_gguf.py * Computation graph code to llama-model.cpp * Chat template to llama-chat.cpp to detect this model's template. --- The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: `ffe12627b4/src/transformers/models/dots1/modular_dots1.py`	2025-06-15 09:52:06 +02:00
Georgi Gerganov	c311ac664d	cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188 ) ggml-ci	2025-06-15 10:08:58 +03:00
Georgi Gerganov	b9912ac570	batch : auto-gen positions + verify multi-sequence input (#14177 ) * batch : verify multi-sequence input batches ggml-ci * cont : auto-gen positions + verify multi-seq input ggml-ci * cont : first print debug info, then perform validation ggml-ci * cont : fix position auto-gen + add comments ggml-ci	2025-06-15 09:18:37 +03:00
Georgi Gerganov	fb85a288d7	vocab : fix build (#14175 ) ggml-ci	2025-06-13 20:03:05 +03:00
Guy Goldenberg	3cfbbdb44e	Merge commit from fork * vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-13 19:20:25 +03:00
Georgi Gerganov	80709b70a2	batch : add LLAMA_BATCH_DEBUG environment variable (#14172 ) * batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display	2025-06-13 18:35:00 +03:00
Georgi Gerganov	60c666347b	batch : rework llama_batch_allocr (#14153 ) * batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci	2025-06-13 13:47:55 +03:00
Đinh Trọng Huy	d714dadb57	pooling : make cls_b and cls_out_b optional (#14165 ) Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-06-13 11:34:08 +03:00
Georgi Gerganov	c33fe8b8c4	vocab : prevent heap overflow when vocab is too small (#14145 ) ggml-ci	2025-06-13 08:03:54 +03:00
Georgi Gerganov	f6e1a7aa87	context : simplify output counting logic during decode (#14142 ) * batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments	2025-06-12 11:50:01 +03:00
Georgi Gerganov	c3ee46fab4	batch : remove logits_all flag (#14141 ) ggml-ci	2025-06-12 11:49:26 +03:00
Georgi Gerganov	9596506965	kv-cache : fix split_equal handling in unified implementation (#14130 ) ggml-ci	2025-06-12 10:02:15 +03:00
compilade	a20b2b05bc	context : round n_tokens to next multiple of n_seqs when reserving (#14140 ) This fixes RWKV inference which otherwise failed when the worst case ubatch.n_seq_tokens rounded to 0.	2025-06-12 02:56:04 -04:00
Georgi Gerganov	89a184fa71	kv-cache : relax SWA masking condition (#14119 ) ggml-ci	2025-06-11 16:48:45 +03:00
Georgi Gerganov	7ae2932116	kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (#14121 )	2025-06-11 12:52:45 +03:00
compilade	dad5c44398	kv-cache : avoid modifying recurrent cells when setting inputs (#13834 ) * kv-cache : avoid modifying recurrent cells when setting inputs * kv-cache : remove inp_s_mask It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy. * kv-cache : fix non-consecutive token pos warning for recurrent models The problem was apparently caused by how the tail cells were swapped. * graph : simplify logic for recurrent state copies * kv-cache : use cell without src refs for rs_z in recurrent cache * llama-graph : fix recurrent state copy The `state_copy` shuffle assumes everything is moved at once, which is not true when `states_extra` is copied back to the cache before copying the range of states between `head` and `head + n_seqs`. This is only a problem if any of the cells in [`head`, `head + n_seqs`) have an `src` in [`head + n_seqs`, `head + n_kv`), which does happen when `n_ubatch > 1` in the `llama-parallel` example. Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes. * llama-graph : rename n_state to state_size in build_recurrent_state This naming should reduce confusion between the state size and the number of states.	2025-06-10 18:20:14 -04:00
Sigbjørn Skjæret	3678b838bb	llama : support GEGLU for jina-bert-v2 (#14090 )	2025-06-10 18:02:08 +02:00
Georgi Gerganov	40cbf571c9	kv-cache : fix shift and defrag logic (#14081 ) * kv-cache : fix shift ggml-ci * cont : reset shift[i] ggml-ci * cont : fix defrag erasing cells that didn't move ggml-ci	2025-06-09 23:04:35 +03:00
Georgi Gerganov	201b31dc2e	graph : fix geglu (#14077 ) ggml-ci	2025-06-09 17:17:31 +03:00
Đinh Trọng Huy	91a8ee6a6f	add geglu activation function (#14074 ) Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-06-09 05:15:31 +01:00
Sigbjørn Skjæret	0974ad7a7c	llama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050 )	2025-06-07 14:13:12 +02:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Georgi Gerganov	487a5e0401	context : fix SWA-related warning for multiple sequences (#14045 )	2025-06-06 13:29:18 +03:00
Sigbjørn Skjæret	d17a809ef0	llama : support multiple classifier outputs and labels (#13940 )	2025-06-06 09:03:25 +02:00
Georgi Gerganov	7f37b6cf1e	memory : migrate from llama_kv_cache to more generic llama_memory (#14006 ) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci	2025-06-05 15:29:22 +03:00
Diego Devesa	3a077146a4	llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013 )	2025-06-05 11:57:42 +02:00
Sigbjørn Skjæret	9f47fa5792	vocab : warn about missing mask token (#14022 )	2025-06-05 09:29:18 +02:00
Georgi Gerganov	9e31bec4fd	context : fix pos_min initialization upon error decode (#14008 ) ggml-ci	2025-06-05 09:06:29 +03:00
Georgi Gerganov	3e63a58ef7	kv-cache : refactor the update/defrag mechanism (#13988 ) * kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci	2025-06-04 18:58:20 +03:00
Xuan-Son Nguyen	3ac67535c8	llama-graph : use ggml_repeat_4d (#13998 )	2025-06-04 10:11:26 +02:00
Georgi Gerganov	e0e806f52e	kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985 ) ggml-ci	2025-06-04 09:50:32 +03:00
Georgi Gerganov	5582c49c39	gemma : more consistent attention scaling for v2 and v3 (#13951 ) * gemma : fix attn scale for 27B * cont : apply scale before attn * cont : consistent attention scaling	2025-06-02 20:54:26 +03:00
Sigbjørn Skjæret	5e1c3aed40	convert : fix nomic-bert-moe mask token (#13757 )	2025-06-01 18:07:21 +02:00
Georgi Gerganov	0fc16b42e8	kv-cache : split implementation in separate sources (#13920 ) ggml-ci	2025-06-01 11:39:27 +03:00
Georgi Gerganov	803f8baf4f	llama : deprecate explicit kv_self defrag/update calls (#13921 ) ggml-ci	2025-05-31 15:58:33 +03:00
Georgi Gerganov	3600cc2886	llama : use n_swa + n_ubatch cells for SWA cache (#13833 ) * llama : use n_swa + n_ubatch cells for SWA cache ggml-ci * llama : add warning about multi-sqeuence SWA contexts	2025-05-31 15:57:44 +03:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching	2025-05-31 12:55:57 +03:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Đinh Trọng Huy	291f2b6913	llama : add support for DistilBert (#13907 ) * add distilbert * small fixes * add note for LLM_ARCH_DISTIL_BERT * Use MODEL_ARCH.BERT for DistilBert --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-30 11:56:02 +02:00
zhangkaihuo	2c90da4c7e	llama : use llm_build_granite for minicpm (#13911 )	2025-05-30 10:31:48 +02:00
Sigbjørn Skjæret	e83ba3e460	llama : add support for jina-reranker-v2 (#13900 )	2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret	6385b843a8	llama : add RobertaForSequenceClassification reranker support (#13875 )	2025-05-29 08:15:01 +02:00
Xuan-Son Nguyen	763d06edb7	llama : fix KV shift for qwen2vl (#13870 ) * llama : fix KV shift for qwen2vl * add ref to the PR	2025-05-28 22:35:31 +02:00
Đinh Trọng Huy	e0e3aa231d	llama : add support for BertForSequenceClassification reranker (#13858 ) * convert: add support for BertForSequenceClassification * add support for reranking using BertForSequenceClassification * merge checks of eos and sep * fix lint --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-28 19:01:58 +02:00
Georgi Gerganov	34b7c0439e	cmake : add llama-cparams.cpp to build (#13832 )	2025-05-27 19:08:44 +03:00
Georgi Gerganov	81713121ee	kv-cells : track min/max used cells and per-sequence positions (#13808 ) * kv-cells : track min/max used cells and per-sequence positions ggml-ci * kv-cells : fix pos-modification updates for seq_pos ggml-ci * kv-cells : add comments ggml-ci	2025-05-27 13:49:41 +03:00
Georgi Gerganov	f9cd68398b	sampling : make sure samplers return at least 1 token (#13822 ) * sampling : min-p should always return at least one token ggml-ci * sampling : same for typical sampling * tests : sampling tests use min_keep == 0 ggml-ci	2025-05-27 12:07:52 +03:00
Georgi Gerganov	4f81b33e32	llama : validate seq id batch input (#13809 ) * llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci	2025-05-27 09:40:59 +03:00
Georgi Gerganov	79c137f776	examples : allow extracting embeddings from decoder contexts (#13797 ) ggml-ci	2025-05-26 14:03:54 +03:00
Georgi Gerganov	de2ef53a4b	kv-cache : rework kv_cell (#13706 ) * kv-cache : rework kv_cell ggml-ci * kv-cells : use "shift" instead of "delta" consistently ggml-ci * llama : add llama_max_parallel_sequences() ggml-ci * kv-cells : update comments [no ci] * context : fail upon construction if sequences exceed max value ggml-ci * kv-cells : get_pos() -> pos_get() + comments ggml-ci * kv-cells : fix tracking of "used" cells ggml-ci	2025-05-25 16:34:36 +03:00
Piotr Jasiukajtis	4032ca4066	llama : add support for Qwen3 MoE tied word embeddings (#13768 )	2025-05-25 10:29:43 +02:00
Olivier Chafik	f5cd27b71d	`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379 ) * add common_json w/ support for truncated json healing * add common_chat_msg_diff * partial common_chat_parse * refactor parser w/ optionals * server: wire chat diffs in stream mode * fix trigger of thinking models (must happen after thoughts are closed) * fix functionary v3.2 raw python! * rename: common_chat_syntax (now contains format) * rm common_regex.at_start * don't return empty <think></think> * accommodate yet another deepseek r1 distill fantasy syntax (`<｜tool▁calls｜>`) * fix QwQ 32B tool call parsing after thoughts (hermes2) * better logs for grammar triggers * consume spaces after parse_json_tool_calls * fix required tool calls w/ thinking models that have pre-opened thinking tags * fix thinking model's initial trigger + test qwq's template * run most test_tool_call tests in stream + non-stream modes * make functionary v3.2 parsing more strict (differentiate first match from others) * send final diff from server, to close off raw python arguments * support partial content streaming in Generic mode * tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5) * Update function-calling.md * Update tool_bench.py * chat-parser: remove input from exception (llm output may contain PII) --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>	2025-05-25 01:48:08 +01:00
0cc4m	259469c4b5	Move GLM4 f32 attention fix to the correct function (#13750 )	2025-05-24 16:49:12 +02:00
Sigbjørn Skjæret	c3a2624339	vocab : fix ugm tokenizer precision (#13743 )	2025-05-24 12:29:09 +02:00
Georgi Gerganov	d13d0f6135	hparams : initialize arrays (#13728 ) ggml-ci	2025-05-23 20:16:13 +03:00
Xuan-Son Nguyen	8a2afb7520	llama : allow custom list of swa_layers (#13726 )	2025-05-23 17:07:04 +02:00
Georgi Gerganov	8a1d206f1d	tts : fix n_ubatch + make WavTokenizer cache-less (#13713 ) ggml-ci	2025-05-22 22:21:07 +03:00
Georgi Gerganov	8e186ef0e7	hparams : support models for which all layers use SWA (#13682 ) ggml-ci	2025-05-21 20:00:49 +03:00
Georgi Gerganov	797f2ac062	kv-cache : simplify the interface (#13660 ) * kv-cache : simplify the interface ggml-ci * context : revert llama_batch_allocr position change ggml-ci	2025-05-21 15:11:13 +03:00
Georgi Gerganov	b44890df2e	model : disable SWA for Phi models (#13676 ) * model : disable SWA for Phi models ggml-ci * model : update warning message * model : print warning only if n_swa > 0 * model : fix typo	2025-05-21 13:09:21 +03:00
Georgi Gerganov	be0239693c	model : fix llama4 graph (#13663 ) ggml-ci	2025-05-20 19:21:04 +03:00
Georgi Gerganov	a4090d1174	llama : remove llama_kv_cache_view API + remove deprecated (#13653 ) ggml-ci	2025-05-20 16:13:16 +03:00
0cc4m	c9c64dee57	Set GLM4 blk..attn_output.weight, kqv_out- matmul to GGML_PREC_F32 to fix infinity values in output (#13639 )	2025-05-20 10:11:56 +02:00
Georgi Gerganov	e298d2fbd0	kv-cache : add SWA support (#13194 ) * kv-cache : prepare for SWA ggml-ci * kv-cache : initial iSWA implementation ggml-ci * kv-cache : rework error recovery logic ggml-ci * models : fix Phi-3 SWA parameters ggml-ci * model : adjust Granite to rope factor changes ggml-ci * server : check if context can do shifts ggml-ci * iswa : for now, always enable shifts (experiment) ggml-ci * kv-cache : simplify SWA logic ggml-ci * kv-cache : apply defrag when we fail to find slots for the batch ggml-ci * llama : update docs about llama_decode ggml-ci * kv-cache : update warning logs when no space for the batch is available ggml-ci * llama : add llama_kv_self_seq_pos_min() * kv-cache : keep track of partial SWA computes and print warnings * server : disallow use cases involving partial SWA context ggml-ci * llama : add param to control SWA cache size ggml-ci * minor : clean-up ggml-ci	2025-05-20 08:05:46 +03:00
Diego Devesa	5364ae4ba5	llama : print hint when loading a model when no backends are loaded (#13589 )	2025-05-16 16:38:07 +02:00
Diego Devesa	c6a2c9e741	gguf : use ggml log system (#13571 ) * gguf : use ggml log system * llama : remove unnecessary new lines in exception messages	2025-05-15 19:13:11 +02:00
Georgi Gerganov	e3a9421b78	kv-cache : fix out-of-bounds view during reserve graph (#13547 ) * kv-cache : fix reserve graph out-of-bounds access ggml-ci * cont : add comment * cont : fix comments [no ci] * cont : more correct comment [no ci]	2025-05-14 23:15:15 +03:00
Sigbjørn Skjæret	f5170c1d7a	editorconfig : fix trailing whitespace from #13542 (#13546 )	2025-05-14 21:22:49 +03:00
Gilad S.	017f10b5fa	fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542 )	2025-05-14 19:18:18 +03:00
Diego Devesa	b7d2672082	llama : fix quantize with dl backends (#13539 )	2025-05-14 16:12:36 +02:00
Gabe Goodhart	5e7d95e22e	fix: Move build_inp_pos to the top of the graph section for build_granite (#13538 ) This matches how others do it, but will still avoid the extra initialization when rope is disabled. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-05-14 15:53:59 +03:00
Ed Addario	e5c834f718	quantize : improve tensor-type pattern matching (#13033 )	2025-05-13 19:12:31 +02:00
Gabe Goodhart	d590cd4c24	model : Granite MoE shared (#13269 ) * feat: Add GGUF conversion for granitemoeshared Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: hparam and arch plumbing for granitemoeshared Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Split MoE fused tensors for shared experts in conversion Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First WIP cut at model arch in cpp The hparam and architecture plumbing should be correct, but the implementation of the shared experts seems to still be broken. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Cleaner (maybe more correct?) splitting for gate/up Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix the input to the shared experts I had misread that the shared experts take the inputs _before_ the standard MoE layer and was feeding the output of the MoE to the shared experts. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Avoid architecture-specific checks for Granite MoE Shared This is a cleaner way that will allow more flexibility in architecture strings going forward. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Split granite architectures out of llm_build_llama This helps de-clutter the llama-family graph construction and allows granite to diverge further (in preparation for Granite 4). NOTE: I removed the granite scale factors from llm_build_deci because they appear to only be there as copy-paste from llm_build_llama. The HF config does not seem to set those values: https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix compiler warning about uninitialized inp_pos This should not have been reachable, but it warns on some compliers Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Consoladate GraniteMoEShared into GraniteMoE for conversion Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-05-13 15:12:01 +02:00
Johannes Gäßler	10d2af0eaa	llama/ggml: add LLM training support (#10544 ) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-12 14:44:49 +02:00
Georgi Gerganov	064cc596ac	context : fix state io for memory-less contexts (#13470 ) ggml-ci	2025-05-12 15:12:27 +03:00
David Huang	7f323a589f	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386 )	2025-05-11 14:18:39 +02:00
Sigbjørn Skjæret	d2a4ef05c6	vocab : add ByteDance-Seed/Seed-Coder (#13423 )	2025-05-10 22:08:07 +02:00
Johannes Gäßler	0cf6725e9f	CUDA: FA support for Deepseek (Ampere or newer) (#13306 ) * CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template	2025-05-09 13:34:58 +02:00
Diego Devesa	27ebfcacba	llama : do not crash if there is no CPU backend (#13395 ) * llama : do not crash if there is no CPU backend * add checks to examples	2025-05-09 13:02:07 +02:00
Xuan-Son Nguyen	3f96aeff39	llama : one-off chat template fix for Mistral-Small-2503 (#13398 ) * llama : one-off chat template fix for Mistral-Small-2503 * update readme * add mistral-v7-tekken	2025-05-09 11:17:51 +02:00
Georgi Gerganov	6562e5a4d6	context : allow cache-less context for embeddings (#13108 ) * context : allow cache-less context for embeddings ggml-ci * context : enable reranking with encode() ggml-ci * context : encode() clears embd_seq ggml-ci * examples : use llama_encode() when appropriate ggml-ci * models : nomic bert moe does not require KV cache * llama : update comments for llama_decode/llama_encode ggml-ci * context : update warning log [no ci]	2025-05-08 14:28:33 +03:00
Georgi Gerganov	51fb96b1ff	context : remove logits_all flag (#13284 ) * context : remove logits_all flag ggml-ci * llama : remove logits_all flag + reorder llama_context_params ggml-ci	2025-05-08 14:26:50 +03:00
Diego Devesa	f061021206	llama : print size and type of overridden tensors (#13364 )	2025-05-08 13:15:15 +02:00
Sigbjørn Skjæret	bc4e1128f7	llama : deci : support ffn-free with attention (#13296 )	2025-05-07 12:49:27 +02:00
piDack	6c7fd67b64	llama : support tie embedding for chatglm models (#13328 )	2025-05-07 09:23:11 +02:00
DocShotgun	ffc727203a	sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345 )	2025-05-06 22:36:24 +02:00
oobabooga	91a86a6f35	sampling : don't consider -infinity values in top_n_sigma (#13344 )	2025-05-06 20:24:15 +02:00
Xuan-Son Nguyen	2f54e348ad	llama : fix build_ffn without gate (#13336 ) * llama : fix build_ffn without gate * fix build on windows * Revert "fix build on windows" This reverts commit `fc420d3c7e`.	2025-05-06 14:25:40 +02:00
oobabooga	233461f812	sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264 ) * sampling: add Top-nσ sampler to `llama-server` and sampler ordering * revert: sampler ordering * revert: VS' crappy auto-formatting * revert: VS' crappy auto-formatting pt.2 * revert: my crappy eye sight... * sampling: add XTC to Top-nσ sampler chain * sampling: add Dyna. Temp. to Top-nσ sampler chain * sampling: actually remove Top-nσ from sampler(oops) * Integrate top_n_sigma into main sampler chain * Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA * Formatting * Lint * Exit early in the sampler if nsigma < 0 --------- Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>	2025-05-05 22:12:19 +02:00
ymcki	3bf785f3ef	llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843 )	2025-05-03 17:39:51 +02:00
Georgi Gerganov	a75cb30dc9	context : fix reorder logic (#13267 ) ggml-ci	2025-05-02 20:54:13 +03:00
Jared Van Bortel	2f567611c0	llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245 )	2025-05-02 11:42:30 -04:00
Georgi Gerganov	c642bc014c	kv-cache : separate recurrent vs non-recurrent impl (#12799 ) * kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref #13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]	2025-05-02 17:48:36 +03:00
Sigbjørn Skjæret	cb06a3c363	llama : orion rope type is neox (#13261 )	2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret	626083faf7	llama : plamo rope type is neox (#13260 )	2025-05-02 12:40:56 +02:00
piDack	2af6880178	llama-chat : reset glmedge chat template (#13253 ) * reset glmedge chat template * fix glmedge chat template	2025-05-02 11:06:09 +02:00

1 2 3 4 5 ...

493 Commits