llama.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Bevenius	82c2600585	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-28 07:34:17 +01:00
Johannes Gäßler	026d2ad472	llama: fix magic number of 999 for GPU layers (#18266 ) * llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode	2025-12-27 20:18:35 +01:00
Georgi Gerganov	0ce03597e8	Merge branch 'master' into HEAD	2025-12-24 10:33:21 +02:00
Johannes Gäßler	147a521636	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
Georgi Gerganov	3b3f5fed31	common : disable backend sampling when grammar is involved	2025-12-18 10:52:21 +02:00
Daniel Bevenius	ad1b60abc4	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-16 09:45:08 +01:00
Johannes Gäßler	b1f3a6e5db	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 ) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-15 09:24:59 +01:00
Georgi Gerganov	22c7f85b9c	Merge branch 'master' into HEAD	2025-12-14 10:19:58 +02:00
Georgi Gerganov	609a2d0268	models : fix YaRN regression + consolidate logic (#18006 ) * models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header	2025-12-14 08:34:56 +02:00
Jeff Bolz	5266379bca	llama_context: synchronize before reallocating output buffer (#17974 )	2025-12-13 09:19:51 -06:00
Georgi Gerganov	4d10b78e23	Merge branch 'master' into HEAD	2025-12-11 14:42:56 +02:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Georgi Gerganov	804e7e3795	graph : respect sampler order for graph reuse	2025-12-10 20:40:15 +02:00
Georgi Gerganov	c02654eb7d	graph : make the compute graph constant with respect to active samplers	2025-12-10 16:19:18 +02:00
Georgi Gerganov	34b407b41c	sampling : use host buffer type for inputs	2025-12-09 17:53:17 +02:00
Georgi Gerganov	92ff767918	llama : require backend samplers to be of type llama_sampler_chain	2025-12-09 15:38:37 +02:00
Georgi Gerganov	6d38db5dfe	Merge branch 'master' into HEAD	2025-12-08 17:55:24 +02:00
Piotr Wilkin (ilintar)	e4e9c4329c	Make graph_max_nodes vary by ubatch size (#17794 ) * Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-08 14:32:41 +01:00
Georgi Gerganov	30742a6ff5	sampling : expand support (wip)	2025-12-06 16:51:56 +02:00
Georgi Gerganov	6958d41366	sampling : check backend support during init	2025-12-04 17:29:08 +02:00
Daniel Bevenius	cf0e1475c5	sampling : lower log level for output buffer reallocations [no ci] This commit changes the logging level for output buffer reallocations in the llama_context::output_reserve function from INFO to DEBUG. The motivation for this is that it currently logs to info and when enabling verbose logging for llama-cli this will get mixed with the output, for example: ```console What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB 1. Stockholm 2\. Helsinki Based are the options 1. Stockholm Explanation: Stockholm is the capital of ... ```	2025-12-01 09:13:47 +01:00
Georgi Gerganov	80742cbaeb	cont : naming	2025-11-30 11:24:30 +02:00
Georgi Gerganov	c187003d81	llama : naming	2025-11-30 00:05:47 +02:00
Georgi Gerganov	1760bd69b3	llama : reserve graphs with samplers	2025-11-29 23:57:25 +02:00
Georgi Gerganov	ff7b0bf632	llama : call backend_init once	2025-11-29 23:09:53 +02:00
Georgi Gerganov	d8d98bb4bb	Merge branch 'master' into HEAD	2025-11-29 22:38:44 +02:00
Diego Devesa	e072b2052e	ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276 ) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 17:33:23 +02:00
Georgi Gerganov	117e2079a9	refactor : simplify and improve memory management	2025-11-28 16:09:42 +02:00
Daniel Bevenius	459b7ae7b9	squash! sampling : support intermixed backend/cpu samplers Fix llama-save-load-state which currently fails by handling the case when batch.logits is nullptr (like when loading state) by allocating space for all outputs as CPU logits.	2025-11-28 13:50:47 +01:00
Piotr Wilkin (ilintar)	ff55414c42	model : Qwen3 Next (#16095 ) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 12:02:56 +01:00
Daniel Bevenius	9ad6522be6	squash! sampling : support intermixed backend/cpu samplers Add check that logits is not null which is can happen for embeddings.	2025-11-28 08:57:48 +01:00
Daniel Bevenius	74be332e24	sampling : support intermixed backend/cpu samplers This commit updates the backend sampling implementation to support intermixed usage of backend and CPU samplers within the same batch. The initial implementation was developed as an all-or-nothing solution: either perform backend sampling for the entire batch, or perform CPU sampling for the entire batch. The motivation for this change is to support batches with mixed sequences. For example, we may have a backend sampler configured for sequence 0, while sequence 1 in the same batch uses CPU sampling. This was not supported in the initial implementation. This issue manifested in llama-server with the webui: decoding with backend samplers would work initially, but after changing to CPU sampling, a slot (sequence) could still be using a backend sampler. This meant that logits in output_reserve would not be allocated, resulting in an error. The solution in this commit inspects the batch to determine which sampling modes are needed and allocates buffers accordingly. However, there is a known inefficiency: when we have intermixed backend/CPU samplers in the same batch, we currently copy all logits to the host, even for sequences using backend samplers. Added test_backend_cpu_mixed_batch to verify correct behavior with mixed backend/CPU samplers in a single batch, including dynamic sampler switching between decode calls.	2025-11-28 08:38:05 +01:00
Daniel Bevenius	172208afbf	sampling : add comments about backend sampler [no ci] This commit adds a comment to llama_context's constructor explaining why backend samplers are initialized early in the process.	2025-11-27 14:59:52 +01:00
Daniel Bevenius	2b4c7927ee	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-25 06:10:33 +01:00
Daniel Bevenius	134e6940ca	llama : skip output reordering for single token batches (#17466 ) This commit adds a check to skip the output reordering logic when n_outputs == 1. With a single output token, the data is trivially sorted and the reordering code is currently doing unnecessary work (resetting and rebuilding output_ids to the same values). The motivation for this change is improved code clarity and avoiding confusion when debugging. While the performance impact is probably negligible, this unnecessary work happens on every decode call in llama-server when processing batches with single-token outputs.	2025-11-24 21:06:17 +01:00
Daniel Bevenius	a02adf4211	sampling : add assertions for contiguous tensors in async copy functions	2025-11-24 21:01:06 +01:00
Daniel Bevenius	8eb9b4769d	sampling : remove redundant checks for stride and size [no ci]	2025-11-24 13:53:29 +01:00
Daniel Bevenius	4a90583d7d	sampling : cleanup and clarify output_reserve	2025-11-24 13:26:18 +01:00
Daniel Bevenius	9e273f7aa4	sampling : fix copying both sampled tokens and logits/probs from backend This commit fixes the issue where both sampled tokens and logits/probs were not being copied correctly from the backend to the host when multiple backend samplers were used. A test for this scenario has also been added to ensure that both types of data are copied correctly when different backend samplers are employed.	2025-11-23 13:12:01 +01:00
Daniel Bevenius	ae23d2d2c1	sampling: clarify candidate ids usage in comments	2025-11-23 11:28:19 +01:00
Daniel Bevenius	65500d05ab	sampling : add stride variable for clarity	2025-11-23 11:27:54 +01:00
Daniel Bevenius	61ffe41dc1	sampling : use pinned memory for backend sampling buffers	2025-11-21 14:02:16 +01:00
Daniel Bevenius	0d28b16bdc	sampling : introduce sampling_info struct This commit introduces a sampling_info struct to encapsulate all backend sampling related data within the llama_context class. It also updates to use more descriptive names for sampled tokens and candidates in the backend sampler ggml data structure.	2025-11-20 14:45:56 +01:00
Daniel Bevenius	18ed4d8f96	squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring.	2025-11-19 15:10:15 +01:00
Daniel Bevenius	d74eb61aa7	squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available.	2025-11-19 11:29:26 +01:00
Daniel Bevenius	7e98ebcc6b	sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode.	2025-11-19 09:31:33 +01:00
Daniel Bevenius	311c1a347f	sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence.	2025-11-18 16:06:23 +01:00
Daniel Bevenius	82957a90f2	sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful.	2025-11-18 15:11:59 +01:00
Georgi Gerganov	4b52e59903	graph : do not include llama-model.h	2025-11-18 13:53:25 +02:00
Daniel Bevenius	7884b0e0ac	sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.	2025-11-17 16:15:58 +01:00

1 2 3

142 Commits