llama.cpp

Commit Graph

Author	SHA1	Message	Date
Oliver Simons	0a17687c72	Make backend dist sampler use same rnd's as dist sampler We sample in double precision and cast to float to match rnd numbers of llama_dampler_dist which uses double precision (sampling from std::uniform_real_distribution<double> and std::uniform_real_distribution<float> with same rng will produce different sequences).	2025-12-19 11:43:19 +01:00
Georgi Gerganov	3b3f5fed31	common : disable backend sampling when grammar is involved	2025-12-18 10:52:21 +02:00
Georgi Gerganov	eefdb0da17	Merge branch 'master' into HEAD	2025-12-18 10:12:47 +02:00
Johannes Gäßler	57c1e05643	llama: offload output layer to GPU first (#18148 )	2025-12-18 08:12:18 +01:00
Julius Tischbein	4d4f4cacd1	llama : Async DirectIO model loading on Linux (#18012 ) * Uncached model read * Removing additional --mmap arg * Removing trailing whitespaces * Adding fallback when O_DIRECT is not supported * Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp * Adding maybe unused keyword for Mac and Windows. * File seek aligned * Removing all branches for direct_io in llama-model-loader.cpp * Always use alignment from llama_file * use_mmap=true	2025-12-18 08:27:19 +02:00
Johannes Gäßler	8dcc3662a2	llama-fit-params: fix memory print (#18136 )	2025-12-17 21:10:03 +01:00
Georgi Gerganov	4301e27319	common : restore grammar-based rejection sampling (#18137 ) * common : restart grammar-based rejection sampling * sampling : allow null samplers	2025-12-17 19:46:00 +02:00
Tarek Dakhran	982060fadc	model: fix LFM2_MOE missing tensors (#18132 )	2025-12-17 12:17:11 +01:00
Daniel Bevenius	c5d44b8525	llama : fix typo in comment [no ci]	2025-12-17 09:02:30 +01:00
Johannes Gäßler	d0794e89d9	llama-fit-params: force disable mlock (#18103 )	2025-12-17 00:50:12 +01:00
Johannes Gäßler	9dcac6cf9f	llama-fit-params: lower ctx size for multi GPU (#18101 )	2025-12-17 00:49:34 +01:00
Johannes Gäßler	0e49a7b8b4	llama-fit-params: fix underflow for dense models (#18095 )	2025-12-17 00:47:37 +01:00
Xuan-Son Nguyen	ef83fb8601	model: fix LFM2 missing tensors (#18105 )	2025-12-16 19:07:43 +01:00
Johannes Gäßler	ec98e20021	llama: fix early stop in params_fit if ctx is set (#18070 )	2025-12-16 14:24:00 +01:00
Xuan-Son Nguyen	7f2b2f3c77	arch: refactor LLM_TENSOR_NAMES (#18051 ) * arch: refactor LLM_TENSOR_NAMES * update docs * typo * fix LLM_ARCH_NEMOTRON_H_MOE * show more meaningful error message on missing tensor * fix and tested LLM_ARCH_NEMOTRON_H_MOE	2025-12-16 13:22:30 +01:00
Piotr Wilkin (ilintar)	a5251ca11d	Optimization: Qwen3 next autoregressive pass (#17996 ) * It's Qwen3 Next, the lean mean token generation machine! * Apply patches from thread * Remove recurrent version, only keep chunked and autoregressive * Remove unnecessary conts and asserts * Remove more extra conts and asserts * Cleanup masking	2025-12-16 11:59:53 +01:00
Xuan-Son Nguyen	3d86c6c2b5	model: support GLM4V vision encoder (#18042 ) * convert ok * no deepstack * less new tensors * cgraph ok * add mrope for text model * faster patch merger * add GGML_ROPE_TYPE_MRNORM * add support for metal * move glm4v do dedicated graph * convert: add norm_embd * clip: add debugging fn * working correctly * fix style * use bicubic * fix mrope metal * improve cpu * convert to neox ordering on conversion * revert backend changes * force stop if using old weight * support moe variant * fix conversion * fix convert (2) * Update tools/mtmd/clip-graph.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * process mrope_section on TextModel base class * resolve conflict merge --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-16 11:25:26 +01:00
Daniel Bevenius	ad1b60abc4	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-16 09:45:08 +01:00
Chris Peterson	2aa45ef9e3	llama: Include algorithm header needed for C++23 (#18078 )	2025-12-16 09:37:55 +02:00
Georgi Gerganov	c560316440	graph : reuse SSM graphs (#16490 ) * graph : reuse hybrid graphs * graph : reuse recurrent graphs * graph : fix reuse check for recurrent inputs * memory : move the recurrent state into the memory context * Revert "memory : move the recurrent state into the memory context" This reverts commit 00f115fe810815d4a22a6dee0acc346131e970e1. * cont : fix build	2025-12-16 09:36:21 +02:00
Daniel Bevenius	2995341730	llama : add support for NVIDIA Nemotron 3 Nano (#18058 ) * llama : add support for NVIDIA Nemotron Nano 3 This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling the conversion and running of this model. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-16 07:19:26 +01:00
HelloKS	9d52f17ae3	model : add KORMo model (#18032 ) * vocab: add KORMo Tokenizer * model: add KORMoForCausalLM * vocab: change pretokenizer to qwen2 * lint: fix unintended line removal * model: make qwen2 bias tensor optional * model: use qwen2 architecture for KORMo	2025-12-15 18:51:43 +01:00
ssweens	4529c660c8	kv-cache: Fix state restore fragmented cache (#17982 ) * kv-cache : fix state restore with fragmented cache (#17527) Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache. * tests : update logic * cleanup: tightened state_read_meta sig, added is_contiguous case * fix: state_read_meta arg reorder loose ends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-15 19:28:35 +02:00
Johannes Gäßler	b1f3a6e5db	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 ) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-15 09:24:59 +01:00
Georgi Gerganov	0086c246ee	Merge branch 'master' into HEAD	2025-12-14 16:44:30 +02:00
Xuan-Son Nguyen	0759b09c90	graph: add f_attn_temp_offset (#18025 )	2025-12-14 13:05:59 +01:00
Georgi Gerganov	22c7f85b9c	Merge branch 'master' into HEAD	2025-12-14 10:19:58 +02:00
Georgi Gerganov	609a2d0268	models : fix YaRN regression + consolidate logic (#18006 ) * models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header	2025-12-14 08:34:56 +02:00
Jeff Bolz	5266379bca	llama_context: synchronize before reallocating output buffer (#17974 )	2025-12-13 09:19:51 -06:00
Georgi Gerganov	7bed317f53	models : fix the attn_factor for mistral3 graphs + improve consistency (#17945 ) * models : fix the attn_factor for mistral3 graphs * cont : rework attn_factor correction logic * cont : make deepseek2 consistent * cont : add TODO * cont : special-case DSv2 * cont : revert Mistral 3 Large changes * cont : fix DS2 to use the original attn_factor * cont : minor comments	2025-12-12 17:12:40 +02:00
Georgi Gerganov	4d10b78e23	Merge branch 'master' into HEAD	2025-12-11 14:42:56 +02:00
Georgi Gerganov	d9f8f60618	batch : fix sequence id ownership (#17915 ) * batch : fix sequence id ownage * cont : reduce allocations	2025-12-11 14:29:47 +02:00
Georgi Gerganov	ab65b47a52	tests : run backend sampler tests always on the CPU	2025-12-11 14:14:47 +02:00
Georgi Gerganov	74b112e3e7	sampling : fix greedy	2025-12-11 13:37:02 +02:00
Georgi Gerganov	8544aba37f	sampling : generic ggml op support detection	2025-12-11 13:19:43 +02:00
Georgi Gerganov	d5d16651a8	cont : fix build	2025-12-11 11:27:47 +02:00
Georgi Gerganov	54e9054017	sampling : optimize logit_bias sampler	2025-12-11 11:14:39 +02:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Georgi Gerganov	804e7e3795	graph : respect sampler order for graph reuse	2025-12-10 20:40:15 +02:00
Georgi Gerganov	44d5c4b592	batch : fix sequence id ownage	2025-12-10 20:35:58 +02:00
Georgi Gerganov	38882247d3	Merge branch 'master' into HEAD	2025-12-10 17:07:21 +02:00
Eric Zhang	b677721819	model : Qwen3-Next-80B-A3B has 48 layers (#17898 ) * model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name	2025-12-10 15:22:40 +01:00
Georgi Gerganov	c02654eb7d	graph : make the compute graph constant with respect to active samplers	2025-12-10 16:19:18 +02:00
Georgi Gerganov	81cb5783c8	Merge branch 'master' into HEAD	2025-12-10 13:41:32 +02:00
Georgi Gerganov	34b407b41c	sampling : use host buffer type for inputs	2025-12-09 17:53:17 +02:00
Georgi Gerganov	92ff767918	llama : require backend samplers to be of type llama_sampler_chain	2025-12-09 15:38:37 +02:00
Rhys-T	63908b631a	cmake: fix Mach-O current version number (#17877 ) PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the Mach-O 'current version' field's 'micro' part, which only goes up to 255. This just sets the Mach-O current version to 0 to get it building properly again. Fixes #17258.	2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret	42b12b5608	model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652 ) * nit, DeepSeek V1 MoE is 16B * base type on n_ff_exp instead	2025-12-09 12:15:06 +01:00
Georgi Gerganov	560ac16f7d	server : handle unsupported cases	2025-12-09 10:55:11 +02:00
Aldehir Rojas	e39502e74b	llama : add token matching support to llama-grammar (#17816 ) * llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens	2025-12-09 00:32:57 -06:00

1 2 3 4 5 ...

802 Commits