llama.cpp

Commit Graph

Author	SHA1	Message	Date
Prabod	5755e52d15	model : Maincoder-1B support (#18534 ) * Add Maincoder model support * Removed SPM model vocabulary setting and MOE related GGUF parameters Removed trailing spaces from maincoder.cpp * removed set_vocab * added new line * Fix formatting * Add a new line for PEP8	2026-01-02 20:11:59 +01:00
Sigbjørn Skjæret	169ee68ffb	model : remove modern-bert iswa template (#18529 ) * remove modern-bert iswa template * forgotten	2026-01-02 00:06:42 +01:00
tt	ced765be44	model: support youtu-vl model (#18479 ) * Support Youtu-VL Model * merge code * fix bug * revert qwen2 code & support rsplit in minja.hpp * update warm info * fix annotation * u * revert minja.hpp * fix * Do not write routed_scaling_factor to gguf when routed_scaling_factor is None * fix expert_weights_scale * LGTM after whitespace fixes * fix * fix * fix * layers to layer_index * enum fix --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-01 19:25:54 +01:00
o7si	d0a6a31470	model : add support for JinaBertModel with non-gated ffn (#18475 ) * WIP: Initial commit for fixing JinaBert original FF type support * convert: add jina-v2-de tokenizer variant for German_Semantic_V3 * convert: fix token collision in BERT phantom vocab conversion * convert: add feed_forward_type metadata * model: add feed_forward_type metadata for jina-bert-v2 * model: jina-bert-v2 support standard GELU FFN variant * model: remove ffn_type, detect FFN variant from tensor dimensions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * revert collision fix to be handled in separate PR --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-01 18:38:51 +01:00
HelloKS	f4f5019254	model: add Solar Open model (#18511 ) * model: add Solar-Open model * vocab: add solar-open to end eog blacklist * model: add proper llm type * chat: basic template for solar open * typo: fix comment about vocab * convert: sugested changes * convert: suggested changes * chat: change reasoning end tag for solar-open * llama-chat: add solar-open template	2026-01-01 18:01:43 +01:00
momonga	9c675c7140	model : Plamo3 support (#17304 ) * plamo3 * fix plamo3 * clean code * clean up the code * fix diff * clean up the code * clean up the code * clean up the code * clean up the code * clean up the code * clean up the code * add chat_template if exist * clean up the code * fix cpu-backend * chore: whitespace trim fix + typo fix * Fix: address review feedback * restore `FREQ_BASE_SWA` constant * Fix: address review feedback2 * Fix:typecheck * Fix: address review feedback3 * final cleanup --------- Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-28 17:28:31 +01:00
Johannes Gäßler	026d2ad472	llama: fix magic number of 999 for GPU layers (#18266 ) * llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode	2025-12-27 20:18:35 +01:00
Xuan-Son Nguyen	4cbafad4f0	model: support MiMo-V2-Flash (#18328 ) * mimov2: convert ok * rename mimov2 --> mimo2 * fix conversion * runnable not incorrect * use sink * add_sliding_window_pattern * add swa and per-layer n_head_kv * correct params * somewhat working * correct gating func * nits * mimo2: wire RMS eps + MoE bias + converter guards * add co-author Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com> * use add_rope_freq_base_swa --------- Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com> Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>	2025-12-24 23:07:08 +01:00
Saba Fallah	54132f1b1f	model : support for LlamaBidirectionalModel architecture (#18220 ) * model: llama-embed-nemotron * minor: python lint * changed arch-name * templated llm_build_llama to be used for both llama and llama-embed arch	2025-12-24 14:02:36 +01:00
Alessandro98-git	96e33a814e	model : fix div-by-zero for Nemotron V2 (#18309 ) * llama-model : fix Nemotron V2 crash by moving MoE parameters calculation * remove whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-23 03:04:57 +01:00
Ryan Mangeno	dfc959b886	model : Granite Embedding support (#15641 ) ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome! * constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only * conversion now working, hf -> gguf * working on support, now working on building graph * some cleanup * cleanup * continuing * correct tensor shape for qkv * fixed tensor mappings and working on buildin graph * tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this * cleanup * cleanup * cleanup * more cleanup * ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more * added cls token per previous modern bert attempt, still working on checking out the rest * fixed pre tokenizer and still working through previous pr * working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer * fixed pre tokenizer * working on swa with local and global alternating attention * some cleanup and now fails on build attn * starting to work, and some cleanup, currently failing on last layer construction in graph build * alternating rope implemented and modern bert graph build succeeds * fixed asser for equal ubatch seq * cleanup * added mask check in vocab * fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values * reuse variable * removed repeat * standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL * correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ... * more modular hparam setting * replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf_update.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-graph.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * removed redundant hparam set * enums for model sizes * conversion for modern-bert model supported rather than just granite-small * Update src/llama-model.cpp Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> * Update src/llama-model.cpp Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> * fixed ordering of enum for freq_base_swa * fixed where I added residual, now gives much much better embeddings~ * readded cacheless logic * removing whitespace * conversion now working for swa pattern - dense every n layers * modern bert put into seperate src file * removing whitespace * fixed whitespace and newline errors in editorconfig job * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * better naming convention, n_swa_pattern -> swa_period * reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support * fixing pyright type-check fail * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-saver.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * added descriptions in llama-model * fixed tensor mappings for conversion * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mapping name for size * nits * unused --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-23 00:28:19 +01:00
Johannes Gäßler	57c1e05643	llama: offload output layer to GPU first (#18148 )	2025-12-18 08:12:18 +01:00
Xuan-Son Nguyen	ef83fb8601	model: fix LFM2 missing tensors (#18105 )	2025-12-16 19:07:43 +01:00
Xuan-Son Nguyen	3d86c6c2b5	model: support GLM4V vision encoder (#18042 ) * convert ok * no deepstack * less new tensors * cgraph ok * add mrope for text model * faster patch merger * add GGML_ROPE_TYPE_MRNORM * add support for metal * move glm4v do dedicated graph * convert: add norm_embd * clip: add debugging fn * working correctly * fix style * use bicubic * fix mrope metal * improve cpu * convert to neox ordering on conversion * revert backend changes * force stop if using old weight * support moe variant * fix conversion * fix convert (2) * Update tools/mtmd/clip-graph.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * process mrope_section on TextModel base class * resolve conflict merge --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-16 11:25:26 +01:00
Daniel Bevenius	2995341730	llama : add support for NVIDIA Nemotron 3 Nano (#18058 ) * llama : add support for NVIDIA Nemotron Nano 3 This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling the conversion and running of this model. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-16 07:19:26 +01:00
HelloKS	9d52f17ae3	model : add KORMo model (#18032 ) * vocab: add KORMo Tokenizer * model: add KORMoForCausalLM * vocab: change pretokenizer to qwen2 * lint: fix unintended line removal * model: make qwen2 bias tensor optional * model: use qwen2 architecture for KORMo	2025-12-15 18:51:43 +01:00
Johannes Gäßler	b1f3a6e5db	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 ) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-15 09:24:59 +01:00
Xuan-Son Nguyen	0759b09c90	graph: add f_attn_temp_offset (#18025 )	2025-12-14 13:05:59 +01:00
Georgi Gerganov	609a2d0268	models : fix YaRN regression + consolidate logic (#18006 ) * models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header	2025-12-14 08:34:56 +02:00
Georgi Gerganov	7bed317f53	models : fix the attn_factor for mistral3 graphs + improve consistency (#17945 ) * models : fix the attn_factor for mistral3 graphs * cont : rework attn_factor correction logic * cont : make deepseek2 consistent * cont : add TODO * cont : special-case DSv2 * cont : revert Mistral 3 Large changes * cont : fix DS2 to use the original attn_factor * cont : minor comments	2025-12-12 17:12:40 +02:00
Eric Zhang	b677721819	model : Qwen3-Next-80B-A3B has 48 layers (#17898 ) * model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name	2025-12-10 15:22:40 +01:00
Sigbjørn Skjæret	42b12b5608	model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652 ) * nit, DeepSeek V1 MoE is 16B * base type on n_ff_exp instead	2025-12-09 12:15:06 +01:00
philip-essential	1d2a1ab73d	model : support Rnj-1 (#17811 ) * add support for rnj1 * refactor gemma3 to support rnj-1 * address review comments	2025-12-09 04:49:03 +01:00
Xuan-Son Nguyen	4d3726278b	model: add llama 4 scaling for mistral-large (deepseek arch) (#17744 )	2025-12-07 22:29:54 +01:00
Herman Semenoff	37adc9c6ba	ggml, llama : use defaulted constructors/destructors (#17649 )	2025-12-03 07:12:18 +01:00
Piotr Wilkin (ilintar)	746f9ee889	Override SSM_A op for Qwen3 Next to reduce splits (#17587 ) * Override SSM_A op for Qwen3 Next to reduce splits * New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context. * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-02 00:43:13 +01:00
Gilad S.	00c361fe53	fix: llama arch implementation (#17665 )	2025-12-01 21:21:13 +01:00
Xuan-Son Nguyen	cd3c118908	model: support Ministral3 (#17644 ) * conversion script * support ministral 3 * maybe this is better? * add TODO for rope_yarn_log_mul * better ppl (tested on 14B-Instruct) * Add Ministral3 support to Mistral format * improve arch handling * add sizes * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits --------- Co-authored-by: Julien Denize <julien.denize@mistral.ai> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-01 12:26:52 +01:00
Piotr Wilkin (ilintar)	ff55414c42	model : Qwen3 Next (#16095 ) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 12:02:56 +01:00
Georgi Gerganov	6783b11fb0	models : fix LFM2 tensors (#17548 )	2025-11-27 16:04:29 +02:00
Aaron Teo	877566d512	llama: introduce support for model-embedded sampling parameters (#17120 )	2025-11-25 09:56:07 +08:00
william pan	4902eebe33	models : Added support for RND1 Diffusion Language Model (#17433 ) * Converted RND1 model to GGUF weights * RND1 llama.cpp support v1 * RND1 llama.cpp support v2 non causal bug * RND1 llama.cpp support v3 doccumentation * RND1 llama.cpp support v4 clean code * linting issues * RND1 pr fixes v1 * RND1 pr fixes v2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Diffusion documentation edits --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-24 14:16:56 +08:00
ubergarm	23bc779a6e	model : detect GigaChat3-10-A1.8B as deepseek lite (#17420 ) * Detect GigaChat3-10-A1.8B as deepseek lite Hardcodes checking number of layers to detect if lite version of deepseek. * Add commnent identifying deepseek lite variants deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B	2025-11-21 14:51:38 +01:00
Bartowski	e1fcf8b09b	model : add AfmoeForCausalLM support (#16477 ) * Add AFMOE model support * Update to vocab * Add model sizing * Undo Rope change for ARCEE model * Address review comments * Update modeling code is_sliding -> use_rope, replace hard-coded logic * Fix AFMOE tokenizer * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update AFMoE tokenizer class identification to be more unique --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-14 13:54:10 +01:00
Sigbjørn Skjæret	9008027aa3	hparams : add n_embd_inp() to support extended embed (#16928 ) * add n_embd_full to support extended embed * don't change output * rename to n_embd_inp * restore n_embd where applicable	2025-11-07 19:27:58 +01:00
Li Pengzhan	9f052478c2	model : add openPangu-Embedded (#16941 ) * Model: add openPangu-Embedded * fixed according to reviewer's comments * fixed the chat template check condition * Apply suggestions from code review change the chat-template check condition and some formatting issue Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * whitespace cleanup --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-05 10:28:58 +01:00
Georgi Gerganov	cd5e3b5754	server : support unified cache across slots (#16736 ) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning	2025-11-02 18:14:04 +02:00
Piotr Wilkin (ilintar)	bea04522ff	refactor : llama-model.cpp (#16252 ) * Sqashed: llama-model.cpp refactoring * Fix formatting of attn / ffn / ffn_moe calls * Fix import regression / unify spacing in models.h * totally DID NOT miss those! * Add missing qwen3vl(moe) models * Add missing new .cpp files to build * Remove extra semicolons * Editor checker * Update src/models/models.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)	0de0a01576	model : Minimax M2 (#16831 ) * Model: Minimax M2 * Cleanup * Cleanup pt. 2 * Cleanup pt. 3 * Update convert_hf_to_gguf_update.py - merge catch blocks Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove vocab models and test * Remove all redundant hparam settings covered by TextModel * Move super to start, don't set block_count * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-31 21:20:47 +01:00
Giuseppe Scrivano	e58d585604	model : add Granite Hybrid nano types (#16896 ) Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-31 21:20:07 +01:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Tianyue-Zhao	bacddc049a	model: Add support for CogVLM model (#15002 ) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-30 12:18:50 +01:00
Georgi Gerganov	85a7d8677b	memory : remove KV cache size padding (#16812 ) * memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic	2025-10-28 20:19:44 +02:00
Johannes Gäßler	7a0e900e36	llama: consistent ctx <-> buf order for KV cache (#16746 )	2025-10-28 11:23:54 +01:00
Johannes Gäßler	945501f5ea	llama: fix leaked buffers for mmap + split files (#16765 )	2025-10-27 09:17:31 +01:00
Sigbjørn Skjæret	73a48c9790	convert : enable expert group selection for all models with it (#16691 )	2025-10-26 17:21:23 +01:00
Sigbjørn Skjæret	7cce4f8158	model : set res->t_embd in SmallThinker models (#16782 )	2025-10-26 16:08:52 +01:00
Shunta Saito	226f295f4d	model : set res->t_embd in PLaMo2 models (#16766 )	2025-10-25 12:26:27 +02:00
Max Krasnyansky	63d2fc46e1	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 ) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX Note: This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2025-10-22 13:47:09 -07:00
Sigbjørn Skjæret	84bf3c6778	model : add BailingMoeV2 support (#16063 ) * add BailingMoeV2 support * update llm types * undo * undo * update llm types * add model collection link * update * almost working * correct group selection and rename n_group_exp * avoid large top_k and use argmax instead for now if we had something like argmax2 that would be equivalent, but this works fine until then * poke * skip group selection when there are no tokens * fix 1T conversion * hopefully fixed expert group selection third time's the charm? * make expert group selection generally available The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture. * allow n_expert_groups to be 1 (Kimi K2) * address review suggestions	2025-10-20 21:38:20 +02:00

1 2 3 4 5

220 Commits