llama.cpp

Commit Graph

Author	SHA1	Message	Date
Sigbjørn Skjæret	e2eb39e81c	ci : bump ty to 0.0.26 (#21156 ) * fix incorrect type ignore comments * bump ty to 0.0.26	2026-03-30 09:29:15 +02:00
Saba Fallah	1743d98057	mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 ) * mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-27 00:07:55 +01:00
Michael Wand	f8d4abae86	convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505 ) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:52:06 +01:00
Pavel Zloi	3d5acab3e7	convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 ) * Support of ruGPT3XL model added * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * chkhsh for ruGPT3XL model added * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:49:09 +01:00
Saba Fallah	a970515bdb	mtmd: Add DeepSeekOCR Support (#17400 ) * mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-03-25 19:57:40 +01:00
Saba Fallah	80322ebdaf	model: codefuse-ai/F2LLM-v2 support	2026-03-25 18:33:42 +01:00
Dowon	3a60d06ad9	convert : register Qwen3Model architecture (#20967 )	2026-03-25 10:37:59 +02:00
bssrdf	ec2b787ebe	mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847 ) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-23 01:06:30 +01:00
Michael Wand	eac9c6ea83	Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730 ) * Corrected convert script for NVFP4 naming and updated gguf constants * Add mostly_MXFP4 to FileType Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * simplify * set initial value [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-21 13:35:21 +02:00
Sigbjørn Skjæret	29b28a9824	ci : switch from pyright to ty (#20826 ) * type fixes * switch to ty * tweak rules * tweak more rules * more tweaks * final tweak * use common import-not-found rule	2026-03-21 08:54:34 +01:00
Dowon	b486c17b3e	convert : support is_causal hyperparameter (#20746 ) * convert : support is_causal hyperparameter Check for the `is_causal` attribute in the Hugging Face model configuration and include it in the GGUF metadata. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * style: fix F541 f-string is missing placeholders --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 11:41:11 +01:00
Xuan-Son Nguyen	d34ff7eb5b	model: mistral small 4 support (#20649 ) * model: mistral small 4 support * fix test * fix test (2) * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * change newline --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 00:31:14 +01:00
Richard Davison	079e5a45f0	convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization (#20539 ) * support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization * cleanup * fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-16 09:18:47 +01:00
Daniel Bevenius	983df142a9	convert : fix/suppress pyright errors (#20442 ) * convert : fix/suppress pyright errors This commit fixes the pyright errors that are generated by pyright for convert_hf_to_gguf.py. The motivation for this is that running this locally generates errors that CI does not, and it can be difficult to spot new errors. One use case is when working on new models which cannot be run in CI due to privacy. Having the ability to run pyright locally is would be helpful in this cases. In the linked issue there is the mention of switching to `ty` which I don't know anything about but in the meantime I would appreciate if we could suppress these errors for now, and later perhaps revert this commit. With this change there are no errors but there are 4 informations messages if the `mistral_common` package is installed. The `--level error` flag can be used to suppress them. Resolves: https://github.com/ggml-org/llama.cpp/issues/20417	2026-03-13 06:00:52 +01:00
Sigbjørn Skjæret	c3e3f9e533	convert : better mtp check and fix return [no ci] (#20419 )	2026-03-12 10:04:20 +01:00
DAN™	fdb17643d3	model : add support for Phi4ForCausalLMV (#20168 ) * Add support for Phi4ForCausalLMV. * Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd. * Rename contants + fix tokenizer label * Clean-ups. * Fix GGUF export. * Set tokenizer.ggml.pre explicitly. * Default vocab name rather than forcing it. * Clean-ups. * Fix indent. * Fix subscriptable error. * remov overcomplicated code path * Clean-ups. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-12 00:25:54 +01:00
Richard Davison	5eae9cb1d9	ggml : add NVFP4 quantization type support (#19769 ) * WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-11 21:02:54 +01:00
Daniel Bevenius	eaf1d7930c	llama : add support for Nemotron 3 Super (#20411 ) * llama : add support for Nemotron 3 Super This commit adds support for the Nemotron 3 Super model (120B.A12B) enabling this model to be converted to GGUF format and run in llama.cpp. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com>	2026-03-11 19:27:53 +01:00
Vinicios Lugli	4d99d45084	model : qwen3vl reranker text support (#20332 ) * model : fix qwen3vl reranker support * Remove CLS_OUT Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-10 23:40:14 +01:00
Sigbjørn Skjæret	cf232515c9	convert : register Qwen 3.5 ForCausalLM for text only (#20119 )	2026-03-05 10:30:02 +01:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
Aman Gupta	b68d75165a	llama: Add option to merge gate and exp weights (#19139 ) * llama: Add option to merge gate and exp weights * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * update constants.py * add gate_up for the all MoE models * convert: simplify merge tensor condition * update constants.py * reduce number of models, add create_tensor_gate_up helper --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-26 21:01:08 +08:00
Maximilian Werk	66287bdaac	model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826 ) * WIP: Add EuroBERT support with autoformatting changes This commit includes: - EuroBERT model implementation for GGUF conversion - C++ backend support for EuroBERT architecture - Unintended autoformatting changes to Python files Saving before reverting formatting-only changes. * feat: add back eos assert when not last token pooling * feat: removed duplicated code and cleanup * feat: removed not working architectures and unnecessary check * fix: typo * fix: dynamic pooling config * feat: added an example model for eurobert * feat: proper llama-vocab implementation for jina-v5 * fix: removed unnecessary comments	2026-02-26 12:14:09 +01:00
HelloKS	ae2368e74e	model : add Kanana-2 model support (#19803 ) * model: Add Kanana-2 model support * lint: adjust spacing	2026-02-22 16:15:02 +01:00
megemini	237958db33	model: Add PaddleOCR-VL model support (#18825 ) * support PaddleOCR-VL * clip: update PaddleOCR model loader parameters to prevent OOM during warmup * [update] add paddleocr vl text model instead of ernie4.5 * [update] restore change of minicpmv * [update] format * [update] format * [update] positions and patch merge permute * [update] mtmd_decode_use_mrope for paddleocr * [update] image min/max pixels * [update] remove set_limit_image_tokens * upate: preprocess without padding * clean up * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-19 17:05:25 +01:00
3 a l i	2bf318fd2f	model : add JAIS-2 architecture support (#19488 ) * model: add JAIS-2 architecture support Add support for the JAIS-2 family of Arabic-English bilingual models from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat). Architecture characteristics: - LayerNorm (not RMSNorm) with biases - ReLU² (ReLU squared) activation function - Separate Q/K/V projections with biases - Simple MLP without gate projection (up -> act -> down) - RoPE positional embeddings - GPT-2 BPE tokenizer Supported model sizes: - Jais-2-8B (32 layers, 26 heads, 3328 hidden) - Jais-2-70B (68 layers, 56 heads, 7168 hidden) Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K Note: JAIS-2 requires F32 precision accumulators for numerical stability and uses standard attention (not flash attention) on CUDA backends. * fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash * fix: use NEOX RoPE type for JAIS2 * fix: remove Q/K permutation (NEOX RoPE doesn't need it) * fix: enable flash attention for JAIS2 (fixed by #19115) * fix: add dedicated JAIS2 pre-tokenizer type and control vector support - Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex - Include original regex from tokenizer.json as comment - Add build_cvec call for control vector support * no longer necessary to override set_vocab --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-19 13:30:17 +01:00
Tarek Dakhran	8004f3a8d1	model : add tokenizer from LFM2.5-Audio-1.5B (#19687 ) * model : Add tokenizer from LFM2.5-Audio-1.5B [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) introduced lightweight audio tokenizer. Tokenizer based on LFM2 architecture and acts as "embedding" model with different input `n_embd` and output `n_embd_out`. To be used in https://github.com/ggml-org/llama.cpp/pull/18641. To convert use ```shell python3 convert_hf_to_gguf.py /path/to/LFM2.5-Audio-1.5B/audio_detokenizer ``` * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Formatting * Rework check for attention layers * Add LFM2 SWA model support * Address PR feedback * Set vocab to none * Move helper function definitions to cpp file --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-19 09:54:48 +01:00
Ryan Mangeno	c0d0430340	model : full modern bert support (#18330 ) * full modern bert support * added gelu op in rank pooling for modern bert * still working on stuff, added mean calculation before classifier head * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * first layer is dense, as per modern bert research paper * Update src/llama-graph.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank * Update src/llama-graph.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-19 08:52:21 +01:00
Xuan-Son Nguyen	eeef3cfced	model: support GLM-OCR (#19677 ) * model: support GLM-OCR * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-18 17:51:40 +01:00
DAN™	05fa625eac	convert : add JoyAI-LLM-Flash (#19651 ) * convert_hf_to_gguf: add JoyAI-LLM-Flash tokenizer hash mapping to deepseek-v3 * llama-vocab: create a new pre-tokenizer name for joyai-llm. * add missing vocab type section * Update convert_hf_to_gguf_update.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-16 22:49:57 +01:00
Saurabh Dash	5f28c53d11	model: Add support for Tiny Aya Models (#19611 ) * changes for tiny aya * changes to hash * changes to vocab * fix some tokenizer regex edge cases * update comment * add some comments for regex * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-02-16 16:28:46 +01:00
Sigbjørn Skjæret	079feab9e3	convert : ensure all models handle new experts count (#19621 ) * ensure all models handle new experts count * revert removal for PhiMoeModel, does not inherit from base	2026-02-14 22:22:32 +01:00
Anav Prasad	01d8eaa28d	mtmd : Add Nemotron Nano 12B v2 VL support (#19547 ) * nemotron nano v2 vlm support added * simplified code; addressed reviews * pre-downsample position embeddings during GGUF conversion for fixed input size	2026-02-14 14:07:00 +01:00
Sigbjørn Skjæret	0d00ef65ed	convert : store ffn_gate_inp_shexp as F32 (#19606 )	2026-02-14 08:17:43 +01:00
Xuan-Son Nguyen	752584d5f5	model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460 ) * model: support GLM MoE DSA arch * working version * pyright * keep indexer tensors * add indexer gguf params * loaded now * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * update * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * minor fix and cleanup --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-13 14:56:53 +01:00
AesSedai	e463bbdf65	model: Add Kimi-K2.5 support (#19170 ) * Move dequant_model to after the text_config merge Add new kimi-k2.5 keys to mtmd convert Update V_MMPROJ tensor mapping for new mm_projector.proj keys Update V_M_IMP_NORM for new mm_projector.pre_norm key * Fix a couple of oversights * Add image support for Kimi-K2.5 * Revert changes to KimiVLForConditionalGeneration * Fix an assert crash * Fix permute swapping w / h on accident * Kimi-K2.5: Use merged QKV for vision * Kimi-K2.5: pre-convert vision QK to use build_rope_2d * Kimi-K2.5: support non-interleaved rope for vision * Kimi-K2.5: fix min / max pixel * Kimi-K2.5: remove v/o permutes, unnecessary * Kimi-K2.5: update permute name to match * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-11 16:47:30 +01:00
JJJYmmm	fc0fe40049	models : support qwen3.5 series (#19468 ) * support qwen3.5 series * remove deepstack for now, and some code clean * code clean * add FULL_ATTENTION_INTERVAL metadata * code clean * reorder v heads for linear attention to avoid expensive interleaved repeat	2026-02-10 18:00:26 +02:00
Piotr Wilkin (ilintar)	854b09f0d7	convert : move experts permutation from Qwen2MoeModel to Qwen3VLMoeTextModel (#19445 ) * Add special case for Qwen3VLMoe * Fix down path, remove arrows and checkmarks * ws * Moved to Qwen3VL * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-10 09:01:37 +01:00
Georgi Gerganov	972f323e73	revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435 )" (#19453 ) This reverts commit `39bf692af1`.	2026-02-09 14:57:51 +02:00
Piotr Wilkin (ilintar)	39bf692af1	[Model] Qwen3.5 dense and MoE support (no vision) (#19435 ) * Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-09 00:24:08 +01:00
forforever73	b83111815e	model : support Step3.5-Flash (#19283 ) * Support Step3.5-Flash * fix: norm.weight + 1 (HF zero_centered=true) * step35: simplify GGUF conversion + drop redundant rope KVs * Address review feedback * rename limits -> clamp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * rename swiglu limits -> swiglu clamp in LLM_KV * avoid CI fail * Apply suggestions from code review * Apply suggestions from code review * disabled KV shifting for LLM_ARCH_STEP35 * Apply suggestions from code review * mistakenly removed cmath * add model size && apply missed suggestion * assert partial_rotary_factors * fix CI errors: * load freq_base_swa --------- Co-authored-by: lvyichen <lvyichen@stepfun.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-06 21:06:14 +01:00
ymcki	3688c4f504	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 ) * kimi linear model implementation * kimi linear convert_hf_to_gguf * kimi linear constants.py tensor_mapping.py * Kimi Linear ggml.h * kimi linear ggml-cpu * Kimi Linear ggml-cuda * Kimi Linear ggml.c * kimi linear src/llama * remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning * remove type mismatch warning * read MoE params * removed some hard coded code * removed all hard code * use DeepseekV2 tokenizer * removed unnecessary internal methods called by the old set_vocab of KimiLinear * rewrite get_vocab for KimiLinear. Removed all kda_scan code * removed all traces of kda_scan * reduce OP count by 1 due to removal of kda_scan * Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache * set n_embd_head_k/v to ensure kv cache works * don't quantize conv1d of Kimi Linear * Kimi Linear backend agnostic * removed LOG_INFO * naive chunking form implemented * fixed some comments * add Kimi-K2 specific tokens to be recognized as EOG * build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682 * replaced Akk and Aqk with mul_mat and clamp * no clamp version * Moved Aqk computation out of the loop * fixed typo and split wkv_b into wk_b and wv_b * MLA KV cache support * fix trailing spaces * moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error * fix trailing whitespace * removed traling whitespaces in empty line + make sure indentation is multiple of 4 * try to make lint happy * remove blank lines to make lint happy * removed at least blank line containing white space * fixed flake8 complaints locally * return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement * removed Kimi-Linear specific change that causes failure at server-windows * removed private: from kimi_linear to make build checks happy * removed unnecessary ggml_cont before ggml_reshape * created static function causal_conv1d to abtract similar code for q/k/v * merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py. * reverted to original * fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms. * remove DT_B from constants.py. remove one comment line in llama-model.cpp * new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight * remove ssm_o_norm_b * remove ssm_o_norm_b * changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k * removed all ggml_cont b4 ggml_reshape_4d * Whitespace * replaced all hparams.get with find_hparams * added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp * use is_mla to switch between different mem_hybrid types * fixed logical errors in convert_hf_to_gguf.py pointed out by CISC * removed if else for required parameters kv_lora_rank and qk_rope_head_dim * add back ggml_cont for Vcur * minor changes * removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp * f16 gguf cannot run without context length * made a mistake of adding back n_ctx parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-02-06 11:39:58 +01:00
Daniel Bevenius	0562503154	convert : add missing return statement for GraniteMoeModel (#19202 ) This commit adds a missing return statement to the GraniteMoeModel class to fix an issue in the model conversion process. Resolves: https://github.com/ggml-org/llama.cpp/issues/19201	2026-01-30 11:12:53 +01:00
Daniel Bevenius	ebf5725870	convert : yield Mamba2Model/GraniteMoeModel modify_tensors (#19157 ) * convert : yield Mamba2Model/GraniteMoeModel modify_tensors This commit updates the `GraniteHybridModel` class' modify_tensors function to properly delegate to `Mamba2Model.modify_tensors` and `GraniteMoeModel.modify_tensors` using 'yield from' instead of 'return'. The motivation for this is that modify_tensors is a generator function (it uses 'yield from'), but the two calls above use return statements but don't yield anything which means that the the caller of this function will not receive any yielded values from it. And this causes layer tensors to be silently dropped during conversion.	2026-01-28 16:49:36 +01:00
Sigbjørn Skjæret	0bf5636938	convert : yield Gemma3N custom_map tensors directly (#19091 )	2026-01-25 18:03:34 +01:00
Sigbjørn Skjæret	70d860824a	convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084 ) * fix Gemma3N and Ernie4.5Moe * fix GraniteMoe	2026-01-25 13:05:05 +01:00
Bartowski	9981c30130	convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064 ) * Add undo_permute = False where needed * Replace super().modify_tensors with ModelBase * Add one more ModelBase.modify_tensors * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-25 02:36:47 +01:00
Aman Gupta	e34d6d03b2	convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866 )	2026-01-23 02:58:07 +08:00
Mariusz Woloszyn	77078e80e5	convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972 ) * Add Ministral3ForCausalLM architeture This adds support for newer architectres like Devstral-2 * removed blank line found after function decorator Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-22 00:55:55 +01:00
Daniel Bevenius	7dee9ff59a	convert : use n_groups instead of hardcoded values in reshape (#18929 ) * convert : use n_groups instead of hardcoded values in reshape This commit modifies the conversion script for NemotronHModel to use the 'n_groups' hyperparameter, and allow Python to calculate the the last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor. * use self.n_group instead of self.hparams["n_groups"]	2026-01-20 06:55:24 +01:00

1 2 3 4 5 ...

335 Commits