* mtmd: llama.cpp DeepSeekOCR support
init commit
* loading sam tensors
* mtmd: fix vision model processing
* deepseek-ocr clip-vit model impl
* mtmd: add DeepSeek-OCR LM support with standard attention
* mtmd: successfully runs DeepSeek-OCR LM in llama-cli
* mtmd: Fix RoPE type for DeepSeek-OCR LM.
* loading LM
testing Vision model loading
* sam warmup working
* sam erroneous return corrected
* clip-vit: corrected cls_embd concat
* clip-vit: model convert qkv_proj split
* corrected combining of image encoders' results
* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model
* concat image_newline and image_seperator tokens
* visual_model warmup (technically) works
* window partitioning using standard ggml ops
* sam implementation without using CPU only ops
* clip: fixed warnings
* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr
* mtmd: fix get_rel_pos
* mtmd: fixed the wrong scaler for get_rel_pos
* image encoding technically works but the output can't be checked singe image decoding fails
* mtmd: minor changed
* mtmd: add native resolution support
* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter
* mtmd: correct token order
* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4
* mtmd: quick fix token order
* mtmd: fix danling pointer
* mtmd: SAM numerically works
* mtmd: debug CLIP-L (vit_pre_ln)
* mtmd: debug CLIP-L & first working DeepSeek-OCR model
* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work
* mtmd: simplify SAM patch embedding
* mtmd: adapt Pillow image resizing function
* mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing
* mtmd: remove --dsocr-mode argument
* mtmd: refactor code & remove unused helper functions
* mtmd: fix tensor names for image newlines and view separator
* clean up
* reverting automatically removed spaces
* reverting automatically removed spaces
* mtmd: fixed bad ocr check in Deepseek2 (LM)
* mtmd: support combined QKV projection in buid_vit
* using common build_attn in sam
* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option
* mtmd: minor fix
* minor formatting and style
* fixed flake8 lint issues
* minor editorconfig-check fixes
* minor editorconfig-check fixes
* mtmd: simplify get_rel_pos
* mtmd: make sam hparams configurable
* mtmd: add detailed comments for resize_bicubic_pillow
* mtmd: fixed wrong input setting
* mtmd: convert model in FP16
* mtmd: minor fix
* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template
* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution
* minor: editconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn
* minor: editconfig-check fix
* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR
* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909
* refactoring, one single builder function and static helpers
* added deepseek-ocr test to tests.sh
* minor formatting fixes
* check with fixed expected resutls
* minor formatting
* editorconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042
* minor
- added GLM-4.6V to big tests
- added missing deps for python test
* convert: minor fix
* mtmd: format code
* convert: quick fix
* convert: quick fix
* minor python formatting
* fixed merge build issue
* merge resolved
- fixed issues in convert
- tested several deepseek models
* minor fix
* minor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions
* - cleaning commented out code
* fixing instabilities issues reintroducing resize_bicubic_pillow
* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr
* rename fc_w --> mm_fc_w
* add links to OCR discussion
* cleaner loading code
* add missing .weight to some tensors
* add default jinja template (to be used by server)
* move test model to ggml-org
* rolling back upscale change
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: bluebread <hotbread70127@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)
* add min/max dynamic patch to gguf meta
* clean up
* simplified handling min/max dynamic patch
* reuse llava_uhd logic for slice images
* provide default values for older models
* flake8
* prevent writing 0 value to gguf
* remove duplicated resolution candidates with a better algorithm
* fix indentation
* format
* add protection from divide by zero
* change to 0 to be safe
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* convert : support is_causal hyperparameter
Check for the `is_causal` attribute in the Hugging Face model configuration and include it in the GGUF metadata.
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* style: fix F541 f-string is missing placeholders
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert : fix/suppress pyright errors
This commit fixes the pyright errors that are generated by pyright for
convert_hf_to_gguf.py.
The motivation for this is that running this locally generates errors
that CI does not, and it can be difficult to spot new errors. One use
case is when working on new models which cannot be run in CI due to
privacy. Having the ability to run pyright locally is would be helpful
in this cases.
In the linked issue there is the mention of switching to `ty` which I
don't know anything about but in the meantime I would appreciate if we
could suppress these errors for now, and later perhaps revert this
commit.
With this change there are no errors but there are 4 informations
messages if the `mistral_common` package is installed. The
`--level error` flag can be used to suppress them.
Resolves: https://github.com/ggml-org/llama.cpp/issues/20417
* WIP: add NVFP4 quantization support
* tests
* improve NVFP4 dot product implementation performance and fix bad super call
* typo
* Use nvfp4 kvalues
* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
* vulcal and perf fixes
* wip
* Fix metal
* fix vulcan
* Rename threshold & fix wrong scale
* Fix MOE
* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
* quantize: add NVFP4 as a quantization type option
* Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
* Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
* ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
* CPU only backend 64 super-block layout
* cleanup
* Remove unused LUT
* int
* exclude NVFP4 from unsupported ops in metal build
* remove quantization for now
* store scales as native UE4M3, preserve original model bits when possible
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* correct comment
* format
* reduce duplication and cleanup
* Address comments
* move detection to prepare_tensors
* Use math instead of const
* Move
* fix comment
* Shelf quantize tests
* Rebase and move check
* cleanup
* lint
* Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use fallback quant config
* Simplify
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* organize
* Refactor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* fix return type
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* llama : add support for Nemotron 3 Super
This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com>
* WIP: Add EuroBERT support with autoformatting changes
This commit includes:
- EuroBERT model implementation for GGUF conversion
- C++ backend support for EuroBERT architecture
- Unintended autoformatting changes to Python files
Saving before reverting formatting-only changes.
* feat: add back eos assert when not last token pooling
* feat: removed duplicated code and cleanup
* feat: removed not working architectures and unnecessary check
* fix: typo
* fix: dynamic pooling config
* feat: added an example model for eurobert
* feat: proper llama-vocab implementation for jina-v5
* fix: removed unnecessary comments
* model: add JAIS-2 architecture support
Add support for the JAIS-2 family of Arabic-English bilingual models
from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).
Architecture characteristics:
- LayerNorm (not RMSNorm) with biases
- ReLU² (ReLU squared) activation function
- Separate Q/K/V projections with biases
- Simple MLP without gate projection (up -> act -> down)
- RoPE positional embeddings
- GPT-2 BPE tokenizer
Supported model sizes:
- Jais-2-8B (32 layers, 26 heads, 3328 hidden)
- Jais-2-70B (68 layers, 56 heads, 7168 hidden)
Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K
Note: JAIS-2 requires F32 precision accumulators for numerical stability
and uses standard attention (not flash attention) on CUDA backends.
* fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash
* fix: use NEOX RoPE type for JAIS2
* fix: remove Q/K permutation (NEOX RoPE doesn't need it)
* fix: enable flash attention for JAIS2 (fixed by #19115)
* fix: add dedicated JAIS2 pre-tokenizer type and control vector support
- Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex
- Include original regex from tokenizer.json as comment
- Add build_cvec call for control vector support
* no longer necessary to override set_vocab
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : Add tokenizer from LFM2.5-Audio-1.5B
[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) introduced lightweight audio tokenizer.
Tokenizer based on LFM2 architecture and acts as "embedding" model with
different input `n_embd` and output `n_embd_out`.
To be used in https://github.com/ggml-org/llama.cpp/pull/18641.
To convert use
```shell
python3 convert_hf_to_gguf.py /path/to/LFM2.5-Audio-1.5B/audio_detokenizer
```
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Formatting
* Rework check for attention layers
* Add LFM2 SWA model support
* Address PR feedback
* Set vocab to none
* Move helper function definitions to cpp file
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* full modern bert support
* added gelu op in rank pooling for modern bert
* still working on stuff, added mean calculation before classifier head
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* first layer is dense, as per modern bert research paper
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* changes for tiny aya
* changes to hash
* changes to vocab
* fix some tokenizer regex edge cases
* update comment
* add some comments for regex
* Apply suggestion from @ngxson
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Move dequant_model to after the text_config merge
Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key
* Fix a couple of oversights
* Add image support for Kimi-K2.5
* Revert changes to KimiVLForConditionalGeneration
* Fix an assert crash
* Fix permute swapping w / h on accident
* Kimi-K2.5: Use merged QKV for vision
* Kimi-K2.5: pre-convert vision QK to use build_rope_2d
* Kimi-K2.5: support non-interleaved rope for vision
* Kimi-K2.5: fix min / max pixel
* Kimi-K2.5: remove v/o permutes, unnecessary
* Kimi-K2.5: update permute name to match
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* support qwen3.5 series
* remove deepstack for now, and some code clean
* code clean
* add FULL_ATTENTION_INTERVAL metadata
* code clean
* reorder v heads for linear attention to avoid expensive interleaved repeat
* Unified delta net handling
* Remove old methods.
* Refactor and optimize
* Adapt autoregressive version from @ymcki
* Change to decay mask approach
* Fix bad permute
* Qwen 3.5 support
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Further fixes
* Use inheritance, remove unneeded conts
* Not like this!
* Remove ggml.h explicit import
* Remove transformers, fix the views
* ACTUALLY fix views, make super calls explicit in conversion.
* Fix conversion again
* Remove extra ggml.h imports
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* kimi linear model implementation
* kimi linear convert_hf_to_gguf
* kimi linear constants.py tensor_mapping.py
* Kimi Linear ggml.h
* kimi linear ggml-cpu
* Kimi Linear ggml-cuda
* Kimi Linear ggml.c
* kimi linear src/llama
* remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning
* remove type mismatch warning
* read MoE params
* removed some hard coded code
* removed all hard code
* use DeepseekV2 tokenizer
* removed unnecessary internal methods called by the old set_vocab of KimiLinear
* rewrite get_vocab for KimiLinear. Removed all kda_scan code
* removed all traces of kda_scan
* reduce OP count by 1 due to removal of kda_scan
* Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache
* set n_embd_head_k/v to ensure kv cache works
* don't quantize conv1d of Kimi Linear
* Kimi Linear backend agnostic
* removed LOG_INFO
* naive chunking form implemented
* fixed some comments
* add Kimi-K2 specific tokens to be recognized as EOG
* build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682
* replaced Akk and Aqk with mul_mat and clamp
* no clamp version
* Moved Aqk computation out of the loop
* fixed typo and split wkv_b into wk_b and wv_b
* MLA KV cache support
* fix trailing spaces
* moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error
* fix trailing whitespace
* removed traling whitespaces in empty line + make sure indentation is multiple of 4
* try to make lint happy
* remove blank lines to make lint happy
* removed at least blank line containing white space
* fixed flake8 complaints locally
* return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement
* removed Kimi-Linear specific change that causes failure at server-windows
* removed private: from kimi_linear to make build checks happy
* removed unnecessary ggml_cont before ggml_reshape
* created static function causal_conv1d to abtract similar code for q/k/v
* merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.
* reverted to original
* fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.
* remove DT_B from constants.py. remove one comment line in llama-model.cpp
* new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight
* remove ssm_o_norm_b
* remove ssm_o_norm_b
* changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k
* removed all ggml_cont b4 ggml_reshape_4d
* Whitespace
* replaced all hparams.get with find_hparams
* added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp
* use is_mla to switch between different mem_hybrid types
* fixed logical errors in convert_hf_to_gguf.py pointed out by CISC
* removed if else for required parameters kv_lora_rank and qk_rope_head_dim
* add back ggml_cont for Vcur
* minor changes
* removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp
* f16 gguf cannot run without context length
* made a mistake of adding back n_ctx parsing
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* convert : yield Mamba2Model/GraniteMoeModel modify_tensors
This commit updates the `GraniteHybridModel` class' modify_tensors
function to properly delegate to `Mamba2Model.modify_tensors` and
`GraniteMoeModel.modify_tensors` using 'yield from' instead of 'return'.
The motivation for this is that modify_tensors is a generator function
(it uses 'yield from'), but the two calls above use return statements
but don't yield anything which means that the the caller of this
function will not receive any yielded values from it. And this causes
layer tensors to be silently dropped during conversion.
* Add Ministral3ForCausalLM architeture
This adds support for newer architectres like Devstral-2
* removed blank line found after function decorator
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert : use n_groups instead of hardcoded values in reshape
This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.
* use self.n_group instead of self.hparams["n_groups"]