* mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> |
||
|---|---|---|
| .. | ||
| debug | ||
| legacy-models | ||
| models | ||
| tests | ||
| CMakeLists.txt | ||
| README.md | ||
| clip-graph.h | ||
| clip-impl.h | ||
| clip-model.h | ||
| clip.cpp | ||
| clip.h | ||
| deprecation-warning.cpp | ||
| mtmd-audio.cpp | ||
| mtmd-audio.h | ||
| mtmd-cli.cpp | ||
| mtmd-helper.cpp | ||
| mtmd-helper.h | ||
| mtmd.cpp | ||
| mtmd.h | ||
| requirements.txt | ||
| test-1.jpeg | ||
| test-2.mp3 | ||
| tests.sh | ||
README.md
Multimodal Support in llama.cpp
This directory provides multimodal capabilities for llama.cpp. Initially intended as a showcase for running LLaVA models, its scope has expanded significantly over time to include various other vision-capable models. As a result, LLaVA is no longer the only multimodal architecture supported.
[!IMPORTANT]
Multimodal support can be viewed as a sub-project within
llama.cpp. It is under very heavy development, and breaking changes are expected.
The naming and structure related to multimodal support have evolved, which might cause some confusion. Here's a brief timeline to clarify:
- #3436: Initial support for LLaVA 1.5 was added, introducing
llava.cppandclip.cpp. Thellava-clibinary was created for model interaction. - #4954: Support for MobileVLM was added, becoming the second vision model supported. This built upon the existing
llava.cpp,clip.cpp, andllava-cliinfrastructure. - Expansion & Fragmentation: Many new models were subsequently added (e.g., #7599, #10361, #12344, and others). However,
llava-clilacked support for the increasingly complex chat templates required by these models. This led to the creation of model-specific binaries likeqwen2vl-cli,minicpmv-cli, andgemma3-cli. While functional, this proliferation of command-line tools became confusing for users. - #12849:
libmtmdwas introduced as a replacement forllava.cpp. Its goals include providing a single, unified command-line interface, improving the user/developer experience (UX/DX), and supporting both audio and image inputs. - #13012:
mtmd-cliwas added, consolidating the various model-specific CLIs into a single tool powered bylibmtmd.
Pre-quantized models
See the list of pre-quantized model here
How it works and what is mmproj?
Multimodal support in llama.cpp works by encoding images into embeddings using a separate model component, and then feeding these embeddings into the language model.
This approach keeps the multimodal components distinct from the core libllama library. Separating these allows for faster, independent development cycles. While many modern vision models are based on Vision Transformers (ViTs), their specific pre-processing and projection steps can vary significantly. Integrating this diverse complexity directly into libllama is currently challenging.
Consequently, running a multimodal model typically requires two GGUF files:
- The standard language model file.
- A corresponding multimodal projector (
mmproj) file, which handles the image encoding and projection.
What is libmtmd?
As outlined in the history, libmtmd is the modern library designed to replace the original llava.cpp implementation for handling multimodal inputs.
Built upon clip.cpp (similar to llava.cpp), libmtmd offers several advantages:
- Unified Interface: Aims to consolidate interaction for various multimodal models.
- Improved UX/DX: Features a more intuitive API, inspired by the
Processorclass in the Hugging Facetransformerslibrary. - Flexibility: Designed to support multiple input types (text, audio, images) while respecting the wide variety of chat templates used by different models.
How to obtain mmproj
Multimodal projector (mmproj) files are specific to each model architecture.
For the following models, you can use convert_hf_to_gguf.py with --mmproj flag to get the mmproj file:
- Gemma 3 ; See the guide here - Note: 1B variant does not have vision support
- SmolVLM (from HuggingFaceTB)
- SmolVLM2 (from HuggingFaceTB)
- Pixtral 12B - only works with
transformers-compatible checkpoint - Qwen 2 VL and Qwen 2.5 VL (from Qwen)
- Mistral Small 3.1 24B
- InternVL 2.5 and InternVL 3 from OpenGVLab (note: we don't support conversion of
InternVL3-*-hfmodel, only non-HF version is supported ;InternLM2Modeltext model is not supported)
For older models, please refer to the relevant guide for instructions on how to obtain or create them:
NOTE: conversion scripts are located under tools/mtmd/legacy-models