* convert ok
* no deepstack
* less new tensors
* cgraph ok
* add mrope for text model
* faster patch merger
* add GGML_ROPE_TYPE_MRNORM
* add support for metal
* move glm4v do dedicated graph
* convert: add norm_embd
* clip: add debugging fn
* working correctly
* fix style
* use bicubic
* fix mrope metal
* improve cpu
* convert to neox ordering on conversion
* revert backend changes
* force stop if using old weight
* support moe variant
* fix conversion
* fix convert (2)
* Update tools/mtmd/clip-graph.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* process mrope_section on TextModel base class
* resolve conflict merge
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : add support for NVIDIA Nemotron Nano 3
This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling
the conversion and running of this model.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [model] add glm-asr support
* fix format for ci
* fix convert format for ci
* update glm_asr convert script & use build_ffn for glm_asr clip & use build_stack for padding and review
* check root architecture for convert hf script
* fix conficlt with upstream
* fix convert script for glm asr & format clip-impl
* format
* restore hparams text
* improved conversion
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix convert_hf_to_gguf.py failing with --mistral-format using later mistral-common versions.
* use get_one_valid_tokenizer_file from mistral-common if available and fallback to old logic otherwise.
* use file name instead of file path for get_one_valid_tokenizer_file.
* fix --mistral-format tokenizer file failing for tokenizers in subdirectories.
* move get_one_valid_tokenizer_file import to avoid nested try-except.
gguf_new_metadata.py reads data from reader.
Reader doesn't byteswap tensors to native endianness.
But writer does expect tensors in native endianness to convert them
into requested endianness.
There are two ways to fix this: update reader and do conversion to native endianness and back,
or skip converting endianness in writer in this particular USE-case.
gguf_editor_gui.py doesn't allow editing or viewing tensor data.
Let's go with skipping excessive byteswapping.
If eventually capability to view or edit tensor data is added,
tensor data should be instead byteswapped when reading it.
* Qwen3 Next - cleaned up version
* Whitespaces and stuff
* Correct minor errors
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Misc. fixes.
* Clean up code, add missing hybrid qualifier
* Did someone transpose the SOLVE_TRI result matrix? Perhaps...
* Whitespace
* Proper tensors for cb calls
* Use llama-graph.h vertical alignment
* BROKEN: chunking
* Set new tensors as inputs.
* Proper chunk logic
* It's the circle of life...
* More shenanigans for n_seq > 1
* Nail in the coffin?
* Fix Windows build
* Eh, one fails on Windows, the other fails on Mac... just use general capture.
* quant : cleanup
* model : cleanup
* qwen3 : cleanup
* cont : cleanup
* cont : cleanup
* ggml : revert change
* qwen3 : cleanup
* cont : cleanup
* Readd cmath
* qwen3 : fix typo
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Usual suspects
* fix my bad suggestion
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix convert_hf_to_gguf.py script on s390x
Assume converted model data is originally little-endian.
Byteswap data on s390x after reading it to put values in correct presentation
for any transformation needed, like calculating weight tensors.
Then byteswap data to little-endian before passing it to GGUFWriter while
GGUFWriter will byteswap data back to big endian if big endian output is requested.
byteswap(inplace=True) calls don't work with lazy tensor and array wrappers.
Use byteswap with copying data to workaround this behaviour.
* Make GGUFWriter accept tensors in native endianness instead of little-endian
With this change if no byteswapping is actually needed, 2 excessive byteswaps can be omitted on s390x
* Fix byteswapping in convert_hf_to_gguf.py for remote models
* convert : parse safetensors directly
* gguf-py : order safetensors tensors by name
Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.
* convert : rename from_safetensors_meta to from_local_tensor
For consistency with from_remote_tensor
* convert : fix no-lazy dtypes from direct safetensors
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* add BailingMoeV2 support
* update llm types
* undo
* undo
* update llm types
* add model collection link
* update
* almost working
* correct group selection and rename n_group_exp
* avoid large top_k and use argmax instead for now
if we had something like argmax2 that would be equivalent, but this works fine until then
* poke
* skip group selection when there are no tokens
* fix 1T conversion
* hopefully fixed expert group selection
third time's the charm?
* make expert group selection generally available
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
* allow n_expert_groups to be 1 (Kimi K2)
* address review suggestions
BF16 requires special handling in this script
while it's a 2-bytes data, but view is 1-byte by default.
Switch to correct view before attempting byteswapping.
With this change correctly byteswapping models like
Meta-Llama-3-8B-Instruct-bf16-GGUF
should be possible.
* model: EmbeddingGemma sentence-transformers dense linear projections support
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
- converting model with dense-layers is optional
- introduced dense config params
* Update convert_hf_to_gguf.py
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* fixed formatting issues
* Update src/llama-graph.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims
* fix python lint
* fix ubuntu gcc build warning
* - fixed thread-safety test
- moved asserts to load_hparams
* - tidying up code
- simplifying graph-context expecting both dense weights
* minor : add TODO
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>