* requirements : update transformers to 5.5.0
This commit updates the transformers dependency to version 5.5.0.
The motivation for this is that transformers 5.5.0 includes support for
Gemma4 and is required to be able to convert Gemma4 models. This is also
causing issues for user of gguf-my-repo.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202
* fix huggingface_hub version
* set version of transformers to 5.5.0
* convert : add ty ignore directives to convert_hf_to_gguf.py
This commit adds `ty: ignore` directives to transformers tokenizers
field/methods to avoid type check errors. There might be better ways to
handle this and perhaps this can be done in a follow up commit.
The motivation for this is that it looks like in transformers 5.5.0
AutoTokenizer.from_pretrained can return generic tokenizer types or None
and the type checker now produces an error when the conversion script
accesses field like tokenizer.vocab.
* convert : add ty ignore to suppress type check errors
* convert : remove incorrect type ignores
* convert : fix remaining python checks
I was running a newer version of ty locally but I've switched to
version 0.0.26 which is what CI uses and I was then able to reproduce
the errors. Sorry about the noise.
* update transformers version to 5.5.1
* WIP: Add EuroBERT support with autoformatting changes
This commit includes:
- EuroBERT model implementation for GGUF conversion
- C++ backend support for EuroBERT architecture
- Unintended autoformatting changes to Python files
Saving before reverting formatting-only changes.
* feat: add back eos assert when not last token pooling
* feat: removed duplicated code and cleanup
* feat: removed not working architectures and unnecessary check
* fix: typo
* fix: dynamic pooling config
* feat: added an example model for eurobert
* feat: proper llama-vocab implementation for jina-v5
* fix: removed unnecessary comments
* model: add JAIS-2 architecture support
Add support for the JAIS-2 family of Arabic-English bilingual models
from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).
Architecture characteristics:
- LayerNorm (not RMSNorm) with biases
- ReLU² (ReLU squared) activation function
- Separate Q/K/V projections with biases
- Simple MLP without gate projection (up -> act -> down)
- RoPE positional embeddings
- GPT-2 BPE tokenizer
Supported model sizes:
- Jais-2-8B (32 layers, 26 heads, 3328 hidden)
- Jais-2-70B (68 layers, 56 heads, 7168 hidden)
Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K
Note: JAIS-2 requires F32 precision accumulators for numerical stability
and uses standard attention (not flash attention) on CUDA backends.
* fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash
* fix: use NEOX RoPE type for JAIS2
* fix: remove Q/K permutation (NEOX RoPE doesn't need it)
* fix: enable flash attention for JAIS2 (fixed by #19115)
* fix: add dedicated JAIS2 pre-tokenizer type and control vector support
- Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex
- Include original regex from tokenizer.json as comment
- Add build_cvec call for control vector support
* no longer necessary to override set_vocab
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* changes for tiny aya
* changes to hash
* changes to vocab
* fix some tokenizer regex edge cases
* update comment
* add some comments for regex
* Apply suggestion from @ngxson
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* support qwen3.5 series
* remove deepstack for now, and some code clean
* code clean
* add FULL_ATTENTION_INTERVAL metadata
* code clean
* reorder v heads for linear attention to avoid expensive interleaved repeat
* model: add Solar-Open model
* vocab: add solar-open to end eog blacklist
* model: add proper llm type
* chat: basic template for solar open
* typo: fix comment about vocab
* convert: sugested changes
* convert: suggested changes
* chat: change reasoning end tag for solar-open
* llama-chat: add solar-open template
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!
* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only
* conversion now working, hf -> gguf
* working on support, now working on building graph
* some cleanup
* cleanup
* continuing
* correct tensor shape for qkv
* fixed tensor mappings and working on buildin graph
* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this
* cleanup
* cleanup
* cleanup
* more cleanup
* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
* added cls token per previous modern bert attempt, still working on checking out the rest
* fixed pre tokenizer and still working through previous pr
* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer
* fixed pre tokenizer
* working on swa with local and global alternating attention
* some cleanup and now fails on build attn
* starting to work, and some cleanup, currently failing on last layer construction in graph build
* alternating rope implemented and modern bert graph build succeeds
* fixed asser for equal ubatch seq
* cleanup
* added mask check in vocab
* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values
* reuse variable
* removed repeat
* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL
* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...
* more modular hparam setting
* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* removed redundant hparam set
* enums for model sizes
* conversion for modern-bert model supported rather than just granite-small
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* fixed ordering of enum for freq_base_swa
* fixed where I added residual, now gives much much better embeddings~
* readded cacheless logic
* removing whitespace
* conversion now working for swa pattern - dense every n layers
* modern bert put into seperate src file
* removing whitespace
* fixed whitespace and newline errors in editorconfig job
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* better naming convention, n_swa_pattern -> swa_period
* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support
* fixing pyright type-check fail
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-saver.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* added descriptions in llama-model
* fixed tensor mappings for conversion
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mapping name for size
* nits
* unused
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* add BailingMoeV2 support
* update llm types
* undo
* undo
* update llm types
* add model collection link
* update
* almost working
* correct group selection and rename n_group_exp
* avoid large top_k and use argmax instead for now
if we had something like argmax2 that would be equivalent, but this works fine until then
* poke
* skip group selection when there are no tokens
* fix 1T conversion
* hopefully fixed expert group selection
third time's the charm?
* make expert group selection generally available
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
* allow n_expert_groups to be 1 (Kimi K2)
* address review suggestions
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
* Add Arcee AFM support
* Add draft update code
* Fix linter and update URL, may still not be final
* Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Remote accidental blank line
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
* Added Phi-4-mini-instruct support
* Update regex per ngxson
* Change the vocab base to Xenova/gpt-4o
* fix conversion update script
* no need to check longrope
* minor style fix
* fix python style
---------
Co-authored-by: Nicholas Sparks <nisparks@microsoft.com>