This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.
This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.
- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.
64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
* vulkan: Enable and optimize large matmul parameter combination for AMD
* limit tuning to AMD GPUs with coopmat support
* use tx_m values instead of _l
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check
This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.
* debug : normalize both pooled and token embeddings
This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.
The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
* qwen3next: simplify qkvz projection
* use ggml_swiglu_split
* revert swiglu_split, but remove redundant repeat()
* fix missing reshape
* rm 2 redundant transposes
* move mul_mat(k,q) to outside of chunking
* rm redundant cont
* improve g_cs_chunk
* add comments about no cont
* use std::pair instead of ggml_concat
* vectorize key_gdiff calculation
* rm unused tensor
* avoid ggml_concat inside loop
* bring back ggml_concat as it may not work on other backend
* nits
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.
This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
* Add mobilenetv5 impl
* Fix comments, remove unused vars
* Fix permute and remove transpose of projection weights
* Fix comments, remove debugging prints from hf_to_gguf
* 1. Hard-code image_mean = 0 and image_std = 1
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
* 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp
2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
* Remove obsolete comments
* - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
* - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology
* Fix stem conv bias name
* Remove explicit handling of bias term for stem conv
* - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
* clean up conversion script
* fix code style
* also preserve audio tensors
* trailing space
* split arch A and V
* rm unused gemma3 func
* fix alignment
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* arg: support remote preset
* proof reading
* allow one HF repo to point to multiple HF repos
* docs: mention about multiple GGUF use case
* correct clean_file_name
* download: also return HTTP status code
* fix case with cache file used
* fix --offline option