Ed Addario
8d97eee557
Improve layer 0 stats
2025-11-17 17:52:15 +00:00
Ed Addario
bf9823afa7
Minor refactoring
2025-11-17 14:51:12 +00:00
Ed Addario
cdc7caea97
Remove unreachable logic
2025-11-17 14:46:45 +00:00
Ed Addario
658c6a8303
Enforce tensor structure when aggregating multiple imatrix files
2025-11-17 14:46:21 +00:00
Ed Addario
a2b86d7fd9
Minor refactoring
2025-11-17 14:14:05 +00:00
Ed Addario
1f3db496cc
Calculate layer_sum only for legacy
2025-11-17 13:36:28 +00:00
Ed Addario
76566b83de
Enforce same-size between compared tensors
2025-11-17 13:28:35 +00:00
Ed Addario
fb2b09a43c
Skip experts with zero count (unused)
2025-11-17 13:06:37 +00:00
Ed Addario
63cbcc6dfc
Refactor legacy determination
2025-11-17 13:05:34 +00:00
Ed Addario
ae1cbc707b
Warn if problem with previous layer
2025-11-17 13:04:16 +00:00
Ed Addario
5384a11b94
Initialise layer and tensor variables
2025-11-17 13:00:47 +00:00
Ed Addario
559ae9ab89
Refactor legacy imatrix handling
2025-11-17 10:19:34 +00:00
Ed Addario
b2b7175e19
Fix bug when vectors are zero
2025-11-06 15:12:09 +00:00
Ed Addario
8bd9d87d3e
Merge branch 'master' into imatrix
2025-10-31 23:19:54 +00:00
Piotr Wilkin (ilintar)
bea04522ff
refactor : llama-model.cpp ( #16252 )
...
* Sqashed: llama-model.cpp refactoring
* Fix formatting of attn / ffn / ffn_moe calls
* Fix import regression / unify spacing in models.h
* totally DID NOT miss those!
* Add missing qwen3vl(moe) models
* Add missing new .cpp files to build
* Remove extra semicolons
* Editor checker
* Update src/models/models.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)
0de0a01576
model : Minimax M2 ( #16831 )
...
* Model: Minimax M2
* Cleanup
* Cleanup pt. 2
* Cleanup pt. 3
* Update convert_hf_to_gguf_update.py - merge catch blocks
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Remove vocab models and test
* Remove all redundant hparam settings covered by TextModel
* Move super to start, don't set block_count
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 21:20:47 +01:00
Giuseppe Scrivano
e58d585604
model : add Granite Hybrid nano types ( #16896 )
...
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-31 21:20:07 +01:00
Johannes Gäßler
31c511a968
CUDA: Volta tensor core support for MMF ( #16843 )
...
* CUDA: Volta tensor core support for MMF
* more generic checks for hardware support
* Update ggml/src/ggml-cuda/mmf.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-10-31 15:57:19 +01:00
Georgi Gerganov
6d39015a74
sync : ggml
2025-10-31 16:26:28 +02:00
Aman Gupta
4146d6a1a6
CUDA: add expert reduce kernel ( #16857 )
...
* CUDA: add expert reduce kernel
* contigous checks, better formatting, use std::vector instead of array
* use vector empty instead of size
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-10-31 20:05:07 +08:00
Georgi Gerganov
8da3c0e200
batch : fix consistency checks for the input positions ( #16890 )
2025-10-31 13:50:33 +02:00
Georgi Gerganov
c22473b580
server : don't print user inputs to console ( #16871 )
2025-10-31 10:54:19 +02:00
Daniel Bevenius
0f715b4e75
server : fix typos in server.cpp comments [no ci] ( #16883 )
2025-10-31 09:51:26 +01:00
Jeff Bolz
d2d931f173
vulkan: disable spirv-opt for rope shaders ( #16872 )
2025-10-31 08:34:47 +01:00
Masato Nakasaka
2976b0374d
vulkan: Fix crash when FP16 mul_mat accumulation is not supported ( #16796 )
...
* Experimenting crash fix
* added assert for aborting and fixed comment
* changed to check if a pipeline is empty or not
* Moved function in class definition
* replaced with is_empty
* Modified is_empty to check only unaligned pipelines
2025-10-31 08:18:59 +01:00
Ruben Ortlam
d2a2673dd1
vulkan: fix shmem overrun in mmq id shader ( #16873 )
...
* vulkan: fix shmem overrun in mmq id shader
* metal : fix mul_mm_id
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-31 08:14:49 +01:00
l3utterfly
13002a0896
ggml-hexagon: respect input size when getting/setting tensor data ( #16836 )
...
* respect input size when getting/setting tensor data
allows partial repacking/copying when get tensor size is smaller than the actual tensor
* Removed duplicate repack_mxfp4_mxfp4x4x2 function
2025-10-30 21:46:31 -07:00
Sigbjørn Skjæret
6eb208d17e
ci : enable free-disk-space on cuda docker build ( #16877 )
2025-10-31 00:34:27 +01:00
lhez
9984cbb61d
opencl: fix boundary handling for mul_mm ( #16875 )
2025-10-30 16:00:20 -07:00
Ed Addario
ce046dcee8
Save statistics to imatrix
2025-10-30 22:43:46 +00:00
RodriMora
ce18efeaf1
convert : update transformers requirements ( #16866 )
...
* Update requirements-convert_legacy_llama.txt
Updated requirements to support Qwen3-VL in transformers 4.57.1 version
* Update requirements/requirements-convert_legacy_llama.txt
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 23:15:03 +01:00
chansikpark
16724b5b68
server : bump request URI max length to 32768 ( #16862 )
2025-10-30 20:22:23 +02:00
Georgi Gerganov
b52edd2558
server : remove n_past ( #16818 )
...
* server : remove n_past
* server : replace slot.n_prompt_tokens() with slot.task->n_tokens()
* server : fixes + clean-up
* cont : fix context shift
* server : add server_tokens::pos_next()
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : fix pos_next() usage
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-30 18:42:57 +02:00
Max Krasnyansky
517b7170e1
cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 ( #16833 )
...
Very similar implementation to the flash-attention chunking, with similar benefits.
2025-10-30 09:06:13 -07:00
Shagun Bera
835e918d84
common: fix typo in cli help text ( #16864 )
2025-10-30 17:47:31 +02:00
JJJYmmm
d261223d24
model: add support for qwen3vl series ( #16780 )
...
* support qwen3vl series.
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
* bugfix: fix the arch check for qwen3vl-moe.
* use build_ffn
* optimize deepstack structure
* optimize deepstack feature saving
* Revert "optimize deepstack feature saving" for temporal fix
This reverts commit f321b9fdf1 .
* code clean
* use fused qkv in clip
* clean up / rm is_deepstack_layers for simplification
* add test model
* move test model to "big" section
* fix imrope check
* remove trailing whitespace
* fix rope fail
* metal : add imrope support
* add imrope support for sycl
* vulkan: add imrope w/o check
* fix vulkan
* webgpu: add imrope w/o check
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix tensor mapping
---------
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Max Krasnyansky
dcca0d3ab8
cpu: introduce chunking for flash attention ( #16829 )
...
Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop
on top that handles the chunks.
2025-10-30 14:26:05 +02:00
Tianyue-Zhao
bacddc049a
model: Add support for CogVLM model ( #15002 )
...
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-30 12:18:50 +01:00
Sigbjørn Skjæret
229bf68628
cuda : fix argsort with 64k+ rows ( #16849 )
2025-10-30 08:56:28 +01:00
Jan Boon
d7395115ba
llama : use std::abs instead of abs ( #16853 )
2025-10-30 08:30:58 +02:00
Jeff Bolz
052df28b0e
vulkan: Handle argsort with a large number of rows ( #16851 )
2025-10-30 07:27:41 +01:00
Oliver Simons
8b11deea46
Hide latency of bias and gate-loading ( #16847 )
...
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
2025-10-30 11:34:15 +08:00
Jeff Bolz
b9ce940177
vulkan: Fuse rope+set_rows ( #16769 )
...
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).
Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).
Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.
Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.
Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.
Add new backend tests.
2025-10-29 15:13:10 -05:00
Xuan-Son Nguyen
3464bdac37
llama: fix ASAN error with M-RoPE ( #16848 )
2025-10-29 20:11:39 +01:00
Ed Addario
7d8819f57a
Improve compute_layer_statistics() processing of mismatched tensor sizes
2025-10-29 18:36:01 +00:00
Ed Addario
006e7ef991
Improve compute_vector_statistics() processing of mismatched tensor sizes
2025-10-29 18:35:39 +00:00
Ed Addario
2a6f5d7e60
Refactor variable names
2025-10-29 18:32:47 +00:00
Xuan-Son Nguyen
e3af5563bd
llama: store mrope data in KV cell ( #16825 )
...
* llama: store mrope data in KV cell
* correct x,y ordering
* address review comments
* add consistency checks
* Update src/llama-kv-cache.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add TODO
* fix asan error
* kv-cells : improve ext handling
* cont : fix headers
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-29 18:09:18 +01:00
Jeff Bolz
10fcc41290
vulkan: Update topk_moe fusion to handle gpt's late softmax ( #16656 )
...
* vulkan: Update topk_moe fusion to handle gpt's late softmax
Based on #16649 .
* Add ggml_check_edges
* Add sync logging to show fusion effects
* handle clamp added in #16655
* Update ggml/src/ggml-impl.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-10-29 14:44:29 +01:00
Ruben Ortlam
bcf5bda6f5
Vulkan MMQ Integer Dot Refactor and K-Quant support ( #16536 )
...
* vulkan: add mmq q2_k integer dot support
* Refactor mmq caching
* Reduce mmq register use
* Load 4 quant blocks into shared memory in one step
* Pack q2_k blocks into caches of 32
* Use 32-bit accumulators for integer dot matmul
* Add q4_k mmq
* Add q3_k mmq
* Add q5_k mmq
* Add q6_k mmq
* Add mxfp4 mmq, enable MMQ MUL_MAT_ID
* Fix mmv dm loads
2025-10-29 14:39:03 +01:00