Commit Graph

7998 Commits

Author SHA1 Message Date
Aman Gupta 3db6e5ef22 add permuted test-case 2026-02-13 14:18:41 +01:00
Aman Gupta 2f0ac21d4b cuda: add support for non-contig q,k,v 2026-02-13 14:12:09 +01:00
Aman Gupta 15d83e0c87 cpu: support for non-contig q,k,v 2026-02-12 21:51:51 +05:30
Aman Gupta 54ea122385 simplify impl + add cuda code 2026-02-11 17:26:39 +01:00
Aman Gupta 86833eb747 repalce qwen3next 2026-02-11 11:47:54 +05:30
Aman Gupta c7edcf22ec add eps 2026-02-11 11:46:27 +05:30
Aman Gupta ffe3e82c8b fix computation 2026-02-11 11:34:47 +05:30
Aman Gupta 4be60a28b6 tranpose 2026-02-11 11:11:44 +05:30
Aman Gupta 8666c546f9 ggml: add GATED_DELTA_NET op 2026-02-11 10:52:49 +05:30
Xuan-Son Nguyen 9a96352729
test: fix IMROPE perf test case (#19465) 2026-02-10 14:37:50 +01:00
Alberto Cabrera Pérez c03a5a46f0
ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (#19360)
* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
2026-02-10 10:47:45 +00:00
k4ss4n 6948adc90d
ggml : use noexcept overload for is_regular_file in backend registration (#19452)
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)

Resolves: #18560
2026-02-10 10:57:48 +01:00
Piotr Wilkin (ilintar) 854b09f0d7
convert : move experts permutation from Qwen2MoeModel to Qwen3VLMoeTextModel (#19445)
* Add special case for Qwen3VLMoe

* Fix down path, remove arrows and checkmarks

* ws

* Moved to Qwen3VL

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-10 09:01:37 +01:00
Daniel Bevenius 66d403c480
tts : fix typos in README.md [no ci] (#19463) 2026-02-10 07:30:41 +01:00
Raul Torres f0bfe54f55
CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (#18968) 2026-02-10 14:19:30 +08:00
hipudding 52e38faf8c
CANN: implement quantized MUL_MAT_ID for MoE models (#19228)
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
2026-02-10 14:18:59 +08:00
Georgi Gerganov a0d585537c
cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad
2026-02-10 08:07:16 +02:00
Xuan-Son Nguyen 98e57ca422
chat: fix case where template accepts type content only (#19419)
* chat: fix case where template accepts type content only

* rm stray log

* reuse render_message_to_json
2026-02-09 22:14:12 +01:00
Tarek Dakhran 262364e31d
mtmd: Implement tiling for LFM2-VL (#19454) 2026-02-09 17:30:32 +01:00
손희준 820ebfa6f4
Server: log when converting requests to chat completions format (#19457)
* Log converting requests

* Print as debug instead of info [no ci]

---------

Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Sascha Rogmann 292f6908cd
spec : remove check rate (#19377)
* spec: remove parameter spec-ngram-check-rate

* spec : renamed statistics vars

* spec : add n_call_begin, n_call_accept

* spec : don't enable key-map-stats
2026-02-09 15:30:50 +02:00
Georgi Gerganov 81ddc60cb3
ci : add metal server workflows (#19293)
* ci : add metal server workflows

* cont : try fix python init

* cont : move to a separate workflow that runs only on master

* cont : fix num jobs

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-09 15:09:30 +02:00
Georgi Gerganov 972f323e73
revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)" (#19453)
This reverts commit 39bf692af1.
2026-02-09 14:57:51 +02:00
Kevin Pouget f5e7734ff2
ggml-virtgpu: add backend documentation (#19354)
* ggml-virtgpu: add backend documentation

Assisted-by-AI: Claude Code

* CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget

* README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md

* docs/ggml-virt: add link to testing + configuration

* Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget"

This reverts commit 8ece8e72e2.

* drop the ggml- prefix

* s/ggerganov/ggml-org

* Relocate VirtGPU.md

* reorganize the text

* turn turn the ascii diagram into a mermaid

* README.md: update the link to the main doc
2026-02-09 20:15:42 +08:00
Hugo 1e8924fd65
cmake : add variable to skip installing tests (#19370)
When packaging downstream, there's usually little point in installing
test. The default behaviour remains the same.
2026-02-09 07:12:02 +01:00
Piotr Wilkin (ilintar) 39bf692af1
[Model] Qwen3.5 dense and MoE support (no vision) (#19435)
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-09 00:24:08 +01:00
Oliver Simons e06088da0f
CUDA: Fix non-contig rope (#19338)
* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-08 15:12:51 +02:00
Adrien Gallouët 5fa1c190d9
rpc : update from common.cpp (#19400)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-08 09:06:45 +01:00
Georgi Gerganov eb449cdfa4
server : improve context checkpoint logic (#19408) 2026-02-08 09:40:04 +02:00
ddh0 5999b50eb0
llama-quantize : cleanup `--help` output (#19317)
* cleanup `llama-quantize --help` output

some much needed TLC

* remove future argument

oops, spoiler

* cleanup of cleanup
2026-02-08 09:22:38 +02:00
Sigbjørn Skjæret 9a5f57795c
ci : remove server job from webui and move slow test (#19424)
* remove server job from webui and move slow test

* use pip-install option
2026-02-08 01:20:00 +01:00
Georgi Gerganov 96441c955e
ci : use -j param correctly when building with sanitizers (#19411)
* ci : use less jobs when building with sanitizers

* cont : fix nproc

* cont : fix the fix

* cont : simplify
2026-02-07 23:50:47 +01:00
Georgi Gerganov 8872ad2125
metal : consolidate bin kernels (#19390)
* metal : refactor bin kernels

* cont

* cont : fix cv
2026-02-07 10:35:56 +02:00
Georgi Gerganov 34ba7b5a2f
metal : fix event synchronization in cpy_tensor_async (#19402) 2026-02-07 07:37:15 +02:00
forforever73 b83111815e
model : support Step3.5-Flash (#19283)
* Support Step3.5-Flash

* fix: norm.weight + 1 (HF zero_centered=true)

* step35: simplify GGUF conversion + drop redundant rope KVs

* Address review feedback

* rename limits -> clamp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rename swiglu limits -> swiglu clamp in LLM_KV

* avoid CI fail

* Apply suggestions from code review

* Apply suggestions from code review

* disabled KV shifting for LLM_ARCH_STEP35

* Apply suggestions from code review

* mistakenly removed cmath

* add model size && apply missed suggestion

* assert partial_rotary_factors

* fix CI errors:

* load freq_base_swa

---------

Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-06 21:06:14 +01:00
Alex Trotta 3228e77287
gguf-py : bump sentencepiece version (#19319)
* gguf-py: Bump sentencepiece version

There's a new version that's been out for a while that addresses the issues mentioned in https://github.com/ggml-org/llama.cpp/pull/14200. There's a long chain of reasons I would like this change, but the short version is that it allows people who use both `sentencepiece` and `gguf` to take advantage of these fixes. On conda-forge, currently, it locks the version (since there is no notion of optional dependencies).

Regardless, I don't think this should be too controversial.

* review feedback
2026-02-06 21:05:19 +01:00
Abhijit Ramesh 7fbd36c50c
ggml-webgpu: JIT compile binary operators and handle binding overlaps (#19310)
* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication
2026-02-06 10:33:30 -08:00
Nechama Krashinski 537eadb1b9
sycl: add F16 support for GGML_OP_CEIL (#19306)
* Fix SYCL CEIL operator

* sycl: implement GGML_OP_CEIL
2026-02-06 23:13:44 +08:00
Jeff Bolz db6adb3c88
tests: reduce number of FA test permutations (#19381)
Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).
2026-02-06 08:50:30 -06:00
Georgi Gerganov dfde5993ea
common : add common_speculative_is_compat() (#19270)
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b62a.

* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Lasse Lauwerys 06bf3796f4
unicode : MSVC regex fix (#19340)
* Fix model loading regex error

* Change comments

* Use const_iterator and remove specializations

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
2026-02-06 15:56:13 +02:00
ymcki 3688c4f504
Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)
* kimi linear model implementation

* kimi linear convert_hf_to_gguf

* kimi linear constants.py tensor_mapping.py

* Kimi Linear ggml.h

* kimi linear ggml-cpu

* Kimi Linear ggml-cuda

* Kimi Linear ggml.c

* kimi linear src/llama

* remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning

* remove type mismatch warning

* read MoE params

* removed some hard coded code

* removed all hard code

* use DeepseekV2 tokenizer

* removed unnecessary internal methods called by the old set_vocab of KimiLinear

* rewrite get_vocab for KimiLinear. Removed all kda_scan code

* removed all traces of kda_scan

* reduce OP count by 1 due to removal of kda_scan

* Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache

* set n_embd_head_k/v to ensure kv cache works

* don't quantize conv1d of Kimi Linear

* Kimi Linear backend agnostic

* removed LOG_INFO

* naive chunking form implemented

* fixed some comments

* add Kimi-K2 specific tokens to be recognized as EOG

* build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682

* replaced Akk and Aqk with mul_mat and clamp

* no clamp version

* Moved Aqk computation out of the loop

* fixed typo and split wkv_b into wk_b and wv_b

* MLA KV cache support

* fix trailing spaces

* moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error

* fix trailing whitespace

* removed traling whitespaces in empty line + make sure indentation is multiple of 4

* try to make lint happy

* remove blank lines to make lint happy

* removed at least blank line containing white space

* fixed flake8 complaints locally

* return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement

* removed Kimi-Linear specific change that causes failure at server-windows

* removed private: from kimi_linear to make build checks happy

* removed unnecessary ggml_cont before ggml_reshape

* created static function causal_conv1d to abtract similar code for q/k/v

* merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.

* reverted to original

* fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.

* remove DT_B from constants.py. remove one comment line in llama-model.cpp

* new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight

* remove ssm_o_norm_b

* remove ssm_o_norm_b

* changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k

* removed all ggml_cont b4 ggml_reshape_4d

* Whitespace

* replaced all hparams.get with find_hparams

* added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

* use is_mla to switch between different mem_hybrid types

* fixed logical errors in convert_hf_to_gguf.py pointed out by CISC

* removed if else for required parameters kv_lora_rank and qk_rope_head_dim

* add back ggml_cont for Vcur

* minor changes

* removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp

* f16 gguf cannot run without context length

* made a mistake of adding back n_ctx parsing

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-06 11:39:58 +01:00
Jeff Bolz 1946e46f4c
vulkan: For coopmat2 FA, use fp16 accumulators for the final result (#19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-06 09:15:13 +01:00
Jeff Bolz f9bd518a6b
vulkan: make FA mask/softcap enables spec constants (#19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-06 08:49:58 +01:00
Georgi Gerganov 7fcf1ef45d
metal : skip loading all-zero mask (#19337)
* metal : skip loading all-zero mask

* cont : minor
2026-02-06 09:25:11 +02:00
Daniel Bevenius e696cfc016
llama : rename llama-sampling to llama-sampler (#19363)
This commit addresses the TODO in llama-sampling.h to rename that header
and the implementation to llama-sampler.
2026-02-06 07:26:54 +01:00
Georgi Gerganov 3e21647666
cuda : cuda graphs now compare all node params (#19383) 2026-02-06 07:55:06 +02:00
Georgi Gerganov 22cae83218
metal : adaptive CPU/GPU interleave based on number of nodes (#19369) 2026-02-05 19:07:22 +02:00
Jeff Bolz 449ec2ab07
vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-05 09:26:38 -06:00
Georgi Gerganov 3795cc1e89
benches : update models + numbers (#19359)
* bench : update script

* benches : update numbers
2026-02-05 14:34:07 +02:00