* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.
* llama : update worst-case graph for unified cache
* ci : disable op offload in some tests
* fix spelling
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
As we only support static graphs for the time and we don't know the size
of the output of top-p, we have to do value-scaling same as for min-p
operator.
Further improvements can be applied to the unit-test (i.e. check for
equivalence of top_p happening on backend with top_p happening on cpu)
and also by constructing candidates and sorting those as opposed to
reversing the sort of the logits (this would be arange +
get_rows instead of argsort + get_rows)
This commit updates the backend sampling implementation to support
intermixed usage of backend and CPU samplers within the same batch.
The initial implementation was developed as an all-or-nothing solution:
either perform backend sampling for the entire batch, or perform CPU
sampling for the entire batch.
The motivation for this change is to support batches with mixed
sequences. For example, we may have a backend sampler configured for
sequence 0, while sequence 1 in the same batch uses CPU sampling. This
was not supported in the initial implementation.
This issue manifested in llama-server with the webui: decoding with
backend samplers would work initially, but after changing to CPU
sampling, a slot (sequence) could still be using a backend sampler.
This meant that logits in output_reserve would not be allocated,
resulting in an error.
The solution in this commit inspects the batch to determine which
sampling modes are needed and allocates buffers accordingly. However,
there is a known inefficiency: when we have intermixed backend/CPU
samplers in the same batch, we currently copy all logits to the host,
even for sequences using backend samplers.
Added test_backend_cpu_mixed_batch to verify correct behavior with
mixed backend/CPU samplers in a single batch, including dynamic
sampler switching between decode calls.
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
* fix pipeline selection
* vulkan: Add N-ary search algorithm for topk
* microoptimizations
This commit fixes the issue where both sampled tokens and logits/probs
were not being copied correctly from the backend to the host when
multiple backend samplers were used.
A test for this scenario has also been added to ensure that both types
of data are copied correctly when different backend samplers are
employed.
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.
* Add files via upload
* fix unit test
* fix crashes for --reasoning-format=none
* Patch buggy official MiniMax-M2 chat template
* add upstream minja fix: https://github.com/ochafik/minja/pull/7
* Fix <think> token not generated
* add test copied from https://github.com/ggml-org/llama.cpp/pull/16946
* cleanup
* Hopes to fix the compilation error on CI
* Delete chat template patching since it’s fixed by upstream Minja
* Remove undeeded Minimax-M2 template patch
https://github.com/ochafik/minja/pull/7#issuecomment-3480356100
* Add proper handling of optional parameters with test
merged tests from: 23d4bb75c4
* Fix making all tool parameters optional
* Move xml tool parser to separate file
* cleanup & add tests for GLM4.5
* add streaming tests & enhancement & cleanups
Add streaming test for both GLM 4.5 and minimax-m2.
Cleanup for preserved_tokens.
Cleanup for grammar rule name.
Enhance the parser's stability.
* cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo
* apply suggestions from reviewers
* fix a misuse for data.grammar_lazy
* fix grammar when tool have no argument
* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2
* update chat.cpp
* fix grammar for GLM 4.5/4.6
* Try fix Jinja template for GLM
* Try fix GLM-4.6.jinja
* Update common/chat-parser-xml-toolcall.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking
* Improve Kimi-K2 chat template
* Fix unit test
* Fix "Invalid tool call arguments passed" in a rare case.
In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Argsort is used for top-k currently. WE optimize argsort by 2 things:
1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
correct entrypoint (the function chooses different execution paths,
it contains `DeviceSegmentedRadixSort` as one of the paths and will
choose the best one according to heuristics.
https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview
Some perf numbers for a RTX PRO 6000:
On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run
ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run
ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run
```
After:
```
ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run
ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run
ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run
```
---
On the model level, tested with
`llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is
the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling`
Before:
```
llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second)
llama_perf_context_print: load time = 18215.58 ms
llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second)
llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second)
llama_perf_context_print: total time = 857.62 ms / 206 tokens
```
After
```
llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second)
llama_perf_context_print: load time = 18366.92 ms
llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second)
llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second)
llama_perf_context_print: total time = 683.65 ms / 206 tokens
```
This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.
The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.
For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.
It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.
Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
* extract rotate_pairs logic from ggml_compute_forward_rope_f32
* templateify ggml_compute_forward_rope_f32 and _f16
* abort when rope type not supported, remove GLM from test-rope
* add imrope branch to switch
* add rope tests for perf
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
* make CI happy
* remove headers not needed
* reduced bank conflicts for fp16 and bf16
* add missing const*
* now bank conflicts free
* use padding instead of swizzling
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage
* tests: init gf and filter out fusion tests for support mode
* tests: filter out fusion cases before calling eval_support
* tests: filter out fusion cases from show_test_coverage as well, fix lint
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : support unified context across slots
* cont : fix speculative decoding initialization
* context : fix n_ctx_per_seq computation
* server : purge slots one by one
* tests : add unified cache server tests
* llama : update per-seq context computation
* test-thread-safety : handle tiny training context of the input model
* server : fix server_tokens clear()
* server : use 4 slots + unified KV by default
* llama : add note about context size queries
* cont : update todos [no ci]
* context : do not cap the size of the context
* tests : adjust parameters to be CI friendlier
* context : add warning