Commit Graph

7438 Commits

Author SHA1 Message Date
Daniel Bevenius 9ad6522be6
squash! sampling : support intermixed backend/cpu samplers
Add check that logits is not null which is can happen for embeddings.
2025-11-28 08:57:48 +01:00
Daniel Bevenius 74be332e24
sampling : support intermixed backend/cpu samplers
This commit updates the backend sampling implementation to support
intermixed usage of backend and CPU samplers within the same batch.

The initial implementation was developed as an all-or-nothing solution:
either perform backend sampling for the entire batch, or perform CPU
sampling for the entire batch.

The motivation for this change is to support batches with mixed
sequences. For example, we may have a backend sampler configured for
sequence 0, while sequence 1 in the same batch uses CPU sampling. This
was not supported in the initial implementation.

This issue manifested in llama-server with the webui: decoding with
backend samplers would work initially, but after changing to CPU
sampling, a slot (sequence) could still be using a backend sampler.
This meant that logits in output_reserve would not be allocated,
resulting in an error.

The solution in this commit inspects the batch to determine which
sampling modes are needed and allocates buffers accordingly. However,
there is a known inefficiency: when we have intermixed backend/CPU
samplers in the same batch, we currently copy all logits to the host,
even for sequences using backend samplers.

Added test_backend_cpu_mixed_batch to verify correct behavior with
mixed backend/CPU samplers in a single batch, including dynamic
sampler switching between decode calls.
2025-11-28 08:38:05 +01:00
yulo 6bca76ff5e
HIP: enable mul_mat_f for RDNA4 (#17437)
* enable mmf for rdna4

* move some mmvf to mmf

* revert lds128 for wmma loading

* Revert "revert lds128 for wmma loading"

This reverts commit db9ae8b6b4.

* Revert "enable mmf for rdna4"

This reverts commit 698c9f2418.

* Revert "move some mmvf to mmf"

This reverts commit 99b92bd665.

* enable mul_mat for rdna4

---------

Co-authored-by: zhang hui <you@example.com>
2025-11-28 08:24:30 +01:00
Piotr Wilkin (ilintar) cd0e3a7a3b
SOLVE_TRI CUDA kernel for small matrices (#17457) 2025-11-28 12:15:32 +08:00
Neo Zhang Jianyu efaaccdd69
refactor pad_reflect_1d to make the UT case pass (#17204)
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-11-28 08:50:56 +08:00
Oliver Simons f9889cf1c7 Fix top-k comp & behavior for non-CUB path
Some changes were made in 5ea3be265b
which were incomplete. In the case of non-CUB, bitonic sort and its
limitations of ncols < 1024 have to apply, similar to argsort.cu
2025-11-27 16:40:41 +01:00
Jeff Bolz 4abef75f2c
vulkan: Implement SOLVE_TRI (#17486)
* vulkan: Implement SOLVE_TRI

* load B matrix through shared memory

* use FLOAT_TYPE
2025-11-27 15:48:00 +01:00
Georgi Gerganov c386114922
arch : add description about LLM_TENSOR_INFOS (#17550) 2025-11-27 16:34:13 +02:00
Daniel Bevenius e9d070980b
sampling : remove backend sampling chain from common_sampler
This commit removes the backend sampling chain from the common_sampler
structure and related functions.

The motivation for this change is that the backend samplers are not
currently set on the context, and if they are they would cause the
a graph reallocation to occur. Instead, the intialization is handled
like it currently is by llama_context's constructor.
2025-11-27 15:28:37 +01:00
Georgi Gerganov 6783b11fb0
models : fix LFM2 tensors (#17548) 2025-11-27 16:04:29 +02:00
Daniel Bevenius 172208afbf
sampling : add comments about backend sampler [no ci]
This commit adds a comment to llama_context's constructor explaining why
backend samplers are initialized early in the process.
2025-11-27 14:59:52 +01:00
matt23654 909072abcf
cuda : fix UMA detection on discrete GPUs. (#17537) 2025-11-27 13:35:35 +02:00
Alberto Cabrera Pérez cd8370b408
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (#17494)
* Enabled q4_K_4x8 path

* Fixed generic Q4_K 8x4 implementation

* wip: dotprod gemm

* Working arm q4_K dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Undo acc rename

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Q4_K arm dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix: q4_qs reinterpret from uint to int

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Removed comments

* Fixed macro guards

* Fixed unused vars in generic implementation

* Fixed unused vars in 8x4 repack

* Fixed unused vars in generic implementation, unneeded comment

* Missing arch fallback for x86

* minor : style

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-27 13:25:14 +02:00
Eric Curtin d21a76ac38
devops: Add build-essential to Ubuntu 26.04 image (#17531)
This is no longer passing the build, needs more packages.

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-27 18:35:47 +08:00
Aleksei Nikiforov 4fcd87cf7c
gguf-py : skip endian-conversion of MXFP4 data (#17523)
* gguf_convert_endian.py: skip MXFP4 data

* Use gguf.constants.GGML_QUANT_SIZES to determine block sizes
2025-11-27 11:35:38 +01:00
Daniel Bevenius 5ea3be265b
cuda : fix top-k compilation when CUB is unavailable
This commit adds a macro guard around argsort_f32_i32_cuda_cub usage
in the top-k fallback path, falling back to bitonic sort when
GGML_CUDA_USE_CUB is not defined.

The motivation for this is that some environments like AMD HIP
do not have CUB available, causing compilation failure.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/19728226426/job/56523606840#step:6:208
2025-11-27 09:40:13 +01:00
Daniel Bevenius 51107a0b63
sampling : fix temperature check to allow zero temperature
This commit modifies the temperature sampling check to allow a
temperature value of zero. Previously, the check only allowed
positive temperature values, which excluded the valid case of
zero temperature.

The motivation for this is to enable a zero temperature setting which is
also currently causing the following test to fail:
```console
(venv) $ cd tools/server/tests
(venv) $ ./tests.sh unit/test_basic.py::test_load_split_model
```
2025-11-27 09:18:43 +01:00
Daniel Bevenius d9d736102b
sampling : use argmax for min-p sampling 2025-11-27 07:38:44 +01:00
Acly b78db3bd50
vulkan : move contiguous checks to device_supports_op (#17490)
* vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op

* im2col: remove contraints on src0 (kernel input)
2025-11-27 06:54:19 +01:00
Jeff Bolz 142df17c9c
vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (#17514) 2025-11-27 06:32:30 +01:00
Xuan-Son Nguyen e509411cf1
server: enable jinja by default, update docs (#17524)
* server: enable jinja by default, update docs

* fix tests
2025-11-27 01:02:50 +01:00
lhez 7cba58bbea
opencl: add sqr, sqrt, mean and ssm_conv (#17476)
* opencl: add sqr

* opencl: add sqrt

* opencl: add mean

* opencl: add ssm_conv

* opencl: add missing cl_khr_fp16

* opencl: do sqrt in f32 then convert to f16 for better precision
2025-11-26 13:29:58 -08:00
Alberto Cabrera Pérez 5449367b21
Fix chunks being too small with small matrix sizes (#17526) 2025-11-26 13:14:54 -08:00
Han Qingzhe 1d594c295c
clip: (minicpmv) fix resampler kq_scale (#17516)
* debug:"solve minicpmv precision problem"

* “debug minicpmv”

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-11-26 21:44:07 +01:00
Daniel Bevenius 7c2bfb352e
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-26 17:52:29 +01:00
Daniel Bevenius 90a3aff2c2
cuda : fix editorconfig-checker warning 2025-11-26 17:44:04 +01:00
Jeff Bolz eec1e33a9e
vulkan: allow graph_optimize for prompt processing workloads (#17475) 2025-11-26 16:46:33 +01:00
Jeff Bolz 879d673759
vulkan: Implement top-k (#17418)
* vulkan: Implement top-k

Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.

* fix pipeline selection

* vulkan: Add N-ary search algorithm for topk

* microoptimizations
2025-11-26 16:45:43 +01:00
Daniel Bevenius 0f7805f32a
common : add get_active_samplers function to check enabled samplers
This commit adds a function to check if a sampler is actually enabled,
meaning that it does not have values that disables its effect. This is
then used by the backend samplers initialization to avoid considering
samplers that are not enabled when determining the split point between
them.

The motivation for this is that this allows the default sampler chain
for `--samplers` to be used and any sampler that is not enabled will not
cause the backend samplers to be skipped.
For example, before this change if the penalties sampler was included in
the samplers list but had default values that disable it, it would cause
the backend samplers to be skipped entirely.

This commit also contains some refactoring to remove some code
duplication.
2025-11-26 15:46:33 +01:00
Oliver Simons 4fea191c66 Use `FetchContent` over CPM as it's bundled with CMake
Thanks @ggerganov for the suggestion
2025-11-26 15:30:37 +01:00
xctan 6ab4e50d9c
ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (#17448)
* ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16

* ggml-cpu : dedup scalar impl

* Update ggml/src/ggml-cpu/vec.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-26 15:33:05 +02:00
Adrien Gallouët 2336cc4784
cmake : use EXCLUDE_FROM_ALL to avoid patch-boringssl.cmake (#17520)
We have to separate the code path starting 3.28 because
`FetchContent_Populate` is now deprecated and will be completely removed
in a future version.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-26 15:15:21 +02:00
Adrien Gallouët e6923caaec
ggml : fix ARM feature verification (#17519)
On arm64 with `cmake` version 3.31.6, the final feature verification fails:

    -- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
    -- Performing Test GGML_MACHINE_SUPPORTS_dotprod
    -- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_i8mm
    -- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_sve
    -- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_sme
    -- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
    -- Performing Test GGML_MACHINE_SUPPORTS_nosme
    -- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
    -- Checking for ARM features using flags:
    --   -U__ARM_FEATURE_SME
    --   -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
    -- Performing Test HAVE_DOTPROD
    -- Performing Test HAVE_DOTPROD - Failed
    -- Performing Test HAVE_SVE
    -- Performing Test HAVE_SVE - Failed
    -- Performing Test HAVE_MATMUL_INT8
    -- Performing Test HAVE_MATMUL_INT8 - Failed
    -- Performing Test HAVE_FMA
    -- Performing Test HAVE_FMA - Success
    -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
    -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
    -- Performing Test HAVE_SME
    -- Performing Test HAVE_SME - Failed
    -- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme

We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-26 15:14:41 +02:00
Jiacheng (Jason) Chen 3e18dba9fd
HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (#17502)
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4

* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
2025-11-26 11:18:48 +01:00
Daniel Bevenius b45d504e70
sampling : add min-p backend sampler 2025-11-26 10:50:58 +01:00
hipudding eeb5605de2
CANN: Add MROPE and IMROPE support (#17401)
* CANN: ROPE supports both MROPE and IMROPE.

1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.

Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.

* Resolve review comments
2025-11-26 16:44:19 +08:00
o7si f3a848a3b1
chore: upgrade cpp-httplib from v0.27.0 to v0.28.0 (#17513) 2025-11-26 09:21:06 +02:00
Jeff Bolz b3b03a7baf
vulkan: Implement GGML_OP_CUMSUM (#17479) 2025-11-26 07:08:10 +01:00
Oliver Simons f23b306cc5 CUDA: Add top-k implementation 2025-11-25 15:25:25 +01:00
Daniel Bevenius ec047e12ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 15:16:44 +01:00
Georgi Gerganov 583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Aleksei Nikiforov 05872ac885
convert : fix big-endian conversion (#17431)
* Fix convert_hf_to_gguf.py script on s390x

Assume converted model data is originally little-endian.
Byteswap data on s390x after reading it to put values in correct presentation
for any transformation needed, like calculating weight tensors.

Then byteswap data to little-endian before passing it to GGUFWriter while
GGUFWriter will byteswap data back to big endian if big endian output is requested.

byteswap(inplace=True) calls don't work with lazy tensor and array wrappers.
Use byteswap with copying data to workaround this behaviour.

* Make GGUFWriter accept tensors in native endianness instead of little-endian

With this change if no byteswapping is actually needed, 2 excessive byteswaps can be omitted on s390x

* Fix byteswapping in convert_hf_to_gguf.py for remote models
2025-11-25 14:18:16 +01:00
Daniel Bevenius 9e5e09d087
sampling : remove backend-dist option (wip)
This commit removes the `--backend-dist` option and instead uses the
configured --samplers chain to determine which samplers run on the
backend.

Backend sampling is still enabled using With `--backend_sampling`, and
the sampler chain, either explictly specified using `--samplers` or the
default, is automatically analyzed to determine which samplers can run
on the backend. The system finds the longest contiguous chain of
backend supported samplers from the start of the sampler sequence.
For example:

* If the chain is `top-k -> temperature -> top-p`, and both `top-k` and
  `temperature` are backend-supported but `top-p` is not, then `top-k`
  and `temperature` will run on the backend, while `top-p` and
  subsequent samplers run on the CPU.

* If all configured samplers are supported, the final distribution
  sampling will also happen on the backend, transferring only the
  sampled token IDs back to the host.

* If the sampler chain starts with an unsupported sampler (e.g.,
  `penalties`), all sampling runs on the CPU. Note that this is
  currently the case with the default sampler so to use backend sampling
  it is required to specify a sampler chain. See below for an example.

The following shows how llama-cli can be run with backend sampling:
```console
$ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
    --prompt 'What is the capital of Sweden?' \
    -n 20 \
    -no-cnv \
    --verbose-prompt \
    -ngl 40 \
    --backend-sampling \
    --samplers 'top_k;temperature'
```
In this case the all sampling will happen on the backend since both
`top_k` and `temperature` are supported backend samplers.

To enable a partial backend sampling (hybrid sampling), for example
running `top_k` and `temperature` on the backend and `typ_p` on the CPU
the following sampler chain could be specified:
```console
$ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
    --prompt 'What is the capital of Sweden?' \
    -n 20 \
    -no-cnv \
    --verbose-prompt \
    -ngl 40 \
    --backend-sampling \
    --samplers 'top_k;temperature;top_p'
```

If this looks good then I'll follow up with updates the llama-cli and
llama-server documentation to reflect these changes.
2025-11-25 14:01:23 +01:00
Diego Devesa 55ab25caf5
codeowners : remove slaren (#17492) 2025-11-25 13:00:23 +01:00
TianHao324 064c90d843
CANN: supports out_prod operator for F32 and F16 (#17406)
Co-authored-by: tianhao <tianhao42@huawei.com>
2025-11-25 17:39:06 +08:00
Daniel Bevenius 53dca56d9b Merge remote-tracking branch 'upstream/master' into gpu-sampling 2025-11-25 08:20:50 +01:00
Daniel Bevenius 0f17ccdee7 examples : add info about hybrid sampling in batched [no ci] 2025-11-25 08:13:23 +01:00
Pascal b1846f1c8e
webui: add rehype plugin to restore HTML in Markdown table cells (#17477)
* webui: add rehype plugin to restore HTML in Markdown table cells

The remark/rehype pipeline neutralizes inline HTML as literal text
(remarkLiteralHtml) so that XML/HTML snippets in LLM responses display
as-is instead of being rendered. This causes <br> and <ul> markup in
table cells to show as plain text.

This plugin traverses the HAST post-conversion, parses whitelisted HTML
patterns (<br>, <ul><li>) from text nodes, and replaces them with actual
HAST element nodes. For lists, adjacent siblings must be combined first
as the AST fragmentation breaks pattern matching.

Strict validation rejects malformed markup, keeping it as raw text.

* chore: update webui build output
2025-11-25 08:01:02 +01:00
Jeff Bolz d414db02d3
vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (#17455) 2025-11-25 07:11:27 +01:00
Daniel Bevenius 2b4c7927ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 06:10:33 +01:00