Commit Graph

7175 Commits

Author SHA1 Message Date
Georgi Gerganov b26c7069fb
common : initialize backend samplers 2025-11-24 20:25:44 +02:00
Georgi Gerganov e2d4f0829c
llama-cli : fix dangling reference to sampler config 2025-11-24 19:51:32 +02:00
Daniel Bevenius d0bea21a3c
examples : update batched to use backend sampling
This commit updates the batched example to demonstrate how to use
backend samplers.
2025-11-24 16:37:22 +01:00
Daniel Bevenius 25f33806d3
sampling : add debug log when backend sampler selects token
This commit adds a debug log statement in the llama_sampler_sample
to indicate when a backend sampler has selected a token for a given
index.

The modification helps in tracing the sampling process and understanding
the flow of control when backend samplers are used.
2025-11-24 15:03:41 +01:00
Daniel Bevenius 8eb9b4769d
sampling : remove redundant checks for stride and size [no ci] 2025-11-24 13:53:29 +01:00
Daniel Bevenius 4a90583d7d
sampling : cleanup and clarify output_reserve 2025-11-24 13:26:18 +01:00
Daniel Bevenius d88ba1813c
common : remove build-info.cpp from commit [no ci]
This file was generated during the build process and should not be
included in previous commits.
2025-11-24 09:31:14 +01:00
Daniel Bevenius 7816f0bb56
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-24 07:44:06 +01:00
Daniel Bevenius 50d21aa4a4
tests : cleanup test-backend-sampler.cpp 2025-11-24 07:18:39 +01:00
william pan 4902eebe33
models : Added support for RND1 Diffusion Language Model (#17433)
* Converted RND1 model to GGUF weights

* RND1 llama.cpp support v1

* RND1 llama.cpp support v2 non causal bug

* RND1 llama.cpp support v3 doccumentation

* RND1 llama.cpp support v4 clean code

* linting issues

* RND1 pr fixes v1

* RND1 pr fixes v2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Diffusion documentation edits

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-24 14:16:56 +08:00
Max Krasnyansky 923ae3c619
hexagon: add support for ROPE_NEOX (#17458) 2025-11-23 18:55:56 -08:00
Raul Torres 01ad35e6d6
CANN: Define `cann_graph_update_required` before macro (#17434)
**Description of the problem**

`cann_graph_update_required` is redundantly defined and
initialized as `false` inside two mutually exclusive macro branches.

**Proposed solution**

Define it right before the macro so that it could serve both
branches.
2025-11-24 10:02:52 +08:00
M. Mediouni fcb013847c
ggml-hexagon: Initial Hexagon v68/v69 support (#17394)
* ggml-hexagon: fix build error with GCC

Add stdexcept include to fix GCC build errors

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

* ggml-hexagon: check VTCM acquire failures

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

* ggml-hexagon: disable destination bypass on older than v73

v68 errors out if having bypass enabled when the VTCM is the destination.

At least on v68 this made things actually work... not a proper fix though, so to look at later...

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

* ggml-hexagon: add initial v68/v69 support

v68 is the Hexagon revision notably used on the Snapdragon 8cx
Gen 3 and the QCM6490.

Also add support for v69.

8MB isn't a supported page size, so relax asked for page size constraint
for HAP_compute_res_attr_set_vtcm_param_v2 to optimal.

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

---------

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
2025-11-23 16:54:49 -08:00
nullname d5bc1ad110
ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (#17212)
* hexagon: add buffer support checks for hexagon sessions

* refactor: simplify buffer support checks in hexagon operations

* hexagon: update buffer support checks to use tensor structure

* refactor: streamline buffer initialization for DSP queue in hexagon operations

* refactor: simplify buffer initialization in DSP queue for hexagon operations

* refactor: optimize hex_supported_buffer function by fold expression

* wip

* refactor: simplify dspqueue_buffers_init function and its usage in hexagon operations

* fix: improve nan handling at hvx_vec_fast_sigmoid_fp32_guard

* refactor: optimize hvx_vec_inverse_fp32_guard for better nan handling

* refactor: update hvx_vec_fast_sigmoid_fp32_guard to use adjusted exponent limits

* refactor: modify hvx_vec_fast_sigmoid_fp32_guard to accept parameters for improved flexibility

* refactor: update hvx_vec_exp_fp32_guard to accept max_exp and inf parameters to save some instructions

* refactor: move hvx_vec_inverse_fp32_guard implementation to hvx-inverse.c for better perf
2025-11-23 14:26:36 -08:00
Pascal 0c7220db56
webui: minor settings reorganization and add disable autoscroll option (#17452)
* webui: added a dedicated 'Display' settings section that groups visualization options

* webui: added a Display setting to toggle automatic chat scrolling

* chore: update webui build output
2025-11-23 18:42:00 +01:00
Daniel Bevenius 9e273f7aa4
sampling : fix copying both sampled tokens and logits/probs from backend
This commit fixes the issue where both sampled tokens and logits/probs
were not being copied correctly from the backend to the host when
multiple backend samplers were used.

A test for this scenario has also been added to ensure that both types
of data are copied correctly when different backend samplers are
employed.
2025-11-23 13:12:01 +01:00
Daniel Bevenius ae23d2d2c1
sampling: clarify candidate ids usage in comments 2025-11-23 11:28:19 +01:00
Daniel Bevenius 65500d05ab
sampling : add stride variable for clarity 2025-11-23 11:27:54 +01:00
Sigbjørn Skjæret 96ac5a2329
cuda : support non-contiguous i32 to i32 copy (#17326)
* support non-contiguous i32 to i32 copy

* add tests

* rename cpy_flt to cpy_scalar and reindent params
2025-11-23 11:13:34 +01:00
Eric Curtin bc809e9c53
vulkan: Update docker image to Ubuntu 26.04 to enable glslc features (#17439)
26.04 provides these

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-23 10:29:36 +01:00
Jeff Bolz 54d83bbe85
vulkan: remove a couple unnecessary switches (#17419) 2025-11-23 06:29:40 +01:00
Adrien Gallouët 4949ac0f18
ci : switch to BoringSSL on Server workflow (#17441)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-22 21:38:19 +01:00
Masato Nakasaka 3f3a4fb9c3
Revive MUL_MAT_ID to perf testing (#17397) 2025-11-22 10:55:43 +01:00
yulo 028f93ef98
HIP: RDNA4 tensor core support for MMF (#17077)
* mmf for rdna4

* align the padding for rdna4

* forbit mul_mat_f for rdna4

* fix as comment

* remove device kernels

* add constexpr for early return

* update based on review comment

* change based on the review comment

* pass compile error

* keep code consistency

---------

Co-authored-by: zhang hui <you@example.com>
2025-11-22 00:03:24 +01:00
lhez 8e9ddba610
opencl: refine condition for kqv mm (#17392) 2025-11-21 14:34:48 -08:00
Daniel Bevenius 79b8cf2a75
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-21 16:38:32 +01:00
ubergarm 23bc779a6e
model : detect GigaChat3-10-A1.8B as deepseek lite (#17420)
* Detect GigaChat3-10-A1.8B as deepseek lite

Hardcodes checking number of layers to detect if lite version of deepseek.

* Add commnent identifying deepseek lite variants

deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B
2025-11-21 14:51:38 +01:00
Daniel Bevenius 9b2439347f
common, tools : refactor model loading to support backend samplers
This commit refactors the model loading process in common/common.cpp
to enable backend sampler to be configure prior to the llama_context
creation.

The motivation for this change is that just being able to set/reset the
backend samplers after the llama_context has been created will cause a
resize to occur in llama_context::output_reserve which we want to avoid.
2025-11-21 14:26:52 +01:00
Daniel Bevenius 61ffe41dc1
sampling : use pinned memory for backend sampling buffers 2025-11-21 14:02:16 +01:00
Adrien Gallouët 28175f857d
cmake : add option to build and link BoringSSL (#17205)
* cmake: add option to build and link BoringSSL

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : fix typo

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : disable boringssl test and asm by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : skip bssl

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : disable fips

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : fix cmake --install

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : use boringssl for windows and mac

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-21 11:46:45 +01:00
Adrien Gallouët 9cc4080441
ci : start using OpenSSL (#17235)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-21 11:45:00 +01:00
Jeff Bolz f1ffbba68e
vulkan: disable async for older Intel devices (#17369)
* vulkan: disable async for older Intel devices

* update detection logic

* use name string for detection
2025-11-21 09:58:17 +01:00
Raul Torres 2370665e56
CANN: Refactor `evaluate_and_capture_cann_graph` (#17333)
* CANN: Refactor `evaluate_and_capture_cann_graph`

**Description of the problem**

* `matched_graph` is obtained even if graph mode is disabled.
* End of graph capture and graph replay are unnecessarily placed in different `if` blocks.

**Proposed solution**

* Obtain `matched_graph` only if graph mode is enabled.
* Place end of graph capture and graph reply inside the same `if` block.
* Unify graph related comments.

* Remove trailing whitespace
2025-11-21 16:23:29 +08:00
Daniel Bevenius c1625620f6
sampling : return early if backend sampling is disabled 2025-11-21 08:47:31 +01:00
nullname 21d31e0810
ggml-hexagon: fix swiglu failure at `test-backend-ops` (#17344)
* refactor: use hvx_vec_exp_fp32_guard_inf for overflow handling in hvx_exp_f32

* feat: add fast sigmoid function with overflow guard for fp32

* refactor: replace hvx_vec_inverse_fp32 with hvx_vec_inverse_fp32_guard_inf for improved overflow handling

* feat: enhance hvx_add_scalar_f32 with overflow handling using infinity guard

* wip

* add HVX_Vector_Alias

wip

* wip

* fix: improve handling of src1 tensor in glu_swiglu_fp32_per_thread function

* fix nc

* wip

* wip

* handle nan at inverse

* wip

* fix neg

* wip

* rename

* fix hvx_vec_inverse_fp32_guard_inf to handle infinity and NaN cases correctly

* wip

* fix hvx_vec_inverse_fp32_guard_inf to handle NaN cases correctly

* wip

* wip

* wip

* fix output sign
2025-11-20 15:45:05 -08:00
Daniel Han dd0f321941
readme : add Unsloth exporting to GGUF in tools (#17411) 2025-11-20 20:07:36 +01:00
Xuan-Son Nguyen 054a45c3d3
grammar: fix regression caused by #17381 (#17412)
* grammar: fix regression caused by #17381

* more readable
2025-11-20 18:35:10 +01:00
Daniel Bevenius 0d28b16bdc
sampling : introduce sampling_info struct
This commit introduces a sampling_info struct to encapsulate all
backend sampling related data within the llama_context class.

It also updates to use more descriptive names for sampled tokens and
candidates in the backend sampler ggml data structure.
2025-11-20 14:45:56 +01:00
Aleksander Grygier 4c91f2633f
Improved file naming & structure for UI components (#17405)
* refactor: Component iles naming & structure

* chore: update webui build output

* refactor: Dialog titles + components namig

* chore: update webui build output

* refactor: Imports

* chore: update webui build output
2025-11-20 14:07:31 +01:00
Piotr Wilkin (ilintar) 92c0b387a9
grammar : fix integer overflow (#17381)
* Fix DoS / integer overflow

* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)

* White space

* Actually, since it's unsigned, use UINT64_MAX
2025-11-20 14:47:04 +02:00
Georgi Gerganov 2286a360ff sync : ggml 2025-11-20 14:10:44 +02:00
YangLe 1d321e592b metal : fix compile on macos 11 (whisper/3533) 2025-11-20 14:10:44 +02:00
Georgi Gerganov 196f5083ef
common : more accurate sampling timing (#17382)
* common : more accurate sampling timing

* eval-callback : minor fixes

* cont : add time_meas impl

* cont : fix log msg [no ci]

* cont : fix multiple definitions of time_meas

* llama-cli : exclude chat template init from time measurement

* cont : print percentage of unaccounted time

* cont : do not reset timings
2025-11-20 13:40:10 +02:00
o7si 5088b435d4
convert : fix TypeError when loading base model remotely in convert_lora_to_gguf (#17385)
* fix: TypeError when loading base model remotely in convert_lora_to_gguf

* refactor: simplify base model loading using cache_dir from HuggingFace

* Update convert_lora_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: add remote_hf_model_id to trigger lazy mode in LoRA converter

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-20 12:30:12 +01:00
Piotr Wilkin (ilintar) 845f200b28
ggml : Fix transposed SOLVE_TRI result (#17323)
* Did someone transpose the SOLVE_TRI result matrix? Perhaps...

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-20 12:58:21 +02:00
Scott Fudally a7784a8b1d
DGX Spark: UMA support (#17368)
* DGX Spark: UMA support

* Updates from PR feedback

* More PR feedback cleanup

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Remove trailing whitespace

* Update ggml/src/ggml-cuda/ggml-cuda.cu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-20 12:32:02 +02:00
Adrien Gallouët 79bb743512
ggml : remove useless and error-prone variadic macros (#17399)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-20 11:18:27 +01:00
sudhiarm 3ae282a06f
kleidiai: fix zero-size array declaration (#17240) 2025-11-20 11:45:49 +02:00
Daniel Bevenius ed4345bdd9 squash! common : fix regression caused by extra memory allocations during sampling
Apply the same changes to llama-sampling.cpp, llama_sampler_sample as
were applied in commit 38f408c25.
2025-11-20 07:56:33 +01:00
ixgbe 5be353ec4a
ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling (#17314)
* ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* fix comment

* fix comment 2

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-20 08:09:18 +02:00