Commit Graph

7011 Commits

Author SHA1 Message Date
Gabe Goodhart 0c74f32632
memory: Hybrid context shift (#17009)
* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: compilade <git@compilade.net>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-10 17:14:23 +02:00
Georgi Gerganov c27efd2bd1
metal : enable tensor API for A19 (#17087) 2025-11-10 15:38:42 +02:00
fj-y-saito df70bedda7
arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277)
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K

* Surround SVE function with compiler directive

* fix compile switch

* fix coding style

* ggml : fix indent

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-10 15:12:59 +02:00
Georgi Gerganov f914544b16
batched-bench : add "separate text gen" mode (#17103) 2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen 4b13a684c5
mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models

* add default hparams
2025-11-10 11:41:05 +01:00
Georgi Gerganov 9898b57cbe
editorconfig : ignore benches/ (#17140)
[no ci]
2025-11-10 12:17:19 +02:00
Acly 1032256ec9
cuda/vulkan : bicubic interpolation (#17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-10 10:19:39 +01:00
Georgi Gerganov 15274c0c50
benches : add eval results (#17139)
[no ci]
2025-11-10 10:44:10 +02:00
Georgi Gerganov b8595b16e6
mtmd : fix embedding size for image input (#17123) 2025-11-09 18:31:02 +02:00
Ruben Ortlam 392e09a608
vulkan: fix memory allocations (#17122) 2025-11-09 16:14:41 +01:00
compilade 802cef44bf
convert : parse safetensors directly (#15667)
* convert : parse safetensors directly

* gguf-py : order safetensors tensors by name

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor

* convert : fix no-lazy dtypes from direct safetensors
2025-11-09 09:49:40 -05:00
compilade 1c07c0c68c
convert : handle compressed-tensors quant method (#17069)
* convert : handle compressed-tensors quant method

* convert : handle int-quantized models

* convert : handle naive-quantized models

* gguf-py : __pos__ is also unary

* convert : fix flake8 lint

* convert : use F32 for dequant of pack-quantized tensors
2025-11-09 09:45:50 -05:00
Georgi Gerganov cb1adf8851
server : handle failures to restore host cache (#17078)
* server : handle failures to restore host cache

* server : add tests for the prompt cache
2025-11-09 14:27:05 +02:00
Georgi Gerganov ef1d826997
benches : add folder with benchmarks (#16931)
* benches : add folder with benchmarks

* benches : update dgx-spark bench
2025-11-09 12:53:29 +02:00
Eric Curtin 86fde91e62
Switch to using Ubuntu 25.10 vulkan/mesa (#16497)
Because "Ubuntu packages to be discontinued in Vulkan SDK"

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-09 10:25:38 +01:00
Ruben Ortlam 7f3e9d339c
vulkan: iGPU memory reporting fix (#17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 09:54:47 +01:00
Ruben Ortlam 8a3519b708
vulkan: fix mmq out of bounds reads (#17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 09:52:57 +01:00
Jeff Bolz 80a6cf6347
vulkan: fuse mul_mat_id + mul (#17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
2025-11-09 09:48:42 +01:00
Georgi Gerganov 0750a59903
metal : retain src and dst buffers during async ops (#17101) 2025-11-09 08:28:51 +02:00
Xuan-Son Nguyen aa3b7a90b4
arg: add --cache-list argument to list cached models (#17073)
* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-08 21:54:14 +01:00
chansikpark 333f2595a3
webui: fix keyboard shortcuts for new chat & edit chat title (#17007) 2025-11-08 20:52:35 +01:00
Jeff Bolz 53d7d21e61
vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
2025-11-08 13:24:29 -06:00
Aidan eeee367de5
server: fix correct time_ms calculation in prompt_progress (#17093)
* fix: correct time_ms calculation in send_partial_response

The time_ms field was incorrectly calculated. The division was happening
before the subtraction leading to incorrect values.

Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After:
(ggml_time_us() - slot.t_start_process_prompt) / 1000

* docs : document time_ms field in prompt_progress
2025-11-08 15:12:11 +02:00
Aman Gupta 64fe17fbb8
Revert "CUDA: add expert reduce kernel (#16857)" (#17100) 2025-11-08 21:05:19 +08:00
Aman Gupta c1b187688d
CUDA: skip fusion for repeating adds in bias (#17080) 2025-11-08 16:58:05 +08:00
SavicStefan b8a5cfd11a
vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-11-08 09:28:22 +01:00
Aleksei Nikiforov 08416ebe7f
ggml: disable vxe for cross-compilation by default (#16966)
Otherwise compilation will fail due to enabling -mvx -mzvector
and not setting corresponding -march options.
2025-11-08 16:00:20 +08:00
Jeff Bolz b4e335d8dc
vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
2025-11-08 08:52:15 +01:00
Jeff Bolz d6fe40fa00
vulkan: Fix test-thread-safety crashes (#17024)
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
2025-11-08 08:39:45 +01:00
Johannes Gäßler e14e842e87
CUDA: fix MMQ stream-k fixup ne1 indices (#17089) 2025-11-08 08:26:18 +01:00
Reese Levine 647b960bd8
ggml webgpu: faster matrix multiplication/matrix-vector multiplication (#17031)
* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings
2025-11-07 19:27:20 -08:00
bssrdf 299f5d782c
CUDA: properly handle nb00=nb02 case for cpy (#17081) 2025-11-07 23:41:58 +01:00
Acly ac76d36201
vulkan : refactor buffer handling in vk_op_f32 (#16840)
* vulkan : refactor/simplify buffer handling in vk_op_* functions

* Combine UMA handling into ggml_vk_tensor_subbuffer
2025-11-07 21:08:50 +01:00
Johannes Gäßler 6515610506
CUDA: fix should_use_mmvf for ne11 == 1 (#17085)
* CUDA: fix should_use_mmvf for ne11 == 1

* Apply suggestion from @am17an

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-11-07 20:53:14 +01:00
Georgi Gerganov 7956bb4d7f
bench : cache the llama_context state at computed depth (#16944)
* bench : cache llama_context state at depth

* cont : handle failures to restore the old state

* cont : print information when the state is being reused
2025-11-07 21:23:11 +02:00
Sigbjørn Skjæret 9008027aa3
hparams : add n_embd_inp() to support extended embed (#16928)
* add n_embd_full to support extended embed

* don't change output

* rename to n_embd_inp

* restore n_embd where applicable
2025-11-07 19:27:58 +01:00
Georgi Gerganov 16bcc1259d
kv-cache : pad the cache size to 256 for performance (#17046)
* kv-cache : pad the size of the small SWA cache for performance

* context : pad the total context to 256

* cont : future-proof the swa pad

* server : adjust test params to new logic
2025-11-07 20:03:25 +02:00
Adrien Gallouët 9eb9a1331d
Revert "ggml-cpu: detect correct cpu flags for arm64 (#16229) (#16239)" (#17084)
This reverts commit 7c23f3f0d4.
2025-11-07 18:34:05 +02:00
iron 7c23f3f0d4
ggml-cpu: detect correct cpu flags for arm64 (#16229) (#16239)
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.

Signed-off-by: lizhenneng <lizhenneng@kylinos.cn>
Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>
2025-11-07 08:18:14 -08:00
Georgi Gerganov 8c0d6bb455
server : print the samplers chain for each request (#17070) 2025-11-07 12:24:47 +02:00
Xuan-Son Nguyen 5c9a18e674
common: move download functions to download.(cpp|h) (#17059)
* common: move download functions to download.(cpp|h)

* rm unused includes

* minor cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-07 11:23:34 +01:00
xctan 7f09a680af
ggml-cpu : optimize RVV q2_k and q3_k kernels (#16887) 2025-11-06 18:12:45 +02:00
Johannes Gäßler aa374175c3
CUDA: fix crash on uneven context without FA (#16988) 2025-11-06 14:05:47 +01:00
Georgi Gerganov 5b180c3d60
metal : initial Metal4 tensor API support (#16634)
* metal : rework mat-mat multiplication

* metal : initial Metal4 support

* cont

* metal : detect tensor support

* cont : better ifdefs

* metal : support tensors in mul_mm_id

* metal : add env for disabling tensor API

* tests : restore

* metal : remove unused constants

* metal : fix check for bfloat tensor support

* cont : handle API incompatibilities

* cont : handle even more incompatibilities

* metal : use tensor API only on M5 and later
2025-11-06 14:45:10 +02:00
Georgi Gerganov b7f9010d24
server : disable checkpoints with mtmd (#17045) 2025-11-06 12:09:29 +02:00
Xuan-Son Nguyen 4882f0ff78
clip: implement minicpm-v sinusoidal embd using GGML (#17036)
* clip: implement minicpm-v sinusoidal embd using GGML

* fix repeat op
2025-11-06 11:02:54 +01:00
YehuditE 9d7c518d64
sycl: add CONCAT operator support (#16047)
* sycl: add CONCAT operator support

* cleanup: remove stray lines added by mistake

* fix: code format issues in concat.cpp and tests/test-backend-ops.cpp

* chore: fix editorconfig violations

* cleanup: drop unnecessary i16 type support

* docs: update sycl-csv and regenerate ops.md

* update docs/ops.md

* fix: adapt to upstream master changes after rebase

* fix: remove empty files

* fix: drop whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-06 11:02:33 +01:00
Johannes Gäßler 22c8c3c6ad
docs: explain CUDA 11 compilation [no ci] (#16824) 2025-11-06 08:14:35 +01:00
l3utterfly 6db3d1ffe6
ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (#16987)
* support older socs where FASTRPC_GET_URI is unsupported

* added graceful fallback when FASTRPC_GET_URI call fails

* use weak symbols instead of loading libcdsprpc.so dynamically

* Add weak pragma for rpcmem_alloc2

* Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp

Removed weak declaration for rpcmem_alloc2.

* Enforce ndev to 1 for archs below v75

Force ndev to 1 for SoCs architectures lower than v75.
2025-11-05 21:46:38 -08:00
bssrdf 230d1169e5
improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)
* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
2025-11-05 21:55:04 +01:00