Commit Graph

7031 Commits

Author SHA1 Message Date
TecJesh 655cddd174
CANN: Add L2_NORM op support (#16856)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace
2025-11-12 15:11:42 +08:00
Neo Zhang Jianyu 5da7664960
[SYCL]fix ci crash about SSM_CONV (#17169)
* fix ci crash

* Update ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-12 14:44:29 +08:00
Raul Torres 23a46ce972
CANN: GGML_CANN_ACL_GRAPH works only USE_ACL_GRAPH enabled (#16861)
The documentation should state that `GGML_CANN_ACL_GRAPH` is only effective if `USE_ACL_GRAPH` was enabled at compilation time.
2025-11-12 14:37:52 +08:00
Max Krasnyansky c273d75375
hexagon: various Op fixes (#17135)
* hexagon: explicitly check for ops with zero nrows

llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.

* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL

Co-authored-by: chraac <chraac@gmail.com>

* hexagon: use fastdiv in ADD_ID

* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs

---------

Co-authored-by: chraac <chraac@gmail.com>
2025-11-11 15:25:04 -08:00
Eve 7d019cff74
disable rms norm mul rope for chips with no fp16 rte (#17134) 2025-11-11 12:53:30 -06:00
sudhiarm 3fe36c3238
ci: add Arm-hosted Graviton4 runner (#17021)
* ci: add Arm-hosted Graviton4 runner

* ci: add missing dependencies for graviton4 build

* ci: enable LFS checkout on graviton4

* ci: move git-lfs install to dependencies in Graviton4 workflow
2025-11-11 17:58:05 +02:00
Xuan-Son Nguyen 1d45b4228f
vendor: split httplib to cpp/h files (#17150)
* vendor: split httplib to cpp/h files

* move defines

* include httplib if curl is not used

* add TODO

* fix build ios

* fix build visionos instead
2025-11-11 13:32:58 +01:00
ixgbe ca4844062b
ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (#17161)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-11 13:41:51 +02:00
duduta 73460f6278
ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805)
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-11 13:33:24 +02:00
Charles Xu 8c583242ad
kleidiai: add optimized per-channel kernels for Q8_0 (#16993) 2025-11-11 13:20:31 +02:00
Mike Abbott 4a5b8aff40
cmake : add version to all shared object files (#17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned.  This applies a version to all generated so files, allowing the package to build without errors.
2025-11-11 13:19:50 +02:00
Nicolas B. Pierron d2d626938a
Install rpc-server when GGML_RPC is ON. (#17149) 2025-11-11 10:53:59 +00:00
levkropp 2fc392ce35
convert : register UMT5Model architecture for T5 conversion (#17160)
Register UMT5Model as a supported architecture variant for T5 model conversion.
This allows the conversion to work for models downloaded with AutoModel.
2025-11-11 09:38:30 +01:00
lhez ece0f5c177
opencl: add fastdiv and use it in set_rows, ported from cuda (#17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
2025-11-10 15:00:13 -08:00
Sigbjørn Skjæret 7bef684118
models : move build_inp_out_ids outside loop (#17151)
* move build_inp_out_ids outside loop

* realign
2025-11-10 22:55:30 +01:00
Max Krasnyansky 395e286bc9
cpu: skip NOPs to avoid barriers (#17133)
* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty
2025-11-10 12:44:49 -08:00
Georgi Gerganov 13730c183b
metal : cap threadgroups size of set_rows (#17146) 2025-11-10 21:33:35 +02:00
Adrien Gallouët 967eb4b2bf
ggml-cpu : inspect -march and -mcpu to found the CPU (#16333)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-10 21:03:36 +02:00
Ruben Ortlam f117be185e
vulkan: check glslc executable string (#17144) 2025-11-10 16:59:26 +01:00
Ruben Ortlam 85234a4b3a
vulkan: fix validation issue introduced by #16868 (#17145) 2025-11-10 16:59:10 +01:00
Gabe Goodhart 0c74f32632
memory: Hybrid context shift (#17009)
* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: compilade <git@compilade.net>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-10 17:14:23 +02:00
Georgi Gerganov c27efd2bd1
metal : enable tensor API for A19 (#17087) 2025-11-10 15:38:42 +02:00
fj-y-saito df70bedda7
arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277)
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K

* Surround SVE function with compiler directive

* fix compile switch

* fix coding style

* ggml : fix indent

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-10 15:12:59 +02:00
Georgi Gerganov f914544b16
batched-bench : add "separate text gen" mode (#17103) 2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen 4b13a684c5
mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models

* add default hparams
2025-11-10 11:41:05 +01:00
Georgi Gerganov 9898b57cbe
editorconfig : ignore benches/ (#17140)
[no ci]
2025-11-10 12:17:19 +02:00
Acly 1032256ec9
cuda/vulkan : bicubic interpolation (#17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-10 10:19:39 +01:00
Georgi Gerganov 15274c0c50
benches : add eval results (#17139)
[no ci]
2025-11-10 10:44:10 +02:00
Georgi Gerganov b8595b16e6
mtmd : fix embedding size for image input (#17123) 2025-11-09 18:31:02 +02:00
Ruben Ortlam 392e09a608
vulkan: fix memory allocations (#17122) 2025-11-09 16:14:41 +01:00
compilade 802cef44bf
convert : parse safetensors directly (#15667)
* convert : parse safetensors directly

* gguf-py : order safetensors tensors by name

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor

* convert : fix no-lazy dtypes from direct safetensors
2025-11-09 09:49:40 -05:00
compilade 1c07c0c68c
convert : handle compressed-tensors quant method (#17069)
* convert : handle compressed-tensors quant method

* convert : handle int-quantized models

* convert : handle naive-quantized models

* gguf-py : __pos__ is also unary

* convert : fix flake8 lint

* convert : use F32 for dequant of pack-quantized tensors
2025-11-09 09:45:50 -05:00
Georgi Gerganov cb1adf8851
server : handle failures to restore host cache (#17078)
* server : handle failures to restore host cache

* server : add tests for the prompt cache
2025-11-09 14:27:05 +02:00
Georgi Gerganov ef1d826997
benches : add folder with benchmarks (#16931)
* benches : add folder with benchmarks

* benches : update dgx-spark bench
2025-11-09 12:53:29 +02:00
Eric Curtin 86fde91e62
Switch to using Ubuntu 25.10 vulkan/mesa (#16497)
Because "Ubuntu packages to be discontinued in Vulkan SDK"

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-09 10:25:38 +01:00
Ruben Ortlam 7f3e9d339c
vulkan: iGPU memory reporting fix (#17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 09:54:47 +01:00
Ruben Ortlam 8a3519b708
vulkan: fix mmq out of bounds reads (#17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 09:52:57 +01:00
Jeff Bolz 80a6cf6347
vulkan: fuse mul_mat_id + mul (#17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
2025-11-09 09:48:42 +01:00
Georgi Gerganov 0750a59903
metal : retain src and dst buffers during async ops (#17101) 2025-11-09 08:28:51 +02:00
Xuan-Son Nguyen aa3b7a90b4
arg: add --cache-list argument to list cached models (#17073)
* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-08 21:54:14 +01:00
chansikpark 333f2595a3
webui: fix keyboard shortcuts for new chat & edit chat title (#17007) 2025-11-08 20:52:35 +01:00
Jeff Bolz 53d7d21e61
vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
2025-11-08 13:24:29 -06:00
Aidan eeee367de5
server: fix correct time_ms calculation in prompt_progress (#17093)
* fix: correct time_ms calculation in send_partial_response

The time_ms field was incorrectly calculated. The division was happening
before the subtraction leading to incorrect values.

Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After:
(ggml_time_us() - slot.t_start_process_prompt) / 1000

* docs : document time_ms field in prompt_progress
2025-11-08 15:12:11 +02:00
Aman Gupta 64fe17fbb8
Revert "CUDA: add expert reduce kernel (#16857)" (#17100) 2025-11-08 21:05:19 +08:00
Aman Gupta c1b187688d
CUDA: skip fusion for repeating adds in bias (#17080) 2025-11-08 16:58:05 +08:00
SavicStefan b8a5cfd11a
vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-11-08 09:28:22 +01:00
Aleksei Nikiforov 08416ebe7f
ggml: disable vxe for cross-compilation by default (#16966)
Otherwise compilation will fail due to enabling -mvx -mzvector
and not setting corresponding -march options.
2025-11-08 16:00:20 +08:00
Jeff Bolz b4e335d8dc
vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
2025-11-08 08:52:15 +01:00
Jeff Bolz d6fe40fa00
vulkan: Fix test-thread-safety crashes (#17024)
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
2025-11-08 08:39:45 +01:00
Johannes Gäßler e14e842e87
CUDA: fix MMQ stream-k fixup ne1 indices (#17089) 2025-11-08 08:26:18 +01:00