Commit Graph

7263 Commits

Author SHA1 Message Date
Jeff Bolz 38eaf32af1
vulkan: change graph_compute to be async and enable get_tensor_async (#17158)
* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
2025-11-15 09:06:41 +01:00
Xuan-Son Nguyen 9b17d74ab7
mtmd: add mtmd_log_set (#17268) 2025-11-14 15:56:19 +01:00
Bartowski e1fcf8b09b
model : add AfmoeForCausalLM support (#16477)
* Add AFMOE model support

* Update to vocab

* Add model sizing

* Undo Rope change for ARCEE model

* Address review comments

* Update modeling code is_sliding -> use_rope, replace hard-coded logic

* Fix AFMOE tokenizer

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update AFMoE tokenizer class identification to be more unique

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-14 13:54:10 +01:00
Marek Hradil jr. 6cd0cf72ce
fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048)
* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047)

* Replace 'static' workaround, with keeping variable in scope for longer

* Create std::array directly and pass into llama_grammar_init_impl

* Add back the trigger pattern

* Missed array include
2025-11-14 14:35:26 +02:00
Georgi Gerganov d396b43748
server : fix "can batch with" bug (#17263) 2025-11-14 14:03:45 +02:00
Georgi Gerganov 45c6ef7307
metal : support argsort for ne00 > 1024 (#17247)
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
2025-11-14 09:36:06 +02:00
Georgi Gerganov 2606b0adab
metal : make the FA extra sizes consistent (#17143) 2025-11-14 09:13:34 +02:00
ixgbe 307772fcda
readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (#17259)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-14 09:12:56 +02:00
Aleksander Grygier f1bad23f88
Better UX for handling multiple attachments in WebUI (#17246) 2025-11-14 01:19:08 +01:00
Alberto Cabrera Pérez becc4816dd
ggml-cpu: handle 3d tensors in repack mat_mul (#17241)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries

* Address performance regression in Qwen and llama.cpp due to chunking
2025-11-13 12:53:00 -08:00
Xuan-Son Nguyen c4abcb2457
server: fixing naming conflict res_error (#17243) 2025-11-13 20:53:47 +01:00
Piotr Wilkin (ilintar) 389ac78b26
ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063)
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from #14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-13 20:54:47 +02:00
Ruben Ortlam a19bd6f7ce
vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219)
* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-11-13 14:51:21 +01:00
Diego Devesa dd091e52f8
sched : fix reserve ignoring user tensor assignments (#17232) 2025-11-13 13:14:02 +01:00
ixgbe 1215dde7b0
ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (#17227)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-13 13:13:32 +01:00
bagheera 0cfb19166b
metal: accelerated conv2d (#17175)
* metal: accelerated conv2d

* cont : cleanup

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-13 13:32:44 +02:00
Georgi Gerganov 2776db6c81
Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)" (#17233)
This reverts commit 1c398dc9ec.
2025-11-13 12:59:37 +02:00
Diego Devesa 879dec341a
ggml-cpu : use template for argsort (#17222) 2025-11-13 10:59:05 +02:00
TecJesh 97d5117217
CANN: Add cross_entropy_loss op support (#16886)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace

* cann: update cross_entropy_loss op support

* remove trailing whitespaces

* rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request.

* undo the l2_norm operator deletion
2025-11-13 09:39:51 +08:00
Aman Gupta a90eb94ca9
CUDA: fuse rope + set_rows (#16884)
* CUDA: add fused rope

* move k forward_expand up

* create helper function instead of re-using params

* make assert statement more in line with comment

* rope_norm: coalesced writes to global mem
2025-11-13 08:50:01 +08:00
Neo Zhang Jianyu 07751f8d44
update SYCL support OPs (#17208)
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-11-13 08:42:23 +08:00
o7si ffb6f3d921
vocab : correct bounds check for UGM XCDA array access (#17215) 2025-11-12 23:41:02 +01:00
Johannes Gäßler 5d6838b74f
CUDA: static assert to prevent misuse of memcpy_1 (#17198) 2025-11-12 23:13:55 +01:00
Mike Abbott 92bb442ad9
docker : preserve .so symlinks for docker container builds (#17214) 2025-11-12 20:33:55 +01:00
Georgi Gerganov 374fe09cdd
ggml : use std::sort in ggml_argsort CPU implementation (#17211)
* ggml : use std::sort in ggml_argsort CPU implementation

* cont : add missing header
2025-11-12 20:43:38 +02:00
Aleksander Grygier 8e878f0cb4
Update packages + upgrade Storybook to v10 (#17201)
* chore: Update packages + upgrade Storybook to v10

* fix: Increase timeout for UI tests
2025-11-12 19:01:48 +01:00
Xuan-Son Nguyen 00c94083b3
server: (refactor) implement generator-based API for task results (#17174)
* server: (refactor) implement generator-based API for task results

* improve

* moving some code

* fix "Response ended prematurely"

* add sink.done before return false

* rm redundant check

* rm unused var

* rename generator --> reader
2025-11-12 18:50:52 +01:00
Xuan-Son Nguyen 017eceed61
ci: add check vendor job (#17179)
* ci: add check vendor job

* use dev version of miniaudio

* move to dedicated workflow, only run on related files changed
2025-11-12 14:56:02 +01:00
Xuan-Son Nguyen ee8dd5c658
server: move res_error/res_ok to static function (#17167) 2025-11-12 14:17:24 +01:00
Alberto Cabrera Pérez 1c398dc9ec
ggml-cpu: handle 3d tensors in repack mat_mul (#17030)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries
2025-11-12 14:52:19 +02:00
Adrien Gallouët 52cf111b31
cmake : cleanup (#17199) 2025-11-12 14:48:30 +02:00
Adrien Gallouët 78010a0d52
cmake : move OpenSSL linking to vendor/cpp-httplib (#17177)
* cmake : move OpenSSL linking to vendor/cpp-httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* bring back httplib 0.27.0

* add -DLLAMA_HTTPLIB

* update cmake config for visionos

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-12 12:32:50 +01:00
TecJesh 655cddd174
CANN: Add L2_NORM op support (#16856)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace
2025-11-12 15:11:42 +08:00
Neo Zhang Jianyu 5da7664960
[SYCL]fix ci crash about SSM_CONV (#17169)
* fix ci crash

* Update ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-12 14:44:29 +08:00
Raul Torres 23a46ce972
CANN: GGML_CANN_ACL_GRAPH works only USE_ACL_GRAPH enabled (#16861)
The documentation should state that `GGML_CANN_ACL_GRAPH` is only effective if `USE_ACL_GRAPH` was enabled at compilation time.
2025-11-12 14:37:52 +08:00
Max Krasnyansky c273d75375
hexagon: various Op fixes (#17135)
* hexagon: explicitly check for ops with zero nrows

llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.

* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL

Co-authored-by: chraac <chraac@gmail.com>

* hexagon: use fastdiv in ADD_ID

* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs

---------

Co-authored-by: chraac <chraac@gmail.com>
2025-11-11 15:25:04 -08:00
Eve 7d019cff74
disable rms norm mul rope for chips with no fp16 rte (#17134) 2025-11-11 12:53:30 -06:00
sudhiarm 3fe36c3238
ci: add Arm-hosted Graviton4 runner (#17021)
* ci: add Arm-hosted Graviton4 runner

* ci: add missing dependencies for graviton4 build

* ci: enable LFS checkout on graviton4

* ci: move git-lfs install to dependencies in Graviton4 workflow
2025-11-11 17:58:05 +02:00
Xuan-Son Nguyen 1d45b4228f
vendor: split httplib to cpp/h files (#17150)
* vendor: split httplib to cpp/h files

* move defines

* include httplib if curl is not used

* add TODO

* fix build ios

* fix build visionos instead
2025-11-11 13:32:58 +01:00
ixgbe ca4844062b
ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (#17161)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-11 13:41:51 +02:00
duduta 73460f6278
ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805)
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-11 13:33:24 +02:00
Charles Xu 8c583242ad
kleidiai: add optimized per-channel kernels for Q8_0 (#16993) 2025-11-11 13:20:31 +02:00
Mike Abbott 4a5b8aff40
cmake : add version to all shared object files (#17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned.  This applies a version to all generated so files, allowing the package to build without errors.
2025-11-11 13:19:50 +02:00
Nicolas B. Pierron d2d626938a
Install rpc-server when GGML_RPC is ON. (#17149) 2025-11-11 10:53:59 +00:00
levkropp 2fc392ce35
convert : register UMT5Model architecture for T5 conversion (#17160)
Register UMT5Model as a supported architecture variant for T5 model conversion.
This allows the conversion to work for models downloaded with AutoModel.
2025-11-11 09:38:30 +01:00
lhez ece0f5c177
opencl: add fastdiv and use it in set_rows, ported from cuda (#17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
2025-11-10 15:00:13 -08:00
Sigbjørn Skjæret 7bef684118
models : move build_inp_out_ids outside loop (#17151)
* move build_inp_out_ids outside loop

* realign
2025-11-10 22:55:30 +01:00
Max Krasnyansky 395e286bc9
cpu: skip NOPs to avoid barriers (#17133)
* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty
2025-11-10 12:44:49 -08:00
Georgi Gerganov 13730c183b
metal : cap threadgroups size of set_rows (#17146) 2025-11-10 21:33:35 +02:00
Adrien Gallouët 967eb4b2bf
ggml-cpu : inspect -march and -mcpu to found the CPU (#16333)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-10 21:03:36 +02:00