Commit Graph

7331 Commits

Author SHA1 Message Date
Ruben Ortlam 80deff3648
vulkan: fix MMQ quantize_y condition (#17301) 2025-11-16 19:38:17 +01:00
Eve 8b1c339bd2
ci : revert #16249 (#17303)
* Delete .github/workflows/build-amd.yml

* Update build.yml
2025-11-16 19:09:17 +01:00
Georgi Gerganov 416e7c7f47
metal : remove obosolete asserts (#17295) 2025-11-16 09:50:26 +02:00
Georgi Gerganov 5b2093becc
server : handle context overflow during decode (#17267)
* server : handle context overflow during decode

* server : minor refactor
2025-11-16 09:23:37 +02:00
lhez 52e5d421f1
opencl: fix rms_norm_mul (#17250)
* opencl: use subgrroup reduce for reduction in rms_norm_mul

* opencl: add comment about workgroup size
2025-11-15 17:40:14 -08:00
shaofeiqi 4db5641210
opencl: add kernel to handle mat mul in attention to improve encoding speed (#17181)
* Add mul_mm_f16_f32_kq_kqv kernel

* Add ggml_cl_mul_mat_kq_kqv_adreno func

* fix whitespace

* remove unused variable

* remove redundant

* refactor and clean up

* remove trailing whitespace
2025-11-15 17:33:10 -08:00
shani-f 72bd7321a7
sycl : unify unary kernels with a generic implementation and enable wide operator support (#17213)
* SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access

* SYCL: update documentation and sycl.csv to reflect new unary op support

* update ops.md after syncing SYCL.csv changes

* Fix SYCL.csv merge conflict

* Update ops.md after fixing SYCL.csv conflicts

* Fix SYCL.csv tail after merge conflict and regenerate ops.md

* Fix line endings and final newline in SYCL.csv

* Remove TOPK_MOE entries from SYCL.csv as requested

* Update ops.md after removing TOPK_MOE from SYCL.csv

* Regenerated SYCL.csv and synced ops.md with upstream

* Update ops.md using create_ops_docs.py
2025-11-16 00:52:42 +01:00
Aleksander Grygier 22e1ce2f81
webui: Fix clickability around chat processing statistics UI (#17278)
* fix: Better pointer events handling in chat processing info elements

* chore: update webui build output
2025-11-15 22:41:41 +01:00
Pascal 1411d9275a
webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618)
* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option

* webui: remove scroll listener causing unnecessary layout updates (model selector)

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: npm run format & update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-11-15 21:09:32 +01:00
Sigbjørn Skjæret 662192e1dc
convert : remove unnecessary chat template patching (#17289) 2025-11-15 20:58:59 +01:00
Jeff Bolz 24dc769f1b
vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287)
These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.
2025-11-15 19:54:23 +01:00
Ruben Ortlam 4dca015b7e
vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (#17285) 2025-11-15 15:18:58 +01:00
Sigbjørn Skjæret 9a8860cf5d
convert : use all parts in safetensors index (#17286) 2025-11-15 14:12:39 +01:00
Sigbjørn Skjæret 9d3ef4809f
convert : set expert gating func in base class (#17279) 2025-11-15 14:06:24 +01:00
Ankur Verma c7b7db0445
mtmd-cli: Avoid logging to stdout for model loading messages in mtmd-cli (#17277) 2025-11-15 12:41:16 +01:00
Giuseppe Scrivano 1568d13c2c
vulkan: implement ABS and NEG (#17245)
* docs: update Vulkan ops

* vulkan: add NEG op

* vulkan: add ABS op

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-15 12:00:29 +01:00
Jeff Bolz 439342ea0b
vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244)
* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths

* set allow_misalign
2025-11-15 11:56:15 +01:00
Jeff Bolz 234ae7d7bd
vulkan: skip all-negative-inf blocks in FA (#17186) 2025-11-15 10:37:25 +01:00
Jeff Bolz 38eaf32af1
vulkan: change graph_compute to be async and enable get_tensor_async (#17158)
* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
2025-11-15 09:06:41 +01:00
Xuan-Son Nguyen 9b17d74ab7
mtmd: add mtmd_log_set (#17268) 2025-11-14 15:56:19 +01:00
Bartowski e1fcf8b09b
model : add AfmoeForCausalLM support (#16477)
* Add AFMOE model support

* Update to vocab

* Add model sizing

* Undo Rope change for ARCEE model

* Address review comments

* Update modeling code is_sliding -> use_rope, replace hard-coded logic

* Fix AFMOE tokenizer

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update AFMoE tokenizer class identification to be more unique

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-14 13:54:10 +01:00
Marek Hradil jr. 6cd0cf72ce
fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048)
* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047)

* Replace 'static' workaround, with keeping variable in scope for longer

* Create std::array directly and pass into llama_grammar_init_impl

* Add back the trigger pattern

* Missed array include
2025-11-14 14:35:26 +02:00
Georgi Gerganov d396b43748
server : fix "can batch with" bug (#17263) 2025-11-14 14:03:45 +02:00
Georgi Gerganov 45c6ef7307
metal : support argsort for ne00 > 1024 (#17247)
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
2025-11-14 09:36:06 +02:00
Georgi Gerganov 2606b0adab
metal : make the FA extra sizes consistent (#17143) 2025-11-14 09:13:34 +02:00
ixgbe 307772fcda
readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (#17259)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-14 09:12:56 +02:00
Aleksander Grygier f1bad23f88
Better UX for handling multiple attachments in WebUI (#17246) 2025-11-14 01:19:08 +01:00
Alberto Cabrera Pérez becc4816dd
ggml-cpu: handle 3d tensors in repack mat_mul (#17241)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries

* Address performance regression in Qwen and llama.cpp due to chunking
2025-11-13 12:53:00 -08:00
Xuan-Son Nguyen c4abcb2457
server: fixing naming conflict res_error (#17243) 2025-11-13 20:53:47 +01:00
Piotr Wilkin (ilintar) 389ac78b26
ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063)
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from #14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-13 20:54:47 +02:00
Ruben Ortlam a19bd6f7ce
vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219)
* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-11-13 14:51:21 +01:00
Diego Devesa dd091e52f8
sched : fix reserve ignoring user tensor assignments (#17232) 2025-11-13 13:14:02 +01:00
ixgbe 1215dde7b0
ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (#17227)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-13 13:13:32 +01:00
bagheera 0cfb19166b
metal: accelerated conv2d (#17175)
* metal: accelerated conv2d

* cont : cleanup

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-13 13:32:44 +02:00
Georgi Gerganov 2776db6c81
Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)" (#17233)
This reverts commit 1c398dc9ec.
2025-11-13 12:59:37 +02:00
Diego Devesa 879dec341a
ggml-cpu : use template for argsort (#17222) 2025-11-13 10:59:05 +02:00
TecJesh 97d5117217
CANN: Add cross_entropy_loss op support (#16886)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace

* cann: update cross_entropy_loss op support

* remove trailing whitespaces

* rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request.

* undo the l2_norm operator deletion
2025-11-13 09:39:51 +08:00
Aman Gupta a90eb94ca9
CUDA: fuse rope + set_rows (#16884)
* CUDA: add fused rope

* move k forward_expand up

* create helper function instead of re-using params

* make assert statement more in line with comment

* rope_norm: coalesced writes to global mem
2025-11-13 08:50:01 +08:00
Neo Zhang Jianyu 07751f8d44
update SYCL support OPs (#17208)
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-11-13 08:42:23 +08:00
o7si ffb6f3d921
vocab : correct bounds check for UGM XCDA array access (#17215) 2025-11-12 23:41:02 +01:00
Johannes Gäßler 5d6838b74f
CUDA: static assert to prevent misuse of memcpy_1 (#17198) 2025-11-12 23:13:55 +01:00
Mike Abbott 92bb442ad9
docker : preserve .so symlinks for docker container builds (#17214) 2025-11-12 20:33:55 +01:00
Georgi Gerganov 374fe09cdd
ggml : use std::sort in ggml_argsort CPU implementation (#17211)
* ggml : use std::sort in ggml_argsort CPU implementation

* cont : add missing header
2025-11-12 20:43:38 +02:00
Aleksander Grygier 8e878f0cb4
Update packages + upgrade Storybook to v10 (#17201)
* chore: Update packages + upgrade Storybook to v10

* fix: Increase timeout for UI tests
2025-11-12 19:01:48 +01:00
Xuan-Son Nguyen 00c94083b3
server: (refactor) implement generator-based API for task results (#17174)
* server: (refactor) implement generator-based API for task results

* improve

* moving some code

* fix "Response ended prematurely"

* add sink.done before return false

* rm redundant check

* rm unused var

* rename generator --> reader
2025-11-12 18:50:52 +01:00
Xuan-Son Nguyen 017eceed61
ci: add check vendor job (#17179)
* ci: add check vendor job

* use dev version of miniaudio

* move to dedicated workflow, only run on related files changed
2025-11-12 14:56:02 +01:00
Xuan-Son Nguyen ee8dd5c658
server: move res_error/res_ok to static function (#17167) 2025-11-12 14:17:24 +01:00
Alberto Cabrera Pérez 1c398dc9ec
ggml-cpu: handle 3d tensors in repack mat_mul (#17030)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries
2025-11-12 14:52:19 +02:00
Adrien Gallouët 52cf111b31
cmake : cleanup (#17199) 2025-11-12 14:48:30 +02:00
Adrien Gallouët 78010a0d52
cmake : move OpenSSL linking to vendor/cpp-httplib (#17177)
* cmake : move OpenSSL linking to vendor/cpp-httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* bring back httplib 0.27.0

* add -DLLAMA_HTTPLIB

* update cmake config for visionos

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-12 12:32:50 +01:00