Commit Graph

6551 Commits

Author SHA1 Message Date
Diego Devesa 360d6533db
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-11 22:47:38 +02:00
Johannes Gäßler 0e6ff0046f
CUDA: larger SRAM reads for tile FA, AMD FP16 dot (#15927)
* CUDA: larger SRAM reads for tile FA, AMD FP16 dot

* fix logic for availability of v_dot2_f32_f16
2025-09-11 21:19:58 +02:00
ddh0 df082f5630
nitpick : correct MB to MiB (#15934)
MB was incorrectly used for 1024 x 1024 bytes instead of MiB
2025-09-11 19:12:34 +02:00
Daniel Bevenius 24a6734daf
ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (#15922)
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.

The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.

Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
2025-09-11 14:39:12 +01:00
Charles Xu 2b3efea9a4
kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed (#15614)
* kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed

* removes the Whisper-specific check for GET_ROWS support
2025-09-11 12:45:40 +02:00
hipudding c0389dba43
CANN: Disable acl_graph for prefill stage (#15933)
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
2025-09-11 15:59:37 +08:00
Oliver Simons 00681dfc16
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (#15872)
* Add fastdiv and fastmodulo to k_bin_bcast kernel

* Address review comments

* `prod_` instead of `prod` suffix

* Add test case for `k_bin_bcast_unravel` in CUDA backend
2025-09-10 22:04:03 +02:00
Jie Fu (傅杰) 4f658855fa
llama : support T5 models with unequal number of encoder-decoder layers (#15909)
* Extend the support of T5 models with different encoder-decoder layers

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Rename n_dec_layer --> dec_n_layer

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Adapt to cases when dec_n_layer > n_layer

Signed-off-by: Jie Fu <jiefu@tencent.com>

---------

Signed-off-by: Jie Fu <jiefu@tencent.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-10 20:51:51 +02:00
Sigbjørn Skjæret 6ab397e12b
graph : support non-contiguous Q in build_attn_mha (#15908)
* support non-contiguous Q in build_attn_mha

* Update src/llama-graph.cpp

ggml-ci

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-10 19:08:59 +02:00
Daniel Bevenius 9de447d94e
ggml-cpu : fix padding in ggml_timestep_embedding (#15917)
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:

Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero

This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.

Resolves: https://github.com/ggml-org/ggml/issues/1324
2025-09-10 17:31:40 +02:00
Georgi Gerganov 0f0a3c2851
metal : make the backend async (#15906)
* metal : make the backend async

ggml-ci

* cont : add comments, extend op offload, clean up

ggml-ci

* metal : fix batch size for MUL_MAT_ID

* metal : remove deprecated ggml_backend_metal_buffer_from_ptr

* metal : create only metal buffers, no wrapping of host memory

ggml-ci

* metal : restore .alloc_buffer for buffer_from_ptr_type

ggml-ci

* metal : remove broken implementation of GGML_OP_SET

ggml-ci

* metal : clean-up loose ends, ready for tests

ggml-ci

* metal : support both private and shared buffers

ggml-ci

* metal : enable private buffers + add global device queue

* metal : disable host buffer to prevent races

ggml-ci

* metal : avoid extra copy during set_tensor

ggml-ci

* metal : use separate buffer types for shread and private Metal buffers

ggml-ci

* metal : simplify synchronization logic

ggml-ci

* metal : fix build

ggml-ci

* metal : do not implement cpy_tensor

ggml-ci

* metal : separate implementations for shared and private buffers

ggml-ci
2025-09-10 17:52:35 +03:00
Daniel Bevenius 33daece86b
ci : add caching for ROCm installation in release workflow (#15924)
This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf9ee ("ci : cache ROCm installation
in windows-latest-cmake-hip (#15887)").
2025-09-10 15:39:57 +02:00
Daniel Bevenius e7b6d83b52
tests : filter out no-ops from coverage report (#15900)
* tests : filter out no-ops from coverage report

This commit is a follow-up commit for #15745 to address the feedback on
how no-op operations should be filtered out from the coverage report.

The feedback regarding the UNARY and GLU sub-operations not being
handled I not exactly sure what should be done. They are included in the
coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in
the list of covered operations:
```console
$ ./build/bin/test-backend-ops --show-coverage
Operations covered by tests (89):
  ✓ ABS
  ✓ ACC
  ✓ ADD
  ✓ ADD1
  ✓ ADD_ID
  ✓ ARANGE
  ✓ ARGMAX
  ✓ ARGSORT
  ✓ CLAMP
  ✓ CONCAT
  ✓ CONV_2D
  ✓ CONV_2D_DW
  ✓ CONV_3D
  ✓ CONV_TRANSPOSE_1D
  ✓ CONV_TRANSPOSE_2D
  ✓ COS
  ✓ COUNT_EQUAL
  ✓ CPY
  ✓ CROSS_ENTROPY_LOSS
  ✓ CROSS_ENTROPY_LOSS_BACK
  ✓ DIAG_MASK_INF
  ✓ DIV
  ✓ DUP
  ✓ ELU
  ✓ EXP
  ✓ FLASH_ATTN_EXT
  ✓ GATED_LINEAR_ATTN
  ✓ GEGLU
  ✓ GEGLU_ERF
  ✓ GEGLU_QUICK
  ✓ GELU
  ✓ GELU_ERF
  ✓ GELU_QUICK
  ✓ GET_ROWS
  ✓ GET_ROWS_BACK
  ✓ GROUP_NORM
  ✓ HARDSIGMOID
  ✓ HARDSWISH
  ✓ IM2COL
  ✓ IM2COL_3D
  ✓ L2_NORM
  ✓ LEAKY_RELU
  ✓ LOG
  ✓ MEAN
  ✓ MUL
  ✓ MUL_MAT
  ✓ MUL_MAT_ID
  ✓ NEG
  ✓ NORM
  ✓ OPT_STEP_ADAMW
  ✓ OPT_STEP_SGD
  ✓ OUT_PROD
  ✓ PAD
  ✓ PAD_REFLECT_1D
  ✓ POOL_2D
  ✓ REGLU
  ✓ RELU
  ✓ REPEAT
  ✓ REPEAT_BACK
  ✓ RMS_NORM
  ✓ RMS_NORM_BACK
  ✓ ROLL
  ✓ ROPE
  ✓ ROPE_BACK
  ✓ RWKV_WKV6
  ✓ RWKV_WKV7
  ✓ SCALE
  ✓ SET
  ✓ SET_ROWS
  ✓ SGN
  ✓ SIGMOID
  ✓ SILU
  ✓ SILU_BACK
  ✓ SIN
  ✓ SOFT_MAX
  ✓ SOFT_MAX_BACK
  ✓ SQR
  ✓ SQRT
  ✓ SSM_CONV
  ✓ SSM_SCAN
  ✓ STEP
  ✓ SUB
  ✓ SUM
  ✓ SUM_ROWS
  ✓ SWIGLU
  ✓ SWIGLU_OAI
  ✓ TANH
  ✓ TIMESTEP_EMBEDDING
  ✓ UPSCALE

Operations without tests (14):
  ✗ ADD_REL_POS
  ✗ CUSTOM
  ✗ DIAG
  ✗ DIAG_MASK_ZERO
  ✗ FLASH_ATTN_BACK
  ✗ GET_REL_POS
  ✗ IM2COL_BACK
  ✗ MAP_CUSTOM1
  ✗ MAP_CUSTOM2
  ✗ MAP_CUSTOM3
  ✗ POOL_1D
  ✗ POOL_2D_BACK
  ✗ WIN_PART
  ✗ WIN_UNPART

Coverage Summary:
  Total operations: 103
  Tested operations: 89
  Untested operations: 14
  Coverage: 86.4%
```

Refs: https://github.com/ggml-org/llama.cpp/pull/15745

* use of ggml_op enum values instead of strcmp
2025-09-10 14:17:09 +02:00
j-k 2cfef4d117
media : add transparent icon svg and png [no ci] (#15891) 2025-09-10 14:51:28 +03:00
Jesse 09e72a037c
gitignore : Ignore vim swap files in tests (#15901) 2025-09-10 14:28:47 +03:00
Chenguang Li 10d8b2b6b0
CANN: Add ROPE sin/cos cache for reuse (#15912)
* CANN: Add ROPE sin/cos cache for reuse

Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.

- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
  (ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation

This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.

Co-authored-by: hipudding <huafengchun@gmail.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-10 18:42:00 +08:00
Chenguang Li 28b5f190ef
CANN: implement LRU cache for ACL graphs (#15814)
* CANN: implement LRU cache for ACL graphs in CANN backend

- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

* fix typo

* The LRU cache capacity can be configured via an env variable

Signed-off-by: noemotiovon <757486878@qq.com>

* refactory acl graph

* refactory && fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-10 15:29:12 +08:00
Daniel Bevenius 86587da03b
llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893)
This commit adds check for two function pointers returned from
ggml_backend_reg_get_proc_address.

The motivation for this is that the function pointer could be nullptr if
the get proc address function changes in the future. This is also
consistent with all the other calls to ggml_backend_reg_get_proc_address
in the code base.
2025-09-10 05:33:58 +02:00
Daniel Bevenius ff02caf9ee
ci : cache ROCm installation in windows-latest-cmake-hip (#15887)
This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job. 

The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step.

Refs: https://github.com/ggml-org/llama.cpp/pull/15365
2025-09-10 05:23:19 +02:00
Ruben Ortlam ae355f6f71
vulkan: throw the oom error instead of no memory type found (#15905) 2025-09-09 22:26:03 +02:00
Jeff Bolz 4f63cd705c
vulkan: Fix OOB accesses in soft_max_back (#15861) 2025-09-09 14:41:15 +02:00
Johannes Gäßler 17bc5a815f
HIP: use v_dot2_f32_f16 instruction for FA (#15884) 2025-09-09 14:04:43 +02:00
lksj92hs ed54e32558
Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886) 2025-09-09 14:01:15 +02:00
Aman Gupta a972faebed
CUDA: Add mul_mat_id support for the mmf kernel (#15767)
* CUDA: Add mul_mat_id support the mmf

Add support for mul_mat_id for bs < 16

* Review: use warp_size, fix should_use_mmf condition

* Launch one block per expert, stride along n_expert_used

* templatize mul_mat_id

* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids

* Reduce compile times by dividing mmf into f16, bf16 and f32 variants

* Divide mmf by ncols_dst

* Add missing files

* Fix MUSA/HIP builds
2025-09-09 14:38:02 +08:00
Johannes Gäßler 550cf726e1
CUDA: fix GET_ROWS for large tensors (#15882) 2025-09-09 08:11:01 +02:00
Georgi Gerganov c252ce67c4
contrib : add notes about merging PRs (#15881)
* contrib : add notes about merging PRs

* Update CONTRIBUTING.md

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-09 08:42:10 +03:00
Daniel Bevenius 70cd37dbbe
requirements : update transformers/torch for Embedding Gemma (#15828)
* requirements : update transformers/torch for Embedding Gemma

This commit updates the requirements to support converting
Embedding Gemma 300m models.

The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.

I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.

* resolve additional python dependencies

* fix pyright errors in tokenizer test and remove unused import
2025-09-09 06:06:52 +02:00
Piotr Wilkin (ilintar) acc1b008cf
model-conversion : add extra debugging support for model conversion (#15877)
* feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py
2025-09-09 06:05:55 +02:00
Aldehir Rojas 7057faf64b
json : support `enum` values within `allOf` (#15830) 2025-09-08 16:14:32 -05:00
j-k fe1c92cd7b
media : add llama1 icon (#15878)
Add svg and png based off llama1-icon.svg
2025-09-08 21:57:01 +03:00
Jeff Bolz e68aa10d8f
vulkan: sort graph to allow more parallel execution (#15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-09 02:10:07 +08:00
Aman Gupta 0a16bf52e6
CUDA: generate_cu_files.py - add missing mxfp4 (#15880) 2025-09-09 01:23:46 +08:00
Jesse 88021565f0
chat : Deepseek V3.1 reasoning and tool calling support (OpenAI Style) (#15533)
* Add DeepSeek V3.1 thinking mode support

- Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value
- Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation)
- Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format:
  - Extracts reasoning content before '</think>' tag into reasoning_content
  - Extracts regular content after '</think>' tag into content
  - No opening '<think>' tag in V3.1 format
- Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking'
- Added V3.1 case to parsing switch statement

This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag.

* Another attempt by V3.1 non-thinking

* Fix test, but it's not asserting anything.

* Ignore vim swap files in tests dir

* Update the test

* Try using try_find_literal instead of regex

* passing test

* Revert "Try using try_find_literal instead of regex"

This reverts commit c50d887ec2.

* Remove unnecessary change

* Remove comment

* Add code to handle non-thinking mode.

* Try to set message['prefix'] when thinking is enabled.

* This fixes reasoning, but breaks normal content. We need state in the
chat parser.

* DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`.

* Simplify (DeepSeek V3.1 reasoning)

* Fix sign inversion bug

* Add some tool calling code (not working).

* Tool calls working in non-reasoning mode.

* Attempt a unit test for tool call parsing.

* Passing test

* Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants.

* Passing DeepSeek V3.1 tool call tests, but model is not working.

* Revert assistance response prefill change. Not my monkeys.

* Add fenced_thinking unit test variant. Passes, but thinking tool calling
still isn't working for some reason.

* Tests pass in reasoning mode. Also e2e tool test passes.

* Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so
as to not accidentally introduce regressions.

* Fix thinking_forced_open logic. tool calling broken. Need to add another
test case.

* That's what I get for cargo culting a newline.

* Add multi tool call test for deepseek v3.1 non-reasoning

* Move test, remove .gitignore change

* Place deepseek-v3.1 reasoning test directly into existing reasoning
function per CISC's request.

* Address whitespace CI failure.

* Merge two assert_equals per CISC's request.

* Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request.

* Merge deepseek V3.1 and regular parse_json_tool_calls() function
behaviors by adding optional update_cursor argument.

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* DeepSeek V3.1 fix reasoning_format none

* Strip grammar down to strictly what we expect based on model card. Throw
out parts we cargo culted from R1 that don't make sense.

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* DeepSeek V3.1 - Add edge case where thinking is forced open, there is
tool calling in the reasoning content, but then the model just stops the
output without closing the </think> tag, so it's not a partial. In this
case, use the tool call in the reasoning content.

* DeepSeek V3.1 - simplify update_cursor

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix indent

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-08 16:59:48 +02:00
Xuan-Son Nguyen 56920f5665
server : bring back timings_per_token (#15879) 2025-09-08 16:50:05 +02:00
Georgi Gerganov b0d52998b9
cuda : fix supports_op condition for get_rows when number of blocks is too large (#15868)
* cuda : fix supports_op condition for get_rows when src1->ne2 > 1

ggml-ci

* ggml : add comment about ggml_get_rows

ggml-ci

* cuda : add FIXME [no ci]

* cuda : update support condition

ggml-ci
2025-09-08 13:56:51 +03:00
Georgi Gerganov f28d4f4ac9
metal : refactor + optimize (#15857)
* metal : refactor

ggml-ci

* cont : refactor FA-vec kernel

* cont : print metal library load time

* minor : warn to debug + bettern kernel names

ggml-ci

* metal : optimize mul_mv q8_0

ggml-ci

* metal : simplify FA pipeline creation functions

ggml-ci

* metal : improve naming consistency

* metal : safer function constants offsets

ggml-ci

* metal : comments

ggml-ci
2025-09-08 13:34:56 +03:00
Xuan-Son Nguyen 9fcb29f22f
ggml: allow casting between f32 and i32 (#15783)
* ggml: allow casting between f32 and i32

* fix cuda

* add vulkan

* fix CPU non-cont

* add non-cont test case

* add note

* extend test number range

* correct note

* add cont version for vulkan
2025-09-08 12:33:01 +02:00
Sigbjørn Skjæret 5ef22d281d
CUDA: non-contiguous src0 not supported for PAD (#15869) 2025-09-08 12:55:44 +03:00
Daniel Bevenius 233d773d02
convert : force setting sliding_window from original config (#15867)
* convert : force setting sliding_window from original config

This commit modifies the set_gguf_parameters method for EmbeddingGemma
so that it reads the sliding_window parameter from the original model
config.json and uses that value.

The motivation for this change is that the Gemma3TextConfig
constructor adjusts the sliding_window value, which can lead to
inconsistencies when converting models as we expects this value to
match the original model's configuration.

Refs: bb45d3631e/src/transformers/models/gemma3/configuration_gemma3.py (L230)

* fix flake8 error

* add link to huggingface PR
2025-09-08 09:44:34 +02:00
Georgi Gerganov a885dcff11
batched-bench : fix llama_synchronize usage during prompt processing (#15835)
ggml-ci
2025-09-08 10:27:07 +03:00
Georgi Gerganov 663027fd54
context : fix n_outputs during reserve (#15858)
ggml-ci
2025-09-08 10:26:36 +03:00
Georgi Gerganov cf0e3ba150
model : avoid ggml_cont_3d for fused QKV weights (#15662)
* model : avoid ggml_cont_3d for fused QKV weights

ggml-ci

* kv-cache : make cpy_k and cpy_v implementation more readable

ggml-ci

* cont : add comments

ggml-ci

* cont : minor fix [no ci]

* cont : one more fix

* cont : clarity

ggml-ci

* kv-cache : require contiguous heads of k_cur and v_cur

ggml-ci
2025-09-08 10:25:33 +03:00
Jeff Bolz d413dca003
tests: large sizes for get_rows (#15687) 2025-09-07 23:23:41 -05:00
Chenguang Li 85ca66a746
CANN: Stream sync between devices for acl_graph (#15809)
* CANN: Switch to stream synchronization

Switch to stream synchronization because events are not effective.

Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: add Comments

---------

Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-08 10:03:29 +08:00
Jeff Bolz 3976dfbe00
vulkan: support im2col_3d (#15795) 2025-09-07 13:50:26 -05:00
Aaron Teo d36e61c580
ggml-cpu: clean up s390x SIMD (#15855)
* ggml-cpu: clean up s390x simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0da4b6aa07)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix hsum data types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 02:18:28 +08:00
Jeff Bolz c97b5e5854
vulkan: Support pad_ext (#15794) 2025-09-07 19:00:49 +02:00
Jeff Bolz 267e99867f
vulkan: Use larger loads in scalar/coopmat1 matmul (#15729)
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...

rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
2025-09-07 18:53:07 +02:00
Daniel Bevenius 3b15924d71
ggml WebGPU: remove userdata from request adapter callback (#15527)
* ggml WebGPU: remove userdata from request adapter callback

This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.

The motivation for this change is to simplify the code and improve
readability.

* inline the callback lambda into the RequestAdapter call

This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
2025-09-07 11:19:45 +03:00
Johannes Gäßler 79bc429262
CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769) 2025-09-07 00:26:28 +02:00