Aman Gupta
efe3a90996
CUDA cmake: add `-lineinfo` for easier debug ( #15260 )
2025-08-12 17:21:45 +08:00
Chenguang Li
bbd57b7eaf
CANN: GGML_OP_CPY optimization ( #15070 )
...
Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-12 16:12:13 +08:00
R0CKSTAR
25ff6f7659
musa: fix failures in test-backend-ops for mul_mat_id op ( #15236 )
...
* musa: fix failures in test-backend-ops for mul_mat_id op
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-12 10:02:51 +08:00
hipudding
be48528b06
CANN: Add broadcast for softmax and FA ( #15208 )
...
* refactor softmax
* fix fa
* fix mask shape
* format
* add comments
* Remove whitespace
2025-08-11 22:50:31 +08:00
rainred
cf9e5648a7
mtmd : Fix MinicpmV model converter and clip to avoid using hardcode. ( #14750 )
...
* Fix MinicpmV model converter and clip to avoid using hardcode.
* Code update for pr/14750
* Remove unused field, update script path in docs.
* Add version 5 for fallback code.
---------
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2025-08-11 16:12:12 +02:00
Xuan-Son Nguyen
fba5c0d680
chat : hotfix gpt-oss jinja raising an exception ( #15243 )
...
* chat : hotfix gpt-oss jinja raising an exception
* fix
2025-08-11 15:31:35 +02:00
Xuan-Son Nguyen
53d0a12658
server : allow specifying reasoning_format in HTTP request ( #15238 )
2025-08-11 14:48:41 +02:00
Zagaj
27093afe78
readme : update infra list ( #15234 )
2025-08-11 15:27:54 +03:00
Georgi Gerganov
228f724d9c
kv-cache : fix seq_rm with seq_id == -1 ( #15226 )
...
* kv-cache : fix seq_rm with seq_id == -1
ggml-ci
* cont : iterate over streams
ggml-ci
2025-08-11 13:58:24 +03:00
Daniel Bevenius
cd3069dfcb
kv-cache : log (debug) all streams in find_slot ( #15176 )
...
This commit updates `llama_kv_cache_unified::find_slot` to log
information for all streams when debug is enabled.
The motivation for this change is that currently if a non-unified
kv-cache is used, then only one stream will be logged because the
code was currently uses `seq_to_stream[1]`.
2025-08-11 11:21:19 +02:00
Sigbjørn Skjæret
50e81bdf5d
convert : fix merge conflicts ( #15229 )
2025-08-11 11:15:44 +02:00
Daniel Bevenius
1ebbaddff2
perplexity : update comments/error msg to use decode [no ci] ( #15227 )
...
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.
The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
2025-08-11 11:21:24 +03:00
Julien Denize
a3a7874272
convert : improve Mistral models integration ( #14737 )
...
* Improve Mistral models integration with llama.cpp
* Revert changes and fix gguf
* Revert change
* refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py
* Revert collateral
* Rename model name
* refactor
* revert
* remove duplicate
* Remove duplication code
* Fixes
* Fix flake issues
* Apply comments
* Apply comments
* Apply comments
* Fix remote
* add default chat template
* Revert
* nit
2025-08-11 10:07:49 +02:00
Charles Xu
002cb1bb33
kleidiai: fix unsigned overflow bug ( #15150 )
...
* kleidiai: fix unsigned overflow bug
* address review comments
2025-08-11 09:59:26 +02:00
David Zhao
79c1160b07
cuda: refactored ssm_scan and use CUB ( #13291 )
...
* cuda: refactored ssm_scan to use CUB
* fixed compilation error when when not using CUB
* assign L to constant and use size_t instead of int
* deduplicated functions
* change min blocks per mp to 1
* Use cub load and store warp transpose
* suppress clang warning
2025-08-09 20:29:43 +02:00
Ed Addario
89051cda35
Update README.md
2025-08-09 14:49:44 +01:00
Ed Addario
dcac206f8e
Add --activation-statistics logic to avoid doubling the imatrix size by default
2025-08-09 14:49:25 +01:00
Aman Gupta
34c9d765bf
CUDA: add attention sinks for tile and wmma ( #15178 )
...
* CUDA: add attention sinks for tile and wmma
* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-09 20:00:24 +08:00
Ed Addario
6fe51e12f1
Fix typo in ECS formula
2025-08-09 09:12:23 +01:00
Ed Addario
94679635c0
Merge branch 'master' into imatrix
2025-08-09 01:29:44 +01:00
Ed Addario
59af5034f7
Update README.md
2025-08-09 01:26:23 +01:00
compilade
e54d41befc
gguf-py : add Numpy MXFP4 de/quantization support ( #15111 )
...
* gguf-py : add MXFP4 de/quantization support
* ggml-quants : handle zero amax for MXFP4
2025-08-08 17:48:26 -04:00
Johannes Gäßler
4850b52aed
server-bench: external OAI servers, sqlite ( #15179 )
...
* server-bench: external OAI servers, sqlite
* Update scripts/server-bench.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/server-bench.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/server-bench.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* raise_for_status
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-08 23:04:36 +02:00
AN Long
cd6983d56d
ggml : fix field name when new ggml_backend ( #14944 )
2025-08-08 14:37:22 +02:00
Olivier Chafik
6c7e9a5440
vendor: sync minja ( #15161 )
...
* vendor: sync minja
* Update minja.hpp
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-08 10:45:18 +01:00
Johannes Gäßler
1425f587a8
CUDA: attention sinks for mma FlashAttention ( #15157 )
2025-08-08 08:19:58 +02:00
lhez
aaa3d07ae7
opencl: support sink in `soft_max` (attn sinks) ( #15152 )
2025-08-07 21:47:03 -07:00
Xuan-Son Nguyen
50aa938901
convert : support non-mxfp4 HF model ( #15153 )
...
* convert : support non-mxfp4 HF model
* rm redundant check
* disable debug check
2025-08-07 23:26:03 +02:00
Jeff Bolz
c4f53563df
vulkan: support fattn sinks ( #15126 )
2025-08-07 22:44:20 +02:00
Jeff Bolz
a0552c8bee
vulkan: Add env var to disable host visible vidmem ( #15109 )
2025-08-07 22:07:11 +02:00
Ed Addario
c5ecdaa1a1
Add Euclidean–Cosine Score (ECS)
2025-08-07 19:04:49 +01:00
Ed Addario
5bb2def02d
Add --activation-statistics parameter
2025-08-07 17:41:21 +01:00
RunningLeon
99acbc9921
llama : Support intern-s1 ( #14875 )
...
* support internvl
* support interns1
* resolve comments
* put interns1 in tensor mapping
* resolve comment
* move tokenizer changes to sub class
2025-08-07 18:20:40 +02:00
uvos
7ad67ba9fe
HIP: add cmake option to enable compiler output of kernel resource usage metrics ( #15103 )
2025-08-07 16:44:14 +02:00
Ed Addario
dadd90ef73
Rename report heading
2025-08-07 14:07:48 +01:00
Christian Kastner
9a96389544
ggml: Skip backend library linking code when GGML_BACKEND_DL=ON ( #15094 )
...
Any available libraries are found and loaded dynamically at runtime.
2025-08-07 13:45:41 +02:00
Ed Addario
e0d6471340
Reverse conditional logic to match convention
2025-08-07 12:04:52 +01:00
Ed Addario
3e9d53c61e
Refactor variable names
2025-08-07 12:03:24 +01:00
Ed Addario
c7959edff5
Merge branch 'master' into imatrix
2025-08-07 11:51:33 +01:00
Johannes Gäßler
1d72c84188
CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 ( #15131 )
...
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-07 10:53:21 +02:00
Johannes Gäßler
20638e4f16
scripts: fix crash when --tool is not set ( #15133 )
2025-08-07 08:50:30 +02:00
Daniel Bevenius
36d3f00e14
requirements : fix PyTorch uint64 compatibility ( #15134 )
...
This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```
This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734 ).
PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: https://github.com/pytorch/pytorch/issues/58734
2025-08-07 05:31:48 +02:00
Reese Levine
5fd160bbd9
ggml: Add basic SET_ROWS support in WebGPU ( #15137 )
...
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
2025-08-06 15:14:40 -07:00
rmatif
756cfea826
fix profiling crash ( #15072 )
2025-08-06 14:17:51 -07:00
lhez
e725a1a982
opencl: add `swiglu_oai` and `add_id` ( #15121 )
...
* opencl: add `swiglu-oai`
* opencl: add `add_id`
* opencl: add missing `add_id.cl`
2025-08-06 12:12:17 -07:00
Sachin Desai
3db4da56a5
chat : support Granite model reasoning and tool call ( #14864 )
2025-08-06 20:27:30 +02:00
Juk Armstrong
476aa3fd57
Fixed name `-override-tensors` to `-override-tensor` ( #15129 )
2025-08-06 17:28:48 +01:00
Diego Devesa
0d8831543c
ggml : fix fallback to CPU for ununsupported ops ( #15118 )
2025-08-06 14:37:35 +02:00
Sigbjørn Skjæret
65c797c4fa
chat : fix yandex chat template ( #15116 )
2025-08-06 13:26:49 +02:00
stevenkuang
25726898e8
chat : fix hunyuan auto-detection ( #15114 )
...
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
2025-08-06 11:48:30 +02:00