Commit Graph

6144 Commits

Author SHA1 Message Date
Ed Addario 2117c4e54b
Update aggregated statistic report layout 2025-08-03 16:38:02 +01:00
Ed Addario a6155a8125
Add compute_layer_statistics() function 2025-08-03 16:35:03 +01:00
Gabriel Larson 83bc2f288c
model : add text-only support for Kimi-VL (and find special tokens in text_config) (#15051)
* basic kimi-vl textmodel conversion

* check config["text_config"] for special tokens
2025-08-03 16:56:25 +02:00
Ed Addario be60469f25
Refactor function names 2025-08-03 15:10:17 +01:00
Jeff Bolz 6c7a441161
vulkan: Use coopmat2 for conv2d (#14982) 2025-08-03 14:23:57 +02:00
Ed Addario fce05aac9e
Refactor lambda into compute_tensor_averages() function 2025-08-03 13:03:21 +01:00
Ed Addario 5324558132
Update table layout 2025-08-03 10:28:47 +01:00
Ed Addario 4d1325e1eb
Refactor variables 2025-08-03 10:28:23 +01:00
Ed Addario a32a2ecbed
Reformat report layout 2025-08-03 00:51:33 +01:00
Ed Addario 4c01f51ae1
Remove inactive 2025-08-03 00:51:12 +01:00
lhez 5c0eb5ef54
opencl: fix adreno compiler detection logic (#15029) 2025-08-02 19:51:18 +02:00
Ed Addario fc8f92596f
Update table display 2025-08-02 16:46:27 +01:00
Ed Addario ee2509f563
Adjust threshold 2025-08-02 16:45:56 +01:00
Ed Addario 9b841eb696
Compute l2 norm 2025-08-02 16:45:09 +01:00
Ed Addario b7fb362d8e
Compute cosine similarity based on activations 2025-08-02 16:43:49 +01:00
Ed Addario cce514a392
Compute entropy for activations 2025-08-02 16:40:40 +01:00
Ed Addario 9744a4a1c6
Determine calculation mode 2025-08-02 16:36:12 +01:00
Ed Addario 78ddb475de
Fix problem up when GGUF does not have in_sum 2025-08-02 16:31:21 +01:00
Johannes Gäßler 03d4698218
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035) 2025-08-02 16:37:08 +02:00
leejet 3303c19b16
cuda: make im2col a little faster (#15025) 2025-08-02 17:15:36 +03:00
Daniel Bevenius 4fdea540bd
kv-cache : skip alignment of n_stream in kv-cache log msg [no ci] (#15040)
This commit removes the right alignment the `n_stream` value in the
log message in the `llama_kv_cache_unified` constructor.

The motivation for this change is to enhance the readability of log
message. Currently the output looks like this:
```console
llama_kv_cache_unified: size = 2048.00 MiB (  4096 cells,  32 layers,  1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
Notice that the `n_stream` value is right aligned, which makes it a
little harder to read.

With the change in this commit the output will look like
```console
llama_kv_cache_unified: size = 2048.00 MiB (  4096 cells,  32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
2025-08-02 17:14:57 +03:00
Georgi Gerganov a4569c41fd
llama : enable LLAMA_SET_ROWS=1 by default (#14959)
ggml-ci
2025-08-02 17:14:21 +03:00
Georgi Gerganov 15e92fd337
cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038)
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
2025-08-02 17:13:05 +03:00
Sigbjørn Skjæret 2bf3fbf0b5
ci : check that pre-tokenizer hashes are up-to-date (#15032)
* torch is not required for convert_hf_to_gguf_update

* add --check-missing parameter

* check that pre-tokenizer hashes are up-to-date
2025-08-02 14:39:01 +02:00
Douglas Hanley 711d5e6fe6
convert : fix Qwen3-Embedding pre-tokenizer hash (#15030) 2025-08-02 12:51:02 +02:00
Jhen-Jie Hong f738989dcb
chat : fix multiple tool_calls on hermes-2-pro (#14962) 2025-08-02 18:04:48 +08:00
Jeff Bolz 4cb208c93c
vulkan: coopmat2 mul_mat optimizations (#14934)
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
2025-08-02 11:21:37 +02:00
R0CKSTAR 3025b621d1
llama-bench: rename DB table name from test to llama_bench (#15003)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-02 17:20:40 +08:00
Jeff Bolz ec0b18802c
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015) 2025-08-02 10:48:30 +02:00
Douglas Hanley 339bd0268c
model : support Qwen3-Embedding (#15023) 2025-08-02 10:44:50 +02:00
Johannes Gäßler f906275537
server: enable token array inputs for OAI API (#15001) 2025-08-02 10:12:41 +02:00
Jeff Bolz a9f7541ec2
vulkan: optimizations for direct convolution (#14933)
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-02 09:57:04 +02:00
Johannes Gäßler 9c35706b98
CUDA: fix MMQ nwarps for AMD with warp_size==32 (#15014) 2025-08-01 20:47:32 +02:00
l-austenfeld c76b420e4c
vendor : update vendored copy of google/minja (#15011)
* vendor : update vendored copy of google/minja

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

* Re-remove trailing whitespace

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

* Remove another trailing whitespace

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

---------

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>
2025-08-01 16:59:06 +02:00
stevenkuang 0f5ccd6fd1
model : add hunyuan dense (#14878)
* support hunyuan_v1_dense

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* update hunyuan_moe to hunyuan_v1_moe

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* fix rope alpha assert and bos token

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* add blank line

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* Revert "update hunyuan_moe to hunyuan_v1_moe"

This reverts commit aa973ca219.

* use hunyuan_dense instead of hunyuan_v1_dense

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* fix hunyuan_moe chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* remove leftover code

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* update hunyuan dense chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* fix hunyuan dense vocab and chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

---------

Signed-off-by: stevenkuang <stevenkuang@tencent.com>
2025-08-01 15:31:12 +02:00
lhez 1c872f71fb
opencl: add f16 for `add`, `sub`, `mul`, `div` (#14984) 2025-08-01 13:15:44 +02:00
Srihari-mcw baad94885d
ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373)
* Initial Q2_K Block Interleaving Implementation

* Addressed review comments and clean up of the code

* Post rebase fixes

* Initial CI/CD fixes

* Update declarations in arch-fallback.h

* Changes for GEMV Q2_K in arch-fallback.h

* Enable repacking only on AVX-512 machines

* Update comments in repack.cpp

* Address q2k comments

---------

Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
2025-08-01 09:20:33 +03:00
Georgi Gerganov ba42794c9e
graph : fix equal_seq() check (#14986)
ggml-ci
2025-08-01 06:38:12 +03:00
diannao 2860d479b4
docker : add cann build pipline (#14591)
* docker: add cann build pipline

* docker: add cann build pipline

* docker: fix cann devops

* cann : fix multi card hccl

* Update ggml/src/ggml-cann/ggml-cann.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update ggml-cann.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-08-01 10:02:34 +08:00
R0CKSTAR 484b2091ce
compare-commits.sh: support both llama-bench and test-backend-ops (#14392)
* compare-commits.sh: support both llama-bench and test-backend-ops

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Speed up the build by specifying -j 12

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Remove build_number from test-backend-ops db

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Apply suggestion from @JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Refine tool selection logic

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-01 08:47:27 +08:00
Ed Addario 2097f038b0
Refactor variable names 2025-07-31 20:46:40 +01:00
Ed Addario daf2dd7880
quantize : skip tensor override when in fallback mode (#14995) 2025-07-31 21:32:18 +02:00
Diego Devesa a06ed5feae
llama : add simple option to enable CPU for MoE weights (--cpu-moe) (#14992) 2025-07-31 20:15:41 +02:00
Aman Gupta 784524053d
Fix params bug in diffusion example (#14993) 2025-08-01 01:22:58 +08:00
Diego Devesa d6818d06a6
llama : allow other bufts when overriding to CPU, add --no-repack option (#14990) 2025-07-31 18:11:34 +02:00
Ruben Ortlam e08a98826b
Vulkan: Fix minor debug mode issues (#14899)
* vulkan: fix debug mode issues

* vulkan: remove broken check_results GGML_OP_SET_ROWS support
2025-07-31 17:46:54 +02:00
tc-mb 952a47f455
mtmd : support MiniCPM-V 4.0 (#14983)
* support minicpm-v 4

* add md

* support MiniCPM-o 4.0

* add default location

* temp rm MiniCPM-o 4.0

* fix code

* fix "minicpmv_projector" default path
2025-07-31 17:22:17 +02:00
Csaba Kecskemeti 36e5fe7bcd
MODEL_TENSOR.SSM_DT_NORM has defined twice (#14991)
* MODEL_TENSOR.SSM_DT_NORM has defined twice, and second overwritten the jamba model's layername

* correct order
2025-07-31 10:59:49 -04:00
g2mt 94933c8c2e
server : implement universal assisted decoding (#12635)
* llama-server : implement universal assisted decoding

* Erase prompt tail for kv-cache

* set vocab_dft_compatible in common_speculative

* rename ctx_main to ctx_tgt

* move vocab_dft_compatible to spec struct

* clear mem_dft, remove mem

* detokenize id_last for incompatible models

* update comment

* add --spec-replace flag

* accept special tokens when translating between draft/main models

* Escape spec-replace

* clamp draft result to size to params.n_draft

* fix comment

* clean up code

* restore old example

* log common_speculative_are_compatible in speculative example

* fix

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-31 14:25:23 +02:00
Dongliang Wei c1dacaa99b
llama : merge build_moe_ffn_from_probs function into build_moe_ffn (#14968) 2025-07-31 14:12:20 +02:00