Commit Graph

6326 Commits

Author SHA1 Message Date
ExtReMLapin 2f28a1c601
Merge branch 'ggml-org:master' into fix_qwen_reasoning_tool_calling_required 2025-08-28 20:37:20 +02:00
mnehete32 c97dc09391
CUDA: add conv2d (#15635)
* CUDA: add conv2d

* CUDA: conv2d - correct formatting and added const
2025-08-28 20:33:03 +02:00
Aaron Teo 6c442f42ff
ggml-cpu: fix invalid hsum build in debug s390x (#15634)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-08-28 22:39:27 +08:00
compilade 73804145ab
ggml : fix SSM_SCAN for n_groups > 1 (#15625) 2025-08-28 10:11:36 -04:00
Georgi Gerganov c8d0d14e77
kv-cache : fix find_slot to not search for continuous slot (#15638)
ggml-ci
2025-08-28 17:09:05 +03:00
Sigbjørn Skjæret 84ab83cc0b
model : jina-embeddings-v3 support (#13693)
* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review
2025-08-28 15:49:50 +02:00
Aman Gupta 55042b3692
scripts: add sqlite3 check for compare-commits.sh (#15633) 2025-08-28 19:23:22 +08:00
Georgi Gerganov 8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks (#15505)
ggml-ci
2025-08-28 12:27:02 +03:00
Aleksei Nikiforov 64387f6e95
gguf-py: byteswapping improvements (#12851)
* gguf-py: implement byteswapping for Q4_0

This is needed to byteswap Mistral model.

Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.

* Rework byteswapping code in gguf-py

Move out details from byteswapping tensor blocks code
2025-08-28 16:56:41 +08:00
Joshua Cogliati d35a1e8c41
cli : change log to warning to explain reason for stopping (#15604)
* Change to warn instead of debug, to explain reason for stopping.

* Update tools/main/main.cpp

Fix printing --2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-28 10:48:20 +03:00
Daniel Bevenius 46d9caa27a
model-conversion : add mmproj conversion target (#15628)
This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.

The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.

Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.

Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main
2025-08-28 09:26:48 +02:00
matiaslin 5a0e3ef6f0
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (#15622)
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.

We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
2025-08-28 02:32:36 +02:00
Johannes Gäßler fbef0fad7a
server: higher timeout for tests (#15621) 2025-08-27 20:58:09 +02:00
Georgi Gerganov da54f9f1a2
presets : add qwen3-30B-a3b FIM (#15616) 2025-08-27 15:48:07 +03:00
uvos 47373271f9
HIP: Enable support for ggml_backend_cuda_register_host_buffer (#15615) 2025-08-27 13:58:54 +02:00
Georgi Gerganov 1bded5a3b3
kv-cache : better estimate of n_kv for multi-sequence batches (#15610)
ggml-ci
2025-08-27 13:55:12 +03:00
Chenguang Li 1e7489745a
CANN: refactor mask handling and improve performance in FA (#15561)
* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: fix review

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-27 17:21:41 +08:00
xctan 1cf123a343
ggml-cpu : add basic RVV support for vector f32 ops (#15057)
* ggml-cpu : add basic RVV support for vector f32 ops

* ggml-cpu : add RVV support for f32 softmax
2025-08-27 16:44:22 +08:00
Daniel Bevenius fcca2182a1
common : add -m to bash completion for --model [no ci] (#15591)
This commit updates the bash completion script to include the -m
short option for the --model argument.

The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
2025-08-27 10:28:53 +02:00
rmatif 86076f92de
OpenCL: add fused group_norm/norm, mul, add (#15314)
* add fused group_norm/norm, mul, add

* fix spacing

* revert rms_norm logic

* fix trailing whitespace
2025-08-26 23:36:05 -07:00
Diego Devesa bcbddcd54f
tests : fix test-opt with GGML_BACKEND_DL (#15599) 2025-08-26 22:14:38 +02:00
Akarshan Biswas 8b69686136
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (#15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.

This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
2025-08-27 00:27:49 +05:30
fidoriel 8ce3ff1d91
mtmd : fix mtmd ios build (#15579) 2025-08-26 20:05:50 +02:00
Eve 44b1efa41a
tests: add performance test for mul mat id (#15543) 2025-08-26 15:42:49 +00:00
shalinib-ibm a6a58d6478
llamafile: PowerPC Sgemm Optimization (#15558)
This patch improves GEMM for FP32 Data Type on PowerPC

Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.

Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).

Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces

Performance Testing:

Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model.  Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.

model                   Size                Params     Backend       Threads   Test    Patch   Base
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp512   98.58   60.3
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp1024  95.88   57.36
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp2048  85.46   53.26
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp4096  68.66   45.78
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp6144  57.35   40.44

25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-08-26 23:35:25 +08:00
Georgi Gerganov 0373486dbc
graph : fix assert in memory-less build_attn (#15590)
ggml-ci
2025-08-26 17:45:17 +03:00
Daniel Bevenius 62cef26ac5
model-conversion : add qat-q4 quantization targets (#15588)
This commit adds two targets to the Makefile for quantizing of
Quantization Aware Trained (QAT) models to Q4_0 format.

The motivation for this is that this sets the token embedding and the
output tensors data types to Q8_0 instead of the default Q6_K. This is
someting that we wish to enforce for QAT Q4_0 models that are to be
uploaded to ggml-org on Huggingface to guarantee the best quality.
2025-08-26 16:12:29 +02:00
Johannes Gäßler 8f5afa94c4
CUDA: return -1 for nonexistent compiled arch (#15587) 2025-08-26 16:01:20 +02:00
CNE FICHEPOIL Pierre e62cd709e7 removed `?` from grammar as it doesn't crash on linux, probably worth it's own issue 2025-08-26 15:01:36 +02:00
CNE FICHEPOIL Pierre 0e558302a2 fix thinking-content eating closing think tag | ref #8953 2025-08-26 14:48:26 +02:00
Georgi Gerganov b3964c1e89
metal : optimize FA vec for large sequences and BS <= 8 (#15566)
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen 79a546220c
mtmd : support Kimi VL model (#15458)
* convert : fix tensor naming conflict for llama 4 vision

* convert ok

* support kimi vision model

* clean up

* fix style

* fix calc number of output tokens

* refactor resize_position_embeddings

* add test case

* rename build fn

* correct a small bug
2025-08-26 12:54:19 +02:00
Pierre F 352274e116 if there is enable_thinking enabled but hermes model doesn't support it, just disable it, don't GGML_ABORT 2025-08-26 12:39:50 +02:00
Pierre F 6d5f561896 reverted changes done to grammar_lazy for hermes 2 2025-08-26 12:30:57 +02:00
Georgi Gerganov 85cc1ae998
context : print graph stats for memory-less contexts (#15586)
ggml-ci
2025-08-26 12:47:00 +03:00
Georgi Gerganov 1d8d83deaa
metal : improve `MUL_MAT_ID` (#15541)
* metal : mul_mm_id remove hdst

* metal : remove mul_mm_id hsrc1

* metal : mul_mm_id simplify + add test

* metal : opt mul_mm_id map0

* metal : optimize mul_mm_id id gathering

* metal : mul/div opt

* metal : optimize mul_mm_id_map0

ggml-ci
2025-08-26 12:46:15 +03:00
tc-mb c4e9239064
model : support MiniCPM-V 4.5 (#15575) 2025-08-26 10:05:55 +02:00
Sigbjørn Skjæret 39842a7f73
gguf-py : remove erroneous FFN_GATE entry (#15583) 2025-08-26 09:08:08 +02:00
Sigbjørn Skjæret 0fd90db585
metal : remove contiguous assertion for src0 in IM2COL (#15577)
* remove contiguous assertion for src0 in IM2COL

* add contiguous check in supports_op
2025-08-26 09:51:43 +03:00
Yoshi_likes_e4 4c37636b3e
Add a warning for special devices (#15563)
* Add warning

* Print the devices names

* Add newlines

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix vector names

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-26 08:15:33 +02:00
Jeff Bolz 34bdbbd7c2
vulkan: Remove splitting for mul_mat_id (#15568)
row_ids only needs to hold the BN rows for the current tile.
2025-08-26 06:42:44 +02:00
Qeeweew 74f52f77f2
CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451)
* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-25 23:21:22 +02:00
lhez f7207b0415
opencl: fix support ops condition for `rms_norm` (#15560) 2025-08-25 14:18:09 -07:00
Pierre F bb5e352ad9 also apply the hotcrashfix here, just in case 2025-08-25 20:26:38 +02:00
Pierre F 86493dd92b fixed really weird grammar crash `Unexpected empty grammar stack after accepting piece:` 2025-08-25 20:13:45 +02:00
Pierre F dbae9214b9 qwen hermes tool calling : fixed grammar rules names 2025-08-25 19:45:19 +02:00
Pierre F 79e4a7bdbc hermes 2 pro tool calling, better support for thinking (thinking tag already opened grammar) 2025-08-25 18:53:46 +02:00
ExtReMLapin de07a432ed
Merge branch 'ggml-org:master' into fix_qwen_reasoning_tool_calling_required 2025-08-25 18:19:34 +02:00
Ruben Ortlam 4d917cd4f6
vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565) 2025-08-25 17:56:59 +02:00
Jeff Bolz 886b97a5d6
tests: Generate unique input values for count_equal (#15487)
This avoids backend-dependent behavior for argmax that leads to intermittent failures.
2025-08-25 10:47:16 -05:00