Commit Graph

6744 Commits

Author SHA1 Message Date
Jeff Bolz b97c9edc59
vulkan: Allow fallback to sysmem memory when vidmem is full (#15649)
* vulkan: Allow fallback to sysmem memory when vidmem is full

* vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK
2025-08-31 08:30:54 +02:00
Jeff Bolz 94e82c7ead
vulkan: clamp matmul and FA results to the max finite value (#15652)
* vulkan: clamp matmul and FA results to the max finite value

* only clamp for fp16
2025-08-31 08:27:57 +02:00
Charles Xu 4d74393bcc
ggml: update kleidiai to v1.13.0 (#15663) 2025-08-31 00:03:42 +08:00
Diego Devesa dd892555b0
Update build.md to remove MSVC arm64 notes (#15684)
Removed information about MSVC compiler limitations for arm64 builds.
2025-08-30 23:51:28 +08:00
Johannes Gäßler e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Johannes Gäßler 38ad381f9f
CUDA: use FP32 arithmetic for conv2d (#15683) 2025-08-30 16:20:32 +02:00
Jeff Bolz 696fccf354
vulkan: Skip syncing for prealloc_y when it is reused (#15544) 2025-08-30 11:11:22 +02:00
Chenguang Li ef476916bb
CANN: FIx compiler warnings (#15661)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-30 10:18:35 +08:00
Sergey Alirzaev d82f6aa34a
server : removed obsolete doc (#15670)
completing a4090d1174
2025-08-30 00:12:53 +02:00
Johannes Gäßler 3d16b29c3b
scripts: strip "AMD Instinct" from GPU name (#15668) 2025-08-29 22:04:08 +02:00
ExtReMLapin 792b44f2ed
server : add documentation for `parallel_tool_calls` param (#15647)
Co-authored-by: Pierre F <no@p.e>
2025-08-29 20:25:40 +03:00
ryan-mangeno c73eb685fd added cls token per previous modern bert attempt, still working on checking out the rest 2025-08-29 12:15:31 -04:00
Aman Gupta 81017865ee
CUDA: fix bug in rms_norm fusion (#15660)
* CUDA: fix bug in rms_norm fusion

* Fix bug for OP_REPEAT

* Fix index for add
2025-08-29 21:30:06 +08:00
Piotr Wilkin (ilintar) 60e5eee31f
chat : Seed OSS thinking + tool call support (#15552)
* Reasoning and tool-calling support for Seed OSS

* Fix grammar and partial parsing

* Whitespace

* New chat template

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove unused 'purge_healing_marker' helper

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-29 14:53:41 +02:00
Aman Gupta 009b709d6e
CUDA: fuse adds, fuse add with rms norm (#15631)
* CUDA: fused add with rms_norm_mul

* Non-broadcast fuse works

* Add fused adds

* format

* Remove n_fuse from template params

* Address review comments

* Move template inside binbcast
2025-08-29 11:35:58 +08:00
Gabe Goodhart e8d99dd0b6
nvidia nemotron nano v2 (nemotronh) (#15507)
* feat: Add NEMOTRONH to python arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add NEMOTRONH to c++ arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add NEMOTRONH to llama-arch layer map

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First pass at conversion for nemotronh

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add a verbose log for each tensor loaded

This is really helpful for diagnosing mismatches between the expected and
received tensors

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First (broken) pass at nemotronh model architecture

It generates tokens, just not valid ones!

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Explicitly enable add_bos_token during conversion

The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Only allocate attention cache for attention layers (not non-recurrent)

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Move residual add to after every block

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the correct norm tensor for the MLP blocks

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)

This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.

* SSM: respect ssm_dt_rank for dt_dim when provided

Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).

* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)

* Rename nemotronh to nemotron_h for consistency

- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)

* feat: Support conversion for older NemotronH models

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Maicon Domingues <dominguesm@outlook.com>
Co-authored-by: weatherman <fxdstudios@gmail.com>
2025-08-28 18:39:31 -06:00
Gabe Goodhart a8bca68f72
fix: Compute the full sum in llama-eval-callback, not just the sum of printed values (#15637)
This makes it much easier to compare between llama.cpp and transformers!

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-08-28 15:27:36 -05:00
mnehete32 c97dc09391
CUDA: add conv2d (#15635)
* CUDA: add conv2d

* CUDA: conv2d - correct formatting and added const
2025-08-28 20:33:03 +02:00
ryan-mangeno 2a1c75047c ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more 2025-08-28 12:59:42 -04:00
ryan-mangeno 853f344cfe more cleanup 2025-08-28 12:47:10 -04:00
ryan-mangeno 40249dd5ec cleanup 2025-08-28 12:37:02 -04:00
ryan-mangeno 9805635c12 cleanup 2025-08-28 12:36:26 -04:00
ryan-mangeno 8f328431a1 cleanup 2025-08-28 12:33:52 -04:00
ryan-mangeno bffe3c9092 tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this 2025-08-28 11:15:10 -04:00
Aaron Teo 6c442f42ff
ggml-cpu: fix invalid hsum build in debug s390x (#15634)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-08-28 22:39:27 +08:00
compilade 73804145ab
ggml : fix SSM_SCAN for n_groups > 1 (#15625) 2025-08-28 10:11:36 -04:00
Georgi Gerganov c8d0d14e77
kv-cache : fix find_slot to not search for continuous slot (#15638)
ggml-ci
2025-08-28 17:09:05 +03:00
Sigbjørn Skjæret 84ab83cc0b
model : jina-embeddings-v3 support (#13693)
* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review
2025-08-28 15:49:50 +02:00
Aman Gupta 55042b3692
scripts: add sqlite3 check for compare-commits.sh (#15633) 2025-08-28 19:23:22 +08:00
Georgi Gerganov 8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks (#15505)
ggml-ci
2025-08-28 12:27:02 +03:00
Aleksei Nikiforov 64387f6e95
gguf-py: byteswapping improvements (#12851)
* gguf-py: implement byteswapping for Q4_0

This is needed to byteswap Mistral model.

Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.

* Rework byteswapping code in gguf-py

Move out details from byteswapping tensor blocks code
2025-08-28 16:56:41 +08:00
Joshua Cogliati d35a1e8c41
cli : change log to warning to explain reason for stopping (#15604)
* Change to warn instead of debug, to explain reason for stopping.

* Update tools/main/main.cpp

Fix printing --2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-28 10:48:20 +03:00
Daniel Bevenius 46d9caa27a
model-conversion : add mmproj conversion target (#15628)
This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.

The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.

Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.

Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main
2025-08-28 09:26:48 +02:00
matiaslin 5a0e3ef6f0
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (#15622)
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.

We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
2025-08-28 02:32:36 +02:00
ryan-mangeno 18c0c23ed8 fixed tensor mappings and working on buildin graph 2025-08-27 15:32:20 -04:00
Johannes Gäßler fbef0fad7a
server: higher timeout for tests (#15621) 2025-08-27 20:58:09 +02:00
Georgi Gerganov da54f9f1a2
presets : add qwen3-30B-a3b FIM (#15616) 2025-08-27 15:48:07 +03:00
uvos 47373271f9
HIP: Enable support for ggml_backend_cuda_register_host_buffer (#15615) 2025-08-27 13:58:54 +02:00
Georgi Gerganov 1bded5a3b3
kv-cache : better estimate of n_kv for multi-sequence batches (#15610)
ggml-ci
2025-08-27 13:55:12 +03:00
Chenguang Li 1e7489745a
CANN: refactor mask handling and improve performance in FA (#15561)
* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: fix review

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-27 17:21:41 +08:00
xctan 1cf123a343
ggml-cpu : add basic RVV support for vector f32 ops (#15057)
* ggml-cpu : add basic RVV support for vector f32 ops

* ggml-cpu : add RVV support for f32 softmax
2025-08-27 16:44:22 +08:00
Daniel Bevenius fcca2182a1
common : add -m to bash completion for --model [no ci] (#15591)
This commit updates the bash completion script to include the -m
short option for the --model argument.

The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
2025-08-27 10:28:53 +02:00
rmatif 86076f92de
OpenCL: add fused group_norm/norm, mul, add (#15314)
* add fused group_norm/norm, mul, add

* fix spacing

* revert rms_norm logic

* fix trailing whitespace
2025-08-26 23:36:05 -07:00
Diego Devesa bcbddcd54f
tests : fix test-opt with GGML_BACKEND_DL (#15599) 2025-08-26 22:14:38 +02:00
Akarshan Biswas 8b69686136
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (#15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.

This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
2025-08-27 00:27:49 +05:30
fidoriel 8ce3ff1d91
mtmd : fix mtmd ios build (#15579) 2025-08-26 20:05:50 +02:00
ryan-mangeno 4ceb828112 correct tensor shape for qkv 2025-08-26 13:03:14 -04:00
ryan-mangeno cc3d7abab4 continuing 2025-08-26 12:38:38 -04:00
ryan-mangeno 41b6864333 cleanup 2025-08-26 12:33:11 -04:00
Eve 44b1efa41a
tests: add performance test for mul mat id (#15543) 2025-08-26 15:42:49 +00:00