* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk
* codeowners : add danbev for examples/debug
* Add @pwilkin to CODEOWNERS for debug
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
This commit adds a new python script that can be used to print tensors
information from a tensor in a safetensors model.
The motivation for this is that during model conversion work it can
sometimes be useful to verify the shape of tensors in the original
model. While it is possible to print the tensors when loading the model
this can be slow when working with larger models.
With this script it is possible to quickly query tensor shapes.
Example usage:
```console
(venv) $ ./scripts/utils/tensor-info.py --help
usage: tensor-info.py [-h] [-m MODEL_PATH] [-l] [tensor_name]
Print tensor information from a safetensors model
positional arguments:
tensor_name Name of the tensor to inspect
options:
-h, --help show this help message and exit
-m MODEL_PATH, --model-path MODEL_PATH
Path to the model directory (default: MODEL_PATH environment variable)
-l, --list List unique tensor patterns in the model (layer numbers replaced with #)
```
Listing tensor names:
```console
(venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m -l
embed_tokens.weight
layers.#.input_layernorm.weight
layers.#.mlp.down_proj.weight
layers.#.mlp.gate_proj.weight
layers.#.mlp.up_proj.weight
layers.#.post_attention_layernorm.weight
layers.#.post_feedforward_layernorm.weight
layers.#.pre_feedforward_layernorm.weight
layers.#.self_attn.k_norm.weight
layers.#.self_attn.k_proj.weight
layers.#.self_attn.o_proj.weight
layers.#.self_attn.q_norm.weight
layers.#.self_attn.q_proj.weight
layers.#.self_attn.v_proj.weight
norm.weight
```
Printing a specific tensor's information:
```console
(venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m layers.0.input_layernorm.weight
Tensor: layers.0.input_layernorm.weight
File: model.safetensors
Shape: [768]
```
* completion : simplify batch (embd) processing
This commit simplifies the processing of embd by removing the for loop
that currently exists which uses params.n_batch as its increment. This
commit also removes the clamping of n_eval as the size of embd is always
at most the size of params.n_batch.
The motivation is to clarify the code as it is currently a little
confusing when looking at this for loop in isolation and thinking that
it can process multiple batches.
* add an assert to verify n_eval is not greater than n_batch
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function
* ggml-virtgpu: deprecate buffer_type is_host remoting
not necessary
* ggml-virtgpu: stop using static vars as cache
The static init isn't thread safe.
* ggml-virtgpu: protect the use of the shared memory to transfer data
* ggml-virtgpu: make the remote calls thread-safe
* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory
* ggml-virtgpu: add a cleanup function for consistency
* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing
* fix style and ordering
* Remove the static variable in apir_device_get_count
* ggml-virtgpu: improve the logging
* fix review minor formatting changes
* CUDA: use mmvq for mul-mat-id for small batch sizes
* add mmvq too
* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs
* templatize multi_token_path
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
* jinja : add missing 'in' test to template engine (#19004)
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* reuse test_is_in in binary op
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.
before:
Error: {
code = 400,
message = "model not found",
type = "invalid_request_error"
}
After:
Error: {
code = 400,
message = "model 'toto' not found",
type = "invalid_request_error"
}
This commit adds a debug option to the model conversion script to enable
using the Python debugger (pdb) during model conversion.
The motivation for this is that I've found myself adding this a few
times now and it would be quicker to have this flag as an option and a
makefile target/recipe for it.
* Remove mutex for pipeline caches, since they are now per-thread.
* Add comment
* Run clang-format
* Cleanup
* Run CI again
* Run CI once more
* Run clang-format
The allowUnfreePredicate in pkgsCuda was wrapping p.meta.license in a
list unconditionally. This fails when meta.license is already a list
of licenses, as it creates a nested list and then tries to access
.free and .shortName on the inner list.
Use lib.toList instead, which correctly handles both cases:
- Single license attrset -> wraps in list
- List of licenses -> returns unchanged