llama.cpp

Commit Graph

Author	SHA1	Message	Date
Piotr Wilkin	c730c343de	Fix number partial parsing issue	2026-02-06 14:35:05 +01:00
Piotr Wilkin	8c1d1bae36	More edge cases	2026-02-05 21:40:56 +01:00
Piotr Wilkin	8e87ba402f	Fix pesky issue on optional trailing arguments in function calls for TAGGED format	2026-02-05 21:40:56 +01:00
Piotr Wilkin	2726bd7090	Remove [[noreturn]] as it causes compilation problems on Mac.	2026-02-05 21:40:56 +01:00
Piotr Wilkin	55719ad155	We don't like segfaults (or failing tests).	2026-02-05 21:40:56 +01:00
Piotr Wilkin	725dc1bf2d	Fix minor regressions, add [[noreturn]] attrib	2026-02-05 21:40:56 +01:00
Piotr Wilkin	1bcedc2bbb	Fix incorrect coercion of strings to non-string types during parsing	2026-02-05 21:40:56 +01:00
Piotr Wilkin	fa52b43c2a	Feeding the hungry editor checker god.	2026-02-05 21:40:56 +01:00
Piotr Wilkin	0fba5187c0	Fix error in argument processing	2026-02-05 21:40:56 +01:00
Piotr Wilkin	88614e6730	Reverd bad change fix some templates and most tests	2026-02-05 21:40:56 +01:00
Piotr Wilkin	c40c56e580	More robust reasoning detection	2026-02-05 21:40:56 +01:00
Piotr Wilkin	9a5d559e8e	Fix reasoning detection	2026-02-05 21:40:56 +01:00
Piotr Wilkin	08c403efcd	Quick vibe-coded fix for proper object printing	2026-02-05 21:40:56 +01:00
Piotr Wilkin	ad74de7548	Missed this.	2026-02-05 21:40:56 +01:00
Piotr Wilkin	0d4179c8aa	ANOTHER GIANT POST-FIXUP SQUISH	2026-02-05 21:40:56 +01:00
Piotr Wilkin	31274f9bd1	THE GIANT AUTOPARSER SQUISH	2026-02-05 21:40:56 +01:00
Piotr Wilkin	65ba390a26	Make call IDs nine-character	2026-02-05 21:40:56 +01:00
Piotr Wilkin	16f756b4c5	Fix sanitizer warnings	2026-02-05 21:40:56 +01:00
Piotr Wilkin	76647dee2e	Fix bad typo	2026-02-05 21:40:56 +01:00
Piotr Wilkin	27f21d4d13	Add workaround for templates requiring non-null content	2026-02-05 21:40:56 +01:00
Georgi Gerganov	22cae83218	metal : adaptive CPU/GPU interleave based on number of nodes (#19369 )	2026-02-05 19:07:22 +02:00
Jeff Bolz	449ec2ab07	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281 ) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-05 09:26:38 -06:00
Georgi Gerganov	3795cc1e89	benches : update models + numbers (#19359 ) * bench : update script * benches : update numbers	2026-02-05 14:34:07 +02:00
Sigbjørn Skjæret	b828e18c75	docker : fix vulkan build (#19352 )	2026-02-05 11:10:39 +01:00
Adrien Gallouët	a4ea7a188f	vendor : update BoringSSL to 0.20260204.0 (#19333 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-05 09:53:35 +01:00
Georgi Gerganov	7a4f97d196	metal : add diag (#19330 )	2026-02-05 10:08:45 +02:00
Oleksandr Kuvshynov	a498c75ad1	vulkan: fix GPU deduplication logic. (#19222 ) * vulkan: fix GPU deduplication logic. As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the (same uuid, same driver) logic is problematic for windows+intel igpu. Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before `88d23ad5` - just dedup based on UUID. Verified that MacOS + 4xVega still reports 4 GPUs with this version. * vulkan: only skip dedup when both drivers are moltenVk	2026-02-05 09:06:59 +01:00
Jeff Bolz	3409ab842d	vulkan: Set k_load_shmem to false when K is too large (#19301 )	2026-02-05 08:48:33 +01:00
Jeff Bolz	c342c3b93d	vulkan: fix non-contig rope (#19299 )	2026-02-05 08:38:59 +01:00
will-lms	af252d0758	metal : add missing includes (#19348 )	2026-02-05 08:05:09 +02:00
Sigbjørn Skjæret	11fb327bf3	vendor : add missing llama_add_compile_flags (#19322 ) * add missing llama_add_compile_flags * disable all warnings for ssl, crypto and fipsmodule	2026-02-05 02:27:38 +01:00
Aaron Teo	e6e934c5ea	vendor: update cpp-httplib version (#19313 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-02-05 05:15:03 +08:00
Daniel Bevenius	b536eb0233	codeowners : add danbev for examples/debug (#19332 ) * codeowners : add danbev for examples/debug * Add @pwilkin to CODEOWNERS for debug --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-02-04 20:20:40 +01:00
Xuan-Son Nguyen	e0c93af2a0	debug: make common_debug_print_tensor readable (#19331 ) * debug: make common_debug_print_tensor readable * editorconfig	2026-02-04 17:55:31 +01:00
Georgi Gerganov	423bee462b	ci : fix sanitize workflow to enable ggml sanitizers too (#19323 )	2026-02-04 15:12:03 +02:00
Xuan-Son Nguyen	8abcc70a74	model: (qwen3next) correct vectorized key_gdiff calculation (#19324 ) * model: (qwen3next) correct vectorized key_gdiff calculation * move transpose to outside of loop	2026-02-04 13:09:58 +01:00
Georgi Gerganov	eaba92c3dc	tests : add non-cont, inplace rope tests (#19296 ) * tests : add non-cont, inplace rope tests * cont : exercise dim 3 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * cont : more dim3 exercises --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2026-02-04 12:45:21 +02:00
Daniel Bevenius	6ab881b7c3	model-conversion : add tensor-info.py utility (#18954 ) This commit adds a new python script that can be used to print tensors information from a tensor in a safetensors model. The motivation for this is that during model conversion work it can sometimes be useful to verify the shape of tensors in the original model. While it is possible to print the tensors when loading the model this can be slow when working with larger models. With this script it is possible to quickly query tensor shapes. Example usage: ```console (venv) $ ./scripts/utils/tensor-info.py --help usage: tensor-info.py [-h] [-m MODEL_PATH] [-l] [tensor_name] Print tensor information from a safetensors model positional arguments: tensor_name Name of the tensor to inspect options: -h, --help show this help message and exit -m MODEL_PATH, --model-path MODEL_PATH Path to the model directory (default: MODEL_PATH environment variable) -l, --list List unique tensor patterns in the model (layer numbers replaced with #) ``` Listing tensor names: ```console (venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m -l embed_tokens.weight layers.#.input_layernorm.weight layers.#.mlp.down_proj.weight layers.#.mlp.gate_proj.weight layers.#.mlp.up_proj.weight layers.#.post_attention_layernorm.weight layers.#.post_feedforward_layernorm.weight layers.#.pre_feedforward_layernorm.weight layers.#.self_attn.k_norm.weight layers.#.self_attn.k_proj.weight layers.#.self_attn.o_proj.weight layers.#.self_attn.q_norm.weight layers.#.self_attn.q_proj.weight layers.#.self_attn.v_proj.weight norm.weight ``` Printing a specific tensor's information: ```console (venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m layers.0.input_layernorm.weight Tensor: layers.0.input_layernorm.weight File: model.safetensors Shape: [768] ```	2026-02-04 10:40:53 +01:00
Georgi Gerganov	d838c22bb3	spec : fix the check-rate logic of ngram-simple (#19261 ) * spec : fix the check-rate logic of ngram-simple * cont : refactor + fix checks	2026-02-04 10:39:53 +02:00
Daniel Bevenius	25f40ca65f	completion : simplify batch (embd) processing (#19286 ) * completion : simplify batch (embd) processing This commit simplifies the processing of embd by removing the for loop that currently exists which uses params.n_batch as its increment. This commit also removes the clamping of n_eval as the size of embd is always at most the size of params.n_batch. The motivation is to clarify the code as it is currently a little confusing when looking at this for loop in isolation and thinking that it can process multiple batches. * add an assert to verify n_eval is not greater than n_batch	2026-02-04 05:43:28 +01:00
Kevin Pouget	015deb9048	ggml-virtgpu: make the code thread safe (#19204 ) * ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function * ggml-virtgpu: deprecate buffer_type is_host remoting not necessary * ggml-virtgpu: stop using static vars as cache The static init isn't thread safe. * ggml-virtgpu: protect the use of the shared memory to transfer data * ggml-virtgpu: make the remote calls thread-safe * ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory * ggml-virtgpu: add a cleanup function for consistency * ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing * fix style and ordering * Remove the static variable in apir_device_get_count * ggml-virtgpu: improve the logging * fix review minor formatting changes	2026-02-04 10:46:18 +08:00
Aman Gupta	2ceda3f662	ggml-cpu: use LUT for converting e8->f32 scales on x86 (#19288 ) * ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro	2026-02-04 09:43:29 +08:00
Georgi Gerganov	44008ce8f9	metal : add solve_tri (#19302 )	2026-02-03 23:43:14 +02:00
Georgi Gerganov	6a9bf2f788	ci : add sanitizer runs for server (#19291 )	2026-02-03 22:41:20 +02:00
Georgi Gerganov	faa1bc26ee	sampling : delegate input allocation to the scheduler (#19266 ) * sampling : delegate input allocation to the scheduler * graph : compute backend samplers only if needed	2026-02-03 22:16:16 +02:00
Ruben Ortlam	32b17abdb0	vulkan: disable coopmat1 fa on Nvidia Turing (#19290 )	2026-02-03 17:37:32 +01:00
Aman Gupta	8bece2eb20	CUDA: use mmvq for mul-mat-id for small batch sizes (#18958 ) * CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path	2026-02-03 23:31:23 +08:00
Sigbjørn Skjæret	a6fd8ca1fe	models : remove unnecessary cont in openelm (#19289 )	2026-02-03 14:20:57 +01:00
Georgi Gerganov	c55bce4159	metal : minor cleanup (#19251 )	2026-02-03 13:43:29 +02:00
Oliver Simons	1f1e57f2bf	CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053 ) By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](`557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816)`) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: \| GPU \| Model \| Test \| t/s master \| t/s osimons/fix_bw_mmq_fixup_kernel \| Speedup \| \|:--------------------------------------------------------\|:----------------------\|:-------\|-------------:\|--------------------------------------:\|----------:\| \| NVIDIA RTX 6000 Ada Generation \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 8404.05 \| 8375.79 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| llama 3B Q4_K_M \| pp8096 \| 16148.93 \| 16019.60 \| 0.99 \| \| NVIDIA RTX 6000 Ada Generation \| llama 8B Q4_0 \| pp8096 \| 8008.29 \| 7978.80 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B BF16 \| pp8096 \| 4263.16 \| 4248.53 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B Q4_K_M \| pp8096 \| 5165.11 \| 5157.43 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 12582.80 \| 12758.37 \| 1.01 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 3B Q4_K_M \| pp8096 \| 16879.10 \| 17619.47 \| 1.04 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 8B Q4_0 \| pp8096 \| 10649.90 \| 10982.65 \| 1.03 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B BF16 \| pp8096 \| 7717.73 \| 7716.22 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B Q4_K_M \| pp8096 \| 7301.90 \| 7370.38 \| 1.01 \|	2026-02-03 11:33:14 +01:00

1 2 3 4 5 ...

7971 Commits All Branches Search

7971 Commits

All Branches