llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aman Gupta	81017865ee	CUDA: fix bug in rms_norm fusion (#15660 ) * CUDA: fix bug in rms_norm fusion * Fix bug for OP_REPEAT * Fix index for add	2025-08-29 21:30:06 +08:00
Piotr Wilkin (ilintar)	60e5eee31f	chat : Seed OSS thinking + tool call support (#15552 ) * Reasoning and tool-calling support for Seed OSS * Fix grammar and partial parsing * Whitespace * New chat template * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove unused 'purge_healing_marker' helper --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-08-29 14:53:41 +02:00
Aman Gupta	009b709d6e	CUDA: fuse adds, fuse add with rms norm (#15631 ) * CUDA: fused add with rms_norm_mul * Non-broadcast fuse works * Add fused adds * format * Remove n_fuse from template params * Address review comments * Move template inside binbcast	2025-08-29 11:35:58 +08:00
Gabe Goodhart	e8d99dd0b6	nvidia nemotron nano v2 (nemotronh) (#15507 ) * feat: Add NEMOTRONH to python arch enum https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add NEMOTRONH to c++ arch enum https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add NEMOTRONH to llama-arch layer map https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at conversion for nemotronh https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add a verbose log for each tensor loaded This is really helpful for diagnosing mismatches between the expected and received tensors https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First (broken) pass at nemotronh model architecture It generates tokens, just not valid ones! https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Explicitly enable add_bos_token during conversion The `tokenizer.json`/`tokenizer_config.json` in the model are a bit contradictory. In the config, add_bos_token is set to False, but the tokenizer model itself has a post_processor that adds the BOS token via type: TemplateProcessing https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Only allocate attention cache for attention layers (not non-recurrent) https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move residual add to after every block https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the correct norm tensor for the MLP blocks https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Nemotron-H: MLP gate cleanup (pass NULL for unused gate) This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise. * SSM: respect ssm_dt_rank for dt_dim when provided Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16). * fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage) * Rename nemotronh to nemotron_h for consistency - Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py - Change architecture string from 'nemotronh' to 'nemotron_h' in all files - Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H - Update class name llm_build_nemotronh to llm_build_nemotron_h - Consistent naming with underscore convention (nemotron_h vs nemotronh) * feat: Support conversion for older NemotronH models https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Maicon Domingues <dominguesm@outlook.com> Co-authored-by: weatherman <fxdstudios@gmail.com>	2025-08-28 18:39:31 -06:00
Gabe Goodhart	a8bca68f72	fix: Compute the full sum in llama-eval-callback, not just the sum of printed values (#15637 ) This makes it much easier to compare between llama.cpp and transformers! https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-08-28 15:27:36 -05:00
mnehete32	c97dc09391	CUDA: add conv2d (#15635 ) * CUDA: add conv2d * CUDA: conv2d - correct formatting and added const	2025-08-28 20:33:03 +02:00
Aaron Teo	6c442f42ff	ggml-cpu: fix invalid hsum build in debug s390x (#15634 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-08-28 22:39:27 +08:00
compilade	73804145ab	ggml : fix SSM_SCAN for n_groups > 1 (#15625 )	2025-08-28 10:11:36 -04:00
Georgi Gerganov	c8d0d14e77	kv-cache : fix find_slot to not search for continuous slot (#15638 ) ggml-ci	2025-08-28 17:09:05 +03:00
Sigbjørn Skjæret	84ab83cc0b	model : jina-embeddings-v3 support (#13693 ) * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * fix vocab parsing with only tokenizer.json * set mask token lstrip attribute * additional unk_token_id fallback just in case [no ci] * revert vocab_size() change [no ci] * merge tensor loading into general bert * rope * add lora embedding and loading (non-functional) * export separate lora ggufs instead * add adapter metadata api * use std::string * convert_hf_to_lora compatibility * fix assert * apply suggestions from review * apply suggestion from review	2025-08-28 15:49:50 +02:00
Aman Gupta	55042b3692	scripts: add sqlite3 check for compare-commits.sh (#15633 )	2025-08-28 19:23:22 +08:00
Georgi Gerganov	8a4280ce43	kv-cache : remove LLAMA_SET_ROWS checks (#15505 ) ggml-ci	2025-08-28 12:27:02 +03:00
Aleksei Nikiforov	64387f6e95	gguf-py: byteswapping improvements (#12851 ) * gguf-py: implement byteswapping for Q4_0 This is needed to byteswap Mistral model. Also restore original shapes after byteswapping tensors. It is not needed at the moment, but do it in case they'd be used in future. * Rework byteswapping code in gguf-py Move out details from byteswapping tensor blocks code	2025-08-28 16:56:41 +08:00
Joshua Cogliati	d35a1e8c41	cli : change log to warning to explain reason for stopping (#15604 ) * Change to warn instead of debug, to explain reason for stopping. * Update tools/main/main.cpp Fix printing --2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-08-28 10:48:20 +03:00
Daniel Bevenius	46d9caa27a	model-conversion : add mmproj conversion target (#15628 ) This commit adds a new target to the Makefile for converting models that are multimodal. This target will convert the original model and in addition also create the mmproj GGUF model. The motivation for this change is that for models that are multimodal, for example those that contain a vision encoders, we will often want to upload both the quantized model and the vision encoder model to HuggingFace. Example usage: ```console $ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/ ... The environment variable CONVERTED_MODEL can be set to this path using: export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf ``` The converted original model can then be quantized, and after that both the quantized model and the mmproj file can then be uploaded to HuggingFace. Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main	2025-08-28 09:26:48 +02:00
matiaslin	5a0e3ef6f0	cuda: Add cublasLt_static linking when GGML_STATIC is enabled (#15622 ) Prior to this change, we faced undefined cublasLt references when attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux. We add linking with CUDA::cublasLt_static when CUDA version is greater than 10.1.	2025-08-28 02:32:36 +02:00
Johannes Gäßler	fbef0fad7a	server: higher timeout for tests (#15621 )	2025-08-27 20:58:09 +02:00
Georgi Gerganov	da54f9f1a2	presets : add qwen3-30B-a3b FIM (#15616 )	2025-08-27 15:48:07 +03:00
uvos	47373271f9	HIP: Enable support for ggml_backend_cuda_register_host_buffer (#15615 )	2025-08-27 13:58:54 +02:00
Georgi Gerganov	1bded5a3b3	kv-cache : better estimate of n_kv for multi-sequence batches (#15610 ) ggml-ci	2025-08-27 13:55:12 +03:00
Chenguang Li	1e7489745a	CANN: refactor mask handling and improve performance in FA (#15561 ) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-27 17:21:41 +08:00
xctan	1cf123a343	ggml-cpu : add basic RVV support for vector f32 ops (#15057 ) * ggml-cpu : add basic RVV support for vector f32 ops * ggml-cpu : add RVV support for f32 softmax	2025-08-27 16:44:22 +08:00
Daniel Bevenius	fcca2182a1	common : add -m to bash completion for --model [no ci] (#15591 ) This commit updates the bash completion script to include the -m short option for the --model argument. The motivation for this is that currently tab completion only works the full --model option, and it is nice to have it work for the short option as well.	2025-08-27 10:28:53 +02:00
rmatif	86076f92de	OpenCL: add fused group_norm/norm, mul, add (#15314 ) * add fused group_norm/norm, mul, add * fix spacing * revert rms_norm logic * fix trailing whitespace	2025-08-26 23:36:05 -07:00
Diego Devesa	bcbddcd54f	tests : fix test-opt with GGML_BACKEND_DL (#15599 )	2025-08-26 22:14:38 +02:00
Akarshan Biswas	8b69686136	SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (#15592 ) The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp. This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.	2025-08-27 00:27:49 +05:30
fidoriel	8ce3ff1d91	mtmd : fix mtmd ios build (#15579 )	2025-08-26 20:05:50 +02:00
Eve	44b1efa41a	tests: add performance test for mul mat id (#15543 )	2025-08-26 15:42:49 +00:00
shalinib-ibm	a6a58d6478	llamafile: PowerPC Sgemm Optimization (#15558 ) This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-08-26 23:35:25 +08:00
Georgi Gerganov	0373486dbc	graph : fix assert in memory-less build_attn (#15590 ) ggml-ci	2025-08-26 17:45:17 +03:00
Daniel Bevenius	62cef26ac5	model-conversion : add qat-q4 quantization targets (#15588 ) This commit adds two targets to the Makefile for quantizing of Quantization Aware Trained (QAT) models to Q4_0 format. The motivation for this is that this sets the token embedding and the output tensors data types to Q8_0 instead of the default Q6_K. This is someting that we wish to enforce for QAT Q4_0 models that are to be uploaded to ggml-org on Huggingface to guarantee the best quality.	2025-08-26 16:12:29 +02:00
Johannes Gäßler	8f5afa94c4	CUDA: return -1 for nonexistent compiled arch (#15587 )	2025-08-26 16:01:20 +02:00
Georgi Gerganov	b3964c1e89	metal : optimize FA vec for large sequences and BS <= 8 (#15566 ) * metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci	2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen	79a546220c	mtmd : support Kimi VL model (#15458 ) * convert : fix tensor naming conflict for llama 4 vision * convert ok * support kimi vision model * clean up * fix style * fix calc number of output tokens * refactor resize_position_embeddings * add test case * rename build fn * correct a small bug	2025-08-26 12:54:19 +02:00
Georgi Gerganov	85cc1ae998	context : print graph stats for memory-less contexts (#15586 ) ggml-ci	2025-08-26 12:47:00 +03:00
Georgi Gerganov	1d8d83deaa	metal : improve `MUL_MAT_ID` (#15541 ) * metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci	2025-08-26 12:46:15 +03:00
tc-mb	c4e9239064	model : support MiniCPM-V 4.5 (#15575 )	2025-08-26 10:05:55 +02:00
Sigbjørn Skjæret	39842a7f73	gguf-py : remove erroneous FFN_GATE entry (#15583 )	2025-08-26 09:08:08 +02:00
Sigbjørn Skjæret	0fd90db585	metal : remove contiguous assertion for src0 in IM2COL (#15577 ) * remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op	2025-08-26 09:51:43 +03:00
Yoshi_likes_e4	4c37636b3e	Add a warning for special devices (#15563 ) * Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix vector names --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-26 08:15:33 +02:00
Jeff Bolz	34bdbbd7c2	vulkan: Remove splitting for mul_mat_id (#15568 ) row_ids only needs to hold the BN rows for the current tile.	2025-08-26 06:42:44 +02:00
Qeeweew	74f52f77f2	CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451 ) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <xiapc@outlook.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-25 23:21:22 +02:00
lhez	f7207b0415	opencl: fix support ops condition for `rms_norm` (#15560 )	2025-08-25 14:18:09 -07:00
Ruben Ortlam	4d917cd4f6	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565 )	2025-08-25 17:56:59 +02:00
Jeff Bolz	886b97a5d6	tests: Generate unique input values for count_equal (#15487 ) This avoids backend-dependent behavior for argmax that leads to intermittent failures.	2025-08-25 10:47:16 -05:00
Ihar Hrachyshka	111f8d06f0	metal: fix regression when no metal devices are present (#15531 )	2025-08-25 18:27:34 +03:00
Johannes Gäßler	5eff6ec9b1	CUDA: MoE helper in device code, better tile sizes (#15525 ) * CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks	2025-08-25 17:23:40 +02:00
Daniel Bevenius	dfd9b5f6c7	model-conversion : set pooling type to none in logits.cpp (#15564 ) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings.	2025-08-25 15:00:43 +02:00
Daniel Bevenius	5a6bc6b1a6	model-conversion : add model card template for embeddings [no ci] (#15557 ) * model-conversion: add model card template for embeddings [no ci] This commit adds a separate model card template (model repository README.md template) for embedding models. The motivation for this is that there server command for the embedding model is a little different and some addition information can be useful in the model card for embedding models which might not be directly relevant for causal models. * squash! model-conversion: add model card template for embeddings [no ci] Fix pyright lint error. * remove --pooling override and clarify embd_normalize usage	2025-08-25 14:25:25 +02:00
Georgi Gerganov	6b64f74b55	batched-bench : fix unified KV cache handling + pp timing (#15562 ) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache	2025-08-25 13:56:43 +03:00

1 2 3 4 5 ...

6318 Commits All Branches Search

6318 Commits

All Branches