llama.cpp

Commit Graph

Author	SHA1	Message	Date
Gabe Goodhart	c08002a198	chat : Granite Docling stopping (#16438 ) * fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-10-06 18:59:40 +02:00
Sigbjørn Skjæret	3a002afafa	ci : refactor sdk caching to minimize storage (#16414 ) * refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]	2025-10-06 17:40:21 +02:00
Georgi Gerganov	a23b9bdbd3	ggml : fix unaligned access in AMX code (#16315 )	2025-10-06 16:05:27 +03:00
Daniel Bevenius	04e632a4aa	ci : remove missing reranker model files (#16444 ) This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649	2025-10-06 14:56:59 +02:00
Daniel Bevenius	a80ff183ab	ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443 ) This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2ggml_f32_epr elements per iteration , there can be up to (2ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630	2025-10-06 14:17:12 +02:00
Yuannan	1d49ca3759	nix : removed metal for nix (#16118 )	2025-10-06 12:29:56 +03:00
Oleksandr Kuvshynov	c5fef0fcea	server: update readme to mention n_past_max metric (#16436 ) https://github.com/ggml-org/llama.cpp/pull/15361 added new metric exported, but I've missed this doc.	2025-10-06 10:53:31 +03:00
Gabe Goodhart	ca71fb9b36	model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206 ) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-05 14:57:47 +02:00
Reese Levine	35266573b9	ggml webgpu: actually add softmax, fix rms_norm offset (#16400 ) * implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit	2025-10-04 20:59:31 -07:00
Eve	86df2c9ae4	vulkan: use a more appropriate amount of threads when generating shaders (#16418 ) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax	2025-10-04 22:04:27 +02:00
Radoslav Gerganov	f39283960b	rpc : check src buffer when copying tensor (#16421 ) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.	2025-10-04 16:22:45 +03:00
Radoslav Gerganov	898acba681	rpc : add support for multiple devices (#16276 ) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order	2025-10-04 12:49:16 +03:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Pascal	128d522c04	chat : support Magistral thinking (#16413 ) * feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral	2025-10-03 21:51:48 +03:00
ddh0	f6dcda3900	server : context checkpointing for hybrid and recurrent models (#16382 ) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-03 21:34:51 +03:00
Georgi Gerganov	606a73f531	metal : fix loop bound in ggml_mem_ranges (#16412 )	2025-10-03 19:18:56 +03:00
Sigbjørn Skjæret	946f71ed9a	llama : fix shapes for bert/mpt q/k norm (#16409 )	2025-10-03 14:40:25 +02:00
Acly	638d330246	ggml : fix graph reallocation with multiple chunks (#16396 ) reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower	2025-10-03 13:49:08 +02:00
Aleksander Grygier	84c8e305e8	Fix missing messages on sibling navigation (#16408 ) * fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output	2025-10-03 12:51:40 +02:00
Jeff Bolz	2aaf0a2a20	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354 ) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-03 12:50:46 +02:00
Jeff Bolz	0e1f838556	vulkan: Fix FA coopmat1 invalid array indexing (#16365 ) When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.	2025-10-03 11:52:46 +02:00
Daniel Bevenius	ad126479c2	ci : change macos-13 to macos-15-intel (#16401 ) This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/	2025-10-03 11:45:16 +02:00
Aleksander Grygier	77233277c9	Capture model name only after first token (streaming) or completed request (#16405 ) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output	2025-10-03 11:30:39 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Aleksander Grygier	136bda78c5	webui : Fix messages payload sent to chat completions (#16402 ) * fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output	2025-10-03 10:11:34 +03:00
Pascal	5113efd34c	fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356 ) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-10-03 08:01:31 +02:00
Sigbjørn Skjæret	d64c8104f0	test-barrier : do not use more threads than physically available (#16389 ) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-02 20:10:12 +02:00
Reese Levine	ef07a40906	ggml webgpu: add support for soft_max, optimize rms_norm (#16357 ) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-02 11:00:31 -07:00
Piotr Wilkin (ilintar)	34fcc5a4ac	model : Apertus model implementation (#15852 ) * First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-02 20:43:22 +03:00
R0CKSTAR	91a2a56556	musa: update compile flags (#16265 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2025-10-02 16:29:56 +03:00
Sigbjørn Skjæret	72ee736c44	ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388 )	2025-10-02 13:51:36 +02:00
Eve	f09aefaa84	ci: update vulkan ci (#16294 )	2025-10-02 10:10:07 +02:00
Georgi Gerganov	bbd32bc038	ci : fix clean-up of old logs (#16381 )	2025-10-02 10:35:43 +03:00
Neo Zhang Jianyu	2be72c2b12	SYCL: Update to oneAPI 2025.2 (#16371 ) * update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-02 10:16:25 +03:00
uvos	95ce098544	HIP: add IMbackK to codeowner (#16375 )	2025-10-02 05:52:59 +02:00
uvos	c8dedc9999	CI: reenable cdna in rocm docker builds (#16376 )	2025-10-01 23:32:39 +02:00
uvos	e95fec640f	HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 (#16221 ) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn	2025-10-01 23:09:25 +02:00
Shunta Saito	ded67b9444	llama : parameter conversion and loading fixes for PLaMo2 variants (#16075 ) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-01 23:08:15 +02:00
uvos	1fe4e38cc2	ci: Properly install rocwmma for hip builds (#16305 ) * CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0	2025-10-01 20:18:03 +02:00
Adrien Gallouët	4201deae9c	common: introduce http.h for httplib-based client (#16373 ) * common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>	2025-10-01 20:22:18 +03:00
Aleksander Grygier	764799279f	Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369 ) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build	2025-10-01 18:18:10 +02:00
Aleksander Grygier	2a9b63383a	Improve code block color theming (#16325 ) * feat: Improve code block theming * chore: update webui build output * chore: Update webui static build	2025-10-01 15:54:42 +02:00
Sigbjørn Skjæret	1104ca1a1c	ci : use registry cache for docker builds (#16366 )	2025-10-01 14:09:52 +02:00
Aleksander Grygier	4f1575921c	Add optional setting for showing "Model used:" information (#16337 ) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output	2025-10-01 12:08:16 +02:00
Eve	132d673554	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345 ) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-01 09:56:36 +02:00
Aleksander Grygier	aa9538a63a	webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363 )	2025-10-01 08:40:26 +03:00
Bartowski	e74c92e842	model : support GLM 4.6 (make a few NextN/MTP tensors not required) (#16359 ) * Make a few GLM tensors not required layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work * Update llama-model.cpp layer.nextn.shared_head_norm also not required in case of future models	2025-09-30 22:24:36 +02:00
Sigbjørn Skjæret	b2ba81dbe0	ci : fix ccache key for ubuntu-cpu-cmake (#16355 ) * fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]	2025-09-30 21:41:42 +02:00
Adrien Gallouët	bf6f3b3a19	common : disable progress bar without a tty (#16352 ) * common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing headers Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-30 20:52:41 +03:00
lhez	7c156df414	opencl: support pad_ext (#15888 )	2025-09-30 10:45:45 -07:00

1 2 3 4 5 ...

6699 Commits All Branches Search

6699 Commits

All Branches