llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ryan Mangeno	94e7ece5cd	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 12:01:47 -04:00
Ryan Mangeno	b442b43303	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 12:01:31 -04:00
Ryan Mangeno	43332bf829	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 12:01:15 -04:00
Ryan Mangeno	89431b6ba6	Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 12:00:56 -04:00
Ryan Mangeno	da3a1c90f8	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:56:59 -04:00
Ryan Mangeno	2ea286237f	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:56:40 -04:00
Ryan Mangeno	952c302128	Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:48:32 -04:00
Ryan Mangeno	72f1f513e0	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:48:19 -04:00
Ryan Mangeno	e3ac2ae531	Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:48:02 -04:00
Ryan Mangeno	4187cf5a94	Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:47:39 -04:00
Ryan Mangeno	97e1de457c	Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:47:23 -04:00
Ryan Mangeno	ff9f8c2bfd	Update convert_hf_to_gguf_update.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:46:58 -04:00
Ryan Mangeno	3976d77524	Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-10 11:46:42 -04:00
ryan-mangeno	f362878b1c	merge	2025-10-05 11:43:47 -04:00
Gabe Goodhart	ca71fb9b36	model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206 ) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-05 14:57:47 +02:00
Reese Levine	35266573b9	ggml webgpu: actually add softmax, fix rms_norm offset (#16400 ) * implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit	2025-10-04 20:59:31 -07:00
Eve	86df2c9ae4	vulkan: use a more appropriate amount of threads when generating shaders (#16418 ) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax	2025-10-04 22:04:27 +02:00
ryan-mangeno	3bbf6716b4	replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion	2025-10-04 13:02:30 -04:00
Radoslav Gerganov	f39283960b	rpc : check src buffer when copying tensor (#16421 ) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.	2025-10-04 16:22:45 +03:00
Radoslav Gerganov	898acba681	rpc : add support for multiple devices (#16276 ) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order	2025-10-04 12:49:16 +03:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Pascal	128d522c04	chat : support Magistral thinking (#16413 ) * feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral	2025-10-03 21:51:48 +03:00
ddh0	f6dcda3900	server : context checkpointing for hybrid and recurrent models (#16382 ) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-03 21:34:51 +03:00
Georgi Gerganov	606a73f531	metal : fix loop bound in ggml_mem_ranges (#16412 )	2025-10-03 19:18:56 +03:00
Sigbjørn Skjæret	946f71ed9a	llama : fix shapes for bert/mpt q/k norm (#16409 )	2025-10-03 14:40:25 +02:00
Acly	638d330246	ggml : fix graph reallocation with multiple chunks (#16396 ) reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower	2025-10-03 13:49:08 +02:00
Aleksander Grygier	84c8e305e8	Fix missing messages on sibling navigation (#16408 ) * fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output	2025-10-03 12:51:40 +02:00
Jeff Bolz	2aaf0a2a20	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354 ) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-03 12:50:46 +02:00
Jeff Bolz	0e1f838556	vulkan: Fix FA coopmat1 invalid array indexing (#16365 ) When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.	2025-10-03 11:52:46 +02:00
Daniel Bevenius	ad126479c2	ci : change macos-13 to macos-15-intel (#16401 ) This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/	2025-10-03 11:45:16 +02:00
Aleksander Grygier	77233277c9	Capture model name only after first token (streaming) or completed request (#16405 ) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output	2025-10-03 11:30:39 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Aleksander Grygier	136bda78c5	webui : Fix messages payload sent to chat completions (#16402 ) * fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output	2025-10-03 10:11:34 +03:00
Pascal	5113efd34c	fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356 ) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-10-03 08:01:31 +02:00
Sigbjørn Skjæret	d64c8104f0	test-barrier : do not use more threads than physically available (#16389 ) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-02 20:10:12 +02:00
Reese Levine	ef07a40906	ggml webgpu: add support for soft_max, optimize rms_norm (#16357 ) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-02 11:00:31 -07:00
Piotr Wilkin (ilintar)	34fcc5a4ac	model : Apertus model implementation (#15852 ) * First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-02 20:43:22 +03:00
R0CKSTAR	91a2a56556	musa: update compile flags (#16265 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2025-10-02 16:29:56 +03:00
Sigbjørn Skjæret	72ee736c44	ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388 )	2025-10-02 13:51:36 +02:00
Eve	f09aefaa84	ci: update vulkan ci (#16294 )	2025-10-02 10:10:07 +02:00
Georgi Gerganov	bbd32bc038	ci : fix clean-up of old logs (#16381 )	2025-10-02 10:35:43 +03:00
Neo Zhang Jianyu	2be72c2b12	SYCL: Update to oneAPI 2025.2 (#16371 ) * update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-02 10:16:25 +03:00
uvos	95ce098544	HIP: add IMbackK to codeowner (#16375 )	2025-10-02 05:52:59 +02:00
uvos	c8dedc9999	CI: reenable cdna in rocm docker builds (#16376 )	2025-10-01 23:32:39 +02:00
uvos	e95fec640f	HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 (#16221 ) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn	2025-10-01 23:09:25 +02:00
Shunta Saito	ded67b9444	llama : parameter conversion and loading fixes for PLaMo2 variants (#16075 ) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-01 23:08:15 +02:00
ryan-mangeno	61a0b03fd6	more modular hparam setting	2025-10-01 15:30:11 -04:00
ryan-mangeno	33eed315a3	correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...	2025-10-01 14:19:44 -04:00
uvos	1fe4e38cc2	ci: Properly install rocwmma for hip builds (#16305 ) * CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0	2025-10-01 20:18:03 +02:00
ryan-mangeno	46f21826b3	merge	2025-10-01 14:08:08 -04:00

1 2 3 4 5 ...

6744 Commits All Branches Search

6744 Commits

All Branches