llama.cpp

Commit Graph

Author	SHA1	Message	Date
Kevin Hannon	c014c3f83a	docs: add information about openvino in the docker page (#20743 )	2026-03-19 15:08:47 +08:00
Chenguang Li	7f2cbd9a4d	CANN: handle in-place ROPE on non-contiguous f32 tensors (#20274 ) RotaryPositionEmbedding on CANN fails when src and dst share the same non-contiguous buffer (inplace + view), because the operator overwrites source data before it is fully read. Add a branch that detects this case and uses contiguous temporary buffers: copy src to temp, run ROPE into another temp, then copy back to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1, inplace=1). Signed-off-by: noemotiovon <757486878@qq.com>	2026-03-19 14:05:01 +08:00
Masashi Yoshimura	509a31d00f	ggml-webgpu: Update the `RMS_NORM` preprocessor and add `L2_NORM` (#20665 ) * Update the preprocessor of RMS_NORM and add L2_NORM. * Fix the name of rms_norm to row_norm.	2026-03-18 21:08:59 -07:00
Masashi Yoshimura	ea01d196d7	ggml-webgpu: Add supports for `DIAG` and `TRI` (#20664 ) * Add supports for DIAG and TRI. * Remove extra ttype and add a comment for TRI op.	2026-03-18 21:08:35 -07:00
Chenguang Li	07ba6d275b	CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (#20031 ) - Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2, then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp). - Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with 48 heads); fixes buffer overflow and large numerical errors in those cases.	2026-03-19 11:02:42 +08:00
Michael Grau	6729d4920c	model : add control vector support where missing (#20653 ) * Add control vector functions to qwen3.5 and qwen-next models * Add missing cvec compatibility to the rest of the models * Adjust comments and formatting * cleanup * whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-18 23:25:12 +01:00
Sigbjørn Skjæret	d13d60af1d	gguf-py : cleaner way to get the first key (#20727 )	2026-03-18 23:21:42 +01:00
crsawyer	5744d7ec43	Rebuild index.html.gz (#20724 )	2026-03-18 18:49:57 +01:00
Reese Levine	8ced5f41f9	Move to no timeout for WaitAny in graph submission to avoid deadlocks in some cases on llvm-pipe backends (#20618 )	2026-03-18 10:23:47 -07:00
Shaw Nguyen	78d550b541	ggml-cpu/x86: fix unused changemask warning in repack (#20692 )	2026-03-18 18:45:06 +02:00
Georgi Gerganov	4efd326e71	sync : ggml	2026-03-18 15:17:28 +02:00
Georgi Gerganov	b08f7322ee	ggml : bump version to 0.9.8 (ggml/1442)	2026-03-18 15:17:28 +02:00
Georgi Gerganov	79187f2fb8	ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)	2026-03-18 15:17:28 +02:00
Julien Chaumond	48e61238e1	webui: improve tooltip wording for attachment requirements (#20688 ) * webui: improve tooltip wording for attachment requirements Co-Authored-By: Claude <Agents+claude@huggingface.co> * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Claude <Agents+claude@huggingface.co>	2026-03-18 14:01:02 +01:00
Pop Flamingo	312cf03328	llama : re-enable manual LoRA adapter free (#19983 ) * Re-enable manual LoRA adapter free * Remove stale "all adapters must be loaded before context creation" stale comments	2026-03-18 12:03:26 +02:00
Masato Nakasaka	f4049ad735	tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483 ) * Fix errors occurring on Windows * Reverted fix #20365 will take care of CRLF isue * Changed to write to directly to stdin * Prevent fclose to happen twice	2026-03-18 10:43:31 +01:00
Aldehir Rojas	5e8910a0db	common : rework gpt-oss parser (#20393 ) * common : rework gpt-oss parser * cont : fix gpt-oss tests * cont : add structured output test * cont : rename final to final_msg	2026-03-18 10:41:25 +01:00
Aaron Teo	fe00a84b4b	tests: enable kv_unified to prevent cuda oom error on rtx 2060 (#20645 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-18 17:40:22 +08:00
Aleksander Grygier	7ab321d40d	webui: Fix duplicated messages on q param (#20715 ) * fix: Remove duplicate message sending on `?q` param * chore: update webui build output	2026-03-18 10:32:43 +01:00
uvos	7533a7d509	HIP : ignore return of hipMemAdvise [no ci] (#20696 )	2026-03-18 09:53:13 +01:00
Andreas Obersteiner	a69d54f990	context : fix graph not resetting when control vector changes (#20381 )	2026-03-18 08:10:13 +02:00
Krishna Sridhar	cf23ee2447	hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (#20701 ) Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear attention layers. These ops follow the existing unary-ops pattern with VTCM DMA double-buffering. - neg: negate via scale by -1.0 - exp: uses existing hvx_exp_f32 HVX intrinsics - sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics - softplus: log(1 + exp(x)) scalar fallback - CONT reuses the existing CPY infrastructure since making a tensor contiguous is equivalent to a same-type copy. - REPEAT implements tiled memory copy with multi-threaded execution via the worker pool, supporting f32 and f16 types. The kernel parallelizes across output rows and uses memcpy for each tile. Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-17 15:34:36 -07:00
Ruben Ortlam	892e3c333a	vulkan: disable mmvq on Intel Windows driver (#20672 ) * vulkan: disable mmvq on Intel Windows driver * improve comment	2026-03-17 21:51:43 +01:00
Kevin Hannon	ee4801e5a6	ggml-blas: set mkl threads from thread context (#20602 ) * ggml blas: set mkl threads from thread context * add code to run blas locally	2026-03-18 01:16:49 +08:00
Piotr Wilkin (ilintar)	d2ecd2d1cf	common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289 ) * Add `--force-pure-content` to force a pure content parser. * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Change parameter name [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 16:16:43 +01:00
Taimur Ahmad	054d8b0f24	ggml-cpu: fix RVV checks in quants and repacking (#20682 ) * ggml-cpu: refactor quants.c; add rvv check * ggml-cpu: refactor; disable generic fallback	2026-03-17 16:03:40 +02:00
Sigbjørn Skjæret	ab0bb93748	ci : bump ccache [no ci] (#20679 ) * bump ccache * forgotten * disable for s390x * disable also for ppc64le	2026-03-17 14:54:31 +01:00
Ruben Ortlam	3a5cb629b1	vulkan: async and event fixes (#20518 ) * vulkan: fix event wait submission, event command buffer reset * fix event command buffer reset validation error * also reset command buffers before reuse * use timeline semaphores instead of fences for event_synchronize * don't use initializer list for semaphore wait info * use multiple events to avoid reset issues * fix event reuse issue with multiple vectors * add semaphore wait condition also if compute_ctx already exists * remove event pending stage	2026-03-17 14:27:23 +01:00
Georgi Gerganov	8cc2d81264	server : fix ctx checkpoint invalidation (#20671 )	2026-03-17 15:21:14 +02:00
Justin Bradford	627670601a	kleidiai : fix MUL_MAT support for batched (3D) inputs (#20620 ) * kleidiai : fix MUL_MAT support for batched (3D) inputs The supports_op() check incorrectly rejected MUL_MAT operations with 3D inputs (ne[2] > 1), but the actual compute_forward_qx() implementation handles batched inputs correctly via a loop over ne12. This caused models with Q4_0/Q8_0 weights to crash during graph scheduling when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during loading (tested with 2D inputs) but the runtime used 3D inputs. Also relax the buffer check to allow supports_op() to be called during weight loading when src[0]->buffer is NULL. Fixes #20608 * Kleidiai support_ops should only return true for 3D inputs, not also 4D	2026-03-17 14:03:54 +02:00
Ruben Ortlam	740a447fc3	vulkan: allow graphics queue only through env var (#20599 ) * vulkan: avoid graphics queue on non-RADV AMD drivers * avoid graphics queues on small GPUs * change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE * reenable transfer queue if graphics queue is not used	2026-03-17 10:09:59 +01:00
Neo Zhang	b6c83aad55	[SYCL] ehance UPSCALE to support all UT cases (#20637 ) * [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1	2026-03-17 10:01:52 +08:00
Piotr Wilkin (ilintar)	2e4a6edd4a	tools/server: support refusal content for Responses API (#20285 ) * Support refusal content for Responses API * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 01:42:04 +01:00
Xuan-Son Nguyen	d34ff7eb5b	model: mistral small 4 support (#20649 ) * model: mistral small 4 support * fix test * fix test (2) * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * change newline --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 00:31:14 +01:00
Georgi Gerganov	45172df4d6	ci : disable AMX jobs (#20654 ) [no ci]	2026-03-16 22:38:59 +02:00
Georgi Gerganov	9b342d0a9f	benches : add Nemotron 3 Nano on DGX Spark (#20652 ) [no ci]	2026-03-16 21:50:43 +02:00
Sigbjørn Skjæret	55e87026f7	tests : write to binary buffer to avoid newline translation in jinja -py [no ci] (#20365 )	2026-03-16 20:40:22 +01:00
Martin Klacer	cf21cdf36c	kleidiai: add data type check to get_tensor_traits (#20639 ) * kleidiai: add data type check to get_tensor_traits * Added check for F16 data type into get_tensor_traits path with input data not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8) Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7 * updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp updated kleidiai.cpp file as per suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-16 21:25:54 +02:00
Sigbjørn Skjæret	0ed992973b	ci : update labeler (#20629 )	2026-03-16 20:24:20 +01:00
Aldehir Rojas	1bbec6a75d	jinja : add capability check for object args (#20612 )	2026-03-16 17:43:14 +01:00
Georgi Gerganov	f47a246a08	sync : ggml	2026-03-16 17:22:06 +02:00
Georgi Gerganov	c0ccbd1f86	ggml : try fix arm build (whisper/0)	2026-03-16 17:22:06 +02:00
David366AI	f6da02c3f2	ggml : extend im2col f16 (ggml/1434) * examples/yolo: fix load_model memory leak * fix/issue-1433 ggml_compute_forward_im2col_f16 assert error * fix/issue-1433	2026-03-16 17:22:06 +02:00
Pascal	dddca026bf	webui: add model information dialog to router mode (#20600 ) * webui: add model information dialog to router mode * webui: add "Available models" section header in model list * webui: remove nested scrollbar from chat template in model info dialog * chore: update webui build output * feat: UI improvements * refactor: Cleaner rendering + UI docs * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-16 15:38:11 +01:00
Aman Gupta	3c8521c4f5	llama-graph: replace cont with reshape for alpha in qwen35 (#20640 )	2026-03-16 22:07:13 +08:00
Aleksander Grygier	67a2209fab	webui: Add MCP CORS Proxy detection logic & UI (#20167 ) * refactor: MCP store cleanup * feat: Add MCP proxy availability detection * fix: Sidebar icon * chore: update webui build output * chore: Formatting * chore: update webui build output * chore: Update package lock * chore: update webui build output * chore: update webui build output * chore: update webui build output	2026-03-16 13:05:36 +01:00
Pascal	d65c4f2dc9	Fix model selector locked to first loaded model with multiple models (#20580 ) * webui: fix model selector being locked to first loaded model When multiple models are loaded, the auto-select effect would re-fire on every loadedModelIds change, overriding the user's manual model selection. Guard with selectedModelId so auto-select only kicks in when no model is chosen yet. * chore: update webui build output	2026-03-16 12:04:06 +01:00
Woof Dog	d8c331c0af	webui: use date in more human readable exported filename (#19939 ) * webui: use date in exported filename Move conversation naming and export to utils update index.html.gz * webui: move literals to message export constants file * webui: move export naming and download back to the conversation store * chore: update webui build output * webui: add comments to some constants * chore: update webui build output	2026-03-16 11:18:13 +01:00
Ruben Ortlam	46dba9fce8	vulkan: fix flash attention dot product precision (#20589 )	2026-03-16 10:45:49 +01:00
Sigbjørn Skjæret	de8f01c2d7	model : wire up Nemotron-H tensors for NVFP4 support (#20561 ) * wire up Nemotron-H tensors for NVFP4 support * add ssm tensors * alignment	2026-03-16 09:19:16 +01:00

1 2 3 4 5 ...

8421 Commits All Branches Search

8421 Commits

All Branches