llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	79187f2fb8	ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)	2026-03-18 15:17:28 +02:00
Julien Chaumond	48e61238e1	webui: improve tooltip wording for attachment requirements (#20688 ) * webui: improve tooltip wording for attachment requirements Co-Authored-By: Claude <Agents+claude@huggingface.co> * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Claude <Agents+claude@huggingface.co>	2026-03-18 14:01:02 +01:00
Pop Flamingo	312cf03328	llama : re-enable manual LoRA adapter free (#19983 ) * Re-enable manual LoRA adapter free * Remove stale "all adapters must be loaded before context creation" stale comments	2026-03-18 12:03:26 +02:00
Masato Nakasaka	f4049ad735	tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483 ) * Fix errors occurring on Windows * Reverted fix #20365 will take care of CRLF isue * Changed to write to directly to stdin * Prevent fclose to happen twice	2026-03-18 10:43:31 +01:00
Aldehir Rojas	5e8910a0db	common : rework gpt-oss parser (#20393 ) * common : rework gpt-oss parser * cont : fix gpt-oss tests * cont : add structured output test * cont : rename final to final_msg	2026-03-18 10:41:25 +01:00
Aaron Teo	fe00a84b4b	tests: enable kv_unified to prevent cuda oom error on rtx 2060 (#20645 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-18 17:40:22 +08:00
Aleksander Grygier	7ab321d40d	webui: Fix duplicated messages on q param (#20715 ) * fix: Remove duplicate message sending on `?q` param * chore: update webui build output	2026-03-18 10:32:43 +01:00
uvos	7533a7d509	HIP : ignore return of hipMemAdvise [no ci] (#20696 )	2026-03-18 09:53:13 +01:00
Andreas Obersteiner	a69d54f990	context : fix graph not resetting when control vector changes (#20381 )	2026-03-18 08:10:13 +02:00
Krishna Sridhar	cf23ee2447	hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (#20701 ) Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear attention layers. These ops follow the existing unary-ops pattern with VTCM DMA double-buffering. - neg: negate via scale by -1.0 - exp: uses existing hvx_exp_f32 HVX intrinsics - sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics - softplus: log(1 + exp(x)) scalar fallback - CONT reuses the existing CPY infrastructure since making a tensor contiguous is equivalent to a same-type copy. - REPEAT implements tiled memory copy with multi-threaded execution via the worker pool, supporting f32 and f16 types. The kernel parallelizes across output rows and uses memcpy for each tile. Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-17 15:34:36 -07:00
Ruben Ortlam	892e3c333a	vulkan: disable mmvq on Intel Windows driver (#20672 ) * vulkan: disable mmvq on Intel Windows driver * improve comment	2026-03-17 21:51:43 +01:00
Kevin Hannon	ee4801e5a6	ggml-blas: set mkl threads from thread context (#20602 ) * ggml blas: set mkl threads from thread context * add code to run blas locally	2026-03-18 01:16:49 +08:00
Piotr Wilkin (ilintar)	d2ecd2d1cf	common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289 ) * Add `--force-pure-content` to force a pure content parser. * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Change parameter name [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 16:16:43 +01:00
Taimur Ahmad	054d8b0f24	ggml-cpu: fix RVV checks in quants and repacking (#20682 ) * ggml-cpu: refactor quants.c; add rvv check * ggml-cpu: refactor; disable generic fallback	2026-03-17 16:03:40 +02:00
Sigbjørn Skjæret	ab0bb93748	ci : bump ccache [no ci] (#20679 ) * bump ccache * forgotten * disable for s390x * disable also for ppc64le	2026-03-17 14:54:31 +01:00
Ruben Ortlam	3a5cb629b1	vulkan: async and event fixes (#20518 ) * vulkan: fix event wait submission, event command buffer reset * fix event command buffer reset validation error * also reset command buffers before reuse * use timeline semaphores instead of fences for event_synchronize * don't use initializer list for semaphore wait info * use multiple events to avoid reset issues * fix event reuse issue with multiple vectors * add semaphore wait condition also if compute_ctx already exists * remove event pending stage	2026-03-17 14:27:23 +01:00
Georgi Gerganov	8cc2d81264	server : fix ctx checkpoint invalidation (#20671 )	2026-03-17 15:21:14 +02:00
Justin Bradford	627670601a	kleidiai : fix MUL_MAT support for batched (3D) inputs (#20620 ) * kleidiai : fix MUL_MAT support for batched (3D) inputs The supports_op() check incorrectly rejected MUL_MAT operations with 3D inputs (ne[2] > 1), but the actual compute_forward_qx() implementation handles batched inputs correctly via a loop over ne12. This caused models with Q4_0/Q8_0 weights to crash during graph scheduling when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during loading (tested with 2D inputs) but the runtime used 3D inputs. Also relax the buffer check to allow supports_op() to be called during weight loading when src[0]->buffer is NULL. Fixes #20608 * Kleidiai support_ops should only return true for 3D inputs, not also 4D	2026-03-17 14:03:54 +02:00
Ruben Ortlam	740a447fc3	vulkan: allow graphics queue only through env var (#20599 ) * vulkan: avoid graphics queue on non-RADV AMD drivers * avoid graphics queues on small GPUs * change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE * reenable transfer queue if graphics queue is not used	2026-03-17 10:09:59 +01:00
Neo Zhang	b6c83aad55	[SYCL] ehance UPSCALE to support all UT cases (#20637 ) * [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1	2026-03-17 10:01:52 +08:00
Piotr Wilkin (ilintar)	2e4a6edd4a	tools/server: support refusal content for Responses API (#20285 ) * Support refusal content for Responses API * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 01:42:04 +01:00
Xuan-Son Nguyen	d34ff7eb5b	model: mistral small 4 support (#20649 ) * model: mistral small 4 support * fix test * fix test (2) * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * change newline --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 00:31:14 +01:00
Georgi Gerganov	45172df4d6	ci : disable AMX jobs (#20654 ) [no ci]	2026-03-16 22:38:59 +02:00
Georgi Gerganov	9b342d0a9f	benches : add Nemotron 3 Nano on DGX Spark (#20652 ) [no ci]	2026-03-16 21:50:43 +02:00
Sigbjørn Skjæret	55e87026f7	tests : write to binary buffer to avoid newline translation in jinja -py [no ci] (#20365 )	2026-03-16 20:40:22 +01:00
Martin Klacer	cf21cdf36c	kleidiai: add data type check to get_tensor_traits (#20639 ) * kleidiai: add data type check to get_tensor_traits * Added check for F16 data type into get_tensor_traits path with input data not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8) Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7 * updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp updated kleidiai.cpp file as per suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-16 21:25:54 +02:00
Sigbjørn Skjæret	0ed992973b	ci : update labeler (#20629 )	2026-03-16 20:24:20 +01:00
Aldehir Rojas	1bbec6a75d	jinja : add capability check for object args (#20612 )	2026-03-16 17:43:14 +01:00
Georgi Gerganov	f47a246a08	sync : ggml	2026-03-16 17:22:06 +02:00
Georgi Gerganov	c0ccbd1f86	ggml : try fix arm build (whisper/0)	2026-03-16 17:22:06 +02:00
David366AI	f6da02c3f2	ggml : extend im2col f16 (ggml/1434) * examples/yolo: fix load_model memory leak * fix/issue-1433 ggml_compute_forward_im2col_f16 assert error * fix/issue-1433	2026-03-16 17:22:06 +02:00
Pascal	dddca026bf	webui: add model information dialog to router mode (#20600 ) * webui: add model information dialog to router mode * webui: add "Available models" section header in model list * webui: remove nested scrollbar from chat template in model info dialog * chore: update webui build output * feat: UI improvements * refactor: Cleaner rendering + UI docs * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-16 15:38:11 +01:00
Aman Gupta	3c8521c4f5	llama-graph: replace cont with reshape for alpha in qwen35 (#20640 )	2026-03-16 22:07:13 +08:00
Aleksander Grygier	67a2209fab	webui: Add MCP CORS Proxy detection logic & UI (#20167 ) * refactor: MCP store cleanup * feat: Add MCP proxy availability detection * fix: Sidebar icon * chore: update webui build output * chore: Formatting * chore: update webui build output * chore: Update package lock * chore: update webui build output * chore: update webui build output * chore: update webui build output	2026-03-16 13:05:36 +01:00
Pascal	d65c4f2dc9	Fix model selector locked to first loaded model with multiple models (#20580 ) * webui: fix model selector being locked to first loaded model When multiple models are loaded, the auto-select effect would re-fire on every loadedModelIds change, overriding the user's manual model selection. Guard with selectedModelId so auto-select only kicks in when no model is chosen yet. * chore: update webui build output	2026-03-16 12:04:06 +01:00
Woof Dog	d8c331c0af	webui: use date in more human readable exported filename (#19939 ) * webui: use date in exported filename Move conversation naming and export to utils update index.html.gz * webui: move literals to message export constants file * webui: move export naming and download back to the conversation store * chore: update webui build output * webui: add comments to some constants * chore: update webui build output	2026-03-16 11:18:13 +01:00
Ruben Ortlam	46dba9fce8	vulkan: fix flash attention dot product precision (#20589 )	2026-03-16 10:45:49 +01:00
Sigbjørn Skjæret	de8f01c2d7	model : wire up Nemotron-H tensors for NVFP4 support (#20561 ) * wire up Nemotron-H tensors for NVFP4 support * add ssm tensors * alignment	2026-03-16 09:19:16 +01:00
Richard Davison	079e5a45f0	convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization (#20539 ) * support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization * cleanup * fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-16 09:18:47 +01:00
Masato Nakasaka	d3936498a3	common : fix iterator::end() dereference (#20445 )	2026-03-16 08:50:38 +02:00
Aman Gupta	34818ea6c0	CUDA: GDN hide memory latency (#20537 )	2026-03-16 11:41:45 +08:00
Tim Burke	8036edc99a	ggml: eliminate hot-path heap allocations and fix tiled MXFP multihead dequant Replace per-row/per-tile std::vector heap allocations with stack buffers in set_rows, one_chunk, and tiled flash attention paths. Fix tiled path to use per-head SoA extraction (matching one_chunk) instead of dequanting the full multihead region per token.	2026-03-15 22:55:34 -04:00
Tim Burke	b8e8d291d1	ggml: refactor x86 AVX2 and ARM NEON MXFP dequant — shared traits and helpers Add mxfp_dequant_traits_t to ggml-common.h as single source of truth for MXFP IEEE-754 reconstruction parameters. Define static const instances for all 4 formats (E4M3, E5M2, E2M3, E3M2), ready for CUDA/Metal/Vulkan reuse. Extract shared dequant and FP6 unpack helpers on both architectures, replacing duplicated inline code and macros. Net -215 lines.	2026-03-15 21:37:02 -04:00
Tim Burke	c913ab36d2	fix buffer overflows for large DK and multi-head MXFP flash attention - Increase q_mxfp_buf from 512 to 2048 bytes (supports DK up to 1024 with MXFP8) - Replace fixed k_soa[4096]/v_soa[4096] stack arrays with dynamically sized vectors - Replace fixed k_head_soa[320]/v_head_soa[320] with dynamically sized vectors - Add soa_bytes divisibility assertion in test init	2026-03-15 20:30:12 -04:00
Tim Burke	f603c036ec	Comment consistencty pass and cleanup.	2026-03-15 20:14:52 -04:00
Tim Burke	c2f2ff7814	ggml: optimize CPU MXFP flash attention hot loop - Per-head dequant: multihead MXFP now extracts only the needed head's SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and dequants DK elements, instead of dequanting all heads (nek2*DK). For 8 KV heads this is 8x less dequant work per KV position. - Hoist loop invariants: base pointer offsets (k_base, v_base), per-head SoA byte offsets, and multihead row bases are computed once per query row instead of per KV position in the inner loop. - Precompute SoA addressing in mxfp_fa_params_init: qs_per_block, blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated once at init rather than derived per iteration. - Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and v_is_f16 check outside the ir loop.	2026-03-15 19:49:27 -04:00
Tim Burke	a51ff77fae	ggml: address PR review — fix buffer overflows, add assertions, normalize MXFP6 naming Fix potential buffer overflows flagged in PR #20609 review: - set_rows: replace fixed float tmp[1024] with std::vector for large n_embd_k_gqa - tiled FA: size q_mxfp_buf with ggml_row_size guard instead of fixed 1024 - one_chunk FA: pre-allocate k/v dequant buffers from mxfp.{k,v}_soa_elems instead of hard-coded float[4096] stack arrays - kv-cache: assert n_embd_k_gqa % qk == 0 before integer division - test init: assert soa_bytes % block_size == 0 Normalize MXFP6 function naming to match MXFP8 convention (short form without element format suffix): mxfp6_e2m3 → mxfp6 in all function identifiers across 14 files. Format-specific items (type enums, traits, lookup tables, constants) retain their _e2m3 suffix.	2026-03-15 18:57:50 -04:00
Tim Burke	5c3a9523ef	Merge remote-tracking branch 'origin/mxfp-flash-attention' into mxfp-flash-attention	2026-03-15 18:00:11 -04:00
Piotr Wilkin (ilintar)	9e2e2198b0	tools/cli: fix disable reasoning (#20606 )	2026-03-15 22:40:53 +01:00
Tim Burke	d8c9f9c7f6	ggml: MXFP flash attention with SoA layout (CPU scalar reference) Add MXFP KV cache quantization for flash attention using Struct-of-Arrays (SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8 (E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec. SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling aligned memory access patterns for GPU backends. All functions in the flash attention pipeline — set_rows quantization, Q preprocessing, K/V dequantization — use SoA end-to-end. The existing AoS block layout remains for MUL_MAT weight quantization (untouched). Q preprocessing applies Walsh-Hadamard rotation (block-32) before quantize/dequant round-trip, distributing outlier energy across the shared exponent group. This is essential for perplexity: MXFP8: +0.22 PPL without rotation MXFP6: +3.34 PPL without rotation Hadamard is skipped for MLA models (DK != DV) where V is a view of K. Shared infrastructure in ggml-common.h: - Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements) - E8M0 MSE-optimal scale search with ±1 range - Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2) - FP6 tight packing (4 six-bit values in 3 bytes, 25% savings) - IEEE-754 bit reconstruction constants for SIMD backends - SoA layout macros, portable bit cast, type property queries CPU implementation: - Scalar reference + ARM NEON + x86 AVX2 optimized paths - Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM) - Split-KV path extended for single-query decode - Generic vec_dot via dequant-to-float for MUL_MAT compatibility - Arch fallbacks for loongarch, powerpc, riscv, s390, wasm KV cache integration: - set_rows writes SoA with optional Hadamard (op_params[0] flag) - K cache block-aligned to 16 for CUDA cp.async compatibility - CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8) Tests: - Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4) - SET_ROWS: Hadamard rotation for all types - SoA-aware test initialization and comparison for MXFP tensors - Quantize functions coverage for all types Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends (CPU, OpenCL, SYCL) for consistency with the MX type family naming.	2026-03-15 17:33:19 -04:00

1 2 3 4 5 ...

8467 Commits All Branches Search

8467 Commits

All Branches