Commit Graph

8467 Commits

Author SHA1 Message Date
Georgi Gerganov 79187f2fb8 ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441) 2026-03-18 15:17:28 +02:00
Julien Chaumond 48e61238e1
webui: improve tooltip wording for attachment requirements (#20688)
* webui: improve tooltip wording for attachment requirements

Co-Authored-By: Claude <Agents+claude@huggingface.co>

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Claude <Agents+claude@huggingface.co>
2026-03-18 14:01:02 +01:00
Pop Flamingo 312cf03328
llama : re-enable manual LoRA adapter free (#19983)
* Re-enable manual LoRA adapter free

* Remove stale "all adapters must be loaded before context creation" stale comments
2026-03-18 12:03:26 +02:00
Masato Nakasaka f4049ad735
tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483)
* Fix errors occurring on Windows

* Reverted fix

#20365 will take care of CRLF isue

* Changed to write to directly to stdin

* Prevent fclose to happen twice
2026-03-18 10:43:31 +01:00
Aldehir Rojas 5e8910a0db
common : rework gpt-oss parser (#20393)
* common : rework gpt-oss parser

* cont : fix gpt-oss tests

* cont : add structured output test

* cont : rename final to final_msg
2026-03-18 10:41:25 +01:00
Aaron Teo fe00a84b4b
tests: enable kv_unified to prevent cuda oom error on rtx 2060 (#20645)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-03-18 17:40:22 +08:00
Aleksander Grygier 7ab321d40d
webui: Fix duplicated messages on q param (#20715)
* fix: Remove duplicate message sending on `?q` param

* chore: update webui build output
2026-03-18 10:32:43 +01:00
uvos 7533a7d509
HIP : ignore return of hipMemAdvise [no ci] (#20696) 2026-03-18 09:53:13 +01:00
Andreas Obersteiner a69d54f990
context : fix graph not resetting when control vector changes (#20381) 2026-03-18 08:10:13 +02:00
Krishna Sridhar cf23ee2447
hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (#20701)
Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.

- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
  contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
  the worker pool, supporting f32 and f16 types. The kernel parallelizes
  across output rows and uses memcpy for each tile.

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-17 15:34:36 -07:00
Ruben Ortlam 892e3c333a
vulkan: disable mmvq on Intel Windows driver (#20672)
* vulkan: disable mmvq on Intel Windows driver

* improve comment
2026-03-17 21:51:43 +01:00
Kevin Hannon ee4801e5a6
ggml-blas: set mkl threads from thread context (#20602)
* ggml blas: set mkl threads from thread context

* add code to run blas locally
2026-03-18 01:16:49 +08:00
Piotr Wilkin (ilintar) d2ecd2d1cf
common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289)
* Add `--force-pure-content` to force a pure content parser.

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Change parameter name [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Taimur Ahmad 054d8b0f24
ggml-cpu: fix RVV checks in quants and repacking (#20682)
* ggml-cpu: refactor quants.c; add rvv check

* ggml-cpu: refactor; disable generic fallback
2026-03-17 16:03:40 +02:00
Sigbjørn Skjæret ab0bb93748
ci : bump ccache [no ci] (#20679)
* bump ccache

* forgotten

* disable for s390x

* disable also for ppc64le
2026-03-17 14:54:31 +01:00
Ruben Ortlam 3a5cb629b1
vulkan: async and event fixes (#20518)
* vulkan: fix event wait submission, event command buffer reset

* fix event command buffer reset validation error

* also reset command buffers before reuse

* use timeline semaphores instead of fences for event_synchronize

* don't use initializer list for semaphore wait info

* use multiple events to avoid reset issues

* fix event reuse issue with multiple vectors

* add semaphore wait condition also if compute_ctx already exists

* remove event pending stage
2026-03-17 14:27:23 +01:00
Georgi Gerganov 8cc2d81264
server : fix ctx checkpoint invalidation (#20671) 2026-03-17 15:21:14 +02:00
Justin Bradford 627670601a
kleidiai : fix MUL_MAT support for batched (3D) inputs (#20620)
* kleidiai : fix MUL_MAT support for batched (3D) inputs

The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.

This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.

Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.

Fixes #20608

* Kleidiai support_ops should only return true for 3D inputs, not also 4D
2026-03-17 14:03:54 +02:00
Ruben Ortlam 740a447fc3
vulkan: allow graphics queue only through env var (#20599)
* vulkan: avoid graphics queue on non-RADV AMD drivers

* avoid graphics queues on small GPUs

* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE

* reenable transfer queue if graphics queue is not used
2026-03-17 10:09:59 +01:00
Neo Zhang b6c83aad55
[SYCL] ehance UPSCALE to support all UT cases (#20637)
* [SYCL] ehance UPSCALE to support more cases

* rm test case result of SYCL1
2026-03-17 10:01:52 +08:00
Piotr Wilkin (ilintar) 2e4a6edd4a
tools/server: support refusal content for Responses API (#20285)
* Support refusal content for Responses API

* Update tools/server/server-common.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tools/server/server-common.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 01:42:04 +01:00
Xuan-Son Nguyen d34ff7eb5b
model: mistral small 4 support (#20649)
* model: mistral small 4 support

* fix test

* fix test (2)

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* change newline

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 00:31:14 +01:00
Georgi Gerganov 45172df4d6
ci : disable AMX jobs (#20654)
[no ci]
2026-03-16 22:38:59 +02:00
Georgi Gerganov 9b342d0a9f
benches : add Nemotron 3 Nano on DGX Spark (#20652)
[no ci]
2026-03-16 21:50:43 +02:00
Sigbjørn Skjæret 55e87026f7
tests : write to binary buffer to avoid newline translation in jinja -py [no ci] (#20365) 2026-03-16 20:40:22 +01:00
Martin Klacer cf21cdf36c
kleidiai: add data type check to get_tensor_traits (#20639)
* kleidiai: add data type check to get_tensor_traits

 * Added check for F16 data type into get_tensor_traits path with input data
   not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7

* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

updated kleidiai.cpp file as per suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-16 21:25:54 +02:00
Sigbjørn Skjæret 0ed992973b
ci : update labeler (#20629) 2026-03-16 20:24:20 +01:00
Aldehir Rojas 1bbec6a75d
jinja : add capability check for object args (#20612) 2026-03-16 17:43:14 +01:00
Georgi Gerganov f47a246a08 sync : ggml 2026-03-16 17:22:06 +02:00
Georgi Gerganov c0ccbd1f86 ggml : try fix arm build (whisper/0) 2026-03-16 17:22:06 +02:00
David366AI f6da02c3f2 ggml : extend im2col f16 (ggml/1434)
* examples/yolo: fix load_model memory leak

* fix/issue-1433 ggml_compute_forward_im2col_f16 assert error

* fix/issue-1433
2026-03-16 17:22:06 +02:00
Pascal dddca026bf
webui: add model information dialog to router mode (#20600)
* webui: add model information dialog to router mode

* webui: add "Available models" section header in model list

* webui: remove nested scrollbar from chat template in model info dialog

* chore: update webui build output

* feat: UI improvements

* refactor: Cleaner rendering + UI docs

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-16 15:38:11 +01:00
Aman Gupta 3c8521c4f5
llama-graph: replace cont with reshape for alpha in qwen35 (#20640) 2026-03-16 22:07:13 +08:00
Aleksander Grygier 67a2209fab
webui: Add MCP CORS Proxy detection logic & UI (#20167)
* refactor: MCP store cleanup

* feat: Add MCP proxy availability detection

* fix: Sidebar icon

* chore: update webui build output

* chore: Formatting

* chore: update webui build output

* chore: Update package lock

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output
2026-03-16 13:05:36 +01:00
Pascal d65c4f2dc9
Fix model selector locked to first loaded model with multiple models (#20580)
* webui: fix model selector being locked to first loaded model

When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.

* chore: update webui build output
2026-03-16 12:04:06 +01:00
Woof Dog d8c331c0af
webui: use date in more human readable exported filename (#19939)
* webui: use date in exported filename

Move conversation naming and export to utils

update index.html.gz

* webui: move literals to message export constants file

* webui: move export naming and download back to the conversation store

* chore: update webui build output

* webui: add comments to some constants

* chore: update webui build output
2026-03-16 11:18:13 +01:00
Ruben Ortlam 46dba9fce8
vulkan: fix flash attention dot product precision (#20589) 2026-03-16 10:45:49 +01:00
Sigbjørn Skjæret de8f01c2d7
model : wire up Nemotron-H tensors for NVFP4 support (#20561)
* wire up Nemotron-H tensors for NVFP4 support

* add ssm tensors

* alignment
2026-03-16 09:19:16 +01:00
Richard Davison 079e5a45f0
convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization (#20539)
* support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization

* cleanup

* fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-16 09:18:47 +01:00
Masato Nakasaka d3936498a3
common : fix iterator::end() dereference (#20445) 2026-03-16 08:50:38 +02:00
Aman Gupta 34818ea6c0
CUDA: GDN hide memory latency (#20537) 2026-03-16 11:41:45 +08:00
Tim Burke 8036edc99a ggml: eliminate hot-path heap allocations and fix tiled MXFP multihead dequant
Replace per-row/per-tile std::vector heap allocations with stack buffers
in set_rows, one_chunk, and tiled flash attention paths. Fix tiled path
to use per-head SoA extraction (matching one_chunk) instead of dequanting
the full multihead region per token.
2026-03-15 22:55:34 -04:00
Tim Burke b8e8d291d1 ggml: refactor x86 AVX2 and ARM NEON MXFP dequant — shared traits and helpers
Add mxfp_dequant_traits_t to ggml-common.h as single source of truth for
MXFP IEEE-754 reconstruction parameters. Define static const instances for
all 4 formats (E4M3, E5M2, E2M3, E3M2), ready for CUDA/Metal/Vulkan reuse.

Extract shared dequant and FP6 unpack helpers on both architectures,
replacing duplicated inline code and macros. Net -215 lines.
2026-03-15 21:37:02 -04:00
Tim Burke c913ab36d2 fix buffer overflows for large DK and multi-head MXFP flash attention
- Increase q_mxfp_buf from 512 to 2048 bytes (supports DK up to 1024 with MXFP8)
- Replace fixed k_soa[4096]/v_soa[4096] stack arrays with dynamically sized vectors
- Replace fixed k_head_soa[320]/v_head_soa[320] with dynamically sized vectors
- Add soa_bytes divisibility assertion in test init
2026-03-15 20:30:12 -04:00
Tim Burke f603c036ec Comment consistencty pass and cleanup. 2026-03-15 20:14:52 -04:00
Tim Burke c2f2ff7814 ggml: optimize CPU MXFP flash attention hot loop
- Per-head dequant: multihead MXFP now extracts only the needed head's
  SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and
  dequants DK elements, instead of dequanting all heads (nek2*DK).
  For 8 KV heads this is 8x less dequant work per KV position.

- Hoist loop invariants: base pointer offsets (k_base, v_base),
  per-head SoA byte offsets, and multihead row bases are computed once
  per query row instead of per KV position in the inner loop.

- Precompute SoA addressing in mxfp_fa_params_init: qs_per_block,
  blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated
  once at init rather than derived per iteration.

- Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and
  v_is_f16 check outside the ir loop.
2026-03-15 19:49:27 -04:00
Tim Burke a51ff77fae ggml: address PR review — fix buffer overflows, add assertions, normalize MXFP6 naming
Fix potential buffer overflows flagged in PR #20609 review:
- set_rows: replace fixed float tmp[1024] with std::vector for large n_embd_k_gqa
- tiled FA: size q_mxfp_buf with ggml_row_size guard instead of fixed 1024
- one_chunk FA: pre-allocate k/v dequant buffers from mxfp.{k,v}_soa_elems
  instead of hard-coded float[4096] stack arrays
- kv-cache: assert n_embd_k_gqa % qk == 0 before integer division
- test init: assert soa_bytes % block_size == 0

Normalize MXFP6 function naming to match MXFP8 convention (short form
without element format suffix): mxfp6_e2m3 → mxfp6 in all function
identifiers across 14 files. Format-specific items (type enums, traits,
lookup tables, constants) retain their _e2m3 suffix.
2026-03-15 18:57:50 -04:00
Tim Burke 5c3a9523ef Merge remote-tracking branch 'origin/mxfp-flash-attention' into mxfp-flash-attention 2026-03-15 18:00:11 -04:00
Piotr Wilkin (ilintar) 9e2e2198b0
tools/cli: fix disable reasoning (#20606) 2026-03-15 22:40:53 +01:00
Tim Burke d8c9f9c7f6 ggml: MXFP flash attention with SoA layout (CPU scalar reference)
Add MXFP KV cache quantization for flash attention using Struct-of-Arrays
(SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8
(E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec.

SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling
aligned memory access patterns for GPU backends. All functions in the
flash attention pipeline — set_rows quantization, Q preprocessing, K/V
dequantization — use SoA end-to-end. The existing AoS block layout
remains for MUL_MAT weight quantization (untouched).

Q preprocessing applies Walsh-Hadamard rotation (block-32) before
quantize/dequant round-trip, distributing outlier energy across the
shared exponent group. This is essential for perplexity:
  MXFP8: +0.22 PPL without rotation
  MXFP6: +3.34 PPL without rotation
Hadamard is skipped for MLA models (DK != DV) where V is a view of K.

Shared infrastructure in ggml-common.h:
- Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements)
- E8M0 MSE-optimal scale search with ±1 range
- Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2)
- FP6 tight packing (4 six-bit values in 3 bytes, 25% savings)
- IEEE-754 bit reconstruction constants for SIMD backends
- SoA layout macros, portable bit cast, type property queries

CPU implementation:
- Scalar reference + ARM NEON + x86 AVX2 optimized paths
- Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM)
- Split-KV path extended for single-query decode
- Generic vec_dot via dequant-to-float for MUL_MAT compatibility
- Arch fallbacks for loongarch, powerpc, riscv, s390, wasm

KV cache integration:
- set_rows writes SoA with optional Hadamard (op_params[0] flag)
- K cache block-aligned to 16 for CUDA cp.async compatibility
- CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8)

Tests:
- Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4)
- SET_ROWS: Hadamard rotation for all types
- SoA-aware test initialization and comparison for MXFP tensors
- Quantize functions coverage for all types

Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends
(CPU, OpenCL, SYCL) for consistency with the MX type family naming.
2026-03-15 17:33:19 -04:00