* cleanup: consolidate MXFP type aliases, fix SoA linker bug on 5 platforms
- Add GGML_TYPE_MXFP8 and GGML_TYPE_MXFP6 short aliases (matching
existing GGML_TYPE_MXFP4 pattern) and use short names consistently
throughout the codebase instead of mixing long/short forms.
- Fix missing SoA dequant symbols (dequantize_row_mxfp{4,8,6}_soa_cpu)
on loongarch, powerpc, riscv, s390, and wasm by adding proper aliases
to each arch section in arch-fallback.h. Previously these were only
defined under GGML_CPU_GENERIC, causing linker failures on those
platforms when using MXFP flash attention.
- Remove 10 files from the PR diff:
- 5 arch stub files replaced by arch-fallback.h aliases
- 5 rename-only files (sycl, opencl, repack, llama-quant) reverted
since the GGML_TYPE_MXFP4 compat alias handles them
* cleanup: DRY FP6 unpack, extract mxfp_kv_params + mxfp_dequant_head helper
- FP6 unpack: x86 and ARM SIMD versions now call ggml_mxfp_unpack_fp6x4()
from ggml-common.h instead of duplicating the scalar bit manipulation.
- Extract mxfp_kv_params sub-struct from mxfp_fa_params: the 7 symmetric
K/V fields (dequantize, multihead, soa_elems, qs_per_block,
head_qs_bytes, head_e8m0_offset, blocks_per_head) are now in a reusable
struct accessed as mxfp.k and mxfp.v.
- Add mxfp_dequant_head() helper: replaces 4 instances of the multihead
SoA extraction pattern (2x memcpy + dequant, with multihead/single-head
branching) with a single function call. Future backends get the pattern
for free.
* cleanup: extract mxfp_kv_params_init to DRY the K/V init blocks
The K and V initialization in mxfp_fa_params_init were structurally
identical 10-line blocks differing only by tensor/dimension. Extract
into mxfp_kv_params_init(type, D, nb2, ne2) so future MXFP formats
get the multihead SoA addressing logic automatically.
* cleanup: generic MSE round-trip, replace magic buffer sizes with constants
- Remove mse_error_fp8_e4m3 and mse_error_fp6_e2m3: these were identical
round-trip functions differing only by converter. mxfp_compute_e8m0_mse
now uses to_elem/to_float directly when mse_error is NULL (FP8/FP6).
MXFP4 keeps its custom decision-tree MSE. New formats get MSE for free
by just setting to_elem/to_float in their traits.
- Replace magic 1024/1088 buffer sizes in flash attention with named
constants MXFP_FA_MAX_D and MXFP_FA_SOA_BUF. One place to change if
max head dimension grows.
* cleanup: remove dead AoS vec_dot for MXFP8/MXFP6, unify SoA impls
MXFP8 and MXFP6 are KV-cache-only types that use SoA layout for flash
attention. The AoS vec_dot functions (scalar generic, AVX2, NEON) were
dead code — no matmul path uses them.
Removed:
- ggml_vec_dot_mxfp{8,6}_q8_0 from scalar, x86, ARM, quants.h
- ggml_vec_dot_mxfp_q8_0_impl shared helper
- arch-fallback.h aliases for vec_dot mxfp8/mxfp6 (12 lines)
- vec_dot/vec_dot_type registration in ggml-cpu.c
Also unified SoA quantize/dequant: the separate mxfp8_soa_impl and
mxfp6_soa_impl functions (4 functions, ~80 lines) are replaced by two
generic functions (quantize_row_mxfp_soa_impl, dequantize_row_mxfp_soa_impl)
that use traits->bits_per_elem and traits->qs_per_block to handle both
byte-aligned (FP8) and 6-bit packed (FP6) formats. New MXFP formats
get SoA for free by setting these trait fields.
* cleanup: remove all AoS MXFP8/MXFP6 quantize/dequant — SoA only
MXFP8 and MXFP6 are KV-cache-only types. All quantization and
dequantization goes through the SoA (Struct-of-Arrays) path for flash
attention. The AoS (block_mxfp8/block_mxfp6 struct) implementations
were dead code that should never have been added.
Removed:
- quantize_row_mxfp{8,6}_impl, dequantize_row_mxfp{8,6}_impl
- quantize_row_mxfp{8,6}_ref, dequantize_row_mxfp{8,6}
- quantize_mxfp{8,6} (ggml_quantize_chunk wrappers)
- All declarations from ggml-quants.h and quants.h
- to_float/from_float_ref registrations from ggml.c type traits
- from_float registration from ggml-cpu.c CPU traits
Block struct definitions (block_mxfp8, block_mxfp6) are retained for
sizeof() in type traits and validate_row_data.
* cleanup: fail fast in ggml_quantize_chunk for KV-cache-only types
Add explicit GGML_ABORT for MXFP8/MXFP6 in ggml_quantize_chunk —
these are KV-cache-only types that use SoA layout via from_float_soa.
Attempting AoS quantization through this entry point is a bug.
* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
* server: (doc) clarify in-scope and out-scope features
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Two bugs in `server_models::load()` that affect router mode reliability:
**Bug 1: Deadlock when child process crashes**
When a child process is killed (e.g., SIGKILL from OS code signature
validation), the monitoring thread deadlocks on `stopping_thread.join()`
because the stopping_thread's wait predicate (`is_stopping`) is never
satisfied — the model name was never inserted into `stopping_models`.
`update_status()` is never reached and the model stays stuck in LOADING
state permanently.
Fix: extend the stopping_thread's wait predicate to also wake when the
child process is no longer alive (`!subprocess_alive()`). When woken by
a dead child, the thread skips the shutdown sequence and returns
immediately. The original `stopping_models.erase()` logic is preserved
for normal unloads.
**Bug 2: TOCTOU race bypasses `--models-max` (ref #20137)**
`unload_lru()` is called outside the mutex, then `load()` acquires the
lock afterward. Under concurrent requests, multiple threads observe
capacity and all proceed to load, exceeding the limit.
Fix: re-check capacity under the lock after `unload_lru()` returns.
If another thread filled the slot in the window between `unload_lru()`
and the lock acquisition, reject with an error instead of silently
exceeding the limit.
* tests : fix fetch_server_test_models.py
* server: to_json_oaicompat cached_tokens
Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
* webui: make server the source of truth for sampling defaults
* webui: fix Custom badge for sampling parameters
* webui: log user overrides after server sync
* chore: update webui build output
* fix: Default values for sampling settings config object
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* add tests for model id parser
* add test case having activated params
* add structured tests for model id parser
* add ToDo
* feat: Improve model parsing logic + tests
* chore: update webui build output
---------
Co-authored-by: bluemoehre <bluemoehre@gmx.de>
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output
Add MXFP KV cache quantization for flash attention using Struct-of-Arrays
(SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8
(E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec.
SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling
aligned memory access patterns for GPU backends. All functions in the
flash attention pipeline — set_rows quantization, Q preprocessing, K/V
dequantization — use SoA end-to-end. The existing AoS block layout
remains for MUL_MAT weight quantization (untouched).
Q preprocessing applies Walsh-Hadamard rotation (block-32) before
quantize/dequant round-trip, distributing outlier energy across the
shared exponent group. This is essential for perplexity:
MXFP8: +0.22 PPL without rotation
MXFP6: +3.34 PPL without rotation
Hadamard is skipped for MLA models (DK != DV) where V is a view of K.
Shared infrastructure in ggml-common.h:
- Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements)
- E8M0 MSE-optimal scale search with ±1 range
- Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2)
- FP6 tight packing (4 six-bit values in 3 bytes, 25% savings)
- IEEE-754 bit reconstruction constants for SIMD backends
- SoA layout macros, portable bit cast, type property queries
CPU implementation:
- Scalar reference + ARM NEON + x86 AVX2 optimized paths
- Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM)
- Split-KV path extended for single-query decode
- Generic vec_dot via dequant-to-float for MUL_MAT compatibility
- Arch fallbacks for loongarch, powerpc, riscv, s390, wasm
KV cache integration:
- set_rows writes SoA with optional Hadamard (op_params[0] flag)
- K cache block-aligned to 16 for CUDA cp.async compatibility
- CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8)
Tests:
- Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4)
- SET_ROWS: Hadamard rotation for all types
- SoA-aware test initialization and comparison for MXFP tensors
- Quantize functions coverage for all types
Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends
(CPU, OpenCL, SYCL) for consistency with the MX type family naming.
llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande
winogrande_score : tokenizing selected tasks
winogrande_score : calculating winogrande score over selected tasks.
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 46
failed to decode the batch, n_batch = 2048, ret = 1
winogrande_score: llama_decode() failed
same for hellaswag:
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 99
failed to decode the batch, n_batch = 2048, ret = 1
hellaswag_score: llama_decode() failed
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* llama : fix pooling assertion crash in chunked GDN detection path
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).
Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.
Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.
* server : add mean pooling tests to embedding test suite
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.
These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
---------
Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit renames the the function `mtmd_get_audio_bitrate` to
`mtmd_get_audio_sample_rate` to better reflect its purpose.
The motivation for this is that the function currently returns the audio
sample rate, not the bitrate (sample_rate × bit_depth × channels), and
that is how it is used in the code as well.
This is a breaking change, but I believe mtmd is still in
experimental/development phase so it might be alright to simply rename.
* quantize : imatrix-fail early + code cleanup
* fix manual override printing
it's in the preliminary loop now, so needs to be on its own line
* revert header changes per ggerganov
* remove old #includes
* clarify naming
rename `tensor_quantization` to `tensor_typo_option` to descirbe its
functionality
* fix per barto
* Parse port numbers from MCP server URLs
* Pass scheme to http proxy for determining whether to use SSL
* Fix download on non-standard port and re-add port to logging
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>