* cleanup: consolidate MXFP type aliases, fix SoA linker bug on 5 platforms
- Add GGML_TYPE_MXFP8 and GGML_TYPE_MXFP6 short aliases (matching
existing GGML_TYPE_MXFP4 pattern) and use short names consistently
throughout the codebase instead of mixing long/short forms.
- Fix missing SoA dequant symbols (dequantize_row_mxfp{4,8,6}_soa_cpu)
on loongarch, powerpc, riscv, s390, and wasm by adding proper aliases
to each arch section in arch-fallback.h. Previously these were only
defined under GGML_CPU_GENERIC, causing linker failures on those
platforms when using MXFP flash attention.
- Remove 10 files from the PR diff:
- 5 arch stub files replaced by arch-fallback.h aliases
- 5 rename-only files (sycl, opencl, repack, llama-quant) reverted
since the GGML_TYPE_MXFP4 compat alias handles them
* cleanup: DRY FP6 unpack, extract mxfp_kv_params + mxfp_dequant_head helper
- FP6 unpack: x86 and ARM SIMD versions now call ggml_mxfp_unpack_fp6x4()
from ggml-common.h instead of duplicating the scalar bit manipulation.
- Extract mxfp_kv_params sub-struct from mxfp_fa_params: the 7 symmetric
K/V fields (dequantize, multihead, soa_elems, qs_per_block,
head_qs_bytes, head_e8m0_offset, blocks_per_head) are now in a reusable
struct accessed as mxfp.k and mxfp.v.
- Add mxfp_dequant_head() helper: replaces 4 instances of the multihead
SoA extraction pattern (2x memcpy + dequant, with multihead/single-head
branching) with a single function call. Future backends get the pattern
for free.
* cleanup: extract mxfp_kv_params_init to DRY the K/V init blocks
The K and V initialization in mxfp_fa_params_init were structurally
identical 10-line blocks differing only by tensor/dimension. Extract
into mxfp_kv_params_init(type, D, nb2, ne2) so future MXFP formats
get the multihead SoA addressing logic automatically.
* cleanup: generic MSE round-trip, replace magic buffer sizes with constants
- Remove mse_error_fp8_e4m3 and mse_error_fp6_e2m3: these were identical
round-trip functions differing only by converter. mxfp_compute_e8m0_mse
now uses to_elem/to_float directly when mse_error is NULL (FP8/FP6).
MXFP4 keeps its custom decision-tree MSE. New formats get MSE for free
by just setting to_elem/to_float in their traits.
- Replace magic 1024/1088 buffer sizes in flash attention with named
constants MXFP_FA_MAX_D and MXFP_FA_SOA_BUF. One place to change if
max head dimension grows.
* cleanup: remove dead AoS vec_dot for MXFP8/MXFP6, unify SoA impls
MXFP8 and MXFP6 are KV-cache-only types that use SoA layout for flash
attention. The AoS vec_dot functions (scalar generic, AVX2, NEON) were
dead code — no matmul path uses them.
Removed:
- ggml_vec_dot_mxfp{8,6}_q8_0 from scalar, x86, ARM, quants.h
- ggml_vec_dot_mxfp_q8_0_impl shared helper
- arch-fallback.h aliases for vec_dot mxfp8/mxfp6 (12 lines)
- vec_dot/vec_dot_type registration in ggml-cpu.c
Also unified SoA quantize/dequant: the separate mxfp8_soa_impl and
mxfp6_soa_impl functions (4 functions, ~80 lines) are replaced by two
generic functions (quantize_row_mxfp_soa_impl, dequantize_row_mxfp_soa_impl)
that use traits->bits_per_elem and traits->qs_per_block to handle both
byte-aligned (FP8) and 6-bit packed (FP6) formats. New MXFP formats
get SoA for free by setting these trait fields.
* cleanup: remove all AoS MXFP8/MXFP6 quantize/dequant — SoA only
MXFP8 and MXFP6 are KV-cache-only types. All quantization and
dequantization goes through the SoA (Struct-of-Arrays) path for flash
attention. The AoS (block_mxfp8/block_mxfp6 struct) implementations
were dead code that should never have been added.
Removed:
- quantize_row_mxfp{8,6}_impl, dequantize_row_mxfp{8,6}_impl
- quantize_row_mxfp{8,6}_ref, dequantize_row_mxfp{8,6}
- quantize_mxfp{8,6} (ggml_quantize_chunk wrappers)
- All declarations from ggml-quants.h and quants.h
- to_float/from_float_ref registrations from ggml.c type traits
- from_float registration from ggml-cpu.c CPU traits
Block struct definitions (block_mxfp8, block_mxfp6) are retained for
sizeof() in type traits and validate_row_data.
* cleanup: fail fast in ggml_quantize_chunk for KV-cache-only types
Add explicit GGML_ABORT for MXFP8/MXFP6 in ggml_quantize_chunk —
these are KV-cache-only types that use SoA layout via from_float_soa.
Attempting AoS quantization through this entry point is a bug.
* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
Address GHSA-q9j6-4hhc-rq9p and GHSA-2q4c-9gq5-5vfp.
The three-iterator overload of std::equal in value_array_t::equivalent()
and value_object_t::equivalent() reads past the end of the shorter
container when comparing arrays or objects of different lengths.
Use the four-iterator overload (C++14) which checks both range lengths.
Found-by: Pwno
* chat : fix out_of_range crash in throw path (#20424 regression)
#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.
Test crashes on unpatched master, passes with fix:
cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build --target test-chat
./build/bin/test-chat
* Update test-chat.cpp
* Update test-chat.cpp
* Update test-chat.cpp
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Add MXFP KV cache quantization for flash attention using Struct-of-Arrays
(SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8
(E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec.
SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling
aligned memory access patterns for GPU backends. All functions in the
flash attention pipeline — set_rows quantization, Q preprocessing, K/V
dequantization — use SoA end-to-end. The existing AoS block layout
remains for MUL_MAT weight quantization (untouched).
Q preprocessing applies Walsh-Hadamard rotation (block-32) before
quantize/dequant round-trip, distributing outlier energy across the
shared exponent group. This is essential for perplexity:
MXFP8: +0.22 PPL without rotation
MXFP6: +3.34 PPL without rotation
Hadamard is skipped for MLA models (DK != DV) where V is a view of K.
Shared infrastructure in ggml-common.h:
- Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements)
- E8M0 MSE-optimal scale search with ±1 range
- Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2)
- FP6 tight packing (4 six-bit values in 3 bytes, 25% savings)
- IEEE-754 bit reconstruction constants for SIMD backends
- SoA layout macros, portable bit cast, type property queries
CPU implementation:
- Scalar reference + ARM NEON + x86 AVX2 optimized paths
- Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM)
- Split-KV path extended for single-query decode
- Generic vec_dot via dequant-to-float for MUL_MAT compatibility
- Arch fallbacks for loongarch, powerpc, riscv, s390, wasm
KV cache integration:
- set_rows writes SoA with optional Hadamard (op_params[0] flag)
- K cache block-aligned to 16 for CUDA cp.async compatibility
- CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8)
Tests:
- Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4)
- SET_ROWS: Hadamard rotation for all types
- SoA-aware test initialization and comparison for MXFP tensors
- Quantize functions coverage for all types
Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends
(CPU, OpenCL, SYCL) for consistency with the MX type family naming.
* tests: allow loading test-backend-ops tests from json
* add error threshold based on op
* add error when file cannot be read
* add graph operator json extraction tool
* add nb parameter for non-contiguous input tensors
* fix view check
* only use view if non-contiguous/permuted, use C++ random instead of rand()
* replace internal API calls with public llama_graph_reserve call
* reduce test description length
* fix nb[0] not getting set for view
* add name to tests
* fix inplace error
* use text file instead of json
* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/
* fix missing declaration
* use pragma once
* fix indent
* fix Windows build
* Parse port numbers from MCP server URLs
* Pass scheme to http proxy for determining whether to use SSL
* Fix download on non-standard port and re-add port to logging
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* common : handle incomplete UTF-8 at end of input in PEG parser
* cont : if reached end prematurely, emit needs_more_input to propagate partial output
* cont: refactor peg parse context to add lenient flag
* cont : remove partial flag, keep lenient flag
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* Allow reshuffled arguments in tagged argument parser format tool calls.
* Remove shuffle just keep the optional parsers in any order
* Remove unnecessary import
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
* llama : remove write/read of output ids/logits/embeddings
This commit removes the write/read of output ids, logits and
embeddings from the llama context state.
Refs: https://github.com/ggml-org/llama.cpp/pull/18862#issuecomment-3756330941
* completion : add replying of session state
This commit updates the session handing in the completion tool to handle
the that logits are no longer stored in the session file. Instead, we
need to replay the last token to get the logits for sampling.
* common : add common_prompt_batch_decode function
This commit adds a new function which is responsible for decoding prompt
and optionally handle the saving for session data.
* update save-state.cpp to use llama_state_load_file
This commit updates the save-load-state example to utilize the new
llama_state_load_file function for loading the model state from a file.
And it also replays the last token after loading since this state is now
stored before the last token is processed.
* examples : set n_seq_max = 2 for ctx3
This commit updates the save-load-state example to set the n_seq_max
parameter to 2 when initializing the ctx3 context.
The motivation for this change is that using 1 as n_parallel/n_seq_max
the context only supports one sequence, but the test laster tries to
use a second sequence which results in the following error:
```console
main : loaded state with 4 tokens
main : seq 0 copied, 225760 bytes
main : kv cache cleared
find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value
state_read_meta: failed to find available cells in kv cache
```
This seems to only happen for recurrent/hybrid models.