* hex-dma: make chained dma the default to handle newer models
This also includes some new instrumentation that we can remove later.
* hexagon: add uint32 dump helper
* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv
ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.
* hexagon: update ssm-conv to make base-addr compute a bit easier to read
* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)
* hex-bin: fix incorrect stride logic
* hexagon: make sure repack buffs are dumped for verbose > 2
* hex-bin: consistently use dma_queue_push even for dummy dst transactions
* hex-dma: start using 2d-wide mode on v75 and up
The removes the need to deal with the 16-bit limitaion for the strides.
* hex-bin: cleanup kernel selection logic
* hex-bin: cleanup binary op core and fix transposed tensor handling
* snapdragon: update run-bench to use larger ubatch and fa-on
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal:add conv_3d backend
Rebased with master and resolved conflicts.
* Resolved issues related to changes in variable names
* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
and allocation branches are skipped
* fix(openvino): explicit memset in buffer_context allocation
* minor
---------
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: correct tiled flash attention SoA pointer math for multihead MXFP
The cleanup refactoring (c919bc471) extracted mxfp_dequant_head as a
shared helper but failed to update the tiled path's data pointers.
The helper expects the full SoA row base (no per-head offset), but the
tiled path was passing a pointer that already included ik2*nbk2, causing
a double head offset that produced NaN during prefill.
Add mxfp_row_ptr helper to centralize the multihead-aware pointer
calculation across both one_chunk and tiled paths. Verified with 16-chunk
perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8)
produce exact matches with the known-good commit (23e88631c).
* perf: reduce E8M0 MSE search range from ±2 to ±1
The base estimate round(log2(amax)) is always within 1 step of optimal.
Empirically verified across 30K blocks and 6 distributions: ±1 and ±2
never disagree. This reduces the scale search from 5 to 3 candidates
(40% fewer inner loop iterations) with zero quality impact.
* perf: eliminate redundant work in MXFP quantize and flash attention
- mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d
- mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop
- tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer
* cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers
- EMAX_OFFSET comments: fix ceil/floor labels to match actual values
- Hadamard flag: unify write path (llama-kv-cache.cpp) and read path
(ops.cpp) to both use DK==DV condition instead of is_mla()
- E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style,
add cross-reference comment
* fix: MXFP8/6 flash attention tests crash on init
The view base tensors for K/V don't get named "k"/"v" but inherit the
MXFP type. The name-based filter in initialize_tensors missed them,
falling through to init_tensor_uniform which calls quantize_chunk and
aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for
all tensors, matching the pattern set_rows tests already use.
* test: expand MXFP set_rows coverage
- Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage
- Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs
- Coverage: 18 → 51 MXFP set_rows tests
* perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path)
* cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers
Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared
helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4)
and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths.
Eliminates duplicated per-block logic while keeping layout-specific
pointer arithmetic in the callers.
* feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support)
Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers
from the SoA generic impl and use them to build AoS row functions for
MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref
in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT.
MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used
as general quantization types. The SoA impl is also DRY'd to delegate
to the same per-block helpers.
* cleanup: remove dead soa_elems field from mxfp_kv_params
Computed but never read — leftover from an earlier design.
* feat: add MXFP8/MXFP6 vec_dot and full CPU type support
Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations,
register from_float + vec_dot + vec_dot_type in CPU traits, and add
fallback remaps for all architectures. MXFP8/6 are now fully tested:
AoS quantization error, reference match, and dot product accuracy all
pass in test-quantize-fns.
* perf: remove E8M0 MSE search — base estimate is perplexity-optimal
The MSE search over ±1 candidates around round(log2(amax)) was found to
HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b.
The base estimate alone (no search) produces better attention patterns
because minimizing per-block reconstruction error is not the same as
minimizing attention score distortion through softmax.
Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant,
and the entire search loop. E8M0 computation is now a single amax scan +
integer bit extraction — no inner loop, no function pointers. This also
simplifies future GPU/Metal implementations.
* perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer)
Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and
quantize block-by-block with a 32-float stack buffer. Eliminates the
std::vector heap allocation and 2 extra memory passes over the full row.
set_rows now dispatches to the fused path when Hadamard is enabled,
falling through to the unfused quantize for non-Hadamard types.
This pattern maps directly to a CUDA kernel: global memory read →
register Hadamard → register quantize → global memory write.
* cleanup: consistent MXFP type names and variable naming
- Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6"
to match "mxfp4". Only one variant of each exists — the suffix was
unnecessary disambiguation that implied alternatives.
- Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles
all types via ggml_type_name matching).
- Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K).
* perf: fuse Q preprocessing round-trip (no SoA buffer needed)
Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions
that apply quantization error to float values without materializing SoA
bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA
buffer → dequant from SoA buffer) with a single pass through per-block
round-trip helpers.
Eliminates the Q_q intermediate buffer and two function pointer calls
from the flash attention hot path. Maps directly to CUDA: Q stays in
registers, Hadamard butterfly + quantize error applied in-place.
* fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures)
E8M0 = 255 means NaN per MX spec, but our encode path already clamps
to 254. When test data contains random E8M0 = 255 bytes, the decode
produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to
fail on MXFP6 (and potentially MXFP4/8).
Fix: clamp 255 → 254 in both E8M0 decode functions:
- ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h)
- ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h)
These are unfortunately duplicated across two headers because
ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves
CPU-only callers that don't include ggml-common.h.
* ggml-cuda: native bf16 flash attention for vec and tile kernels
mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo
* ggml-cuda: address code owner review feedback
reverted tile kernel changes to avoid larger refactor
* fix ci failures on turing and hip
* fix bf16 vec kernel compile on hip v_dot2 platforms
* add comments
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Increase per-thread work if the K-dimension is small
With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.
This change increases the number of output elements per block for such cases.
* Limit this change to ncols_dst = 1
* tab to space
* cleanup: consolidate MXFP type aliases, fix SoA linker bug on 5 platforms
- Add GGML_TYPE_MXFP8 and GGML_TYPE_MXFP6 short aliases (matching
existing GGML_TYPE_MXFP4 pattern) and use short names consistently
throughout the codebase instead of mixing long/short forms.
- Fix missing SoA dequant symbols (dequantize_row_mxfp{4,8,6}_soa_cpu)
on loongarch, powerpc, riscv, s390, and wasm by adding proper aliases
to each arch section in arch-fallback.h. Previously these were only
defined under GGML_CPU_GENERIC, causing linker failures on those
platforms when using MXFP flash attention.
- Remove 10 files from the PR diff:
- 5 arch stub files replaced by arch-fallback.h aliases
- 5 rename-only files (sycl, opencl, repack, llama-quant) reverted
since the GGML_TYPE_MXFP4 compat alias handles them
* cleanup: DRY FP6 unpack, extract mxfp_kv_params + mxfp_dequant_head helper
- FP6 unpack: x86 and ARM SIMD versions now call ggml_mxfp_unpack_fp6x4()
from ggml-common.h instead of duplicating the scalar bit manipulation.
- Extract mxfp_kv_params sub-struct from mxfp_fa_params: the 7 symmetric
K/V fields (dequantize, multihead, soa_elems, qs_per_block,
head_qs_bytes, head_e8m0_offset, blocks_per_head) are now in a reusable
struct accessed as mxfp.k and mxfp.v.
- Add mxfp_dequant_head() helper: replaces 4 instances of the multihead
SoA extraction pattern (2x memcpy + dequant, with multihead/single-head
branching) with a single function call. Future backends get the pattern
for free.
* cleanup: extract mxfp_kv_params_init to DRY the K/V init blocks
The K and V initialization in mxfp_fa_params_init were structurally
identical 10-line blocks differing only by tensor/dimension. Extract
into mxfp_kv_params_init(type, D, nb2, ne2) so future MXFP formats
get the multihead SoA addressing logic automatically.
* cleanup: generic MSE round-trip, replace magic buffer sizes with constants
- Remove mse_error_fp8_e4m3 and mse_error_fp6_e2m3: these were identical
round-trip functions differing only by converter. mxfp_compute_e8m0_mse
now uses to_elem/to_float directly when mse_error is NULL (FP8/FP6).
MXFP4 keeps its custom decision-tree MSE. New formats get MSE for free
by just setting to_elem/to_float in their traits.
- Replace magic 1024/1088 buffer sizes in flash attention with named
constants MXFP_FA_MAX_D and MXFP_FA_SOA_BUF. One place to change if
max head dimension grows.
* cleanup: remove dead AoS vec_dot for MXFP8/MXFP6, unify SoA impls
MXFP8 and MXFP6 are KV-cache-only types that use SoA layout for flash
attention. The AoS vec_dot functions (scalar generic, AVX2, NEON) were
dead code — no matmul path uses them.
Removed:
- ggml_vec_dot_mxfp{8,6}_q8_0 from scalar, x86, ARM, quants.h
- ggml_vec_dot_mxfp_q8_0_impl shared helper
- arch-fallback.h aliases for vec_dot mxfp8/mxfp6 (12 lines)
- vec_dot/vec_dot_type registration in ggml-cpu.c
Also unified SoA quantize/dequant: the separate mxfp8_soa_impl and
mxfp6_soa_impl functions (4 functions, ~80 lines) are replaced by two
generic functions (quantize_row_mxfp_soa_impl, dequantize_row_mxfp_soa_impl)
that use traits->bits_per_elem and traits->qs_per_block to handle both
byte-aligned (FP8) and 6-bit packed (FP6) formats. New MXFP formats
get SoA for free by setting these trait fields.
* cleanup: remove all AoS MXFP8/MXFP6 quantize/dequant — SoA only
MXFP8 and MXFP6 are KV-cache-only types. All quantization and
dequantization goes through the SoA (Struct-of-Arrays) path for flash
attention. The AoS (block_mxfp8/block_mxfp6 struct) implementations
were dead code that should never have been added.
Removed:
- quantize_row_mxfp{8,6}_impl, dequantize_row_mxfp{8,6}_impl
- quantize_row_mxfp{8,6}_ref, dequantize_row_mxfp{8,6}
- quantize_mxfp{8,6} (ggml_quantize_chunk wrappers)
- All declarations from ggml-quants.h and quants.h
- to_float/from_float_ref registrations from ggml.c type traits
- from_float registration from ggml-cpu.c CPU traits
Block struct definitions (block_mxfp8, block_mxfp6) are retained for
sizeof() in type traits and validate_row_data.
* cleanup: fail fast in ggml_quantize_chunk for KV-cache-only types
Add explicit GGML_ABORT for MXFP8/MXFP6 in ggml_quantize_chunk —
these are KV-cache-only types that use SoA layout via from_float_soa.
Attempting AoS quantization through this entry point is a bug.
* cleanup : hoist mxfp soa functions
* fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert
Three fixes for CI failures:
1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include
causes NAN/INFINITY to become non-constexpr, breaking __device__ static
table initialization for the MXFP LUTs.
2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal
shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS)
but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS).
3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a
conditional fallback — when ne2 is not divisible by heads_per_region,
fall back to per-head SoA init instead of crashing.
* fix : correct guard for mxfp cpu dequant functions
* fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout
- Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP
LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY
- Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides,
matching the KV cache layout for permuted flash attention tests
* fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables
CUDA/HIP/MUSA __device__ static tables require constexpr initializers.
Standard NAN/INFINITY macros may expand to non-constexpr expressions
(e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr
for static init). Previous fix attempted __uint_as_float for nvcc and
__builtin_bit_cast for clang — neither worked universally.
Use __builtin_nanf("") and __builtin_inff() which are constexpr on
all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define
once before the platform #if chain instead of per-platform copies.
* fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests
The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal
precision, causing bitwise mismatches against the IEEE-754 element
converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e
precision for exact float round-trip on all 256 entries.
Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single
definition using __builtin_nanf/__builtin_inff (constexpr on all
target compilers), and adds LUT validation tests to test-quantize-fns
that verify all 5 MXFP element converters match their canonical LUT
values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256,
FP8 E5M2: 256 — 656 total values verified).
* fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc
MSVC does not support __builtin_nanf/__builtin_inff. Use standard
NAN/INFINITY macros on MSVC (which work for regular static tables),
and compiler builtins only on GCC/Clang/nvcc (needed for CUDA
__device__ table constexpr initialization).
* fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros
When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__
are defined. The previous fix checked _MSC_VER first, giving nvcc the
MSVC NAN/INFINITY macros which are not constexpr for __device__ tables.
Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff.
* cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA)
MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention,
which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions
(NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added
and are dead code — no model stores weights in these formats.
Removed:
- AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch
- AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch
- AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic
- AoS fallback defines in arch-fallback.h
- CPU traits .to_float entries for MXFP6/MXFP8
- MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests)
Kept (correct SoA code):
- All *_soa_* functions (NEON, AVX2, generic, dispatch)
- CPU traits .from_float_soa / .to_float_soa
- Flash attention and SET_ROWS Hadamard test cases
- Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip)
- MXFP4 AoS code (upstream model weight support, untouched)
Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code
that had a NEON bug. The test no longer runs because the type is
correctly excluded from AoS test paths.
* test: guard all MXFP types must have SoA traits for flash attention
All MXFP flash attention uses SoA layout exclusively. Test validates:
- ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa
- MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float
Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8,
or removes SoA traits from any MXFP type, CI will catch it.
* test: add Hadamard, SoA cross-check, E8M0, and layout offset tests
* test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests
Add comprehensive tests to catch the bugs backend implementers hit most:
- Element converter edge cases: subnormals, max finite, saturation, NaN, sign
- FP6 pack/unpack exhaustive round-trip with known-answer byte verification
- E8M0 known-answer decode + HALF vs FULL scale distinction
- E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs)
- Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats
- Consolidate duplicate SoA switches into single table in test-backend-ops
* test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse
- MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match
- Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types
- mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants
- Block size consistency: verify QK_MXFP{4,8,6} == 32
- EMAX_OFFSET vs format max: validate constants produce valid E8M0
- Edge case LUT validation: expected_bits verified against canonical LUTs
- FP4 E2M1 exhaustive converter round-trip (16/16)
* cleanup: tighten MXFP test comments to match repo conventions
* fix: platform-specific NaN/Infinity for GPU device table initializers
FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized
in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses
these LUTs (they use converter functions instead), so guard them out of
GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.
rpc : prevent division by zero in deserialize_tensor
When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server.
This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0.
(Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here).
* style: remove trailing whitespace
Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* vulkan: change gated_delta_net to shard a column across a subgroup
This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).
This fixes a perf regression from the transposing of the values in memory
(!20443).
* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
* CANN: add BF16 support for core operators
Add BF16 (bfloat16) type support to the CANN backend for the following
operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and
OUT_PROD. This enables BF16 models to run on Ascend NPUs.
* CANN: skip NZ weight format for BF16 and add 310P compile guards
NZ weight format conversion does not support BF16 tensors, skip it
in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID
and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P
guards for all BF16 operator support since 310P does not support BF16.
* migrate(vtcm): unify VTCM management for HMX merge
- Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled,
hmx_dma, vtcm_scratch_size, exp2_table
- Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for
entire session instead of per-op acquire/release
- Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in
session-hold mode, delegate in per-op mode
- Add VTCM tail reservation for precompute tables (256KB, 64KB aligned)
in htp_iface_start under HTP_HAS_HMX
- Add HMX init/cleanup hooks in htp_iface_start/stop
- Add precompute table recovery in vtcm_acquire after VTCM preemption
- Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation)
* migrate(repack): replace x4x2 with HMX tile-permuted super-block format
- Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants)
- Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted
- Implement inverse repack for get_tensor debug verification
- Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2
- MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy)
- Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks
- Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate
- Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op
* migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops
Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op
migration. Reuses existing dma_queue_flush() for sync points instead
of adding redundant dma_queue_drain().
* migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking
Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for
cleaner separation between HVX and HMX code. Simplify HMX hardware
locking by replacing the two-level lock design (SHARED HAP lock +
custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock
on the existing vtcm_rctx, which already has HMX capability.
Key changes:
- Create htp/hmx/ subdirectory with all HMX infrastructure and ops
- Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx)
- Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed)
- Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals)
- Update main.c includes to use hmx/ prefix
- Clean up duplicate declarations from hmx-worker-pool.h
* migrate(hmx-infra): consolidate HMX infrastructure into htp_context
- Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops
- Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func)
- Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx
- Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release
- Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue
- Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers
- Delete upstream llama.cpp AGENTS.md (not applicable to fork)
* migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table
- Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable
- Remove table duplication loop in precompute-table.c
- Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c
- Fix table_size to 65536 (single 64 KB copy) in main.c
The exp2 lookup table is read-only; concurrent VTCM reads do not cause
bank conflicts, so duplicating the table wastes 192 KB of VTCM for no
benefit.
* migrate(dsp-main): add HMX priority dispatch in packet_callback
- Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types)
- Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16)
- Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32
- Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled
- Split RMS_NORM and SCALE into separate case blocks for independent dispatch
- All HMX wrappers guarded by #ifdef HTP_HAS_HMX
* migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels
Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build
integration: compile hmx-matmul-ops, hmx-flash-attn-ops,
hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels
with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain
unchanged.
* migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility
- hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions
- hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names
- hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings
- hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable
* hmx: set Q/O element type to fp16 for flash attention
The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element
should be false to match the actual data layout.
* hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback
Remove the v73+ HMX-specific super-block/tile-permuted weight format
and unify all architectures on the HVX x4x2 packed format. The DSP now
decides at runtime whether to use the HMX or HVX matmul path based on
dimension constraints (M%32, N%32, K%256 alignment), rather than the
host rejecting ops in supports_op. This simplifies the host repack
logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL
quantization support across host and DSP.
Key changes:
- Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile
permutation (ggml-hexagon.cpp, hmx-quants.h)
- Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL
- Force is_host=false so tensor copies go through format conversion
- Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h)
- Rewrite DSP dequantizers to work directly on x4x2 layout
(hmx-matmul-ops.c)
- Fix mxclracc.hf placement: clear per output tile, not once globally
- Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c)
- Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks
- Add VTCM allocation overflow asserts
- Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt)
* Enhance HMX debugging capabilities with new tile dumping functions
- Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data.
- Updated hmx_dump_tile_rows to provide raw memory output for debugging.
- Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting.
- Refined existing macros for dumping HVX vector values to streamline debugging output.
These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations.
* OK for small mat mul
* hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers
The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix
DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing
truncated transfers and NaN results. Fix by using 2D DMA (per-row stride ×
n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both
x4x2 and fp16 weight paths.
Also includes:
- Use standard vlut16 instead of _nomatch variant for dequantization
- Add per-tile vscatter drain barrier for correctness
- Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default)
* hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline
Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over
the generic unary path. Re-enable DMA pipeline mode for QK matmul.
* hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow
All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride)
are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the
non-pipeline path by switching from 1D to 2D DMA, but the pipeline path
(3 call sites) was left unchanged and still used 1D DMA with
chunk_size = n_cols * row_stride as roiwidth, which overflows for any
practical matrix size when the pipeline is active.
Add a local hmx_dma_push_safe() helper that transparently handles overflow:
- Fast path (zero overhead): all params fit in 16 bits -> direct call.
- Contiguous block: reshapes into a single 2D descriptor with sub_width
that fits in 16 bits, preserving async DMA behavior.
- Stride overflow: row-by-row fallback for future large-k models where
per-row stride itself exceeds 65535.
Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use
the safe helper, including the 3 pipeline sites (1D -> 2D fix), the
FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the
output-stationary activation fetch.
* hexagon: multithread activation/output transfer and add HMX matmul fallback
- Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with
transfer_activation_chunk_multithread across all HMX matmul paths
- Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32
output store, following the same worker pool pattern
- Rename transfer_activation_chunk_no_prefetch back to
transfer_activation_chunk_fp32_to_fp16 and clean up stale comments
- Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error
* [todo]: dynamic alloc vtcm, cause prefill regression.
* hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults
Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges.
* hexagon: split unaligned-M HMX matmul into HMX+HVX phases
- keep HMX for the 32-aligned head rows and process tail rows with HVX
- force re-quantization for HVX tail after HMX phase to avoid stale VTCM state
- preserve fallback behavior when N is unaligned or no aligned M rows exist
* hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces
Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous
K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead.
The dequantize loop now takes the batch-4 path when 4 aligned K-tiles
are available within the same column tile, falling back to the original
single-tile path otherwise.
Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no
longer needed.
* hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag
Promote DSP response error from log to GGML_ABORT for fail-fast
behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX
matmul so the HVX path correctly re-quantizes activations.
* hexagon: support batch matmul. This fix perplexity issue
The problem comes from Grouped-Query Attention(GQA). Strides between batches are not well respected
TODO: optimize batch matmul to reuse weights between batches.
* hexagon: reuse weights in fp16 batch matmul
* hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test
* hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options
* hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL
* hexagon: remove the unnecessary changes not related to HMX
* hexagon: bypass HMX by default
* hexagon: add upstream repo link to htp-ops-lib ported file headers
* hexagon: restore host buffer support
* hexagon: add HMX=1 option for the adb scripts
* hex-hmx: improve DMA pipelining
* hex-hmx: further improvements to dma pipelining
* hex-hmx: minor cleanup
* hex-hmx: move hmx lock out of inner loops/calls
* hex-hmx: remove unnecessary state and wrappers
* hex-hmx: remove hmx dir and unify f32 to f16 conversions
* hex-hmx: further unify hvx conversions
* hex-hmx: revert f16 converter to the original for now
* hex-hmx: minor cleanup for f16 to f32 converter
* hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code
* hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm
* hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned
* hex-mm: use hvx_vec_splat_f16 in the hmx code
* hex-mm: use VLEN and HTP types in hmx-code
* hex-mm: remove duplicate QK and defs
* hexagon: pre-shuffle quants before vlut16
* hexagon: enable HMX by default
* hex-mm: code indent fixes for hmx-matmul
* hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm
* hex-mm: more formatting fixes
* hex-mm: minor naming updates in hmx code
* hex-mm: remove leftover from rebase conflict
* Fix the incorrect indents
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
RotaryPositionEmbedding on CANN fails when src and dst share the same
non-contiguous buffer (inplace + view), because the operator overwrites
source data before it is fully read.
Add a branch that detects this case and uses contiguous temporary
buffers: copy src to temp, run ROPE into another temp, then copy back
to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1,
inplace=1).
Signed-off-by: noemotiovon <757486878@qq.com>
- Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by
padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2,
then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp).
- Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of
sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with
48 heads); fixes buffer overflow and large numerical errors in those cases.
Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.
- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
the worker pool, supporting f32 and f16 types. The kernel parallelizes
across output rows and uses memcpy for each tile.
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* kleidiai : fix MUL_MAT support for batched (3D) inputs
The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.
This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.
Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.
Fixes#20608
* Kleidiai support_ops should only return true for 3D inputs, not also 4D
* vulkan: avoid graphics queue on non-RADV AMD drivers
* avoid graphics queues on small GPUs
* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE
* reenable transfer queue if graphics queue is not used
* kleidiai: add data type check to get_tensor_traits
* Added check for F16 data type into get_tensor_traits path with input data
not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7
* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
updated kleidiai.cpp file as per suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>