llama.cpp

Commit Graph

Author	SHA1	Message	Date
Tim Burke	0e3304fbca	Merge remote-tracking branch 'origin/master' into mxfp-flash-attention	2026-03-22 02:14:05 -04:00
Tim Burke	ad2fa9035a	test : add testing and fixes * cleanup : hoist mxfp soa functions * fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert Three fixes for CI failures: 1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include causes NAN/INFINITY to become non-constexpr, breaking __device__ static table initialization for the MXFP LUTs. 2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS) but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS). 3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a conditional fallback — when ne2 is not divisible by heads_per_region, fall back to per-head SoA init instead of crashing. * fix : correct guard for mxfp cpu dequant functions * fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout - Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY - Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides, matching the KV cache layout for permuted flash attention tests * fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables CUDA/HIP/MUSA __device__ static tables require constexpr initializers. Standard NAN/INFINITY macros may expand to non-constexpr expressions (e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr for static init). Previous fix attempted __uint_as_float for nvcc and __builtin_bit_cast for clang — neither worked universally. Use __builtin_nanf("") and __builtin_inff() which are constexpr on all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define once before the platform #if chain instead of per-platform copies. * fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal precision, causing bitwise mismatches against the IEEE-754 element converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e precision for exact float round-trip on all 256 entries. Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single definition using __builtin_nanf/__builtin_inff (constexpr on all target compilers), and adds LUT validation tests to test-quantize-fns that verify all 5 MXFP element converters match their canonical LUT values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256, FP8 E5M2: 256 — 656 total values verified). * fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc MSVC does not support __builtin_nanf/__builtin_inff. Use standard NAN/INFINITY macros on MSVC (which work for regular static tables), and compiler builtins only on GCC/Clang/nvcc (needed for CUDA __device__ table constexpr initialization). * fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__ are defined. The previous fix checked _MSC_VER first, giving nvcc the MSVC NAN/INFINITY macros which are not constexpr for __device__ tables. Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff. * cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA) MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention, which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions (NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added and are dead code — no model stores weights in these formats. Removed: - AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch - AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch - AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic - AoS fallback defines in arch-fallback.h - CPU traits .to_float entries for MXFP6/MXFP8 - MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests) Kept (correct SoA code): - All _soa_ functions (NEON, AVX2, generic, dispatch) - CPU traits .from_float_soa / .to_float_soa - Flash attention and SET_ROWS Hadamard test cases - Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip) - MXFP4 AoS code (upstream model weight support, untouched) Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code that had a NEON bug. The test no longer runs because the type is correctly excluded from AoS test paths. * test: guard all MXFP types must have SoA traits for flash attention All MXFP flash attention uses SoA layout exclusively. Test validates: - ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa - MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8, or removes SoA traits from any MXFP type, CI will catch it. * test: add Hadamard, SoA cross-check, E8M0, and layout offset tests * test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests Add comprehensive tests to catch the bugs backend implementers hit most: - Element converter edge cases: subnormals, max finite, saturation, NaN, sign - FP6 pack/unpack exhaustive round-trip with known-answer byte verification - E8M0 known-answer decode + HALF vs FULL scale distinction - E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs) - Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats - Consolidate duplicate SoA switches into single table in test-backend-ops * test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse - MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match - Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types - mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants - Block size consistency: verify QK_MXFP{4,8,6} == 32 - EMAX_OFFSET vs format max: validate constants produce valid E8M0 - Edge case LUT validation: expected_bits verified against canonical LUTs - FP4 E2M1 exhaustive converter round-trip (16/16) * cleanup: tighten MXFP test comments to match repo conventions * fix: platform-specific NaN/Infinity for GPU device table initializers FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses these LUTs (they use converter functions instead), so guard them out of GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.	2026-03-22 01:07:55 -04:00
Tim Burke	dd263ff567	mxfp traits : ensure mxfp soa quant and dequant functions are tested	2026-03-21 15:09:49 -04:00
Tim Burke	5bb05ed21c	Comment consistency pass and cleanup.	2026-03-21 13:52:54 -04:00
y198	2bcdddd5e3	fix(rpc): prevent division by zero in deserialize_tensor (#20712 ) rpc : prevent division by zero in deserialize_tensor When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0. (Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here). * style: remove trailing whitespace	2026-03-21 15:59:43 +02:00
Matt Corallo	cea560f483	Add shader count for Intel Arc Pro B60 (#20818 )	2026-03-21 05:22:51 +01:00
Tim Burke	23e88631c4	fix: gate tiled GEMM and split-KV paths to preserve q8_0/q4_0 vec_dot semantics	2026-03-20 23:40:15 -04:00
shalinib-ibm	e6ec21e62f	ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791 ) Explicitly mark save_acc and add_save_Acc with always_inline in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator disassembly within kernel's register context, preventing un-necessary stask spills. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-03-21 07:11:45 +08:00
Tim Burke	358bd71b52	Merge branch 'master' into mxfp-flash-attention	2026-03-20 18:49:51 -04:00
Jeff Bolz	e06c3ab2bc	vulkan: change gated_delta_net to shard a column across a subgroup (#20662 ) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups	2026-03-20 12:17:15 +01:00
hipudding	1af9dab32b	CANN: add BF16 support for core operators (#20152 ) * CANN: add BF16 support for core operators Add BF16 (bfloat16) type support to the CANN backend for the following operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and OUT_PROD. This enables BF16 models to run on Ascend NPUs. * CANN: skip NZ weight format for BF16 and add 310P compile guards NZ weight format conversion does not support BF16 tensors, skip it in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P guards for all BF16 operator support since 310P does not support BF16.	2026-03-20 17:08:39 +08:00
Sundaram krishnan	a0bbcdd9b6	ggml: guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767 )	2026-03-19 21:36:23 +02:00
Rail Chabdarov	340807273b	hip: Avoid compiler bug in RDNA code generation during debug builds on Windows (#20655 )	2026-03-19 19:14:08 +01:00
Yiwei Shao	74c42ee1f4	hexagon: add Matrix Extensions (HMX) for Hexagon NPU backend (#20693 ) * migrate(vtcm): unify VTCM management for HMX merge - Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled, hmx_dma, vtcm_scratch_size, exp2_table - Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for entire session instead of per-op acquire/release - Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in session-hold mode, delegate in per-op mode - Add VTCM tail reservation for precompute tables (256KB, 64KB aligned) in htp_iface_start under HTP_HAS_HMX - Add HMX init/cleanup hooks in htp_iface_start/stop - Add precompute table recovery in vtcm_acquire after VTCM preemption - Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation) * migrate(repack): replace x4x2 with HMX tile-permuted super-block format - Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants) - Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted - Implement inverse repack for get_tensor debug verification - Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2 - MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy) - Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks - Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate - Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op * migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op migration. Reuses existing dma_queue_flush() for sync points instead of adding redundant dma_queue_drain(). * migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for cleaner separation between HVX and HMX code. Simplify HMX hardware locking by replacing the two-level lock design (SHARED HAP lock + custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock on the existing vtcm_rctx, which already has HMX capability. Key changes: - Create htp/hmx/ subdirectory with all HMX infrastructure and ops - Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx) - Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed) - Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals) - Update main.c includes to use hmx/ prefix - Clean up duplicate declarations from hmx-worker-pool.h * migrate(hmx-infra): consolidate HMX infrastructure into htp_context - Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops - Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func) - Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx - Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release - Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue - Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers - Delete upstream llama.cpp AGENTS.md (not applicable to fork) * migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table - Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable - Remove table duplication loop in precompute-table.c - Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c - Fix table_size to 65536 (single 64 KB copy) in main.c The exp2 lookup table is read-only; concurrent VTCM reads do not cause bank conflicts, so duplicating the table wastes 192 KB of VTCM for no benefit. * migrate(dsp-main): add HMX priority dispatch in packet_callback - Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types) - Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16) - Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32 - Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled - Split RMS_NORM and SCALE into separate case blocks for independent dispatch - All HMX wrappers guarded by #ifdef HTP_HAS_HMX * migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build integration: compile hmx-matmul-ops, hmx-flash-attn-ops, hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain unchanged. * migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility - hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions - hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names - hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings - hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable * hmx: set Q/O element type to fp16 for flash attention The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element should be false to match the actual data layout. * hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback Remove the v73+ HMX-specific super-block/tile-permuted weight format and unify all architectures on the HVX x4x2 packed format. The DSP now decides at runtime whether to use the HMX or HVX matmul path based on dimension constraints (M%32, N%32, K%256 alignment), rather than the host rejecting ops in supports_op. This simplifies the host repack logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL quantization support across host and DSP. Key changes: - Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile permutation (ggml-hexagon.cpp, hmx-quants.h) - Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL - Force is_host=false so tensor copies go through format conversion - Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h) - Rewrite DSP dequantizers to work directly on x4x2 layout (hmx-matmul-ops.c) - Fix mxclracc.hf placement: clear per output tile, not once globally - Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c) - Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks - Add VTCM allocation overflow asserts - Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt) * Enhance HMX debugging capabilities with new tile dumping functions - Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data. - Updated hmx_dump_tile_rows to provide raw memory output for debugging. - Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting. - Refined existing macros for dumping HVX vector values to streamline debugging output. These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations. * OK for small mat mul * hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing truncated transfers and NaN results. Fix by using 2D DMA (per-row stride × n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both x4x2 and fp16 weight paths. Also includes: - Use standard vlut16 instead of _nomatch variant for dequantization - Add per-tile vscatter drain barrier for correctness - Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default) * hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over the generic unary path. Re-enable DMA pipeline mode for QK matmul. * hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride) are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the non-pipeline path by switching from 1D to 2D DMA, but the pipeline path (3 call sites) was left unchanged and still used 1D DMA with chunk_size = n_cols * row_stride as roiwidth, which overflows for any practical matrix size when the pipeline is active. Add a local hmx_dma_push_safe() helper that transparently handles overflow: - Fast path (zero overhead): all params fit in 16 bits -> direct call. - Contiguous block: reshapes into a single 2D descriptor with sub_width that fits in 16 bits, preserving async DMA behavior. - Stride overflow: row-by-row fallback for future large-k models where per-row stride itself exceeds 65535. Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use the safe helper, including the 3 pipeline sites (1D -> 2D fix), the FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the output-stationary activation fetch. * hexagon: multithread activation/output transfer and add HMX matmul fallback - Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with transfer_activation_chunk_multithread across all HMX matmul paths - Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32 output store, following the same worker pool pattern - Rename transfer_activation_chunk_no_prefetch back to transfer_activation_chunk_fp32_to_fp16 and clean up stale comments - Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error * [todo]: dynamic alloc vtcm, cause prefill regression. * hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges. * hexagon: split unaligned-M HMX matmul into HMX+HVX phases - keep HMX for the 32-aligned head rows and process tail rows with HVX - force re-quantization for HVX tail after HMX phase to avoid stale VTCM state - preserve fallback behavior when N is unaligned or no aligned M rows exist * hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead. The dequantize loop now takes the batch-4 path when 4 aligned K-tiles are available within the same column tile, falling back to the original single-tile path otherwise. Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no longer needed. * hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag Promote DSP response error from log to GGML_ABORT for fail-fast behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX matmul so the HVX path correctly re-quantizes activations. * hexagon: support batch matmul. This fix perplexity issue The problem comes from Grouped-Query Attention(GQA). Strides between batches are not well respected TODO: optimize batch matmul to reuse weights between batches. * hexagon: reuse weights in fp16 batch matmul * hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test * hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options * hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL * hexagon: remove the unnecessary changes not related to HMX * hexagon: bypass HMX by default * hexagon: add upstream repo link to htp-ops-lib ported file headers * hexagon: restore host buffer support * hexagon: add HMX=1 option for the adb scripts * hex-hmx: improve DMA pipelining * hex-hmx: further improvements to dma pipelining * hex-hmx: minor cleanup * hex-hmx: move hmx lock out of inner loops/calls * hex-hmx: remove unnecessary state and wrappers * hex-hmx: remove hmx dir and unify f32 to f16 conversions * hex-hmx: further unify hvx conversions * hex-hmx: revert f16 converter to the original for now * hex-hmx: minor cleanup for f16 to f32 converter * hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code * hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm * hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned * hex-mm: use hvx_vec_splat_f16 in the hmx code * hex-mm: use VLEN and HTP types in hmx-code * hex-mm: remove duplicate QK and defs * hexagon: pre-shuffle quants before vlut16 * hexagon: enable HMX by default * hex-mm: code indent fixes for hmx-matmul * hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm * hex-mm: more formatting fixes * hex-mm: minor naming updates in hmx code * hex-mm: remove leftover from rebase conflict * Fix the incorrect indents --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-19 09:11:06 -07:00
uvos	b49d8b8757	ci : add hip quality check (#20430 ) * CI: add hip quality check * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Revert "Update .github/workflows/hip-quality-check.yml" This reverts commit `efa0bfcdb0`. * scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs * scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list * Bump ccache version * Add mssing seperators to list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 17:05:44 +01:00
Reese Levine	c1258830b2	ggml webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) + GET_ROWS optimization (#20687 ) * Implement l2_norm, set, tri * Add DIAG/SOLVE_TRI * Add SSM_CONV * Better get_rows and gated_delta_net to support qwen3.5 * Clean up, update ops.md * Fix binding_index type for wasm * Fix read write annotations * cleanups	2026-03-19 08:45:28 -07:00
Eve	07feeaa92e	vulkan: dequantize iq4_xs 4 at a time (#20657 )	2026-03-19 11:32:04 +01:00
Charles Xu	3fee84e156	cmake : fix build warning when kleidiai is enabled (#20457 ) * cmake : fix build warning when kleidiai is enabled * remove LLAMA_ARG_THREADS from KleidiAI backend	2026-03-19 10:14:48 +02:00
Chenguang Li	7f2cbd9a4d	CANN: handle in-place ROPE on non-contiguous f32 tensors (#20274 ) RotaryPositionEmbedding on CANN fails when src and dst share the same non-contiguous buffer (inplace + view), because the operator overwrites source data before it is fully read. Add a branch that detects this case and uses contiguous temporary buffers: copy src to temp, run ROPE into another temp, then copy back to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1, inplace=1). Signed-off-by: noemotiovon <757486878@qq.com>	2026-03-19 14:05:01 +08:00
Masashi Yoshimura	509a31d00f	ggml-webgpu: Update the `RMS_NORM` preprocessor and add `L2_NORM` (#20665 ) * Update the preprocessor of RMS_NORM and add L2_NORM. * Fix the name of rms_norm to row_norm.	2026-03-18 21:08:59 -07:00
Masashi Yoshimura	ea01d196d7	ggml-webgpu: Add supports for `DIAG` and `TRI` (#20664 ) * Add supports for DIAG and TRI. * Remove extra ttype and add a comment for TRI op.	2026-03-18 21:08:35 -07:00
Chenguang Li	07ba6d275b	CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (#20031 ) - Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2, then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp). - Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with 48 heads); fixes buffer overflow and large numerical errors in those cases.	2026-03-19 11:02:42 +08:00
Reese Levine	8ced5f41f9	Move to no timeout for WaitAny in graph submission to avoid deadlocks in some cases on llvm-pipe backends (#20618 )	2026-03-18 10:23:47 -07:00
Shaw Nguyen	78d550b541	ggml-cpu/x86: fix unused changemask warning in repack (#20692 )	2026-03-18 18:45:06 +02:00
Georgi Gerganov	79187f2fb8	ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)	2026-03-18 15:17:28 +02:00
uvos	7533a7d509	HIP : ignore return of hipMemAdvise [no ci] (#20696 )	2026-03-18 09:53:13 +01:00
Krishna Sridhar	cf23ee2447	hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (#20701 ) Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear attention layers. These ops follow the existing unary-ops pattern with VTCM DMA double-buffering. - neg: negate via scale by -1.0 - exp: uses existing hvx_exp_f32 HVX intrinsics - sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics - softplus: log(1 + exp(x)) scalar fallback - CONT reuses the existing CPY infrastructure since making a tensor contiguous is equivalent to a same-type copy. - REPEAT implements tiled memory copy with multi-threaded execution via the worker pool, supporting f32 and f16 types. The kernel parallelizes across output rows and uses memcpy for each tile. Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-17 15:34:36 -07:00
Ruben Ortlam	892e3c333a	vulkan: disable mmvq on Intel Windows driver (#20672 ) * vulkan: disable mmvq on Intel Windows driver * improve comment	2026-03-17 21:51:43 +01:00
Kevin Hannon	ee4801e5a6	ggml-blas: set mkl threads from thread context (#20602 ) * ggml blas: set mkl threads from thread context * add code to run blas locally	2026-03-18 01:16:49 +08:00
Taimur Ahmad	054d8b0f24	ggml-cpu: fix RVV checks in quants and repacking (#20682 ) * ggml-cpu: refactor quants.c; add rvv check * ggml-cpu: refactor; disable generic fallback	2026-03-17 16:03:40 +02:00
Ruben Ortlam	3a5cb629b1	vulkan: async and event fixes (#20518 ) * vulkan: fix event wait submission, event command buffer reset * fix event command buffer reset validation error * also reset command buffers before reuse * use timeline semaphores instead of fences for event_synchronize * don't use initializer list for semaphore wait info * use multiple events to avoid reset issues * fix event reuse issue with multiple vectors * add semaphore wait condition also if compute_ctx already exists * remove event pending stage	2026-03-17 14:27:23 +01:00
Justin Bradford	627670601a	kleidiai : fix MUL_MAT support for batched (3D) inputs (#20620 ) * kleidiai : fix MUL_MAT support for batched (3D) inputs The supports_op() check incorrectly rejected MUL_MAT operations with 3D inputs (ne[2] > 1), but the actual compute_forward_qx() implementation handles batched inputs correctly via a loop over ne12. This caused models with Q4_0/Q8_0 weights to crash during graph scheduling when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during loading (tested with 2D inputs) but the runtime used 3D inputs. Also relax the buffer check to allow supports_op() to be called during weight loading when src[0]->buffer is NULL. Fixes #20608 * Kleidiai support_ops should only return true for 3D inputs, not also 4D	2026-03-17 14:03:54 +02:00
Ruben Ortlam	740a447fc3	vulkan: allow graphics queue only through env var (#20599 ) * vulkan: avoid graphics queue on non-RADV AMD drivers * avoid graphics queues on small GPUs * change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE * reenable transfer queue if graphics queue is not used	2026-03-17 10:09:59 +01:00
Neo Zhang	b6c83aad55	[SYCL] ehance UPSCALE to support all UT cases (#20637 ) * [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1	2026-03-17 10:01:52 +08:00
Martin Klacer	cf21cdf36c	kleidiai: add data type check to get_tensor_traits (#20639 ) * kleidiai: add data type check to get_tensor_traits * Added check for F16 data type into get_tensor_traits path with input data not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8) Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7 * updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp updated kleidiai.cpp file as per suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-16 21:25:54 +02:00
Georgi Gerganov	c0ccbd1f86	ggml : try fix arm build (whisper/0)	2026-03-16 17:22:06 +02:00
David366AI	f6da02c3f2	ggml : extend im2col f16 (ggml/1434) * examples/yolo: fix load_model memory leak * fix/issue-1433 ggml_compute_forward_im2col_f16 assert error * fix/issue-1433	2026-03-16 17:22:06 +02:00
Ruben Ortlam	46dba9fce8	vulkan: fix flash attention dot product precision (#20589 )	2026-03-16 10:45:49 +01:00
Aman Gupta	34818ea6c0	CUDA: GDN hide memory latency (#20537 )	2026-03-16 11:41:45 +08:00
Tim Burke	8036edc99a	ggml: eliminate hot-path heap allocations and fix tiled MXFP multihead dequant Replace per-row/per-tile std::vector heap allocations with stack buffers in set_rows, one_chunk, and tiled flash attention paths. Fix tiled path to use per-head SoA extraction (matching one_chunk) instead of dequanting the full multihead region per token.	2026-03-15 22:55:34 -04:00
Tim Burke	b8e8d291d1	ggml: refactor x86 AVX2 and ARM NEON MXFP dequant — shared traits and helpers Add mxfp_dequant_traits_t to ggml-common.h as single source of truth for MXFP IEEE-754 reconstruction parameters. Define static const instances for all 4 formats (E4M3, E5M2, E2M3, E3M2), ready for CUDA/Metal/Vulkan reuse. Extract shared dequant and FP6 unpack helpers on both architectures, replacing duplicated inline code and macros. Net -215 lines.	2026-03-15 21:37:02 -04:00
Tim Burke	c913ab36d2	fix buffer overflows for large DK and multi-head MXFP flash attention - Increase q_mxfp_buf from 512 to 2048 bytes (supports DK up to 1024 with MXFP8) - Replace fixed k_soa[4096]/v_soa[4096] stack arrays with dynamically sized vectors - Replace fixed k_head_soa[320]/v_head_soa[320] with dynamically sized vectors - Add soa_bytes divisibility assertion in test init	2026-03-15 20:30:12 -04:00
Tim Burke	f603c036ec	Comment consistencty pass and cleanup.	2026-03-15 20:14:52 -04:00
Tim Burke	c2f2ff7814	ggml: optimize CPU MXFP flash attention hot loop - Per-head dequant: multihead MXFP now extracts only the needed head's SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and dequants DK elements, instead of dequanting all heads (nek2*DK). For 8 KV heads this is 8x less dequant work per KV position. - Hoist loop invariants: base pointer offsets (k_base, v_base), per-head SoA byte offsets, and multihead row bases are computed once per query row instead of per KV position in the inner loop. - Precompute SoA addressing in mxfp_fa_params_init: qs_per_block, blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated once at init rather than derived per iteration. - Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and v_is_f16 check outside the ir loop.	2026-03-15 19:49:27 -04:00
Tim Burke	a51ff77fae	ggml: address PR review — fix buffer overflows, add assertions, normalize MXFP6 naming Fix potential buffer overflows flagged in PR #20609 review: - set_rows: replace fixed float tmp[1024] with std::vector for large n_embd_k_gqa - tiled FA: size q_mxfp_buf with ggml_row_size guard instead of fixed 1024 - one_chunk FA: pre-allocate k/v dequant buffers from mxfp.{k,v}_soa_elems instead of hard-coded float[4096] stack arrays - kv-cache: assert n_embd_k_gqa % qk == 0 before integer division - test init: assert soa_bytes % block_size == 0 Normalize MXFP6 function naming to match MXFP8 convention (short form without element format suffix): mxfp6_e2m3 → mxfp6 in all function identifiers across 14 files. Format-specific items (type enums, traits, lookup tables, constants) retain their _e2m3 suffix.	2026-03-15 18:57:50 -04:00
Tim Burke	5c3a9523ef	Merge remote-tracking branch 'origin/mxfp-flash-attention' into mxfp-flash-attention	2026-03-15 18:00:11 -04:00
Tim Burke	d8c9f9c7f6	ggml: MXFP flash attention with SoA layout (CPU scalar reference) Add MXFP KV cache quantization for flash attention using Struct-of-Arrays (SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8 (E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec. SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling aligned memory access patterns for GPU backends. All functions in the flash attention pipeline — set_rows quantization, Q preprocessing, K/V dequantization — use SoA end-to-end. The existing AoS block layout remains for MUL_MAT weight quantization (untouched). Q preprocessing applies Walsh-Hadamard rotation (block-32) before quantize/dequant round-trip, distributing outlier energy across the shared exponent group. This is essential for perplexity: MXFP8: +0.22 PPL without rotation MXFP6: +3.34 PPL without rotation Hadamard is skipped for MLA models (DK != DV) where V is a view of K. Shared infrastructure in ggml-common.h: - Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements) - E8M0 MSE-optimal scale search with ±1 range - Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2) - FP6 tight packing (4 six-bit values in 3 bytes, 25% savings) - IEEE-754 bit reconstruction constants for SIMD backends - SoA layout macros, portable bit cast, type property queries CPU implementation: - Scalar reference + ARM NEON + x86 AVX2 optimized paths - Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM) - Split-KV path extended for single-query decode - Generic vec_dot via dequant-to-float for MUL_MAT compatibility - Arch fallbacks for loongarch, powerpc, riscv, s390, wasm KV cache integration: - set_rows writes SoA with optional Hadamard (op_params[0] flag) - K cache block-aligned to 16 for CUDA cp.async compatibility - CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8) Tests: - Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4) - SET_ROWS: Hadamard rotation for all types - SoA-aware test initialization and comparison for MXFP tensors - Quantize functions coverage for all types Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends (CPU, OpenCL, SYCL) for consistency with the MX type family naming.	2026-03-15 17:33:19 -04:00
Sigbjørn Skjæret	ebbf544ed1	sycl : fix for untransposed GDA recurrent state (#20583 )	2026-03-15 19:10:15 +01:00
Johannes Gäßler	ae40cd27c8	CUDA: limit number of FA stream-k CUDA blocks (#20586 )	2026-03-15 18:30:47 +01:00
Pascal	ceef6b5233	ggml: avoid creating CUDA context during device init (#20595 )	2026-03-16 00:42:56 +08:00

1 2 3 4 5 ...

2154 Commits