* cleanup : hoist mxfp soa functions
* fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert
Three fixes for CI failures:
1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include
causes NAN/INFINITY to become non-constexpr, breaking __device__ static
table initialization for the MXFP LUTs.
2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal
shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS)
but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS).
3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a
conditional fallback — when ne2 is not divisible by heads_per_region,
fall back to per-head SoA init instead of crashing.
* fix : correct guard for mxfp cpu dequant functions
* fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout
- Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP
LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY
- Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides,
matching the KV cache layout for permuted flash attention tests
* fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables
CUDA/HIP/MUSA __device__ static tables require constexpr initializers.
Standard NAN/INFINITY macros may expand to non-constexpr expressions
(e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr
for static init). Previous fix attempted __uint_as_float for nvcc and
__builtin_bit_cast for clang — neither worked universally.
Use __builtin_nanf("") and __builtin_inff() which are constexpr on
all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define
once before the platform #if chain instead of per-platform copies.
* fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests
The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal
precision, causing bitwise mismatches against the IEEE-754 element
converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e
precision for exact float round-trip on all 256 entries.
Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single
definition using __builtin_nanf/__builtin_inff (constexpr on all
target compilers), and adds LUT validation tests to test-quantize-fns
that verify all 5 MXFP element converters match their canonical LUT
values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256,
FP8 E5M2: 256 — 656 total values verified).
* fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc
MSVC does not support __builtin_nanf/__builtin_inff. Use standard
NAN/INFINITY macros on MSVC (which work for regular static tables),
and compiler builtins only on GCC/Clang/nvcc (needed for CUDA
__device__ table constexpr initialization).
* fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros
When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__
are defined. The previous fix checked _MSC_VER first, giving nvcc the
MSVC NAN/INFINITY macros which are not constexpr for __device__ tables.
Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff.
* cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA)
MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention,
which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions
(NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added
and are dead code — no model stores weights in these formats.
Removed:
- AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch
- AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch
- AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic
- AoS fallback defines in arch-fallback.h
- CPU traits .to_float entries for MXFP6/MXFP8
- MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests)
Kept (correct SoA code):
- All *_soa_* functions (NEON, AVX2, generic, dispatch)
- CPU traits .from_float_soa / .to_float_soa
- Flash attention and SET_ROWS Hadamard test cases
- Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip)
- MXFP4 AoS code (upstream model weight support, untouched)
Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code
that had a NEON bug. The test no longer runs because the type is
correctly excluded from AoS test paths.
* test: guard all MXFP types must have SoA traits for flash attention
All MXFP flash attention uses SoA layout exclusively. Test validates:
- ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa
- MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float
Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8,
or removes SoA traits from any MXFP type, CI will catch it.
* test: add Hadamard, SoA cross-check, E8M0, and layout offset tests
* test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests
Add comprehensive tests to catch the bugs backend implementers hit most:
- Element converter edge cases: subnormals, max finite, saturation, NaN, sign
- FP6 pack/unpack exhaustive round-trip with known-answer byte verification
- E8M0 known-answer decode + HALF vs FULL scale distinction
- E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs)
- Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats
- Consolidate duplicate SoA switches into single table in test-backend-ops
* test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse
- MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match
- Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types
- mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants
- Block size consistency: verify QK_MXFP{4,8,6} == 32
- EMAX_OFFSET vs format max: validate constants produce valid E8M0
- Edge case LUT validation: expected_bits verified against canonical LUTs
- FP4 E2M1 exhaustive converter round-trip (16/16)
* cleanup: tighten MXFP test comments to match repo conventions
* fix: platform-specific NaN/Infinity for GPU device table initializers
FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized
in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses
these LUTs (they use converter functions instead), so guard them out of
GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)
The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.
Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU: restructured loops for row-wise transposed access
Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.
All GATED_DELTA_NET backend-ops tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* llama : rever fgdn argument changes
* graph : remove GDN state transposes
* vulkan : adapt
* cuda : remove obsolete smem code
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* llama : enable chunked fused GDN path
* models : avoid Q and K repeats when using fused GDA
* cont : fix comment
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cont : fix the fix
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cont : fix
* metal : add GDN kernel (#20361)
* metal : add Metal backend for GGML_OP_GATED_DELTA_NET
Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.
Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.
Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
tg128: 170 -> 213 t/s (+25%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* metal : validate contiguity of all input tensors in supports_op
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* metal : add algorithm equivalence comment for GDA decay path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* cont : unslop + optimize
* cont : clean-up
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* CUDA: AR gated delta net improvements (#20391)
* Add FastDiv to gated_delta_net_cuda
* Shard columns across warps
This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).
* Remove unneded include in gated_delta_net.cu
* Improve comments
* Apply code-formating
* Make sharding HIP-compatible
1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA
* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t
* Rename variables
* Enable GDN also for prefill, move TODO for chunked_GDN
* Actually remove the TODO from 2068908975
* Get warp size at runtime
warp_size is not known at compile time in hip host code.
* Don't expose ggml_cuda_get_physical_warp_size on host
---------
Co-authored-by: uvos <devnull@uvos.xyz>
* llama : refactor llm_build_delta_net_base API
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
* WIP: add NVFP4 quantization support
* tests
* improve NVFP4 dot product implementation performance and fix bad super call
* typo
* Use nvfp4 kvalues
* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
* vulcal and perf fixes
* wip
* Fix metal
* fix vulcan
* Rename threshold & fix wrong scale
* Fix MOE
* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
* quantize: add NVFP4 as a quantization type option
* Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
* Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
* ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
* CPU only backend 64 super-block layout
* cleanup
* Remove unused LUT
* int
* exclude NVFP4 from unsupported ops in metal build
* remove quantization for now
* store scales as native UE4M3, preserve original model bits when possible
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* correct comment
* format
* reduce duplication and cleanup
* Address comments
* move detection to prepare_tensors
* Use math instead of const
* Move
* fix comment
* Shelf quantize tests
* Rebase and move check
* cleanup
* lint
* Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use fallback quant config
* Simplify
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* organize
* Refactor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* fix return type
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* llama : add support for Nemotron 3 Super
This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com>
Replace GGML_ABORT("fatal error") in ggml_metal_synchronize() with
error flag + return. This aligns synchronize error handling with
graph_compute, which already returns GGML_STATUS_FAILED for the same
condition.
When a command buffer fails (e.g., iOS GPU access revocation during
backgrounding, macOS eGPU disconnect, OOM), the backend enters an
error state instead of killing the host process. Subsequent
graph_compute calls return GGML_STATUS_FAILED immediately. Recovery
requires recreating the backend.
Failed extra command buffers are properly released on the error path
to avoid Metal object leaks.
Enable mul_mv_ext small-batch kernels (BS 2-8) for BF16, Q2_K,
and Q3_K quantization types. These types previously fell through
to the slower single-row mul_mv path.
BF16 uses the float4 dequantize path (like F16). Q2_K and Q3_K
use the float4x4 K-quant path (like Q4_K/Q5_K/Q6_K).
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32
* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx
* cann: forward declaration of device context struct
* cann: move offload op check after device context declaration
* cuda: fix whitespace
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* add count equal for metal
* remove trailing whitespace
* updated doc ops table
* changed shmem to i32
* added multi tg and templating
* removed BLAS support from Metal docs
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add memset to set dst to 0
* metal : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal: use shared buffers on eGPU
With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.
* metal: use shared buffers on eGPU
* metal: use shared buffers on eGPU
* feat: Add a batched version of ssm_conv
This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Optimized SSM_SCAN kernel for metal
This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Add test-backend-ops perf tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Real representitive tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use function constant for ssm_conv batch size
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: backend op tests for ssm_scan from granite4 1b-h
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: remove commented out templates
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: float4 version of ssm_conv_batched
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing ggml_metal_cv_free
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Feat: Added vulkan circular tiling support
* Feat: Added cpu circular
* Feat: Added cuda kernels
* Added tests
* Added tests
* Removed non-pad operations
* Removed unneded changes
* removed backend non pad tests
* Update test-backend-ops.cpp
* Fixed comment on pad test
* removed trailing whitespace
* Removed unneded test in test-backend-ops
* Removed removed test from calls
* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp
Co-authored-by: Ruben Ortlam <picard12@live.de>
* Fixed alignment
* Formatting
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Format pad
* Format
* Clang format
* format
* format
* don't change so much stuff
* clang format and update to bool
* fix duplicates
* don't need to fix the padding
* make circular bool
* duplicate again
* rename vulkan to wrap around
* Don't need indent
* moved to const expr
* removed unneded extra line break
* More readable method calls
* Minor wording changes
* Added final newline
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added circular pad ext tests
* Gate non circular pad devices
* Cleaned gating of non-circular pad devices
---------
Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat(wip): Port initial TRI impl from pervious work
The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove argument for constant val override
This was added in the original draft, but later removed. With this, the
kernel now passes tests.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Move the ttype conditional to templating to avoid conditional in kernel
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Type fixes
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: Add softplus for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add EXPM1 for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add FILL for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unused arguments
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use select instead of branch for softplus non-vec
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>