llama.cpp

Commit Graph

Author	SHA1	Message	Date
Tim Burke	aba3778ca8	Merge `ccea34ba41` into `312d870a89`	2026-03-24 05:35:01 +02:00
Aldehir Rojas	312d870a89	common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912 )	2026-03-23 22:21:47 -05:00
Max Krasnyansky	7cadbfce10	hexagon: general DMA and Binary Op fixes for large strides (#20918 ) * hex-dma: make chained dma the default to handle newer models This also includes some new instrumentation that we can remove later. * hexagon: add uint32 dump helper * hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset spans page boundaries. * hexagon: update ssm-conv to make base-addr compute a bit easier to read * hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB) * hex-bin: fix incorrect stride logic * hexagon: make sure repack buffs are dumped for verbose > 2 * hex-bin: consistently use dma_queue_push even for dummy dst transactions * hex-dma: start using 2d-wide mode on v75 and up The removes the need to deal with the 16-bit limitaion for the strides. * hex-bin: cleanup kernel selection logic * hex-bin: cleanup binary op core and fix transposed tensor handling * snapdragon: update run-bench to use larger ubatch and fa-on	2026-03-23 15:33:49 -07:00
Max Krasnyansky	1fb2290a51	Add codeowners for scripts/snapdragon and docs/snapdragon (#20915 ) * Add codeowners for scripts/snapdragon * Also add docs/backends/snapdragon	2026-03-23 14:57:18 -07:00
lhez	1772701f99	opencl: add q6_K gemm and gemv kernels for Adreno (#20089 ) * opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code * opencl: add q6_K transpose * opencl: fix cvt kernel name * opencl: add call to q6_K gemv * opencl: fix q6_K scale transpose * opencl: fix loading for gemv q6_K, refactor * opencl: fix transpose_8_buf kernel assignment, refactor * opencl: refactor q6_K transpose * opencl: add gemm_noshuffle_q6_k_f32 * opencl: fix qh loading * opencl: refactor q6_K gemv host side, release bufs and imgs * opencl: refactor * opencl: fix q6_K dequant and scale selection * opencl: workaround compiler bug, fix dump_tensor * opencl: refactor q6_K convert kernels * opencl: unpack transformed q6_K in get_tensor * opencl: refactor, handle non-uniform workgroups * opencl: support non-vector subgroup bcast	2026-03-23 12:44:18 -07:00
las7	39bf0d3c6a	rpc : RCE patch (#20908 )	2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen	bd6992180b	contrib: add "Requirements" section to PR template (#20841 ) * contrib: add "Requirements" section to PR template * typo [no ci] * use h2, add "Additional information" --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-03-23 16:59:02 +01:00
Davi Henrique Linhares	fd18364755	devops: upgraded default oneAPI version (#20731 )	2026-03-23 21:47:34 +08:00
Aleksander Grygier	11fb11b901	webui: Improve chat form positioning (#20901 )	2026-03-23 14:30:55 +01:00
Geo Maciolek	35b662bb5d	docs: Fix typo in reasoning flag documentation (#20780 ) Tested to verify - the typo is just in the docs, not the actual flag.	2026-03-23 21:24:55 +08:00
Georgi Gerganov	f93c09e267	memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887 )	2026-03-23 14:08:46 +02:00
Eric Zhang	841bc203e2	docs : rerun llama-gen-docs to include new CLI args (#20892 )	2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen	31a5cf4c3f	server: use httplib dynamic threads (#20817 ) * server: use httplib dynamic threads * change to n_threads_http + 1024	2026-03-23 12:22:46 +01:00
Georgi Gerganov	e32d243849	ai : update gh permissions (#20895 )	2026-03-23 13:21:41 +02:00
Pascal	c44a932cf4	webui: fix --webui-config-file settings not applied on load (#20823 ) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output	2026-03-23 11:25:35 +01:00
Rashid Ul Islam	177c75852a	metal: add CONV_3D (#19927 ) * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * metal:add conv_3d backend Rebased with master and resolved conflicts. * Resolved issues related to changes in variable names * kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 09:45:34 +02:00
Jhen-Jie Hong	7a0b6a635e	common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859 )	2026-03-23 08:35:27 +01:00
Chenguang Li	07ff000551	CANN: add RoPE cache preload before ACL graph capture (#20747 ) ACL graph capture disallows host-to-device memcpy and device memory malloc/free on the captured stream. Pre-load the RoPE cache before capture so that: - Host-to-device copies and allocations run on the non-captured stream - Cache metadata is populated and memory pool is warmed up - During capture, only on-device computations are recorded; host-side and allocation branches are skipped	2026-03-23 15:24:06 +08:00
Dan Hoffman	cc18f965b6	fix(openvino): explicit memset in buffer_context allocation (#20857 ) * fix(openvino): explicit memset in buffer_context allocation * minor --------- Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 08:05:37 +02:00
shaofeiqi	84ffd0c192	opencl: add flattened Q4_K mv and general Q4_K mm (#20773 )	2026-03-22 22:45:11 -07:00
Tim Burke	ccea34ba41	perf : multiple fixes and enhancements, remove MSE search, expand test coverage * fix: correct tiled flash attention SoA pointer math for multihead MXFP The cleanup refactoring (`c919bc471`) extracted mxfp_dequant_head as a shared helper but failed to update the tiled path's data pointers. The helper expects the full SoA row base (no per-head offset), but the tiled path was passing a pointer that already included ik2nbk2, causing a double head offset that produced NaN during prefill. Add mxfp_row_ptr helper to centralize the multihead-aware pointer calculation across both one_chunk and tiled paths. Verified with 16-chunk perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8) produce exact matches with the known-good commit (`23e88631c`). perf: reduce E8M0 MSE search range from ±2 to ±1 The base estimate round(log2(amax)) is always within 1 step of optimal. Empirically verified across 30K blocks and 6 distributions: ±1 and ±2 never disagree. This reduces the scale search from 5 to 3 candidates (40% fewer inner loop iterations) with zero quality impact. * perf: eliminate redundant work in MXFP quantize and flash attention - mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d - mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop - tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer * cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers - EMAX_OFFSET comments: fix ceil/floor labels to match actual values - Hadamard flag: unify write path (llama-kv-cache.cpp) and read path (ops.cpp) to both use DK==DV condition instead of is_mla() - E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style, add cross-reference comment * fix: MXFP8/6 flash attention tests crash on init The view base tensors for K/V don't get named "k"/"v" but inherit the MXFP type. The name-based filter in initialize_tensors missed them, falling through to init_tensor_uniform which calls quantize_chunk and aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for all tensors, matching the pattern set_rows tests already use. * test: expand MXFP set_rows coverage - Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage - Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs - Coverage: 18 → 51 MXFP set_rows tests * perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path) * cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4) and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths. Eliminates duplicated per-block logic while keeping layout-specific pointer arithmetic in the callers. * feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support) Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers from the SoA generic impl and use them to build AoS row functions for MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT. MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used as general quantization types. The SoA impl is also DRY'd to delegate to the same per-block helpers. * cleanup: remove dead soa_elems field from mxfp_kv_params Computed but never read — leftover from an earlier design. * feat: add MXFP8/MXFP6 vec_dot and full CPU type support Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations, register from_float + vec_dot + vec_dot_type in CPU traits, and add fallback remaps for all architectures. MXFP8/6 are now fully tested: AoS quantization error, reference match, and dot product accuracy all pass in test-quantize-fns. * perf: remove E8M0 MSE search — base estimate is perplexity-optimal The MSE search over ±1 candidates around round(log2(amax)) was found to HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b. The base estimate alone (no search) produces better attention patterns because minimizing per-block reconstruction error is not the same as minimizing attention score distortion through softmax. Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant, and the entire search loop. E8M0 computation is now a single amax scan + integer bit extraction — no inner loop, no function pointers. This also simplifies future GPU/Metal implementations. * perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer) Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and quantize block-by-block with a 32-float stack buffer. Eliminates the std::vector heap allocation and 2 extra memory passes over the full row. set_rows now dispatches to the fused path when Hadamard is enabled, falling through to the unfused quantize for non-Hadamard types. This pattern maps directly to a CUDA kernel: global memory read → register Hadamard → register quantize → global memory write. * cleanup: consistent MXFP type names and variable naming - Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6" to match "mxfp4". Only one variant of each exists — the suffix was unnecessary disambiguation that implied alternatives. - Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles all types via ggml_type_name matching). - Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K). * perf: fuse Q preprocessing round-trip (no SoA buffer needed) Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions that apply quantization error to float values without materializing SoA bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA buffer → dequant from SoA buffer) with a single pass through per-block round-trip helpers. Eliminates the Q_q intermediate buffer and two function pointer calls from the flash attention hot path. Maps directly to CUDA: Q stays in registers, Hadamard butterfly + quantize error applied in-place. * fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures) E8M0 = 255 means NaN per MX spec, but our encode path already clamps to 254. When test data contains random E8M0 = 255 bytes, the decode produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to fail on MXFP6 (and potentially MXFP4/8). Fix: clamp 255 → 254 in both E8M0 decode functions: - ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h) - ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h) These are unfortunately duplicated across two headers because ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves CPU-only callers that don't include ggml-common.h.	2026-03-22 20:12:09 -04:00
bssrdf	ec2b787ebe	mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847 ) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-23 01:06:30 +01:00
DorianRudolph	d3ac030a5d	mtmd : fix LightOnOCR image preprocessing (#20877 )	2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen	49bfddeca1	server: allow router to report child instances sleep status (#20849 ) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits	2026-03-22 18:33:52 +01:00
Johannes Gäßler	bd3f1d9d65	CUDA: fix BF16 FA compilation (#20865 )	2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret	23c9182ce8	jinja : refactor token advancement (#20864 ) * refactor token advancement * exercise sub-expressions	2026-03-22 17:45:10 +01:00
Evgeny Kurnevsky	81bc4d3ddc	server: fix Host header (#20843 ) It should include port when it's not default.	2026-03-22 22:29:22 +08:00
Neo Zhang	f40a80b4f3	support bf16 and quantized type (#20803 )	2026-03-22 22:06:27 +08:00
Patrick Buckley	db9d8aa428	ggml-cuda: native bf16 flash attention for vec kernel (#20525 ) * ggml-cuda: native bf16 flash attention for vec and tile kernels mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo * ggml-cuda: address code owner review feedback reverted tile kernel changes to avoid larger refactor * fix ci failures on turing and hip * fix bf16 vec kernel compile on hip v_dot2 platforms * add comments --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-22 11:05:51 +01:00
Gaurav Garg	ccb87fa3ee	[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635 ) * Increase per-thread work if the K-dimension is small With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases. * Limit this change to ncols_dst = 1 * tab to space	2026-03-22 16:49:35 +08:00
Tim Burke	c919bc471b	cleanup : remove unused untested code and improve consistency * cleanup: consolidate MXFP type aliases, fix SoA linker bug on 5 platforms - Add GGML_TYPE_MXFP8 and GGML_TYPE_MXFP6 short aliases (matching existing GGML_TYPE_MXFP4 pattern) and use short names consistently throughout the codebase instead of mixing long/short forms. - Fix missing SoA dequant symbols (dequantize_row_mxfp{4,8,6}_soa_cpu) on loongarch, powerpc, riscv, s390, and wasm by adding proper aliases to each arch section in arch-fallback.h. Previously these were only defined under GGML_CPU_GENERIC, causing linker failures on those platforms when using MXFP flash attention. - Remove 10 files from the PR diff: - 5 arch stub files replaced by arch-fallback.h aliases - 5 rename-only files (sycl, opencl, repack, llama-quant) reverted since the GGML_TYPE_MXFP4 compat alias handles them * cleanup: DRY FP6 unpack, extract mxfp_kv_params + mxfp_dequant_head helper - FP6 unpack: x86 and ARM SIMD versions now call ggml_mxfp_unpack_fp6x4() from ggml-common.h instead of duplicating the scalar bit manipulation. - Extract mxfp_kv_params sub-struct from mxfp_fa_params: the 7 symmetric K/V fields (dequantize, multihead, soa_elems, qs_per_block, head_qs_bytes, head_e8m0_offset, blocks_per_head) are now in a reusable struct accessed as mxfp.k and mxfp.v. - Add mxfp_dequant_head() helper: replaces 4 instances of the multihead SoA extraction pattern (2x memcpy + dequant, with multihead/single-head branching) with a single function call. Future backends get the pattern for free. * cleanup: extract mxfp_kv_params_init to DRY the K/V init blocks The K and V initialization in mxfp_fa_params_init were structurally identical 10-line blocks differing only by tensor/dimension. Extract into mxfp_kv_params_init(type, D, nb2, ne2) so future MXFP formats get the multihead SoA addressing logic automatically. * cleanup: generic MSE round-trip, replace magic buffer sizes with constants - Remove mse_error_fp8_e4m3 and mse_error_fp6_e2m3: these were identical round-trip functions differing only by converter. mxfp_compute_e8m0_mse now uses to_elem/to_float directly when mse_error is NULL (FP8/FP6). MXFP4 keeps its custom decision-tree MSE. New formats get MSE for free by just setting to_elem/to_float in their traits. - Replace magic 1024/1088 buffer sizes in flash attention with named constants MXFP_FA_MAX_D and MXFP_FA_SOA_BUF. One place to change if max head dimension grows. * cleanup: remove dead AoS vec_dot for MXFP8/MXFP6, unify SoA impls MXFP8 and MXFP6 are KV-cache-only types that use SoA layout for flash attention. The AoS vec_dot functions (scalar generic, AVX2, NEON) were dead code — no matmul path uses them. Removed: - ggml_vec_dot_mxfp{8,6}_q8_0 from scalar, x86, ARM, quants.h - ggml_vec_dot_mxfp_q8_0_impl shared helper - arch-fallback.h aliases for vec_dot mxfp8/mxfp6 (12 lines) - vec_dot/vec_dot_type registration in ggml-cpu.c Also unified SoA quantize/dequant: the separate mxfp8_soa_impl and mxfp6_soa_impl functions (4 functions, ~80 lines) are replaced by two generic functions (quantize_row_mxfp_soa_impl, dequantize_row_mxfp_soa_impl) that use traits->bits_per_elem and traits->qs_per_block to handle both byte-aligned (FP8) and 6-bit packed (FP6) formats. New MXFP formats get SoA for free by setting these trait fields. * cleanup: remove all AoS MXFP8/MXFP6 quantize/dequant — SoA only MXFP8 and MXFP6 are KV-cache-only types. All quantization and dequantization goes through the SoA (Struct-of-Arrays) path for flash attention. The AoS (block_mxfp8/block_mxfp6 struct) implementations were dead code that should never have been added. Removed: - quantize_row_mxfp{8,6}_impl, dequantize_row_mxfp{8,6}_impl - quantize_row_mxfp{8,6}_ref, dequantize_row_mxfp{8,6} - quantize_mxfp{8,6} (ggml_quantize_chunk wrappers) - All declarations from ggml-quants.h and quants.h - to_float/from_float_ref registrations from ggml.c type traits - from_float registration from ggml-cpu.c CPU traits Block struct definitions (block_mxfp8, block_mxfp6) are retained for sizeof() in type traits and validate_row_data. * cleanup: fail fast in ggml_quantize_chunk for KV-cache-only types Add explicit GGML_ABORT for MXFP8/MXFP6 in ggml_quantize_chunk — these are KV-cache-only types that use SoA layout via from_float_soa. Attempting AoS quantization through this entry point is a bug.	2026-03-22 02:44:56 -04:00
Tim Burke	0e3304fbca	Merge remote-tracking branch 'origin/master' into mxfp-flash-attention	2026-03-22 02:14:05 -04:00
Tim Burke	ad2fa9035a	test : add testing and fixes * cleanup : hoist mxfp soa functions * fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert Three fixes for CI failures: 1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include causes NAN/INFINITY to become non-constexpr, breaking __device__ static table initialization for the MXFP LUTs. 2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS) but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS). 3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a conditional fallback — when ne2 is not divisible by heads_per_region, fall back to per-head SoA init instead of crashing. * fix : correct guard for mxfp cpu dequant functions * fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout - Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY - Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides, matching the KV cache layout for permuted flash attention tests * fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables CUDA/HIP/MUSA __device__ static tables require constexpr initializers. Standard NAN/INFINITY macros may expand to non-constexpr expressions (e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr for static init). Previous fix attempted __uint_as_float for nvcc and __builtin_bit_cast for clang — neither worked universally. Use __builtin_nanf("") and __builtin_inff() which are constexpr on all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define once before the platform #if chain instead of per-platform copies. * fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal precision, causing bitwise mismatches against the IEEE-754 element converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e precision for exact float round-trip on all 256 entries. Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single definition using __builtin_nanf/__builtin_inff (constexpr on all target compilers), and adds LUT validation tests to test-quantize-fns that verify all 5 MXFP element converters match their canonical LUT values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256, FP8 E5M2: 256 — 656 total values verified). * fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc MSVC does not support __builtin_nanf/__builtin_inff. Use standard NAN/INFINITY macros on MSVC (which work for regular static tables), and compiler builtins only on GCC/Clang/nvcc (needed for CUDA __device__ table constexpr initialization). * fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__ are defined. The previous fix checked _MSC_VER first, giving nvcc the MSVC NAN/INFINITY macros which are not constexpr for __device__ tables. Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff. * cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA) MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention, which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions (NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added and are dead code — no model stores weights in these formats. Removed: - AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch - AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch - AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic - AoS fallback defines in arch-fallback.h - CPU traits .to_float entries for MXFP6/MXFP8 - MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests) Kept (correct SoA code): - All _soa_ functions (NEON, AVX2, generic, dispatch) - CPU traits .from_float_soa / .to_float_soa - Flash attention and SET_ROWS Hadamard test cases - Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip) - MXFP4 AoS code (upstream model weight support, untouched) Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code that had a NEON bug. The test no longer runs because the type is correctly excluded from AoS test paths. * test: guard all MXFP types must have SoA traits for flash attention All MXFP flash attention uses SoA layout exclusively. Test validates: - ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa - MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8, or removes SoA traits from any MXFP type, CI will catch it. * test: add Hadamard, SoA cross-check, E8M0, and layout offset tests * test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests Add comprehensive tests to catch the bugs backend implementers hit most: - Element converter edge cases: subnormals, max finite, saturation, NaN, sign - FP6 pack/unpack exhaustive round-trip with known-answer byte verification - E8M0 known-answer decode + HALF vs FULL scale distinction - E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs) - Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats - Consolidate duplicate SoA switches into single table in test-backend-ops * test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse - MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match - Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types - mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants - Block size consistency: verify QK_MXFP{4,8,6} == 32 - EMAX_OFFSET vs format max: validate constants produce valid E8M0 - Edge case LUT validation: expected_bits verified against canonical LUTs - FP4 E2M1 exhaustive converter round-trip (16/16) * cleanup: tighten MXFP test comments to match repo conventions * fix: platform-specific NaN/Infinity for GPU device table initializers FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses these LUTs (they use converter functions instead), so guard them out of GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.	2026-03-22 01:07:55 -04:00
ddh0	3306dbaef7	misc : prefer ggml-org models in docs and examples (#20827 ) * misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file	2026-03-21 22:00:26 +01:00
Tim Burke	dd263ff567	mxfp traits : ensure mxfp soa quant and dequant functions are tested	2026-03-21 15:09:49 -04:00
Tim Burke	5bb05ed21c	Comment consistency pass and cleanup.	2026-03-21 13:52:54 -04:00
Andrea Arcangeli	990e4d9698	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 ) * grammar: add test case for nullable symbol loop Reproduce stack overflow (or OOM) with ( [x]* )* found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= ( [x]* )" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: prevent stack overflow with nullable symbol loop Fix a potential stack overflow in llama_grammar_advance_stack that could occur when processing grammars with nullable symbols that lead to infinite derivations of empty strings. The fix introduces cycle detection by tracking visited stacks to prevent infinite recursion. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A20 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) * grammar: convert recursive llama_grammar_advance_stack to iterative This change converts the function to an iterative approach using explicit stacks, which prevents deep recursion and eliminates the risk of stack overflow. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A30 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration convert from recursive to interactive""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) v2: Added a `std::set` to perform tree-based lookups with O(N log N) complexity. Testing with a parallel run of `test-grammar-integration` shows a double-digit percentage increase in runtime. An `unordered_set` with O(1) hashing was also evaluated, but the overhead of constructing hash keys from pointers made it significantly slower than the rbtree implementation that only requires an ordering operator. The performance regression in the test suite appears justified by the overall reduction in algorithmic complexity. Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> * grammar: add test case for hang in repetition grammar processing This commit adds a new test case to the grammar integration tests that specifically targets a hang scenario in the repetition grammar parser found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= (([^x]){0,99}){0,99}" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: add repetition threshold check The change introduces a maximum repetition threshold to avoid excessive rule expansion during grammar parsing. When parsing repetition patterns like {m,n}, the parser now calculates the potential number of rules that would be generated and throws an error if the product of previous rules and new rules exceeds the threshold. A test case was added to verify the threshold is properly enforced for deeply nested repetition patterns that would otherwise cause hangs.	2026-03-21 18:43:35 +01:00
Tom Hillbrunner	212f4521b0	context : use n_embd_out for pooled embedding extraction (#20840 ) The MEAN/CLS/LAST pooling paths in encode() and decode() used n_embd_inp() (16384 for qwen3vl with deepstack) to read from the pooled embedding tensor, which only has n_embd_out() (4096) floats per sequence. This caused a tensor read out of bounds assertion. Fixes embedding mode for Qwen3-VL-Embedding models.	2026-03-21 19:35:00 +02:00
Xuan-Son Nguyen	568aec82d2	docs : explicit about banning accounts that violates policy (#19593 )	2026-03-21 15:50:16 +01:00
y198	2bcdddd5e3	fix(rpc): prevent division by zero in deserialize_tensor (#20712 ) rpc : prevent division by zero in deserialize_tensor When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0. (Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here). * style: remove trailing whitespace	2026-03-21 15:59:43 +02:00
Michael Wand	eac9c6ea83	Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730 ) * Corrected convert script for NVFP4 naming and updated gguf constants * Add mostly_MXFP4 to FileType Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * simplify * set initial value [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-21 13:35:21 +02:00
Sigbjørn Skjæret	29b28a9824	ci : switch from pyright to ty (#20826 ) * type fixes * switch to ty * tweak rules * tweak more rules * more tweaks * final tweak * use common import-not-found rule	2026-03-21 08:54:34 +01:00
Matt Corallo	cea560f483	Add shader count for Intel Arc Pro B60 (#20818 )	2026-03-21 05:22:51 +01:00
Tim Burke	23e88631c4	fix: gate tiled GEMM and split-KV paths to preserve q8_0/q4_0 vec_dot semantics	2026-03-20 23:40:15 -04:00
Piotr Wilkin (ilintar)	b1c70e2e54	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 )	2026-03-21 00:19:04 +01:00
shalinib-ibm	e6ec21e62f	ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791 ) Explicitly mark save_acc and add_save_Acc with always_inline in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator disassembly within kernel's register context, preventing un-necessary stask spills. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-03-21 07:11:45 +08:00
Tim Burke	358bd71b52	Merge branch 'master' into mxfp-flash-attention	2026-03-20 18:49:51 -04:00
Georgi Gerganov	4cb7e0bd61	ai : limit runtime of the agent (#20816 )	2026-03-20 20:31:25 +02:00
James O'Leary	149b2493c0	common : fix typo in debug log ('extracft' -> 'extract') (#20807 )	2026-03-20 18:23:18 +01:00
Georgi Gerganov	b31b30f31d	ai : do not run bash commands in the prompt (#20810 )	2026-03-20 19:06:33 +02:00

1 2 3 4 5 ...

8513 Commits All Branches Search

8513 Commits

All Branches