llama.cpp

History

Tim Burke ad2fa9035a test : add testing and fixes * cleanup : hoist mxfp soa functions * fix: CI failures — CUDA __device__ init, Metal MXFP supports_op, SoA test assert Three fixes for CI failures: 1. Remove <cmath> from CUDA/HIP/MUSA section of ggml-common.h — the include causes NAN/INFINITY to become non-constexpr, breaking __device__ static table initialization for the MXFP LUTs. 2. Add MXFP type guards to Metal's supports_op: MXFP8/MXFP6 have no Metal shaders yet (reject all ops), MXFP4 has AoS shaders (MUL_MAT, GET_ROWS) but no SoA/flash attention support yet (reject FLASH_ATTN_EXT, SET_ROWS). 3. Replace strict assert in test-backend-ops init_tensor_mxfp_soa with a conditional fallback — when ne2 is not divisible by heads_per_region, fall back to per-head SoA init instead of crashing. * fix : correct guard for mxfp cpu dequant functions * fix: CUDA MXFP LUT init and MXFP flash attention SoA test layout - Add per-platform GGML_TABLE_NAN/GGML_TABLE_INFINITY macros for MXFP LUTs — uses __uint_as_float on CUDA to avoid MSVC non-constexpr INFINITY - Fix init_tensor_mxfp_soa to detect multihead SoA from tensor strides, matching the KV cache layout for permuted flash attention tests * fix: CUDA MXFP LUT init — use __builtin_nanf/__builtin_inff for constexpr device tables CUDA/HIP/MUSA __device__ static tables require constexpr initializers. Standard NAN/INFINITY macros may expand to non-constexpr expressions (e.g. MSVC: (float)(1e+300), nvcc: __uint_as_float is not constexpr for static init). Previous fix attempted __uint_as_float for nvcc and __builtin_bit_cast for clang — neither worked universally. Use __builtin_nanf("") and __builtin_inff() which are constexpr on all target compilers (nvcc, clang for HIP/MUSA, GCC, MSVC). Define once before the platform #if chain instead of per-platform copies. * fix: correct E5M2 LUT precision and add converter-vs-LUT validation tests The kvalues_mxfp8_e5m2 LUT had 50 values with insufficient decimal precision, causing bitwise mismatches against the IEEE-754 element converter. Regenerated from ggml_mxfp_fp8_e5m2_to_float() with %.9e precision for exact float round-trip on all 256 entries. Also consolidates GGML_TABLE_NAN/GGML_TABLE_INFINITY into a single definition using __builtin_nanf/__builtin_inff (constexpr on all target compilers), and adds LUT validation tests to test-quantize-fns that verify all 5 MXFP element converters match their canonical LUT values (FP4 E2M1: 16, FP6 E2M3: 64, FP6 E3M2: 64, FP8 E4M3: 256, FP8 E5M2: 256 — 656 total values verified). * fix: MSVC compat for GGML_TABLE_NAN/INFINITY — use builtins only on GCC/Clang/nvcc MSVC does not support __builtin_nanf/__builtin_inff. Use standard NAN/INFINITY macros on MSVC (which work for regular static tables), and compiler builtins only on GCC/Clang/nvcc (needed for CUDA __device__ table constexpr initialization). * fix: handle nvcc+MSVC host — check __CUDACC__ before _MSC_VER for NAN/INF macros When nvcc uses MSVC as the host compiler, both _MSC_VER and __CUDACC__ are defined. The previous fix checked _MSC_VER first, giving nvcc the MSVC NAN/INFINITY macros which are not constexpr for __device__ tables. Add __CUDACC__ exclusion so nvcc gets __builtin_nanf/__builtin_inff. * cleanup: remove AoS MXFP6/MXFP8 dequant code — these types are KV-cache-only (SoA) MXFP6 (E2M3) and MXFP8 (E4M3) exist only for KV cache flash attention, which uses SoA (Struct-of-Arrays) layout. The AoS dequant functions (NEON, AVX2, CPU dispatch, generic wrappers) were incorrectly added and are dead code — no model stores weights in these formats. Removed: - AoS NEON dequant: dequantize_row_mxfp{6,8}_neon, _cpu dispatch - AoS AVX2 dequant: dequantize_row_mxfp{6,8}_avx2, _cpu dispatch - AoS generic wrappers: dequantize_row_mxfp{6,8}_cpu_generic - AoS fallback defines in arch-fallback.h - CPU traits .to_float entries for MXFP6/MXFP8 - MXFP6/MXFP8 from all_types[] in test-backend-ops (no AoS tests) Kept (correct SoA code): - All _soa_ functions (NEON, AVX2, generic, dispatch) - CPU traits .from_float_soa / .to_float_soa - Flash attention and SET_ROWS Hadamard test cases - Scalar reference dequant in ggml-quants.c (test-quantize-fns roundtrip) - MXFP4 AoS code (upstream model weight support, untouched) Fixes ARM64 CI failure: GET_ROWS(mxfp6_e2m3) was testing dead AoS code that had a NEON bug. The test no longer runs because the type is correctly excluded from AoS test paths. * test: guard all MXFP types must have SoA traits for flash attention All MXFP flash attention uses SoA layout exclusively. Test validates: - ALL MXFP types (MXFP4, MXFP6, MXFP8) have from_float_soa and to_float_soa - MXFP6/MXFP8 (KV-cache-only) do NOT have AoS CPU to_float Prevents regression: if someone adds AoS dequant back for MXFP6/MXFP8, or removes SoA traits from any MXFP type, CI will catch it. * test: add Hadamard, SoA cross-check, E8M0, and layout offset tests * test: add MXFP converter edge cases, FP6 packing, E8M0 known-answer tests Add comprehensive tests to catch the bugs backend implementers hit most: - Element converter edge cases: subnormals, max finite, saturation, NaN, sign - FP6 pack/unpack exhaustive round-trip with known-answer byte verification - E8M0 known-answer decode + HALF vs FULL scale distinction - E8M0 rounding boundary at sqrt(2) threshold (catches floor-only bugs) - Converter exhaustive round-trip: quantize(dequantize(i))==i for all formats - Consolidate duplicate SoA switches into single table in test-backend-ops * test: add AoS/SoA cross-check, Hadamard pipeline, format spec, and mxfp_rmse - MXFP4 AoS vs SoA cross-check: two independent code paths, bitwise match - Full Hadamard pipeline roundtrip: H→quantize→dequant→H for all 3 types - mxfp_rmse helper: computes sqrt(sum/n), with named pipeline constants - Block size consistency: verify QK_MXFP{4,8,6} == 32 - EMAX_OFFSET vs format max: validate constants produce valid E8M0 - Edge case LUT validation: expected_bits verified against canonical LUTs - FP4 E2M1 exhaustive converter round-trip (16/16) * cleanup: tighten MXFP test comments to match repo conventions * fix: platform-specific NaN/Infinity for GPU device table initializers FP8 E4M3/E5M2 LUTs contain NaN/Inf which cannot be constexpr-initialized in __device__ tables on any CUDA/HIP/MUSA version. No GPU backend uses these LUTs (they use converter functions instead), so guard them out of GPU builds entirely. Simplify GGML_TABLE_NAN/INFINITY to CPU-only macros.		2026-03-22 01:07:55 -04:00
..
peg-parser	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
.gitignore	common : introduce composable PEG parser combinators for chat parsing (#17136 )	2025-12-03 12:45:32 +02:00
CMakeLists.txt	test : add testing and fixes	2026-03-22 01:07:55 -04:00
export-graph-ops.cpp	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 )	2026-03-12 13:26:00 +01:00
get-model.cpp	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
get-model.h	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
gguf-model-data.cpp	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
gguf-model-data.h	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
run-json-schema-to-grammar.mjs	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00
test-alloc.cpp	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
test-arg-parser.cpp	ci, tests : use cmake to download models and remove libcurl dependency (#18791 )	2026-01-14 07:46:27 +01:00
test-autorelease.cpp	docs : Minor cleanups (#19252 )	2026-02-02 08:38:55 +02:00
test-backend-ops.cpp	test : add testing and fixes	2026-03-22 01:07:55 -04:00
test-backend-sampler.cpp	tests: enable kv_unified to prevent cuda oom error on rtx 2060 (#20645 )	2026-03-18 17:40:22 +08:00
test-barrier.cpp	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748 )	2025-12-10 12:32:23 -08:00
test-c.c	ggml : remove kompute backend (#14501 )	2025-07-03 07:48:32 +03:00
test-chat-auto-parser.cpp	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
test-chat-peg-parser.cpp	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
test-chat-template.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-chat.cpp	common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777 )	2026-03-20 02:37:22 +01:00
test-double-float.cpp	ggml : minor naming changes (#8433 )	2024-07-12 10:46:02 +03:00
test-gbnf-validator.cpp	cmake : do not include ./src as public for libllama (#13062 )	2025-04-24 16:00:10 +03:00
test-gguf-model-data.cpp	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
test-gguf.cpp	ggml/gguf : prevent integer overflows (#19856 )	2026-02-24 20:17:11 +02:00
test-grammar-integration.cpp	grammar: Fix grammar root symbol check (#19761 )	2026-03-12 12:04:56 +01:00
test-grammar-llguidance.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-grammar-parser.cpp	llama : add token matching support to llama-grammar (#17816 )	2025-12-09 00:32:57 -06:00
test-jinja.cpp	tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483 )	2026-03-18 10:43:31 +01:00
test-json-partial.cpp	common : handle unicode during partial json parsing (#16526 )	2025-10-12 16:18:47 +03:00
test-json-schema-to-grammar.cpp	examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968 )	2026-03-10 14:38:18 +01:00
test-llama-archs.cpp	model: mistral small 4 support (#20649 )	2026-03-17 00:31:14 +01:00
test-llama-grammar.cpp	llama : add token matching support to llama-grammar (#17816 )	2025-12-09 00:32:57 -06:00
test-log.cpp	common : use common_ prefix for common library functions (#9805 )	2024-10-10 22:57:42 +02:00
test-lora-conversion-inference.sh	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
test-model-load-cancel.cpp	llama : update llama_model API names (#11063 )	2025-01-06 10:55:18 +02:00
test-mtmd-c-api.c	mtmd : add C public API (#13184 )	2025-05-04 23:43:42 +02:00
test-opt.cpp	tests : fix test-opt with GGML_BACKEND_DL (#15599 )	2025-08-26 22:14:38 +02:00
test-peg-parser.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-quantize-fns.cpp	test : add testing and fixes	2026-03-22 01:07:55 -04:00
test-quantize-perf.cpp	ci: run the x64 and arm ci on the github machines instead (#16183 )	2025-09-25 08:06:06 +03:00
test-quantize-stats.cpp	server: introduce API for serving / loading / unloading multiple models (#17470 )	2025-12-01 19:41:04 +01:00
test-reasoning-budget.cpp	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
test-regex-partial.cpp	common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342 )	2026-01-03 16:02:43 -06:00
test-rope.cpp	ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805 )	2025-11-11 13:33:24 +02:00
test-sampling.cpp	sampling : optimize samplers by reusing bucket sort (#15665 )	2025-08-31 20:41:02 +03:00
test-state-restore-fragmented.cpp	kv-cache: Fix state restore fragmented cache (#17982 )	2025-12-15 19:28:35 +02:00
test-thread-safety.cpp	server : support unified cache across slots (#16736 )	2025-11-02 18:14:04 +02:00
test-tokenizer-0.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-0.py	py : logging and flake8 suppression refactoring (#7081 )	2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh	model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826 )	2026-02-26 12:14:09 +01:00
test-tokenizer-1-bpe.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-1-spm.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-random.py	requirements : update transformers/torch for Embedding Gemma (#15828 )	2025-09-09 06:06:52 +02:00
test-tokenizers-repo.sh	devops: add s390x & ppc64le CI (#15925 )	2025-09-27 02:03:33 +08:00
testing.h	common : implement new jinja template engine (#18462 )	2026-01-16 11:22:06 +01:00