llama.cpp

Commit Graph

Author	SHA1	Message	Date
Richard Davison	5eae9cb1d9	ggml : add NVFP4 quantization type support (#19769 ) * WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-11 21:02:54 +01:00
Aleksei Nikiforov	d82b7a7c1d	gguf-py : fix passing non-native endian tensors (editor-gui and new-metadata) (#17553 ) gguf_new_metadata.py reads data from reader. Reader doesn't byteswap tensors to native endianness. But writer does expect tensors in native endianness to convert them into requested endianness. There are two ways to fix this: update reader and do conversion to native endianness and back, or skip converting endianness in writer in this particular USE-case. gguf_editor_gui.py doesn't allow editing or viewing tensor data. Let's go with skipping excessive byteswapping. If eventually capability to view or edit tensor data is added, tensor data should be instead byteswapped when reading it.	2025-11-28 20:53:01 +01:00
Aleksei Nikiforov	4fcd87cf7c	gguf-py : skip endian-conversion of MXFP4 data (#17523 ) * gguf_convert_endian.py: skip MXFP4 data * Use gguf.constants.GGML_QUANT_SIZES to determine block sizes	2025-11-27 11:35:38 +01:00
Aleksei Nikiforov	7adc79c032	gguf-py : add support for endian conversion of BF16 data (#16594 ) BF16 requires special handling in this script while it's a 2-bytes data, but view is 1-byte by default. Switch to correct view before attempting byteswapping. With this change correctly byteswapping models like Meta-Llama-3-8B-Instruct-bf16-GGUF should be possible.	2025-10-15 22:43:08 +02:00
Aleksei Nikiforov	64387f6e95	gguf-py: byteswapping improvements (#12851 ) * gguf-py: implement byteswapping for Q4_0 This is needed to byteswap Mistral model. Also restore original shapes after byteswapping tensors. It is not needed at the moment, but do it in case they'd be used in future. * Rework byteswapping code in gguf-py Move out details from byteswapping tensor blocks code	2025-08-28 16:56:41 +08:00
Sigbjørn Skjæret	e5bebe5251	gguf-py : add --chat-template-file to gguf_new_metadata (#15075 )	2025-08-04 21:01:48 +02:00
Ed Addario	c81f4192f9	gguf-py : dump bpw per layer and model in markdown mode (#14703 )	2025-07-16 00:04:42 +02:00
Sigbjørn Skjæret	2b131621e6	gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561 )	2025-05-29 15:36:05 +02:00
Daniel Tang	07ad2b6db3	gguf-py : fix disconnect-before-connect in editor-gui (#13569 ) The bug caused a crash upon load with venvs created with --system-site-packages to use python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4 from Kubuntu 24.10.	2025-05-15 18:47:10 +02:00
compilade	a7366faa5b	gguf-py : avoid requiring pyside6 for other scripts (#13036 ) - gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/), and the entrypoints in pyproject.toml can directly refer to the main functions.	2025-05-05 22:27:31 -04:00
Chris Thompson	aff9d107b0	gguf-py : GGUF Editor GUI - Python + Qt6 (#12930 )	2025-04-18 20:30:41 +02:00
Sigbjørn Skjæret	69050a11be	Refactor gguf scripts to improve metadata handling (#11909 ) * Refactor gguf scripts to improve metadata handling Added contents method to ReaderField class Added endianess property to GGUFReader class * update scripts * fix import * remove unused import * attempt to work around flake and pyright errors * second attempt * give up, ignore type * bump version * apply newbyteorder fixes	2025-02-26 08:04:48 -05:00
Aleksei Nikiforov	651adf4b66	gguf_convert_endian.py: implement byteswapping for q4_k and q6_k (#11349 )	2025-02-24 11:27:01 +00:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
Vinesh Janarthanan	c05e8c9934	gguf-py: fixed local detection of gguf package (#11180 ) * updated path to gguf package for non-installed setups * added reader.py to readme * Bumped gguf version to 0.15.0	2025-01-11 11:42:31 +02:00
Vinesh Janarthanan	8a1d9c25fa	gguf-py : move scripts directory (#11116 ) * Moved scripts dir and fixed pyproject.toml * updated readme * fixed README urls * bump pypi gguf to v0.14.0 * retrigger ci * empty commit - trigger ci	2025-01-08 20:54:58 +02:00

16 Commits