* WIP: add NVFP4 quantization support
* tests
* improve NVFP4 dot product implementation performance and fix bad super call
* typo
* Use nvfp4 kvalues
* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
* vulcal and perf fixes
* wip
* Fix metal
* fix vulcan
* Rename threshold & fix wrong scale
* Fix MOE
* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
* quantize: add NVFP4 as a quantization type option
* Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
* Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
* ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
* CPU only backend 64 super-block layout
* cleanup
* Remove unused LUT
* int
* exclude NVFP4 from unsupported ops in metal build
* remove quantization for now
* store scales as native UE4M3, preserve original model bits when possible
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* correct comment
* format
* reduce duplication and cleanup
* Address comments
* move detection to prepare_tensors
* Use math instead of const
* Move
* fix comment
* Shelf quantize tests
* Rebase and move check
* cleanup
* lint
* Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use fallback quant config
* Simplify
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* organize
* Refactor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* fix return type
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* Adding --direct-io flag for model loading
* Fixing read_raw() calls
* Fixing Windows read_raw_at
* Changing type off_t to size_t for windows and Renaming functions
* disable direct io when mmap is explicitly enabled
* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL
* Fallback to std::fread in case O_DIRECT fails due to bad address
* Windows: remove const keywords and unused functions
* Update src/llama-mmap.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!
* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only
* conversion now working, hf -> gguf
* working on support, now working on building graph
* some cleanup
* cleanup
* continuing
* correct tensor shape for qkv
* fixed tensor mappings and working on buildin graph
* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this
* cleanup
* cleanup
* cleanup
* more cleanup
* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
* added cls token per previous modern bert attempt, still working on checking out the rest
* fixed pre tokenizer and still working through previous pr
* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer
* fixed pre tokenizer
* working on swa with local and global alternating attention
* some cleanup and now fails on build attn
* starting to work, and some cleanup, currently failing on last layer construction in graph build
* alternating rope implemented and modern bert graph build succeeds
* fixed asser for equal ubatch seq
* cleanup
* added mask check in vocab
* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values
* reuse variable
* removed repeat
* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL
* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...
* more modular hparam setting
* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* removed redundant hparam set
* enums for model sizes
* conversion for modern-bert model supported rather than just granite-small
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* fixed ordering of enum for freq_base_swa
* fixed where I added residual, now gives much much better embeddings~
* readded cacheless logic
* removing whitespace
* conversion now working for swa pattern - dense every n layers
* modern bert put into seperate src file
* removing whitespace
* fixed whitespace and newline errors in editorconfig job
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* better naming convention, n_swa_pattern -> swa_period
* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support
* fixing pyright type-check fail
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-saver.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* added descriptions in llama-model
* fixed tensor mappings for conversion
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mapping name for size
* nits
* unused
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* Uncached model read
* Removing additional --mmap arg
* Removing trailing whitespaces
* Adding fallback when O_DIRECT is not supported
* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp
* Adding maybe unused keyword for Mac and Windows.
* File seek aligned
* Removing all branches for direct_io in llama-model-loader.cpp
* Always use alignment from llama_file
* use_mmap=true
* llama: automatically fit args to free memory
llama-fit-params tool
* fix CI
* hints for bug reports, ensure no reallocation
* fix segfault with Vulkan
* add llama-fit-params to CI
* fix CI
* fix CI
* fix CI
* minor adjustments
* fix assignment of 1 dense layer
* fix logger not being reset on model load failure
* remove --n-gpu-layer hint on model load failure
* fix llama-fit-params verbosity
* fix edge case
* fix typo [no ci]
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Add NEMOTRONH to python arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add NEMOTRONH to c++ arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add NEMOTRONH to llama-arch layer map
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First pass at conversion for nemotronh
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add a verbose log for each tensor loaded
This is really helpful for diagnosing mismatches between the expected and
received tensors
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First (broken) pass at nemotronh model architecture
It generates tokens, just not valid ones!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Explicitly enable add_bos_token during conversion
The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Only allocate attention cache for attention layers (not non-recurrent)
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Move residual add to after every block
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the correct norm tensor for the MLP blocks
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)
This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.
* SSM: respect ssm_dt_rank for dt_dim when provided
Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).
* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)
* Rename nemotronh to nemotron_h for consistency
- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)
* feat: Support conversion for older NemotronH models
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Maicon Domingues <dominguesm@outlook.com>
Co-authored-by: weatherman <fxdstudios@gmail.com>
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
* model : print tensor size during load
* cont : fix units MB -> MiB
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* impl::load change map bpe_ranks to onordered map for reduce time of impl::load on 30%
* llama_model_loader::init_mapping - replace new llama_mmap to std::make_unique<llama_mmap> for clean code & reduce (/2) time of running init_mappings
* Update src/llama-vocab.cpp
---------
Co-authored-by: lexasub <empty@empty.ru>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types