Ed Addario
030ed3c909
Merge branch 'master' into imatrix
2025-08-05 21:58:00 +01:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com>
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret
f324a3b715
chat : only remove double bos/eos if added ( #15086 )
...
* only remove double bos/eos if added
* fix tests
2025-08-05 20:43:36 +02:00
Georgi Gerganov
be42642581
readme : update hot topics ( #15097 )
2025-08-05 20:19:33 +03:00
Romain Biessy
3306ceabf0
sycl: fix mul_mat selection ( #15092 )
2025-08-05 18:39:55 +02:00
Ed Addario
88854c9179
Refactor legacy mode
2025-08-05 14:16:45 +01:00
Juk Armstrong
c81de6e107
Fix `glm4moe` bug ( #15088 )
2025-08-05 13:56:44 +01:00
Ed Addario
4c3fea89d6
Update report layout
2025-08-05 13:32:59 +01:00
Ed Addario
49996a19da
Refactor variable names
2025-08-05 13:32:46 +01:00
Ed Addario
aea9b31db5
Make ZD Score two-tailed
2025-08-05 12:57:13 +01:00
Alex Wu
22f060c9c4
webui: fix markdown table ( #15081 )
...
* webui: fix markdown table
* webui: fix table display with themes
2025-08-05 13:56:44 +02:00
Ed Addario
906548a00a
Update aggregated sum of squared activations per layer
2025-08-05 12:06:19 +01:00
compilade
ee3a9fcf88
context : fix index overflow on huge outputs ( #15080 )
...
* context : fix overflow when re-ordering huge outputs
* context : fix logits size overflow for huge batches
2025-08-05 11:27:45 +02:00
Ed Addario
b37393423d
Compute aggregated (per layer) l2 norm
2025-08-05 08:54:57 +01:00
Ed Addario
5e40cf4f1c
Do not resize if in_sum is null
2025-08-05 00:18:53 +01:00
Diego Devesa
ec428b02c3
llama : add --n-cpu-moe option ( #15077 )
...
* llama : add --n-cpu-moe option
Keeps the MoE weights of the first N layers in the CPU
2025-08-05 01:05:36 +02:00
compilade
19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix ( #15076 )
...
* imatrix : add warning when suffix is not .gguf for GGUF imatrix
* imatrix : only warn about suffix when output format is unspecified
2025-08-04 23:26:52 +02:00
Ed Addario
adbff66394
Merge branch 'master' into imatrix
2025-08-04 22:16:10 +01:00
Ed Addario
c39c4e2a33
Refactor variable name
2025-08-04 22:15:50 +01:00
Christian Kastner
41613437ff
cmake: Add GGML_BACKEND_DIR option ( #15074 )
...
* cmake: Add GGML_BACKEND_DIR option
This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.
* Fix phrasing
2025-08-04 21:29:14 +02:00
Sigbjørn Skjæret
e5bebe5251
gguf-py : add --chat-template-file to gguf_new_metadata ( #15075 )
2025-08-04 21:01:48 +02:00
Sam
ef0144c087
model: support GLM 4.5 family of models ( #14939 )
...
* model: Add GLM 4.5 (#14921 )
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Merge in PR suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model: Add GLM 4.5 family of models (#14921 )
1. Updated tensor_mapping.py with NextN tensor mappings
- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm
2. Added num_nextn_predict_layers configuration
- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config
3. Added FIM tokens for GLM4_MOE
- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
- <|code_prefix|> for FIM_PRE
- <|code_suffix|> for FIM_SUF
- <|code_middle|> for FIM_MID
4. Removed manual NextN tensor handling
- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system
* glm 4.5 update tensors names
* model: glm 4.5 apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model: glm 4.5 apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model: glm 4.5 apply suggestions from code review
* Apply suggestions from code review
* patch broken chat template
* typings fix
* add TENSOR_SKIP flag
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Update src/llama-model-loader.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-04 20:29:25 +02:00
Sigbjørn Skjæret
2721257e3e
quantize : fix confusing error message if ftype is invalid ( #15071 )
2025-08-04 18:11:02 +02:00
Reese Levine
587d0118f5
ggml: WebGPU backend host improvements and style fixing ( #14978 )
...
* Add parameter buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
2025-08-04 08:52:43 -07:00
Jeff Bolz
5aa1105da2
vulkan: fix build when using glslang that does not support coopmat2 ( #15062 )
2025-08-04 07:09:19 +02:00
compilade
d31192b4ee
imatrix : use GGUF by default ( #14842 )
...
* imatrix : use GGUF by default
* imatrix : use GGUF regardless of the output filename
The legacy format can only be produced with --output-format dat
2025-08-03 22:00:05 +02:00
compilade
0a2f5496be
imatrix : fix 3d activation handling for hybrid and recurrent models ( #14994 )
...
* imatrix : use a single count for dense 3d tensors
* imatrix : fix 3d activations when model tensor is 2d
* imatrix : fix 3d tensor counts
2025-08-03 21:49:13 +02:00
compilade
11a3811164
memory : handle kv_unified for hybrid models ( #15050 )
2025-08-03 21:43:07 +02:00
Csaba Kecskemeti
97366dc6ab
vocab : JetBrains Mellum pre-tokenizer ( #15045 )
2025-08-03 21:38:18 +02:00
Ed Addario
f1c2a4ca3f
Fix printing l2 norm when calc_mode = 1
2025-08-03 17:14:46 +01:00
Ed Addario
90cb1be99d
Minor cosmetic changes
2025-08-03 16:57:27 +01:00
Ed Addario
2117c4e54b
Update aggregated statistic report layout
2025-08-03 16:38:02 +01:00
Ed Addario
a6155a8125
Add compute_layer_statistics() function
2025-08-03 16:35:03 +01:00
Gabriel Larson
83bc2f288c
model : add text-only support for Kimi-VL (and find special tokens in text_config) ( #15051 )
...
* basic kimi-vl textmodel conversion
* check config["text_config"] for special tokens
2025-08-03 16:56:25 +02:00
Ed Addario
be60469f25
Refactor function names
2025-08-03 15:10:17 +01:00
Jeff Bolz
6c7a441161
vulkan: Use coopmat2 for conv2d ( #14982 )
2025-08-03 14:23:57 +02:00
Ed Addario
fce05aac9e
Refactor lambda into compute_tensor_averages() function
2025-08-03 13:03:21 +01:00
Ed Addario
5324558132
Update table layout
2025-08-03 10:28:47 +01:00
Ed Addario
4d1325e1eb
Refactor variables
2025-08-03 10:28:23 +01:00
Ed Addario
a32a2ecbed
Reformat report layout
2025-08-03 00:51:33 +01:00
Ed Addario
4c01f51ae1
Remove inactive
2025-08-03 00:51:12 +01:00
lhez
5c0eb5ef54
opencl: fix adreno compiler detection logic ( #15029 )
2025-08-02 19:51:18 +02:00
Ed Addario
fc8f92596f
Update table display
2025-08-02 16:46:27 +01:00
Ed Addario
ee2509f563
Adjust threshold
2025-08-02 16:45:56 +01:00
Ed Addario
9b841eb696
Compute l2 norm
2025-08-02 16:45:09 +01:00
Ed Addario
b7fb362d8e
Compute cosine similarity based on activations
2025-08-02 16:43:49 +01:00
Ed Addario
cce514a392
Compute entropy for activations
2025-08-02 16:40:40 +01:00
Ed Addario
9744a4a1c6
Determine calculation mode
2025-08-02 16:36:12 +01:00
Ed Addario
78ddb475de
Fix problem up when GGUF does not have in_sum
2025-08-02 16:31:21 +01:00
Johannes Gäßler
03d4698218
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 ( #15035 )
2025-08-02 16:37:08 +02:00