Commit Graph

678 Commits

Author SHA1 Message Date
Ed Addario b748a1efa7
Fix typo 2025-09-21 22:03:54 +01:00
Ed Addario 896cdc2121
Refactor potential overflow 2025-09-21 22:03:36 +01:00
Ed Addario fecc472c61
Fix typos in variable names 2025-09-21 17:26:38 +01:00
Ed Addario e92db008bc
Refactor quantisation checks into its own function 2025-09-21 17:20:48 +01:00
Ed Addario 814f6b66be
Minor general refactoring 2025-09-21 16:45:09 +01:00
Ed Addario 0d5f18303e
Refactor lagrange_penalty() 2025-09-21 16:22:00 +01:00
Ed Addario 9a1656eb97
Refactor pareto optimise and convexify 2025-09-21 16:21:35 +01:00
Ed Addario 1a3e9ea4c8
Refactor estimate_error() 2025-09-21 16:21:00 +01:00
Ed Addario a7ee915e19
Refactor trimmed_sum() 2025-09-21 16:20:06 +01:00
Ed Addario b09662f86a
Refactor estimate_lambda() 2025-09-21 16:19:49 +01:00
Ed Addario 17be7615ce
Refactor candidate types build 2025-09-21 16:19:28 +01:00
Ed Addario 08146fd67f
Refactor side_data() and copy_or_broadcast() 2025-09-21 16:19:03 +01:00
Ed Addario 7386d4eadd
Refactor row sampling 2025-09-21 16:18:26 +01:00
Ed Addario b6c008fd8a
Refactor helper lambdas 2025-09-21 16:04:13 +01:00
Ed Addario b433fd9547
Refactor last budget pass 2025-09-21 13:43:09 +01:00
Ed Addario c466c53808
Refactor pareto pruning and convexification 2025-09-21 13:42:54 +01:00
Ed Addario 6b8cedf3bc
Refactor estimate_lambda() 2025-09-21 13:42:31 +01:00
Ed Addario bdefdb673c
Refactor copy_or_broadcast() 2025-09-21 13:42:07 +01:00
Ed Addario e8e2aed17a
Refactor row sampling 2025-09-21 13:41:44 +01:00
Ed Addario 9e74f83411
Replace --bpw-bias flag with --no-bias 2025-09-20 23:06:37 +01:00
Ed Addario ab02bb1f3e
Merge branch 'master' into quantize 2025-09-20 21:41:25 +01:00
Ed Addario a36946997e
Replace fast_bias() for per slice version and remove precise_bias() 2025-09-20 21:36:54 +01:00
Ed Addario 14fae69a7b
General refactoring 2025-09-20 21:31:31 +01:00
Georgi Gerganov e58174cecb
llama : bump max seq limit from 64 to 256 (#15916)
ggml-ci
2025-09-18 12:47:56 +03:00
Xuan-Son Nguyen 8f8f2274ee
convert : add Llama4ForCausalLM (#16042)
* convert : add Llama4ForCausalLM

* handle swa

* half working version

* fix use_kq_norm

* fix use_kq_norm
2025-09-17 19:18:21 +02:00
Jie Fu (傅杰) 745cbcf2fe
llama-quant : fix the verification of attention layers for encoder-decoder models (#16023)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-17 09:30:55 +02:00
Shane A 85286f3548
model : add OLMo3 support (#16015)
* Add HF to gguf conversion logic for Olmo3

* Add Olmo3 implementation

* Update rope comment

* Fix indentation

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-17 09:01:58 +02:00
Aman Gupta 6d758839ff
Add LLaDA-7b-MoE diffusion model (#16003) 2025-09-16 10:38:28 +08:00
Ed Addario ad70fca5b2
Merge branch 'quantize' of https://github.com/EAddario/llama.cpp into quantize 2025-09-15 07:42:37 +01:00
Ed Addario 9b857e3984
Merge branch 'ggml-org:master' into quantize 2025-09-14 23:35:43 +01:00
Ed Addario c709e1a335
Fix MoE tensor estimation 2025-09-14 22:38:27 +01:00
Sigbjørn Skjæret b8e09f08b9
model : add grok-2 support (#15539)
* add grok-2 support

* type fix

* type fix

* type fix

* "fix" vocab for invalid sequences

* fix expert tensor mapping and spaces in vocab

* add chat template

* fix norm tensor mapping

* rename layer_out_norm to ffn_post_norm

* ensure ffn_post_norm is mapped

* fix experts merging

* remove erroneous FFN_GATE entry

* concatenate split tensors and add more metadata

* process all expert layers and try cat instead of hstack

* add support for community BPE vocab

* fix expert feed forward length and ffn_down concat

* commit this too

* add ffn_up/gate/down, unsure if sequence is right

* add ffn_gate/down/up to tensor names

* correct residual moe (still not working)

* mess--

* fix embedding scale being applied twice

* add built in chat template

* change beta fast for grok if default value

* remove spm vocab in favor of community bpe vocab

* change attention temp length metadata type to integer

* update attention temp length metadata

* remove comment

* replace M_SQRT2 with std::sqrt(2)

* add yarn metadata, move defaults to hparams
2025-09-14 23:00:59 +02:00
Ed Addario 8503d59ee4
Increase IQ options 2025-09-13 11:49:18 +01:00
Ed Addario 2b516068e2
"Convexify" candidate list 2025-09-13 09:41:52 +01:00
Ed Addario 12e816b511
Replace greedy allocator with lagrangian relaxation 2025-09-13 09:24:23 +01:00
Ed Addario 7d85993f26
Minor refactoring 2025-09-13 08:44:41 +01:00
Ed Addario 4dff85fbe5
Improve precise_lambda() efficiency 2025-09-13 08:41:37 +01:00
Ed Addario bc8762f27f
Capture surrounding function name 2025-09-13 08:33:22 +01:00
Ed Addario 886536d80a
Increase error type precision 2025-09-13 08:27:23 +01:00
Haiyue Wang f4e664f838
context : remove redundant explicit casting to the same type (#15948)
The function 'output_reserve' return type is 'uint32_t', so need to add
explicit casting.
2025-09-12 18:16:32 +03:00
Diego Devesa 360d6533db
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-11 22:47:38 +02:00
ddh0 df082f5630
nitpick : correct MB to MiB (#15934)
MB was incorrectly used for 1024 x 1024 bytes instead of MiB
2025-09-11 19:12:34 +02:00
Ed Addario f0f07bd6ba
Merge branch 'master' into quantize 2025-09-10 21:16:04 +01:00
Jie Fu (傅杰) 4f658855fa
llama : support T5 models with unequal number of encoder-decoder layers (#15909)
* Extend the support of T5 models with different encoder-decoder layers

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Rename n_dec_layer --> dec_n_layer

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Adapt to cases when dec_n_layer > n_layer

Signed-off-by: Jie Fu <jiefu@tencent.com>

---------

Signed-off-by: Jie Fu <jiefu@tencent.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-10 20:51:51 +02:00
Sigbjørn Skjæret 6ab397e12b
graph : support non-contiguous Q in build_attn_mha (#15908)
* support non-contiguous Q in build_attn_mha

* Update src/llama-graph.cpp

ggml-ci

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-10 19:08:59 +02:00
Ed Addario 04c07b3272
Add better control over MSE and directional bias computation 2025-09-10 18:00:56 +01:00
Daniel Bevenius 86587da03b
llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893)
This commit adds check for two function pointers returned from
ggml_backend_reg_get_proc_address.

The motivation for this is that the function pointer could be nullptr if
the get proc address function changes in the future. This is also
consistent with all the other calls to ggml_backend_reg_get_proc_address
in the code base.
2025-09-10 05:33:58 +02:00
Georgi Gerganov 663027fd54
context : fix n_outputs during reserve (#15858)
ggml-ci
2025-09-08 10:26:36 +03:00
Georgi Gerganov cf0e3ba150
model : avoid ggml_cont_3d for fused QKV weights (#15662)
* model : avoid ggml_cont_3d for fused QKV weights

ggml-ci

* kv-cache : make cpy_k and cpy_v implementation more readable

ggml-ci

* cont : add comments

ggml-ci

* cont : minor fix [no ci]

* cont : one more fix

* cont : clarity

ggml-ci

* kv-cache : require contiguous heads of k_cur and v_cur

ggml-ci
2025-09-08 10:25:33 +03:00
Ed Addario 7d04050b23
Merge branch 'master' into quantize 2025-09-06 13:09:41 +01:00