Commit Graph

498 Commits

Author SHA1 Message Date
ibrahimkhadraoui 67b2664290 cleaning unused hparams 2025-07-08 10:20:17 +04:00
younesbelkada d2f46f18ac moe cleanuips 2025-07-07 17:36:22 +04:00
younesbelkada 68cb7845e9 more cleanups 2025-07-07 17:34:20 +04:00
Younes B fd203302aa
Update src/llama-model-loader.cpp 2025-07-07 17:29:50 +04:00
younesbelkada 084873c215 some cleanups 2025-07-07 17:28:08 +04:00
younesbelkada 632861e6c1 some cleanups 2025-07-07 17:27:34 +04:00
younesbelkada f74e266f04 fix comment 2025-07-07 17:23:47 +04:00
ibrahimkhadraoui 042e5ff90b cleaning debug quant 2025-07-07 17:21:54 +04:00
ibrahimkhadraoui 624699c53f cleaning debugging stuff 2025-07-07 17:20:24 +04:00
ibrahimkhadraoui 935d46fab0 changed ROPE_TYPE 2025-07-07 17:01:54 +04:00
ibrahimkhadraoui ae937f442c rm unused key 2025-07-07 16:57:36 +04:00
ibrahimkhadraoui 53446f7e42 rm unused MAMBA_CHUNK_SIZE 2025-07-07 15:29:56 +04:00
ibrahimkhadraoui 0ad3502839 rm extra space 2025-07-07 15:26:46 +04:00
younesbelkada a9f3a63dc1 injected mup 2025-07-07 15:00:25 +04:00
ibrahimkhadraoui b3bc1fb237 Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp-public into add-fh1-rebased 2025-07-07 14:36:55 +04:00
ibrahimkhadraoui 286e1fa569 fix rope_theta 2025-07-07 14:36:51 +04:00
ibrahimkhadraoui 49d7420964 inp_out_ids moved outside of layers loop 2025-07-07 14:18:48 +04:00
ibrahimkhadraoui 8c50893820 added some cb functions for debugging puposes 2025-07-07 14:10:45 +04:00
Younes B 6c39e775dd
fix conversion and d_inner 2025-07-07 10:56:49 +02:00
ibrahimkhadraoui 7a25441e13 fixed multipliers 2025-07-04 17:41:03 +04:00
ibrahimkhadraoui 15138df48f small fix ffn_norm 2025-07-04 15:37:40 +04:00
younesbelkada 22de62cf56 fix 2025-07-04 15:02:14 +04:00
younesbelkada cce35498d5 pre-norm -> norm 2025-07-04 14:58:33 +04:00
younesbelkada 50eadc7b33 fixes 2025-07-04 14:47:31 +04:00
younesbelkada 14c37ec047 more cleaning on python code 2025-07-03 18:09:30 +04:00
younesbelkada fdd5cff4ba minor fix 2025-07-03 17:12:05 +04:00
younesbelkada 0c93ef6a9c more fixes 2025-07-03 15:26:33 +04:00
younesbelkada 03568c9358 fix 2025-07-03 15:10:18 +04:00
younesbelkada 71a6848e2d another fix 2025-07-03 15:08:23 +04:00
younesbelkada f897efdaf6 push more fixes 2025-07-03 15:05:01 +04:00
younesbelkada 991de6cbe4 v1 2025-07-03 14:49:56 +04:00
Georgi Gerganov a70c8a0c4b
kv-cache : use ggml_set_rows (#14285)
* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci
2025-07-03 10:53:35 +03:00
compilade 5d46babdc2
llama : initial Mamba-2 support (#9126)
* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size
2025-07-02 13:10:24 -04:00
Georgi Gerganov 745f11fed0
memory : correctly handle failure in apply() (#14438)
ggml-ci
2025-06-30 18:03:03 +03:00
Sigbjørn Skjæret a0535ffa0d
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
* implement unary REGLU/GEGLU/SWIGLU cpu ops

* relax constraints

* duplicate shape of source

* fix ggml_vec_geglu_f16

* special case gated ops

* implement unary REGLU/GEGLU/SWIGLU cuda ops

* tighten constraints again

* refactor into GGML_GLU_OP

* metal : add glu kernels

ggml-ci

* add CUDA_GLU_BLOCK_SIZE [no ci]

* more constraints and use 64bit ints

ggml-ci

* 64bit multiplication [no ci]

* implement swapped variants (cpu/cuda)

* update comment [no ci]

ggml-ci

* Vulkan: Add GLU ops and shaders

* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate

* ggml : implement GLU for split up/gate (#14181)

* implement GLU for split up/gate

* add tests for ggml_glu_split

* Vulkan: Implement glu_split logic and shader support

* add split to logging [no ci]

* SYCL: refactor element_size ops and add split up and gate support to gated kernels

* SYCL: switch GEGLU to use tanh approximation

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>

* GGML: increase OP count in assertion

* Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

* vulkan: Increase workgroup size for GLU, for performance (#14345)

* vulkan: Increase workgroup size for GLU, for performance

* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup

* merge fix

* metal : add support for split and swap

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-06-29 11:04:10 +02:00
Weizhao Ouyang 566c16fcce
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang <weizhao.ouyang@arm.com>
2025-06-28 16:08:21 +02:00
Georgi Gerganov 72babea5de
graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
2025-06-27 21:42:02 +03:00
Georgi Gerganov 43678060c1
recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
2025-06-27 17:55:45 +03:00
Xuan-Son Nguyen 8846aace49
model : gemma3n text-only (#14400)
* gemma3n

* add llm_graph_input_one
2025-06-26 20:34:02 +03:00
Sigbjørn Skjæret b25346221d
llama : return mistral-v7-tekken as default template only (#14390) 2025-06-26 15:01:14 +02:00
Georgi Gerganov 62af464227
batch : fix check for empty sequences in memory (#14364)
* batch : fix check for empty sequences in memory

ggml-ci

* cont : reuse the var

ggml-ci
2025-06-24 18:26:30 +03:00
Molly Sophia 72c6bc3f3d
llama : better rwkv chat template and add missing `inputs.use_jinja` setting (#14336)
* llama-cli : add missing `inputs.use_jinja` setting

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama : better legacy chat template for rwkv

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-06-23 19:56:19 +08:00
Georgi Gerganov 7b50d589a8
kv-cells : fix tracking of seq_pos (#14339)
* kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

* cont : improve error message

ggml-ci

* cont : add more comments
2025-06-23 12:27:35 +03:00
Ed Addario fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) (#13037) 2025-06-22 23:16:26 +02:00
Georgi Gerganov 692e3cdd0a
memory : rename interface to llama_memory_context_i (#14296)
* memory : rename interface to llama_memory_context_i

ggml-ci

* cont : fix comments

* cont : use "mctx" for referencing a memory context

ggml-ci
2025-06-21 08:03:46 +03:00
Sigbjørn Skjæret 22015b2092
lint : remove trailing whitepace (#14304) 2025-06-20 16:37:44 +02:00
Ruikai Peng dd6e6d0b6a
vocab : prevent tokenizer overflow (#14301)
* vocab : prevent stack overflow in tokenize

* vocab : return error instead of aborting on oversized token count

* vocab : INT32_MIN from llama_tokenize on overflow
2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret 88fc854b4b
llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
Georgi Gerganov 812939a9e9
model : more uniform output id handling (#14275)
* model : more uniform output id handling

ggml-ci

* cont : revert n_outputs < n_tokens optimization

ggml-ci

* cont : fix out_ids initialization

ggml-ci
2025-06-20 10:50:27 +03:00
Georgi Gerganov 4c9fdfbe15
ubatch : new splitting logic (#14217)
ggml-ci
2025-06-20 10:14:14 +03:00