llama.cpp

Commit Graph

Author	SHA1	Message	Date
ibrahimkhadraoui	67b2664290	cleaning unused hparams	2025-07-08 10:20:17 +04:00
younesbelkada	d2f46f18ac	moe cleanuips	2025-07-07 17:36:22 +04:00
younesbelkada	68cb7845e9	more cleanups	2025-07-07 17:34:20 +04:00
Younes B	fd203302aa	Update src/llama-model-loader.cpp	2025-07-07 17:29:50 +04:00
younesbelkada	084873c215	some cleanups	2025-07-07 17:28:08 +04:00
younesbelkada	632861e6c1	some cleanups	2025-07-07 17:27:34 +04:00
younesbelkada	f74e266f04	fix comment	2025-07-07 17:23:47 +04:00
ibrahimkhadraoui	042e5ff90b	cleaning debug quant	2025-07-07 17:21:54 +04:00
ibrahimkhadraoui	624699c53f	cleaning debugging stuff	2025-07-07 17:20:24 +04:00
ibrahimkhadraoui	935d46fab0	changed ROPE_TYPE	2025-07-07 17:01:54 +04:00
ibrahimkhadraoui	ae937f442c	rm unused key	2025-07-07 16:57:36 +04:00
ibrahimkhadraoui	53446f7e42	rm unused MAMBA_CHUNK_SIZE	2025-07-07 15:29:56 +04:00
ibrahimkhadraoui	0ad3502839	rm extra space	2025-07-07 15:26:46 +04:00
younesbelkada	a9f3a63dc1	injected mup	2025-07-07 15:00:25 +04:00
ibrahimkhadraoui	b3bc1fb237	Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp-public into add-fh1-rebased	2025-07-07 14:36:55 +04:00
ibrahimkhadraoui	286e1fa569	fix rope_theta	2025-07-07 14:36:51 +04:00
ibrahimkhadraoui	49d7420964	inp_out_ids moved outside of layers loop	2025-07-07 14:18:48 +04:00
ibrahimkhadraoui	8c50893820	added some cb functions for debugging puposes	2025-07-07 14:10:45 +04:00
Younes B	6c39e775dd	fix conversion and d_inner	2025-07-07 10:56:49 +02:00
ibrahimkhadraoui	7a25441e13	fixed multipliers	2025-07-04 17:41:03 +04:00
ibrahimkhadraoui	15138df48f	small fix ffn_norm	2025-07-04 15:37:40 +04:00
younesbelkada	22de62cf56	fix	2025-07-04 15:02:14 +04:00
younesbelkada	cce35498d5	pre-norm -> norm	2025-07-04 14:58:33 +04:00
younesbelkada	50eadc7b33	fixes	2025-07-04 14:47:31 +04:00
younesbelkada	14c37ec047	more cleaning on python code	2025-07-03 18:09:30 +04:00
younesbelkada	fdd5cff4ba	minor fix	2025-07-03 17:12:05 +04:00
younesbelkada	0c93ef6a9c	more fixes	2025-07-03 15:26:33 +04:00
younesbelkada	03568c9358	fix	2025-07-03 15:10:18 +04:00
younesbelkada	71a6848e2d	another fix	2025-07-03 15:08:23 +04:00
younesbelkada	f897efdaf6	push more fixes	2025-07-03 15:05:01 +04:00
younesbelkada	991de6cbe4	v1	2025-07-03 14:49:56 +04:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-03 10:53:35 +03:00
compilade	5d46babdc2	llama : initial Mamba-2 support (#9126 ) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-02 13:10:24 -04:00
Georgi Gerganov	745f11fed0	memory : correctly handle failure in apply() (#14438 ) ggml-ci	2025-06-30 18:03:03 +03:00
Sigbjørn Skjæret	a0535ffa0d	ggml : implement REGLU/GEGLU/SWIGLU ops (#14158 ) * implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-06-29 11:04:10 +02:00
Weizhao Ouyang	566c16fcce	model : add support for ERNIE 4.5 0.3B model (#14408 ) Add Day-0 support for Baidu ERNIE 4.5 0.3B model. Signed-off-by: Weizhao Ouyang <weizhao.ouyang@arm.com>	2025-06-28 16:08:21 +02:00
Georgi Gerganov	72babea5de	graph : make llm_graph_context destructor virtual (#14410 ) ggml-ci	2025-06-27 21:42:02 +03:00
Georgi Gerganov	43678060c1	recurrent : call balloc split_reset() in init_batch() (#14414 ) ggml-ci	2025-06-27 17:55:45 +03:00
Xuan-Son Nguyen	8846aace49	model : gemma3n text-only (#14400 ) * gemma3n * add llm_graph_input_one	2025-06-26 20:34:02 +03:00
Sigbjørn Skjæret	b25346221d	llama : return mistral-v7-tekken as default template only (#14390 )	2025-06-26 15:01:14 +02:00
Georgi Gerganov	62af464227	batch : fix check for empty sequences in memory (#14364 ) * batch : fix check for empty sequences in memory ggml-ci * cont : reuse the var ggml-ci	2025-06-24 18:26:30 +03:00
Molly Sophia	72c6bc3f3d	llama : better rwkv chat template and add missing `inputs.use_jinja` setting (#14336 ) * llama-cli : add missing `inputs.use_jinja` setting Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama : better legacy chat template for rwkv Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-06-23 19:56:19 +08:00
Georgi Gerganov	7b50d589a8	kv-cells : fix tracking of seq_pos (#14339 ) * kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments	2025-06-23 12:27:35 +03:00
Ed Addario	fa4a9f2a1c	quantize : handle user-defined pruning of whole layers (blocks) (#13037 )	2025-06-22 23:16:26 +02:00
Georgi Gerganov	692e3cdd0a	memory : rename interface to llama_memory_context_i (#14296 ) * memory : rename interface to llama_memory_context_i ggml-ci * cont : fix comments * cont : use "mctx" for referencing a memory context ggml-ci	2025-06-21 08:03:46 +03:00
Sigbjørn Skjæret	22015b2092	lint : remove trailing whitepace (#14304 )	2025-06-20 16:37:44 +02:00
Ruikai Peng	dd6e6d0b6a	vocab : prevent tokenizer overflow (#14301 ) * vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow	2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
Georgi Gerganov	812939a9e9	model : more uniform output id handling (#14275 ) * model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci	2025-06-20 10:50:27 +03:00
Georgi Gerganov	4c9fdfbe15	ubatch : new splitting logic (#14217 ) ggml-ci	2025-06-20 10:14:14 +03:00

1 2 3 4 5 ...

498 Commits