* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size
* implement unary REGLU/GEGLU/SWIGLU cpu ops
* relax constraints
* duplicate shape of source
* fix ggml_vec_geglu_f16
* special case gated ops
* implement unary REGLU/GEGLU/SWIGLU cuda ops
* tighten constraints again
* refactor into GGML_GLU_OP
* metal : add glu kernels
ggml-ci
* add CUDA_GLU_BLOCK_SIZE [no ci]
* more constraints and use 64bit ints
ggml-ci
* 64bit multiplication [no ci]
* implement swapped variants (cpu/cuda)
* update comment [no ci]
ggml-ci
* Vulkan: Add GLU ops and shaders
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (#14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.
ref: #8366
* use I64 for indices
* ggml : add repeat impl for i64
* ggml : add ggml_is_contiguous_rows
* ggml : ggml_set_rows support broadcast
* ggml : ggml_set_rows support quantized dst
ggml-ci
* ggml : support GGML_TYPE_F32 ".from_float" trait
* ggml : ggml_set_rows update comment + better index name
* tests : add ggml_set_rows
* metal : add ggml_set_rows implementation
ggml-ci
* ggml : simplify forward_dup_f32
* ggml : fix supports_op
* tests : add comment to set_rows
* ggml : leave the repeat_i64 for a separate PR
ggml-ci
* ggml : set_rows use std::min instead of MIN
* ggml : better error message for set_rows unsupported type
* metal : perform op->type check only once
* tests : more consistent implementation + more tests
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci
This commit updates the compilation of default.metallib to skip the
intermediate .air (Apple Intermediate Representation) file.
The motivation for this change is to simplify the custom command a
little and avoid generating and then removing the .air file.
* metal : refactor im2col parameters into a struct
* metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets
* metal : refactor sum_rows parameters into a struct
* metal : refactor soft_max parameters into a struct
* metal : refactor diag_mask_inf parameters into a struct
* metal : refactor ssm_conv parameters into a struct
* metal : refactor ssm_scan parameters into a struct
* metal : refactor get_rows parameters into a struct
* metal : refactor group_norm parameters into a struct
* metal : refactor conv_transpose_1d parameters into a struct
* metal : refactor upscale parameters into a struct
* metal : refactor pad parameters into a struct
* metal : refactor pad_reflect_1d parameters into a struct
* metal : refactor arange parameters into a struct
* metal : refactor timestep_embedding parameters into a struct
* metal : refactor argsort parameters into a struct
* metal : refactor leaky_relu parameters into a struct
* metal : refactor pool_2d parameters into a struct
* metal : fix trailing whitespace
---------
Co-authored-by: alexju <alexju@tencent.com>
This commit updates the custom command to build the default.metallib
file to use the correct path to ../ggml-common.h by using the variable
METALLIB_COMMON.
The motivation for this change is that currently when building and
specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is
generated:
```console
[ 11%] Linking CXX shared library ../../bin/libggml.dylib
[ 11%] Built target ggml
make[2]: *** No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'. Stop.
make[1]: *** [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2
```
With the above change the build could progress but there was a follow
on error about not being able to find the ggml-common.h file in
ggml-metal.metal where is was included as a relative path:
```console
[ 11%] Compiling Metal kernels
/Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'?
^~~~~~~~~~~~~~~~~~
"ggml-common.h"
1 error generated.
```
Removing the relative path then allowed the build to complete
successfully.
-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.
* Support fp16 unary operations in the CUDA backend
* cpu: increase fp16 support for unary operators in the CPU backend
* cuda: increase fp16 support for unary operators in the CUDA backend
* Add test cases for fp16 unary operators
* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing
* metal: fix PR comments for unary op support after fp16 unary tests
This fixes segmentation fault error when running tests when no metal
devices are available (for example, when not linked with Core Graphics
framework or otherwise).