* optimise GGML_OP_SUM
* add non-contiguous tests by permuting the input
* change tests to require full contiguity of OP_SUM
* cuda : add check GGML_OP_SUM
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* scaffold to support opt step adamw on metal (not written so far)
* add opt-step-adamw kernel for metal
* pass op->src[4] as a separate buffer to the pipeline
* add bounds check to opt-step-adamw kernel
* complete scaffold for GGML_OP_SUM
* naive GGML_OP_SUM kernel
* remove unwanted comment
* change OP_SUM capability gate
* Add has_simdgroup_reduction to both ops to pass CI
* metal : better unroll in the FA kernels
* metal : index FA blocks
* tests : restore [no ci]
* metal : prevent division by zero in FA kernels
* metal : fix -INF detection logic
* metal : pad K, V and Mask when needed
* cont : simplify
* cuda : add TODO about KV padding requirement
* metal : add comments
* metal : remove mask padding requirement
* metal : ssm_scan minor opts
* metal : get_rows optimize
* metal : cpy optimize
* metal : ssm_conv opt
* metal : ssm_scan simplify
* metal : ssm_Scan opt
* metal : support mul_mm with src1->type == GGML_TYPE_F16
* metal : support mul_mm_id with src1->type == GGML_TYPE_F16
[no ci]
* metal : mul_mm support ne00 % 32 != 0
* metal : support mul_mm_id with ne00 % 32 != 0
* cont : remove unnecessary unrolls
* cont : simplify data loading
* metal : optimize mul_mm when output bounds checks are not needed
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci