* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
* SYCL: disable faulty fp16 CPU exponent for now
* Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec3.
* SYCL: disable faulty fp16 CPU exponent for now
* Fix logic of disabling exponent kernel
* implement unary REGLU/GEGLU/SWIGLU cpu ops
* relax constraints
* duplicate shape of source
* fix ggml_vec_geglu_f16
* special case gated ops
* implement unary REGLU/GEGLU/SWIGLU cuda ops
* tighten constraints again
* refactor into GGML_GLU_OP
* metal : add glu kernels
ggml-ci
* add CUDA_GLU_BLOCK_SIZE [no ci]
* more constraints and use 64bit ints
ggml-ci
* 64bit multiplication [no ci]
* implement swapped variants (cpu/cuda)
* update comment [no ci]
ggml-ci
* Vulkan: Add GLU ops and shaders
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (#14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* Add header and namespace to use enqueue_functions extension
* Convert submit and parallel_for to use new extension in convert.cpp
* Convert submit and parallel_for to use extension in ggml-sycl.cpp
* Convert submit and parallel_for to use extension in gla.cpp
* Convert submit and parallel_for in mmq.cpp
* Convert submit and parallel_for in mmvq.cpp
* Convert submit and parallel_for in remaining files
* Convert all simple parallel_for to nd_launch from enqueue_functions
extension
* Wrapping extension in general function
Create a general function that enable the enqueue_functions extension if
it is enable in the compiler, otherwise call the general SYCL function
to launch kernels.
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669
which adds SYCL-Graph support for recording CUDA BLAS commands.
With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph
enabled. Prior to this change, an error would be thrown.
```
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2
UR CUDA ERROR:
Value: 700
Name: CUDA_ERROR_ILLEGAL_ADDRESS
Description: an illegal memory access was encountered
Function: operator()
Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154
Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
* Add Reorder to Q6_K mmvq implementation
* Address PR comments: clean up comments
* Remove unused parameter after refactoring q4_k
* Adding inline to function and removing unnecessary reference to int
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* SYCL: Implement few same quantized type copy kernels
* Use memcpy for copying contiguous tensors
ggml-ci
* feat(sycl): add contiguous tensor copy support and device checks
Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.
* refactor: replace specific block copy functions with template
The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.
* Exclude BF16 support for COPY tensors for now
ggml-ci
* perf: adjust SYCL copy kernel block sizes for efficiency
Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
* Use ceil_div
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
* sycl : Implemented reorder Q4_0 mmvq
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* sycl : Fixed mmvq being called when reorder is disabled
* sycl : Improved comments in the quants header
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* Use static_assert
* safe_div -> ceil_div
* Clarify qi comment
* change the reorder tensor from init to execute OP
* dbg
* Undo changes to test-backend-ops
* Refactor changes on top of q4_0 reorder fix
* Missing Reverts
* Refactored opt_for_reorder logic to simplify code path
* Explicit inlining and unroll
* Renamed mul_mat_algo enum for consistency
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
* SYCL: Add all missing unary kernels
ggml-ci
* decouple kernel launch range from data size using strided loop
* use ciel_div helper for num_blocks
ggml-ci
* clean auto imported header files
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
* SYCL: Add fp16 support to some elementwise OP kernels
* remove comment
ggml-ci
* Use static_cast directly
* remove not needed cast from tanh
* Use static cast and remove unneeded castings
* Adjust device_support_op for unary OPs
* Use cast_data and typed_data struct to deduplicate casting code
* Rename oneMKL Interface to oneMath
* Use oneMath for Intel vendor
* Rename occurences to mkl
* clang-format
* Silence verbose warnings
* Set oneMath HIP_TARGETS
* Fix silence warnings
* Remove step to build oneMath from build instructions
* Use fixed oneMath version
* Remove INTEL_CPU
* Fold CMake oneDNN conditions
* Use Intel oneMKL for Intel devices
* Improve CMake message
* Link against MKL::MKL_SYCL::BLAS only
* Move oneMath documentation to Nvidia and AMD sections
* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com>
* opt performance by reorder for Intel GPU
* detect hw type and save opt feature, and print opt feature
* correct name
* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed
* add env variable GGML_SYCL_DISABLE_OPT for debug
* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT
* add performance data
* mv getrows functions to separeted files
* fix global variables
---------
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).
* SYCL: SOFTMAX F16 mask support and other fixes
* test-backend-ops: Add F16 mask test cases
* Implement host pool for matrix_info
Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp
* Remove unnecessary headers and cast
* Reorder member variable to avoid warning on initialization
* Formatting
* Remove unused variable
* Address PR review feedback - remove warning
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* SYCL: refactor ggml_sycl_compute_forward
* SYCL: add back GGML_USED(dst) to ggml_sycl_cpy
* SYCL: add function name to noop debug
* SYCL: Some device info print refactoring and add details of XMX availability
* SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6
* Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6"
This reverts commit f62dc45f31.
* Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6
* Migrate to tensor->buffer for checking backend buffer type: 1
* SYCL: common.cpp try to migrate away from tensor->backend
* SYCL: fix assertions and add proper comments
* SYCL: remove extra space
* SYCL: Add back static to ggml_backend_buffer_is_sycl_split function
* SYCL: Add pragma directive to suppress warning spam
* SYCL: Integrate debug logs with GGML_LOG and other fixes
* Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes"
This reverts commit 2607b7de0f.
Let's keep the current SYCL specific logging mechanism for now
* SYCL: Use GGML_SYCL_DEBUG after reverting
* SYCL: reg_get_proc_address func, update to the current func signature
* SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d
* Try to reduce some unused and typecast warnings
* Reduce compiler warnings step 2
* add a newline at the end of the file
* Initialize nreduce as size_t
* [SYCL] Remove pragma directives from mmq.cpp
* SYCL: mmq add condition to prevent blocks_per_tile_x_row variable from becoming 0
* SYCL softmax: Initialize nreduce as size_t
* ggml-sycl.cpp: fix some trailing whitespaces
* SYCL: remove the unused variables instead of commenting it out
* SYCL poo2d kernel: set NAN for invalid pooling op
* SYCL gemm.hpp: remove pragma directives
* SYCL gemm.hpp: use const cast to properly support dnnl::memory
* SYCL: wkv6 remove a comment
* SYCL: clean comments step 2
* SYCL: clean comments and variables step 3
* SYCL: Use GGML_UNUSED for unused variables
* SYCL: remove extra empty lines and a comment
* Remove TODO
* cleanup spaces
* add a stdout for unsupported op
* use sycl printf over fprintf
* remove prints for CI
* SYCL ggml-sycl: pool2D use sycl::nan and remove if-else block
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend
Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* Formatting
* Address PR comments to increase readibility
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
* Add option to set the SYCL architecture for all targets
* Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option
* Document that setting GGML_SYCL_ARCH can improve the performance
* sycl: Use syclcompat::dp4a
* Using the syclcompat version allow the compiler to optimize the
operation with native function
* Update news section
* Update CI Windows oneAPI version to 2025.0
* Reword doc
* Call syclcompat::dp4a inside dpct::dp4a
This reverts commit 90cb61d692.
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)
* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops
* Fixes asserts in norm to fix debug builds.
* rwkv6: rename to wkv6
* rwkv6: support avx2 avx512 armv8 armv9
* rwkv6: update cuda file name
* rwkv6: rename params
* wkv on sycl
* sycl: add some ops
* sycl: Enhance OP support judgment
* wkv6: drop armv9 and tranfer to GGML style
ggml-ci
* sync : ggml
* update the function to use appropriate types
* fix define error
* Update ggml/src/ggml-cpu.c
* add appropriate asserts
* move element-wise functions outside
* put the declaration outside the loop
* rewrite to be more inline with the common pattern for distributing threads
* use recommended way GGML_TENSOR_LOCALS
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
* fallback mmvq to mul_mat
* mmvq in cuda path
* Update ggml/src/ggml-sycl.cpp
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
---------
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
* ggml : move rope type enum to ggml.h
This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.
The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.
Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.
* squash! ggml : move rope type enum to ggml.h
This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.
I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.
* squash! ggml : move rope type enum to ggml.h
This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.
* squash! ggml : move rope type enum to ggml.h
This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.
* squash! ggml : move rope type enum to ggml.h
This commit fixes the editorconfig-checker warnings.
* squash! ggml : move rope type enum to ggml.h
Update comment for ggml_rope function.
* Revert "squash! ggml : move rope type enum to ggml.h"
This reverts commit 6261222bd0.
* squash! ggml : move rope type enum to ggml.h
Add GGML_ROPE_TYPE_NEOX to rope_common.comp.
* remove extra line
---------
Co-authored-by: slaren <slarengh@gmail.com>