Commit Graph

941 Commits

Author SHA1 Message Date
Shawn yang eb3949938e
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)
* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 08:48:04 +02:00
Johannes Gäßler e562eece7c
CUDA: fix typo in FlashAttention code (#13926) 2025-05-30 21:22:03 +02:00
Diego Devesa b47ab7b8e9
sched : avoid changing cur_copy when a graph is already allocated (#13922) 2025-05-30 18:56:19 +02:00
Diego Devesa df0c0c7d02
cuda : prevent using split buffers with 3d/4d matrices (#13919) 2025-05-30 16:37:18 +02:00
Akarshan Biswas b49a8ff96b
SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div
2025-05-30 19:40:57 +05:30
Christian Kastner ec9e0301fe
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890) 2025-05-30 01:28:54 +02:00
Yibo Cai 54a2c7a8cd
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
2025-05-29 14:39:20 +03:00
Christian Kastner 21fcc21ad5
cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function
2025-05-29 12:50:25 +02:00
Vineel Abhinav dd8ba93416
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-29 12:18:43 +03:00
Vineel Abhinav 1b8fb8152d
ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
2025-05-29 09:01:33 +03:00
Johannes Gäßler a68247439b
CUDA: fix FA tg at long context for CC >= 8.9 (#13852) 2025-05-28 13:33:37 +02:00
leo-pony 1e8659e65a
CANN: Add SOC TYPE printing in cmake configuration (#13837) 2025-05-28 11:54:20 +08:00
lhez a3c30846e4
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)
* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`
2025-05-27 12:56:08 -07:00
lhez 1701d4c54f
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790) 2025-05-27 12:53:14 -07:00
Jeff Bolz bef8176387
vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
2025-05-27 18:39:07 +02:00
Akarshan Biswas f3101a8cc6
SYCL: add gelu_erf kernel (#13749)
* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
2025-05-27 20:52:59 +05:30
Xuan-Son Nguyen a8ea03d8ad
ggml : add ggml_repeat_4d (#13824) 2025-05-27 15:53:55 +02:00
xctan 05f6ac6283
ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage
2025-05-27 16:21:36 +03:00
Christian Kastner 7fe03e7446
ggml-cpu: x86 feature detection is specific to x86 (#13811) 2025-05-27 13:18:39 +02:00
Diego Devesa 952f3953c1
ggml : allow CUDA graphs when using pipeline parallelism (#13814) 2025-05-27 13:05:18 +02:00
Georgi Gerganov 4265a87b59
cuda : avoid cuGetErrorString (#13791)
ggml-ci
2025-05-26 22:14:52 +03:00
Akarshan Biswas 6f180b915c
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)
* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div
2025-05-26 21:10:36 +05:30
Romain Biessy 9012eb9b45
sycl: Add more debug prints (#13640) 2025-05-26 10:28:53 +02:00
Jeff Bolz fef693dc6b
vulkan: mark IM2COL as supporting non-contig (#13783) 2025-05-26 06:02:07 +02:00
Bizhao Shi 2d38b6e400
CANN: Add the basic supports of Flash Attention kernel (#13627)
* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline
2025-05-26 10:20:18 +08:00
Akarshan Biswas 515fdbf7ed
SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b0.

ggml-ci
2025-05-25 10:08:37 +03:00
Diego Devesa 2bd1b30f69
ggml-cpu : set openmp wait time if not set (#13758) 2025-05-24 22:26:47 +02:00
Xuan-Son Nguyen 4c32832c59
ggml : add ggml_gelu_erf() CUDA kernel (#13719)
* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon
2025-05-24 13:06:47 +02:00
Johannes Gäßler ffd0eae60b
CUDA: fix race condition in FA vector kernels (#13742) 2025-05-24 11:46:19 +02:00
Chenguang Li faaaff5f94
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-23 16:47:53 +08:00
Xuan-Son Nguyen e16c4731c7
ggml : fix the order of ggml_unary_op (#13718) 2025-05-23 08:12:48 +02:00
Jeff Bolz 1dcd01960c
vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-23 06:45:02 +02:00
Jeff Bolz c10ed6cbcc
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696) 2025-05-23 06:33:45 +02:00
Judd a127ff1780
use LOG_WARN to replace `std::cerr` (#13657) 2025-05-23 06:33:08 +02:00
Nicolò Scipione d394a9aedc
sycl : Remove waits from function calls (#13702)
* removes the waits in async memcpy functions
2025-05-22 12:54:43 +01:00
Ewan Crawford 6b56a64690
SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458))
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-22 16:24:09 +08:00
Henry Linjamäki a4e8912dfd
opencl: Add support for multiple devices (#12622)
* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.
2025-05-21 16:21:45 -07:00
Henry Linjamäki edbf42edfd
opencl: fix couple crashes (#12795)
* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-21 13:21:17 -07:00
Xuan-Son Nguyen cf4cb59e64
ggml : add ggml_gelu_erf() (#13667)
* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order
2025-05-21 16:26:33 +02:00
R0CKSTAR 33983057d0
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)
* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-21 09:58:49 +08:00
Eve fb1cab201c
vulkan: fix warnings (#13626)
* small fixes

* remove ifdef
2025-05-20 21:35:16 +00:00
Johannes Gäßler b69f1647f9
CUDA: skip fully masked-out KV in FA vec kernel (#13584)
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-20 14:45:07 +02:00
Svetlozar Georgiev 4245e622e0
sycl: disable reorder for sycl mulmat (#13536) 2025-05-20 11:34:15 +02:00
Georgi Gerganov c00a2634be
metal : fix typo in FA kernel comments (#13651) 2025-05-20 10:41:40 +03:00
Nicolò Scipione f7c9429c85
sycl : Overcoming workaround for mmap() allocation on Windows (#13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-20 08:54:43 +08:00
0cc4m 8960efd0a6
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607) 2025-05-19 17:54:08 +02:00
Johannes Gäßler 6c35981a64 mnist: fix segmentation fault (ggml/1227) 2025-05-19 13:29:56 +03:00
Diego Devesa 8b5e19aea6 ggml : fix apple OS check in ggml_print_backtrace (ggml/1229) 2025-05-19 13:29:56 +03:00
Daniel Tang 60aea028b5 ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
Chenguang Li 33d7aed4a8
CANN: Support MOE Model MUL_MAT_ID (#13042)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:21:17 +08:00
Gilad S. e3a7cf6c5b
cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Jeff Bolz 2f5a4e1e09
vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz 4f41ee11d6
vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) 2025-05-17 08:35:47 +02:00
Georgi Gerganov 654a67794f
metal : add FA-vec kernel for head size 64 (#13583)
ggml-ci
2025-05-16 20:32:58 +03:00
Łukasz Ślusarczyk 0a338ed013
sycl : fixed compilation warnings (#13582) 2025-05-16 18:15:29 +08:00
Diego Devesa c6a2c9e741
gguf : use ggml log system (#13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
2025-05-15 19:13:11 +02:00
Atharva Dubey 02cdd2d8b0
sycl: simplify bin_bcast_kernel (#13383) 2025-05-15 17:39:52 +02:00
Svetlozar Georgiev 64bb51cf90
sycl: reordered Q4_K MMVQ (#13109) 2025-05-15 17:35:44 +02:00
Łukasz Ślusarczyk 9c404ed54c
sycl: use oneDNN for matrices multiplication (#12972) 2025-05-15 16:53:41 +02:00
Yibo Cai 5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
Johannes Gäßler 4696d56749
CUDA: fix crash on large batch size for quant. MoE (#13537) 2025-05-14 16:41:02 +02:00
Johannes Gäßler 6da34fa276
CUDA: faster Deepseek FA, add Turing support (#13435) 2025-05-14 16:08:20 +02:00
bandoti 09d13d94fb
cmake: simplify vulkan shader test logic (#13263) 2025-05-14 07:53:57 -03:00
Jeff Bolz 24e86cae72
vulkan: KHR_coopmat flash attention (#13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-14 11:55:26 +02:00
Jeff Bolz ab3971f2a0
vulkan: workaround FA compile failures on macos (#13517) 2025-05-14 06:15:50 +02:00
Georgi Gerganov f0995d28ce
metal : use FA-vec kernel up to batch size 20 (#13496)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci
2025-05-13 18:04:39 +03:00
Georgi Gerganov c252e0c409
metal : optimize multi-sequence FA vec kernel (#13493)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci
2025-05-13 18:04:00 +03:00
Dan Johansson 4f711afed5
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-05-13 18:02:28 +03:00
lhez f0d46ef157
opencl: remove unnecessary assert for `add` (#13257) 2025-05-12 13:13:49 -07:00
Johannes Gäßler 10d2af0eaa
llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Dan Johansson a71a4075cd
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * code review fixes

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

---------

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
2025-05-12 13:06:19 +02:00
Johannes Gäßler 95e18884fc
CUDA: fix misaligned synchronization in FA (#13469) 2025-05-12 10:51:21 +02:00
Xuan-Son Nguyen df8491922f
ggml : add mrope kernel for metal (#13457) 2025-05-12 10:29:13 +02:00
Atharva Dubey 14492144c2
enable dpcpp nightly builds with libraries (#13406) 2025-05-12 13:15:32 +08:00
Johannes Gäßler 7474e00b34
CUDA: fix crash with partial offloading of MoE (#13439) 2025-05-11 16:09:33 +02:00
David Huang 7f323a589f
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
Johannes Gäßler 0208355f42
CUDA: fix race conditions FlashAttention kernels (#13438) 2025-05-10 22:22:48 +02:00
Johannes Gäßler d8919424f1
CUDA: fix FlashAttention on Turing (#13415) 2025-05-10 09:16:52 +02:00
Jeff Bolz dc1d2adfc0
vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
Alberto Cabrera Pérez 17512a94d6
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)
* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
2025-05-09 16:34:08 +01:00
Georgi Gerganov 611aa914ef
metal : optimize MoE for large batches (#13388)
ggml-ci
2025-05-09 15:14:56 +03:00
Johannes Gäßler 0cf6725e9f
CUDA: FA support for Deepseek (Ampere or newer) (#13306)
* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template
2025-05-09 13:34:58 +02:00
Johannes Gäßler 5c86c9ed3e
CUDA: fix crash on large batch size for MoE models (#13384) 2025-05-09 12:14:04 +02:00
Radoslav Gerganov b486ba05bf
rpc : add rpc_msg_set_tensor_hash_req (#13353)
* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix
2025-05-09 10:31:07 +03:00
Jeff Bolz 02115dcd9a
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-09 09:23:41 +02:00
Alberto Cabrera Pérez 8733e0cf6e
sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel
2025-05-08 10:08:01 +01:00
Daniel Bevenius 13b0a04597 whisper: remove MSVC warnings pragmas (whisper/3090)
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
Jared Tweed bba9d945c1 cmake : removed stdc++fs (whisper/3097)
* removed stdc++fs

* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
R0CKSTAR 1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile (#13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 09:48:23 +02:00
Johannes Gäßler 141a908a59
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135) 2025-05-06 23:35:51 +02:00
Akarshan Biswas 1e333d5bba
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254)
* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default
2025-05-06 20:27:06 +05:30
Johannes Gäßler 2356fb1d53
CUDA: fix bad asserts for partial offload (#13337) 2025-05-06 13:58:51 +02:00
Johannes Gäßler 15a28ec8c7
CUDA: fix --split-mode row for MMQ (#13323) 2025-05-06 08:36:46 +02:00
Johannes Gäßler 9070365020
CUDA: fix logic for clearing padding with -ngl 0 (#13320) 2025-05-05 22:32:13 +02:00
Akarshan Biswas 66645a5285
SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308)
ggml-ci
2025-05-05 13:39:10 +05:30
Diego Devesa 9fdfcdaedd
rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
Aaron Teo 6eb7d25c70
ggml : activate s390x simd for Q3_K (#13301)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-04 19:49:12 +02:00
Johannes Gäßler 93c4e23905
CUDA: fix race condition in MMQ stream-k fixup (#13299) 2025-05-04 14:16:39 +02:00
Johannes Gäßler 8afbd96818
CUDA: fix race condition in MMQ ids_dst (#13294) 2025-05-04 13:58:38 +02:00
Jeff Bolz 8ae5ebcf85
vulkan: Additional type support for unary, binary, and copy (#13266)
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00