nullname
c23ab465c0
feat: perf opt part4 ( #43 )
...
* wip
* refactor: rewrite dequantize_row_q4_0 by intrinsic
* log for debug
* fix q4 intrinsic
* small opt
* wip
* wip
* add vtcm_quota_size
* add perf log for hexagon-npu backend
* wip
* add log
* sync after a specfic op
* increase worker thread priority
* fix unbalanced thread slice
* small slict to fit in vtcm cache
* limit the supported row element size
* opt 4_0 dequant
* fix q4 dequant
* add power_utils
* add rms_norm
* wip
* enable rms_norm f32
* fix rms_norm with param
* fix compiling flags
* use float
* fix small row size
* vectorized rms norm
* wip
* read 2 vectors
* rename
* add perf log on update
* set empty tensors handle also
* merge some rpc functions
* opt param update
* wip
* print more log
* add struct for update param config
* add npu_device_graph_set_tensor_with_param
* merge tensor and params update
* wip
* wip
* make as template to reuse
* vectorize dequantize_row_q8_0
* opt
* avoid using union to store q data
* wip
* wip
* wip
2025-05-28 00:00:42 +08:00
nullname
2306f82a58
fix compiling error
2025-05-27 06:35:41 +00:00
hongruichen
54b3021e0c
Merge branch 'master' into dev-refactoring
2025-05-27 10:21:33 +08:00
Georgi Gerganov
4265a87b59
cuda : avoid cuGetErrorString ( #13791 )
...
ggml-ci
2025-05-26 22:14:52 +03:00
Akarshan Biswas
6f180b915c
SYCL: Add non contiguous support in RMS_NORM and NORM kernels ( #13611 )
...
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
2025-05-26 21:10:36 +05:30
Romain Biessy
9012eb9b45
sycl: Add more debug prints ( #13640 )
2025-05-26 10:28:53 +02:00
Jeff Bolz
fef693dc6b
vulkan: mark IM2COL as supporting non-contig ( #13783 )
2025-05-26 06:02:07 +02:00
Bizhao Shi
2d38b6e400
CANN: Add the basic supports of Flash Attention kernel ( #13627 )
...
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
2025-05-26 10:20:18 +08:00
Akarshan Biswas
515fdbf7ed
SYCL: revert "sycl: simplify bin_bcast_kernel ( #13383 )" ( #13752 )
...
Temporarily reverted due to failing fp16 DIV operation
This reverts commit 02cdd2d8b0 .
ggml-ci
2025-05-25 10:08:37 +03:00
Diego Devesa
2bd1b30f69
ggml-cpu : set openmp wait time if not set ( #13758 )
2025-05-24 22:26:47 +02:00
Xuan-Son Nguyen
4c32832c59
ggml : add ggml_gelu_erf() CUDA kernel ( #13719 )
...
* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon
2025-05-24 13:06:47 +02:00
Johannes Gäßler
ffd0eae60b
CUDA: fix race condition in FA vector kernels ( #13742 )
2025-05-24 11:46:19 +02:00
Chenguang Li
faaaff5f94
CANN: Support MUL_MAT_ID for q8_0 and q4_0 ( #13705 )
...
* [CANN]Support MUL_MAT_ID Q8 && Q4
Signed-off-by: noemotiovon <757486878@qq.com>
* codestyle adjustment
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-23 16:47:53 +08:00
Xuan-Son Nguyen
e16c4731c7
ggml : fix the order of ggml_unary_op ( #13718 )
2025-05-23 08:12:48 +02:00
Jeff Bolz
1dcd01960c
vulkan: support CPY from any type to itself ( #13695 )
...
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-23 06:45:02 +02:00
Jeff Bolz
c10ed6cbcc
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it ( #13696 )
2025-05-23 06:33:45 +02:00
Judd
a127ff1780
use LOG_WARN to replace `std::cerr` ( #13657 )
2025-05-23 06:33:08 +02:00
Nicolò Scipione
d394a9aedc
sycl : Remove waits from function calls ( #13702 )
...
* removes the waits in async memcpy functions
2025-05-22 12:54:43 +01:00
Ewan Crawford
6b56a64690
SYCL: Avoid using with SYCL-Graph for unsupported nodes ( #13587 )
...
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458) )
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-22 16:24:09 +08:00
Henry Linjamäki
a4e8912dfd
opencl: Add support for multiple devices ( #12622 )
...
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
2025-05-21 16:21:45 -07:00
Henry Linjamäki
edbf42edfd
opencl: fix couple crashes ( #12795 )
...
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-21 13:21:17 -07:00
Xuan-Son Nguyen
cf4cb59e64
ggml : add ggml_gelu_erf() ( #13667 )
...
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
2025-05-21 16:26:33 +02:00
R0CKSTAR
33983057d0
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy ( #13647 )
...
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-21 09:58:49 +08:00
Eve
fb1cab201c
vulkan: fix warnings ( #13626 )
...
* small fixes
* remove ifdef
2025-05-20 21:35:16 +00:00
Johannes Gäßler
b69f1647f9
CUDA: skip fully masked-out KV in FA vec kernel ( #13584 )
...
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-20 14:45:07 +02:00
Svetlozar Georgiev
4245e622e0
sycl: disable reorder for sycl mulmat ( #13536 )
2025-05-20 11:34:15 +02:00
Georgi Gerganov
c00a2634be
metal : fix typo in FA kernel comments ( #13651 )
2025-05-20 10:41:40 +03:00
Nicolò Scipione
f7c9429c85
sycl : Overcoming workaround for mmap() allocation on Windows ( #13482 )
...
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-20 08:54:43 +08:00
0cc4m
8960efd0a6
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence ( #13607 )
2025-05-19 17:54:08 +02:00
Johannes Gäßler
6c35981a64
mnist: fix segmentation fault (ggml/1227)
2025-05-19 13:29:56 +03:00
Diego Devesa
8b5e19aea6
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
2025-05-19 13:29:56 +03:00
Daniel Tang
60aea028b5
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
Chenguang Li
33d7aed4a8
CANN: Support MOE Model MUL_MAT_ID ( #13042 )
...
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:21:17 +08:00
Gilad S.
e3a7cf6c5b
cmake: use the current build config for vulkan-shaders-gen ( #13595 )
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Jeff Bolz
2f5a4e1e09
vulkan: move common FA code to flash_attn_base.comp ( #13556 )
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz
4f41ee11d6
vulkan: use scalar FA rather than coopmat2 when N==1 ( #13554 )
2025-05-17 08:35:47 +02:00
Georgi Gerganov
654a67794f
metal : add FA-vec kernel for head size 64 ( #13583 )
...
ggml-ci
2025-05-16 20:32:58 +03:00
nullname
295f7f5957
feat: perf opt part3 ( #42 )
...
* add f16 support to etl wise op
* wip
* Revert "wip"
This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.
* qf32 for mul
* wip
* Revert "wip"
This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.
* disable fp16 add/sub
* tempate trick
* wip
* add f16 mulmat
* add log
* fix view liked op
* add log
* fix f16 mulmat
* add quant type
* wip
* add l2fetch
* add vtcm_mem
* wip
* fix fetch
* use vtcm cache in mulmat
* revert vtcm cache
* cache plane
* small opt for plane cache
* cache plane for some element wise op
* wip
* enable fetch even on vtcm
* wip
* copy sysMonApp
* small opt
* init ltu
* add compute_params
* add op common header
* move vtcm_mem allocation to compute_param
* fallback to memcache when vtcm allocate failed
* pre-calculate quantize type
* wip
* try fix test failure
* try fix mulmat nan
* fix inf in mulmat
* remove debug logs
* wip
* small refactoring on the dequant row func
* fix typo
* improve logging
* add q4_0 and q8_0
* wip
* wip
* build hexagon libs in cmake
* wip
* fix qnn only build flag
* fix typo
* fix todo
* wip
* wip
* add to_float
* use to)float directly instead of ltu
* wip
* cache f16_to_f32 table into vtcm
* print tensor dims at log
* init device in supports_op_impl
* revert cache ltu
* wip
* wip
* fix graph calc issues by validate cache manually after each op
* add cache invalidate func
* enable cache fallback only in quantize tensors
* add option to disable quantized tensors
* propagate the asan flag to npu build
* fix asan option
* wip
* invalidate tensors after finished
* implement backend_buffer_reset
* wip
* wip
* refactoring plane cache mechanism
* wip
* split row elements across thread
* use table for f16 to f32 conversion
* sync after each op
* small refactoring to invalidate l2 cahce
* wip
* opt on float fetching
* unroll for loop manually
* reduce vtcm usage
* add perf tracking for npu
* print dimensions for profiler log
* wip
* wip
* wip
* add sub proc tracker
* fix typo
* print pcycles
* wip
* wip
* prefetch rows
* add l2fetch_row
* small tweak based on perf tracer
* opt l2 fetching
* wip
2025-05-16 19:57:33 +08:00
Łukasz Ślusarczyk
0a338ed013
sycl : fixed compilation warnings ( #13582 )
2025-05-16 18:15:29 +08:00
Diego Devesa
c6a2c9e741
gguf : use ggml log system ( #13571 )
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
2025-05-15 19:13:11 +02:00
Atharva Dubey
02cdd2d8b0
sycl: simplify bin_bcast_kernel ( #13383 )
2025-05-15 17:39:52 +02:00
Svetlozar Georgiev
64bb51cf90
sycl: reordered Q4_K MMVQ ( #13109 )
2025-05-15 17:35:44 +02:00
Łukasz Ślusarczyk
9c404ed54c
sycl: use oneDNN for matrices multiplication ( #12972 )
2025-05-15 16:53:41 +02:00
Yibo Cai
5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm ( #13519 )
...
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
Johannes Gäßler
4696d56749
CUDA: fix crash on large batch size for quant. MoE ( #13537 )
2025-05-14 16:41:02 +02:00
Johannes Gäßler
6da34fa276
CUDA: faster Deepseek FA, add Turing support ( #13435 )
2025-05-14 16:08:20 +02:00
bandoti
09d13d94fb
cmake: simplify vulkan shader test logic ( #13263 )
2025-05-14 07:53:57 -03:00
Jeff Bolz
24e86cae72
vulkan: KHR_coopmat flash attention ( #13506 )
...
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-14 11:55:26 +02:00
Jeff Bolz
ab3971f2a0
vulkan: workaround FA compile failures on macos ( #13517 )
2025-05-14 06:15:50 +02:00
Georgi Gerganov
f0995d28ce
metal : use FA-vec kernel up to batch size 20 ( #13496 )
...
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
2025-05-13 18:04:39 +03:00