Commit Graph

56 Commits

Author SHA1 Message Date
lhez 79cc0f2daf
opencl: add basic support for q4_1 (#19534)
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
2026-02-12 14:52:37 -08:00
lhez 4d3daf80f8
opencl: add general Q6_K mm and Q4_K mv (#19347)
* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace
2026-02-11 10:33:13 -08:00
lhez 91ea44e89b
opencl: refactor some ops, concat, repeat, tanh and scale (#19226)
* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables
2026-02-02 15:54:43 -08:00
shaofeiqi 971facc38e
opencl: add optimized q8_0 mm kernel for adreno (#18871)
* Add Q8_0 OpenCL kernel

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>

* opencl: fix build for non-adreno

* opencl: refactor q8_0

* opencl: enforce subgroup size of 64 for adreno for q8_0

* For A750 and older generations, subgroup size can be 64 or 128.
  This kernel assumes subgroup size 64.

* opencl: suppress warning when adreno kernels are disabled

---------

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-01-30 10:19:27 -08:00
lhez 94eeb5967c
opencl: add flattened q6_K mv (#19054)
* opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat`

* opencl: clean up

* opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat`

* opencl: tweak the workgroup size a bit

* opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat`

* opencl: proper alignment for q6_K

* opencl: boundary handling for flattened q6_K mv

* opencl: rename q6_K mv kernel file

* opencl: put flattened q6_K mv in its own file

* opencl: use lower k in file name

* opencl: use K in variable names
2026-01-26 19:36:24 -08:00
shaofeiqi 5516b9c16a
opencl: add TRI op support (#18979) 2026-01-21 22:05:54 -08:00
lhez d1b4757ded
opencl: fix q6_K mv for m=1 (#18893) 2026-01-17 13:50:32 -08:00
shaofeiqi 785a710085
OpenCL: add SOLVE_TRI op support (#18846) 2026-01-15 11:17:17 -08:00
shaofeiqi 707cbafcaa
opencl: add SOFTPLUS op support (#18726) 2026-01-10 21:57:44 -08:00
shaofeiqi 593da7fa49
opencl: add EXPM1 op (#18704) 2026-01-09 10:13:13 -08:00
shaofeiqi 568371a726
opencl: add FILL op support (#18682) 2026-01-07 22:04:50 -08:00
lhez eb492bf43f
opencl: unpack q4_0 for adreno in get_tensor (#18278) 2025-12-22 10:19:01 -08:00
lhez 7cba58bbea
opencl: add sqr, sqrt, mean and ssm_conv (#17476)
* opencl: add sqr

* opencl: add sqrt

* opencl: add mean

* opencl: add ssm_conv

* opencl: add missing cl_khr_fp16

* opencl: do sqrt in f32 then convert to f16 for better precision
2025-11-26 13:29:58 -08:00
lhez 52e5d421f1
opencl: fix rms_norm_mul (#17250)
* opencl: use subgrroup reduce for reduction in rms_norm_mul

* opencl: add comment about workgroup size
2025-11-15 17:40:14 -08:00
shaofeiqi 4db5641210
opencl: add kernel to handle mat mul in attention to improve encoding speed (#17181)
* Add mul_mm_f16_f32_kq_kqv kernel

* Add ggml_cl_mul_mat_kq_kqv_adreno func

* fix whitespace

* remove unused variable

* remove redundant

* refactor and clean up

* remove trailing whitespace
2025-11-15 17:33:10 -08:00
lhez ece0f5c177
opencl: add fastdiv and use it in set_rows, ported from cuda (#17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
2025-11-10 15:00:13 -08:00
lhez c5023daf60
opencl: support imrope (#16914)
* opencl: support imrope

* opencl: fix whitespace
2025-11-03 11:47:57 -08:00
lhez 9984cbb61d
opencl: fix boundary handling for mul_mm (#16875) 2025-10-30 16:00:20 -07:00
Shawn Gu 81387858f1
opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (#16602)
* opencl: transposed gemm/gemv moe kernel with mxfp4,f32

* add restore kernel for moe transpose

* fix trailing whitespaces

* resolve compilation warnings
2025-10-17 17:55:32 -07:00
lhez 0cb7a0683b
opencl: add q8_0 mm support (#16469)
* opencl: add mm_q8_0_f32

* opencl: fix data loading for incomplete tile

* opencl: use q8_0 mm for larger matrix

* opencl: add some tests to cover the path
2025-10-15 10:51:04 -07:00
lhez d93f8439b0
opencl: fix FA for f32 (#16584) 2025-10-15 10:48:28 -07:00
lhez 7c156df414
opencl: support pad_ext (#15888) 2025-09-30 10:45:45 -07:00
lhez d1c84a662d
opencl: support ne3 in get_rows (#15866) 2025-09-30 09:55:13 -07:00
Sigbjørn Skjæret 3ecb2f671a
ggml : implement set_rows with i32 index (#16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-22 19:13:00 +02:00
lhez c4510dc937
opencl: initial `q8_0` mv support (#15732) 2025-09-21 14:48:44 -07:00
Shawn Gu 3edd87cd05
opencl: optimize mxfp4 kernels (#16037)
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2025-09-18 12:03:34 -07:00
Daniel Bevenius 3913f8730e
ggml : fix padding in timestep embedding kernels (#15932)
* ggml : remove adding extra dim timestep embedding

This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.

The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.

* ggml-cuda : fix padding in timestep embedding kernel

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.

* ggml-metal : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel

* ggml-opencl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-sycl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-vulkan : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-cpu : fix padding in timestep embedding function

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
2025-09-16 15:25:57 +02:00
rmatif 97669e4073
opencl: add attn sinks support for FA kernels (#15706) 2025-09-01 23:26:53 -07:00
rmatif 86076f92de
OpenCL: add fused group_norm/norm, mul, add (#15314)
* add fused group_norm/norm, mul, add

* fix spacing

* revert rms_norm logic

* fix trailing whitespace
2025-08-26 23:36:05 -07:00
rmatif 912ff8c119
OpenCL: add initial FA support (#14987)
* add F16/F16 fa support

* fix kernel init

* use mad instead of fma

* use inline function

* mark FA with sinks as unsupported for now

* add pragma unroll to loops
2025-08-16 01:05:55 -07:00
lhez e2c1bfff53
opencl: add initial mxfp4 support via mv (#15270)
* opencl: add reference `mul_mv_mxfp4_f32`

* opencl: add reference `mul_mv_id` for mxfp4

* Q4_0 tranpose fix for Adreno

---------

Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com>
2025-08-15 09:52:14 -07:00
rmatif 60a7658810
opencl: allow mixed f16/f32 `add` (#15140) 2025-08-12 02:42:41 -07:00
lhez aaa3d07ae7
opencl: support sink in `soft_max` (attn sinks) (#15152) 2025-08-07 21:47:03 -07:00
lhez e725a1a982
opencl: add `swiglu_oai` and `add_id` (#15121)
* opencl: add `swiglu-oai`

* opencl: add `add_id`

* opencl: add missing `add_id.cl`
2025-08-06 12:12:17 -07:00
lhez 1c872f71fb
opencl: add f16 for `add`, `sub`, `mul`, `div` (#14984) 2025-08-01 13:15:44 +02:00
lhez 6e6725459a
opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (#14809) 2025-07-30 14:56:55 -07:00
lhez ce111d39d6
opencl: add fused `rms_norm_mul` (#14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
2025-07-25 17:12:13 +02:00
Sigbjørn Skjæret 38d3af1b73
opencl: fix `im2col` when `KW!=KH` (#14803) 2025-07-21 13:55:10 -07:00
rmatif 6c9ee3b17e
opencl: add conv2d kernel (#14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-21 10:03:19 -07:00
rmatif 6bdda13981
opencl: add tiled mul_mat_f16_f32 (#14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-10 14:58:12 -07:00
lhez 0b8855775c
opencl: add `set_rows` for `f16` and `f32` (#14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-10 11:48:52 -07:00
Xuan-Son Nguyen 98bab638fb
ggml : add ggml_scale_bias (#14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Sigbjørn Skjæret 6681688146
opencl: add GELU_ERF (#14476) 2025-07-04 23:24:56 -07:00
Sigbjørn Skjæret 28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445) 2025-07-03 23:07:22 +02:00
lhez bee28421be
opencl : broadcast for soft_max (#14510) 2025-07-03 20:22:24 +02:00
lhez 603e43dc91
opencl : update upscale to support align corners (#14488) 2025-07-02 09:07:42 +02:00
lhez 79b33b2317
opencl : add GEGLU, REGLU, SWIGLU (#14456) 2025-07-01 09:19:16 +02:00
lhez 4c763c8d1b
opencl: add `mul_mv_id_q4_0_f32_8x_flat` (#14003) 2025-06-10 16:55:58 -07:00
rmatif bfb1e012a0
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes
2025-06-02 16:53:36 -07:00
lhez a3c30846e4
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)
* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`
2025-05-27 12:56:08 -07:00