Commit Graph

1358 Commits

Author SHA1 Message Date
bssrdf f95664c76c make tensor core path available for cc 7.5 and above 2025-11-01 14:35:44 -04:00
bssrdf c1f67c19e0 make CI happy 2025-10-29 23:23:21 -04:00
bssrdf 2b5351a898 make CI happy 2025-10-29 23:17:36 -04:00
bssrdf c141ce3533 make CI happy 2025-10-29 22:56:27 -04:00
bssrdf 1f3d5eb8e9 prevent CI compile failure 2025-10-29 22:47:03 -04:00
bssrdf 70132278cb more clean up 2025-10-29 21:57:12 -04:00
bssrdf a3b4d8d31e clean up 2025-10-29 21:46:15 -04:00
bssrdf 55859a86aa remove implicit op and related calls; replace conv_2d with conv_2d_implicit kernel 2025-10-29 21:36:03 -04:00
bssrdf 2dfbbee73f clean up 2025-10-29 13:19:35 -04:00
bssrdf 1e568252b5 switch to default conv2d interface 2025-10-29 12:11:26 -04:00
bssrdf 4b1920e9e7 reduced bank conflicts for output 2025-10-29 10:40:52 -04:00
bssrdf 75dde410a8 WIP: minor tweak 2025-10-28 14:41:48 -04:00
bssrdf 3ea524e9c4 WIP: almost working 2025-10-27 23:10:19 -04:00
bssrdf 6d12288037 WIP: fixed a bug in cpy transpos index computation 2025-10-27 17:32:03 -04:00
bssrdf a3784e17ad WIP: debugging cpy transpose 2025-10-27 15:09:03 -04:00
bssrdf cc327f5224 added a specialization for cuda copy op when tensor is transposed 2025-10-27 11:23:27 -04:00
bssrdf c68fe36ae2 WIP: cleanup; enhanced test case 2025-10-25 21:57:39 -04:00
bssrdf 475f9879c5 WIP: fixed another bug 2025-10-25 20:24:14 -04:00
bssrdf 396f55831c WIP: bug fix 2025-10-25 18:14:12 -04:00
bssrdf 610e41ae2d still debugging 2025-10-25 11:10:39 -04:00
bssrdf 980ddc1e87 properly use __CUDA_ARCH__ to protect the tensor path 2025-10-24 21:56:58 -04:00
bssrdf 24b553204b WIP: fixed another bug 2025-10-24 16:53:40 -04:00
bssrdf 6c90c20cb1 WIP: bug fix 2025-10-24 15:33:57 -04:00
bssrdf be25be8ed3 WIP: debugging tensor core kernel 2025-10-24 14:24:26 -04:00
bssrdf 80a996cfc0 WIP: tensore code compiled ok 2025-10-24 11:41:11 -04:00
bssrdf 2715341c1d WIP: output 2025-10-23 21:29:45 -04:00
bssrdf 66f6d16265 WIP 2025-10-23 13:52:26 -04:00
bssrdf 215ebf6526 WIP 2025-10-22 15:56:55 -04:00
bssrdf 1b69ed44c6 WIP 2025-10-21 17:15:26 -04:00
bssrdf f931ad883f WIP 2025-10-21 17:12:50 -04:00
bssrdf f0a480cc22 WIP 2025-10-21 15:43:35 -04:00
bssrdf 6a1f8b4d57 change padding size back to 4 2025-10-15 14:21:04 -04:00
bssrdf ac77b8d0e0 change padding size to 1; added padding to input smem 2025-10-15 14:07:24 -04:00
bssrdf 3f99818925 unroll some loops 2025-10-15 12:46:46 -04:00
bssrdf b70cca2ea3 add support for both NCHW and NHWC layouts 2025-10-14 14:24:35 -04:00
bssrdf 3e2f722d11 fixed missing dilation 2025-10-14 11:12:55 -04:00
bssrdf 2237722056 added block variants; to be debugged 2025-10-14 11:02:10 -04:00
bssrdf 16b0f0ae3c work in progress 2025-10-13 18:41:30 -04:00
bssrdf 0ca43582e8 reorder register tile loop 2025-10-08 13:52:56 -04:00
bssrdf 53a2ccbe12 minor update and add direct conv in benchmarking 2025-09-24 21:48:20 -04:00
bssrdf 2ec76aa8f3 Merge branch 'master' into conv2d-implicit 2025-09-10 22:04:20 -04:00
Oliver Simons 00681dfc16
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (#15872)
* Add fastdiv and fastmodulo to k_bin_bcast kernel

* Address review comments

* `prod_` instead of `prod` suffix

* Add test case for `k_bin_bcast_unravel` in CUDA backend
2025-09-10 22:04:03 +02:00
bssrdf 735886b099 merged with upstream master 2025-09-10 12:28:59 -04:00
Daniel Bevenius 9de447d94e
ggml-cpu : fix padding in ggml_timestep_embedding (#15917)
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:

Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero

This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.

Resolves: https://github.com/ggml-org/ggml/issues/1324
2025-09-10 17:31:40 +02:00
Georgi Gerganov 0f0a3c2851
metal : make the backend async (#15906)
* metal : make the backend async

ggml-ci

* cont : add comments, extend op offload, clean up

ggml-ci

* metal : fix batch size for MUL_MAT_ID

* metal : remove deprecated ggml_backend_metal_buffer_from_ptr

* metal : create only metal buffers, no wrapping of host memory

ggml-ci

* metal : restore .alloc_buffer for buffer_from_ptr_type

ggml-ci

* metal : remove broken implementation of GGML_OP_SET

ggml-ci

* metal : clean-up loose ends, ready for tests

ggml-ci

* metal : support both private and shared buffers

ggml-ci

* metal : enable private buffers + add global device queue

* metal : disable host buffer to prevent races

ggml-ci

* metal : avoid extra copy during set_tensor

ggml-ci

* metal : use separate buffer types for shread and private Metal buffers

ggml-ci

* metal : simplify synchronization logic

ggml-ci

* metal : fix build

ggml-ci

* metal : do not implement cpy_tensor

ggml-ci

* metal : separate implementations for shared and private buffers

ggml-ci
2025-09-10 17:52:35 +03:00
Chenguang Li 10d8b2b6b0
CANN: Add ROPE sin/cos cache for reuse (#15912)
* CANN: Add ROPE sin/cos cache for reuse

Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.

- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
  (ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation

This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.

Co-authored-by: hipudding <huafengchun@gmail.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-10 18:42:00 +08:00
Chenguang Li 28b5f190ef
CANN: implement LRU cache for ACL graphs (#15814)
* CANN: implement LRU cache for ACL graphs in CANN backend

- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

* fix typo

* The LRU cache capacity can be configured via an env variable

Signed-off-by: noemotiovon <757486878@qq.com>

* refactory acl graph

* refactory && fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-10 15:29:12 +08:00
Ruben Ortlam ae355f6f71
vulkan: throw the oom error instead of no memory type found (#15905) 2025-09-09 22:26:03 +02:00
Jeff Bolz 4f63cd705c
vulkan: Fix OOB accesses in soft_max_back (#15861) 2025-09-09 14:41:15 +02:00
Johannes Gäßler 17bc5a815f
HIP: use v_dot2_f32_f16 instruction for FA (#15884) 2025-09-09 14:04:43 +02:00