bssrdf
c141ce3533
make CI happy
2025-10-29 22:56:27 -04:00
bssrdf
1f3d5eb8e9
prevent CI compile failure
2025-10-29 22:47:03 -04:00
bssrdf
70132278cb
more clean up
2025-10-29 21:57:12 -04:00
bssrdf
a3b4d8d31e
clean up
2025-10-29 21:46:15 -04:00
bssrdf
55859a86aa
remove implicit op and related calls; replace conv_2d with conv_2d_implicit kernel
2025-10-29 21:36:03 -04:00
bssrdf
2dfbbee73f
clean up
2025-10-29 13:19:35 -04:00
bssrdf
1e568252b5
switch to default conv2d interface
2025-10-29 12:11:26 -04:00
bssrdf
4b1920e9e7
reduced bank conflicts for output
2025-10-29 10:40:52 -04:00
bssrdf
75dde410a8
WIP: minor tweak
2025-10-28 14:41:48 -04:00
bssrdf
3ea524e9c4
WIP: almost working
2025-10-27 23:10:19 -04:00
bssrdf
6d12288037
WIP: fixed a bug in cpy transpos index computation
2025-10-27 17:32:03 -04:00
bssrdf
a3784e17ad
WIP: debugging cpy transpose
2025-10-27 15:09:03 -04:00
bssrdf
cc327f5224
added a specialization for cuda copy op when tensor is transposed
2025-10-27 11:23:27 -04:00
bssrdf
30990788e8
WIP
2025-10-27 08:29:20 -04:00
bssrdf
c68fe36ae2
WIP: cleanup; enhanced test case
2025-10-25 21:57:39 -04:00
bssrdf
475f9879c5
WIP: fixed another bug
2025-10-25 20:24:14 -04:00
bssrdf
396f55831c
WIP: bug fix
2025-10-25 18:14:12 -04:00
bssrdf
610e41ae2d
still debugging
2025-10-25 11:10:39 -04:00
bssrdf
c45df12ee7
this case is broken; to be debugged
2025-10-24 22:40:34 -04:00
bssrdf
980ddc1e87
properly use __CUDA_ARCH__ to protect the tensor path
2025-10-24 21:56:58 -04:00
bssrdf
24b553204b
WIP: fixed another bug
2025-10-24 16:53:40 -04:00
bssrdf
6c90c20cb1
WIP: bug fix
2025-10-24 15:33:57 -04:00
bssrdf
be25be8ed3
WIP: debugging tensor core kernel
2025-10-24 14:24:26 -04:00
bssrdf
80a996cfc0
WIP: tensore code compiled ok
2025-10-24 11:41:11 -04:00
bssrdf
2715341c1d
WIP: output
2025-10-23 21:29:45 -04:00
bssrdf
66f6d16265
WIP
2025-10-23 13:52:26 -04:00
bssrdf
215ebf6526
WIP
2025-10-22 15:56:55 -04:00
bssrdf
1b69ed44c6
WIP
2025-10-21 17:15:26 -04:00
bssrdf
f931ad883f
WIP
2025-10-21 17:12:50 -04:00
bssrdf
f0a480cc22
WIP
2025-10-21 15:43:35 -04:00
bssrdf
15484c9bd6
turn on tests for implicit conv2d
2025-10-17 22:16:16 -04:00
bssrdf
6a1f8b4d57
change padding size back to 4
2025-10-15 14:21:04 -04:00
bssrdf
ac77b8d0e0
change padding size to 1; added padding to input smem
2025-10-15 14:07:24 -04:00
bssrdf
3f99818925
unroll some loops
2025-10-15 12:46:46 -04:00
bssrdf
b70cca2ea3
add support for both NCHW and NHWC layouts
2025-10-14 14:24:35 -04:00
bssrdf
3e2f722d11
fixed missing dilation
2025-10-14 11:12:55 -04:00
bssrdf
2237722056
added block variants; to be debugged
2025-10-14 11:02:10 -04:00
bssrdf
16b0f0ae3c
work in progress
2025-10-13 18:41:30 -04:00
bssrdf
0ca43582e8
reorder register tile loop
2025-10-08 13:52:56 -04:00
bssrdf
c6255442bb
minor updates
2025-10-08 13:38:16 -04:00
bssrdf
53a2ccbe12
minor update and add direct conv in benchmarking
2025-09-24 21:48:20 -04:00
bssrdf
2ec76aa8f3
Merge branch 'master' into conv2d-implicit
2025-09-10 22:04:20 -04:00
Oliver Simons
00681dfc16
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance ( #15872 )
...
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
2025-09-10 22:04:03 +02:00
Jie Fu (傅杰)
4f658855fa
llama : support T5 models with unequal number of encoder-decoder layers ( #15909 )
...
* Extend the support of T5 models with different encoder-decoder layers
Signed-off-by: Jie Fu <jiefu@tencent.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-arch.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Rename n_dec_layer --> dec_n_layer
Signed-off-by: Jie Fu <jiefu@tencent.com>
* Adapt to cases when dec_n_layer > n_layer
Signed-off-by: Jie Fu <jiefu@tencent.com>
---------
Signed-off-by: Jie Fu <jiefu@tencent.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-10 20:51:51 +02:00
Sigbjørn Skjæret
6ab397e12b
graph : support non-contiguous Q in build_attn_mha ( #15908 )
...
* support non-contiguous Q in build_attn_mha
* Update src/llama-graph.cpp
ggml-ci
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-10 19:08:59 +02:00
bssrdf
735886b099
merged with upstream master
2025-09-10 12:28:59 -04:00
Daniel Bevenius
9de447d94e
ggml-cpu : fix padding in ggml_timestep_embedding ( #15917 )
...
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:
Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero
This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.
Resolves: https://github.com/ggml-org/ggml/issues/1324
2025-09-10 17:31:40 +02:00
Georgi Gerganov
0f0a3c2851
metal : make the backend async ( #15906 )
...
* metal : make the backend async
ggml-ci
* cont : add comments, extend op offload, clean up
ggml-ci
* metal : fix batch size for MUL_MAT_ID
* metal : remove deprecated ggml_backend_metal_buffer_from_ptr
* metal : create only metal buffers, no wrapping of host memory
ggml-ci
* metal : restore .alloc_buffer for buffer_from_ptr_type
ggml-ci
* metal : remove broken implementation of GGML_OP_SET
ggml-ci
* metal : clean-up loose ends, ready for tests
ggml-ci
* metal : support both private and shared buffers
ggml-ci
* metal : enable private buffers + add global device queue
* metal : disable host buffer to prevent races
ggml-ci
* metal : avoid extra copy during set_tensor
ggml-ci
* metal : use separate buffer types for shread and private Metal buffers
ggml-ci
* metal : simplify synchronization logic
ggml-ci
* metal : fix build
ggml-ci
* metal : do not implement cpy_tensor
ggml-ci
* metal : separate implementations for shared and private buffers
ggml-ci
2025-09-10 17:52:35 +03:00
Daniel Bevenius
33daece86b
ci : add caching for ROCm installation in release workflow ( #15924 )
...
This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf9ee ("ci : cache ROCm installation
in windows-latest-cmake-hip (#15887 )").
2025-09-10 15:39:57 +02:00
Daniel Bevenius
e7b6d83b52
tests : filter out no-ops from coverage report ( #15900 )
...
* tests : filter out no-ops from coverage report
This commit is a follow-up commit for #15745 to address the feedback on
how no-op operations should be filtered out from the coverage report.
The feedback regarding the UNARY and GLU sub-operations not being
handled I not exactly sure what should be done. They are included in the
coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in
the list of covered operations:
```console
$ ./build/bin/test-backend-ops --show-coverage
Operations covered by tests (89):
✓ ABS
✓ ACC
✓ ADD
✓ ADD1
✓ ADD_ID
✓ ARANGE
✓ ARGMAX
✓ ARGSORT
✓ CLAMP
✓ CONCAT
✓ CONV_2D
✓ CONV_2D_DW
✓ CONV_3D
✓ CONV_TRANSPOSE_1D
✓ CONV_TRANSPOSE_2D
✓ COS
✓ COUNT_EQUAL
✓ CPY
✓ CROSS_ENTROPY_LOSS
✓ CROSS_ENTROPY_LOSS_BACK
✓ DIAG_MASK_INF
✓ DIV
✓ DUP
✓ ELU
✓ EXP
✓ FLASH_ATTN_EXT
✓ GATED_LINEAR_ATTN
✓ GEGLU
✓ GEGLU_ERF
✓ GEGLU_QUICK
✓ GELU
✓ GELU_ERF
✓ GELU_QUICK
✓ GET_ROWS
✓ GET_ROWS_BACK
✓ GROUP_NORM
✓ HARDSIGMOID
✓ HARDSWISH
✓ IM2COL
✓ IM2COL_3D
✓ L2_NORM
✓ LEAKY_RELU
✓ LOG
✓ MEAN
✓ MUL
✓ MUL_MAT
✓ MUL_MAT_ID
✓ NEG
✓ NORM
✓ OPT_STEP_ADAMW
✓ OPT_STEP_SGD
✓ OUT_PROD
✓ PAD
✓ PAD_REFLECT_1D
✓ POOL_2D
✓ REGLU
✓ RELU
✓ REPEAT
✓ REPEAT_BACK
✓ RMS_NORM
✓ RMS_NORM_BACK
✓ ROLL
✓ ROPE
✓ ROPE_BACK
✓ RWKV_WKV6
✓ RWKV_WKV7
✓ SCALE
✓ SET
✓ SET_ROWS
✓ SGN
✓ SIGMOID
✓ SILU
✓ SILU_BACK
✓ SIN
✓ SOFT_MAX
✓ SOFT_MAX_BACK
✓ SQR
✓ SQRT
✓ SSM_CONV
✓ SSM_SCAN
✓ STEP
✓ SUB
✓ SUM
✓ SUM_ROWS
✓ SWIGLU
✓ SWIGLU_OAI
✓ TANH
✓ TIMESTEP_EMBEDDING
✓ UPSCALE
Operations without tests (14):
✗ ADD_REL_POS
✗ CUSTOM
✗ DIAG
✗ DIAG_MASK_ZERO
✗ FLASH_ATTN_BACK
✗ GET_REL_POS
✗ IM2COL_BACK
✗ MAP_CUSTOM1
✗ MAP_CUSTOM2
✗ MAP_CUSTOM3
✗ POOL_1D
✗ POOL_2D_BACK
✗ WIN_PART
✗ WIN_UNPART
Coverage Summary:
Total operations: 103
Tested operations: 89
Untested operations: 14
Coverage: 86.4%
```
Refs: https://github.com/ggml-org/llama.cpp/pull/15745
* use of ggml_op enum values instead of strcmp
2025-09-10 14:17:09 +02:00