bssrdf
|
d9a48580fc
|
use a better criterian to use split-k
|
2025-11-05 13:58:25 -05:00 |
bssrdf
|
688de6d7d8
|
fixed bug now split-k is working
|
2025-11-05 13:47:38 -05:00 |
bssrdf
|
6f44f47113
|
added split-k mode for skinny mnk shapes
|
2025-11-05 13:04:37 -05:00 |
bssrdf
|
275c08d25d
|
add more sd like test cases
|
2025-11-04 15:16:31 -05:00 |
bssrdf
|
00a49c2fc1
|
another CI fix
|
2025-11-03 19:49:56 -05:00 |
bssrdf
|
8572313000
|
remove trailing blank
|
2025-11-03 19:45:22 -05:00 |
bssrdf
|
27881fbe7b
|
fixes for CI
|
2025-11-03 19:43:55 -05:00 |
bssrdf
|
fa9e415c9b
|
minor update of test case
|
2025-11-03 15:48:57 -05:00 |
bssrdf
|
f95664c76c
|
make tensor core path available for cc 7.5 and above
|
2025-11-01 14:35:44 -04:00 |
bssrdf
|
417cfc3cc6
|
added a test case to directly compare im2col and implicit gemm
|
2025-10-31 19:57:28 -04:00 |
bssrdf
|
c1f67c19e0
|
make CI happy
|
2025-10-29 23:23:21 -04:00 |
bssrdf
|
2b5351a898
|
make CI happy
|
2025-10-29 23:17:36 -04:00 |
bssrdf
|
c141ce3533
|
make CI happy
|
2025-10-29 22:56:27 -04:00 |
bssrdf
|
1f3d5eb8e9
|
prevent CI compile failure
|
2025-10-29 22:47:03 -04:00 |
bssrdf
|
70132278cb
|
more clean up
|
2025-10-29 21:57:12 -04:00 |
bssrdf
|
a3b4d8d31e
|
clean up
|
2025-10-29 21:46:15 -04:00 |
bssrdf
|
55859a86aa
|
remove implicit op and related calls; replace conv_2d with conv_2d_implicit kernel
|
2025-10-29 21:36:03 -04:00 |
bssrdf
|
2dfbbee73f
|
clean up
|
2025-10-29 13:19:35 -04:00 |
bssrdf
|
1e568252b5
|
switch to default conv2d interface
|
2025-10-29 12:11:26 -04:00 |
bssrdf
|
4b1920e9e7
|
reduced bank conflicts for output
|
2025-10-29 10:40:52 -04:00 |
bssrdf
|
75dde410a8
|
WIP: minor tweak
|
2025-10-28 14:41:48 -04:00 |
bssrdf
|
3ea524e9c4
|
WIP: almost working
|
2025-10-27 23:10:19 -04:00 |
bssrdf
|
6d12288037
|
WIP: fixed a bug in cpy transpos index computation
|
2025-10-27 17:32:03 -04:00 |
bssrdf
|
a3784e17ad
|
WIP: debugging cpy transpose
|
2025-10-27 15:09:03 -04:00 |
bssrdf
|
cc327f5224
|
added a specialization for cuda copy op when tensor is transposed
|
2025-10-27 11:23:27 -04:00 |
bssrdf
|
30990788e8
|
WIP
|
2025-10-27 08:29:20 -04:00 |
bssrdf
|
c68fe36ae2
|
WIP: cleanup; enhanced test case
|
2025-10-25 21:57:39 -04:00 |
bssrdf
|
475f9879c5
|
WIP: fixed another bug
|
2025-10-25 20:24:14 -04:00 |
bssrdf
|
396f55831c
|
WIP: bug fix
|
2025-10-25 18:14:12 -04:00 |
bssrdf
|
610e41ae2d
|
still debugging
|
2025-10-25 11:10:39 -04:00 |
bssrdf
|
c45df12ee7
|
this case is broken; to be debugged
|
2025-10-24 22:40:34 -04:00 |
bssrdf
|
980ddc1e87
|
properly use __CUDA_ARCH__ to protect the tensor path
|
2025-10-24 21:56:58 -04:00 |
bssrdf
|
24b553204b
|
WIP: fixed another bug
|
2025-10-24 16:53:40 -04:00 |
bssrdf
|
6c90c20cb1
|
WIP: bug fix
|
2025-10-24 15:33:57 -04:00 |
bssrdf
|
be25be8ed3
|
WIP: debugging tensor core kernel
|
2025-10-24 14:24:26 -04:00 |
bssrdf
|
80a996cfc0
|
WIP: tensore code compiled ok
|
2025-10-24 11:41:11 -04:00 |
bssrdf
|
2715341c1d
|
WIP: output
|
2025-10-23 21:29:45 -04:00 |
bssrdf
|
66f6d16265
|
WIP
|
2025-10-23 13:52:26 -04:00 |
bssrdf
|
215ebf6526
|
WIP
|
2025-10-22 15:56:55 -04:00 |
bssrdf
|
1b69ed44c6
|
WIP
|
2025-10-21 17:15:26 -04:00 |
bssrdf
|
f931ad883f
|
WIP
|
2025-10-21 17:12:50 -04:00 |
bssrdf
|
f0a480cc22
|
WIP
|
2025-10-21 15:43:35 -04:00 |
bssrdf
|
15484c9bd6
|
turn on tests for implicit conv2d
|
2025-10-17 22:16:16 -04:00 |
bssrdf
|
6a1f8b4d57
|
change padding size back to 4
|
2025-10-15 14:21:04 -04:00 |
bssrdf
|
ac77b8d0e0
|
change padding size to 1; added padding to input smem
|
2025-10-15 14:07:24 -04:00 |
bssrdf
|
3f99818925
|
unroll some loops
|
2025-10-15 12:46:46 -04:00 |
bssrdf
|
b70cca2ea3
|
add support for both NCHW and NHWC layouts
|
2025-10-14 14:24:35 -04:00 |
bssrdf
|
3e2f722d11
|
fixed missing dilation
|
2025-10-14 11:12:55 -04:00 |
bssrdf
|
2237722056
|
added block variants; to be debugged
|
2025-10-14 11:02:10 -04:00 |
bssrdf
|
16b0f0ae3c
|
work in progress
|
2025-10-13 18:41:30 -04:00 |