bssrdf
|
3ea524e9c4
|
WIP: almost working
|
2025-10-27 23:10:19 -04:00 |
bssrdf
|
a3784e17ad
|
WIP: debugging cpy transpose
|
2025-10-27 15:09:03 -04:00 |
bssrdf
|
30990788e8
|
WIP
|
2025-10-27 08:29:20 -04:00 |
bssrdf
|
c68fe36ae2
|
WIP: cleanup; enhanced test case
|
2025-10-25 21:57:39 -04:00 |
bssrdf
|
475f9879c5
|
WIP: fixed another bug
|
2025-10-25 20:24:14 -04:00 |
bssrdf
|
396f55831c
|
WIP: bug fix
|
2025-10-25 18:14:12 -04:00 |
bssrdf
|
610e41ae2d
|
still debugging
|
2025-10-25 11:10:39 -04:00 |
bssrdf
|
c45df12ee7
|
this case is broken; to be debugged
|
2025-10-24 22:40:34 -04:00 |
bssrdf
|
980ddc1e87
|
properly use __CUDA_ARCH__ to protect the tensor path
|
2025-10-24 21:56:58 -04:00 |
bssrdf
|
24b553204b
|
WIP: fixed another bug
|
2025-10-24 16:53:40 -04:00 |
bssrdf
|
6c90c20cb1
|
WIP: bug fix
|
2025-10-24 15:33:57 -04:00 |
bssrdf
|
be25be8ed3
|
WIP: debugging tensor core kernel
|
2025-10-24 14:24:26 -04:00 |
bssrdf
|
3f99818925
|
unroll some loops
|
2025-10-15 12:46:46 -04:00 |
bssrdf
|
b70cca2ea3
|
add support for both NCHW and NHWC layouts
|
2025-10-14 14:24:35 -04:00 |
bssrdf
|
2237722056
|
added block variants; to be debugged
|
2025-10-14 11:02:10 -04:00 |
bssrdf
|
c6255442bb
|
minor updates
|
2025-10-08 13:38:16 -04:00 |
bssrdf
|
53a2ccbe12
|
minor update and add direct conv in benchmarking
|
2025-09-24 21:48:20 -04:00 |
bssrdf
|
83a3b7d6a9
|
Refactor conv2d_implicit_kernel for improved bitwise operations; add test for implicit convolution
|
2025-09-06 17:26:19 -04:00 |