bssrdf
|
ecbbdb6608
|
reducing integer ops
|
2025-11-14 13:05:31 -05:00 |
bssrdf
|
0cb1ff419a
|
move some register to const memory space
|
2025-11-14 12:02:13 -05:00 |
bssrdf
|
b015e4b7dc
|
WIP: fixed bugs now results are correct
|
2025-11-14 11:10:34 -05:00 |
bssrdf
|
7d99222a61
|
WIP: debugging
|
2025-11-13 22:08:41 -05:00 |
bssrdf
|
0939511846
|
change mac loop to match cutlass
|
2025-11-13 15:45:43 -05:00 |
bssrdf
|
fac6f0adc3
|
add missing batch index bounds check
|
2025-11-10 20:05:39 -05:00 |
bssrdf
|
a1fb3c1509
|
fixed a bug now split-k can choose a better split factor
|
2025-11-08 16:45:59 -05:00 |
bssrdf
|
68ccd2a899
|
refactor cuda core code path
|
2025-11-06 09:54:01 -05:00 |
bssrdf
|
09e3a5f07d
|
try to reduce index calculation
|
2025-11-05 22:02:57 -05:00 |
bssrdf
|
688de6d7d8
|
fixed bug now split-k is working
|
2025-11-05 13:47:38 -05:00 |
bssrdf
|
6f44f47113
|
added split-k mode for skinny mnk shapes
|
2025-11-05 13:04:37 -05:00 |
bssrdf
|
c1f67c19e0
|
make CI happy
|
2025-10-29 23:23:21 -04:00 |
bssrdf
|
2b5351a898
|
make CI happy
|
2025-10-29 23:17:36 -04:00 |
bssrdf
|
2dfbbee73f
|
clean up
|
2025-10-29 13:19:35 -04:00 |
bssrdf
|
980ddc1e87
|
properly use __CUDA_ARCH__ to protect the tensor path
|
2025-10-24 21:56:58 -04:00 |
bssrdf
|
6c90c20cb1
|
WIP: bug fix
|
2025-10-24 15:33:57 -04:00 |
bssrdf
|
be25be8ed3
|
WIP: debugging tensor core kernel
|
2025-10-24 14:24:26 -04:00 |
bssrdf
|
80a996cfc0
|
WIP: tensore code compiled ok
|
2025-10-24 11:41:11 -04:00 |
bssrdf
|
66f6d16265
|
WIP
|
2025-10-23 13:52:26 -04:00 |
bssrdf
|
215ebf6526
|
WIP
|
2025-10-22 15:56:55 -04:00 |
bssrdf
|
f931ad883f
|
WIP
|
2025-10-21 17:12:50 -04:00 |
bssrdf
|
b70cca2ea3
|
add support for both NCHW and NHWC layouts
|
2025-10-14 14:24:35 -04:00 |
bssrdf
|
2237722056
|
added block variants; to be debugged
|
2025-10-14 11:02:10 -04:00 |
bssrdf
|
16b0f0ae3c
|
work in progress
|
2025-10-13 18:41:30 -04:00 |
bssrdf
|
8a589317b6
|
Add implicit GEMM convolution operation for 2D tensors in CUDA
|
2025-09-02 22:47:41 -04:00 |