Commit Graph

25 Commits

Author SHA1 Message Date
bssrdf ecbbdb6608 reducing integer ops 2025-11-14 13:05:31 -05:00
bssrdf 0cb1ff419a move some register to const memory space 2025-11-14 12:02:13 -05:00
bssrdf b015e4b7dc WIP: fixed bugs now results are correct 2025-11-14 11:10:34 -05:00
bssrdf 7d99222a61 WIP: debugging 2025-11-13 22:08:41 -05:00
bssrdf 0939511846 change mac loop to match cutlass 2025-11-13 15:45:43 -05:00
bssrdf fac6f0adc3 add missing batch index bounds check 2025-11-10 20:05:39 -05:00
bssrdf a1fb3c1509 fixed a bug now split-k can choose a better split factor 2025-11-08 16:45:59 -05:00
bssrdf 68ccd2a899 refactor cuda core code path 2025-11-06 09:54:01 -05:00
bssrdf 09e3a5f07d try to reduce index calculation 2025-11-05 22:02:57 -05:00
bssrdf 688de6d7d8 fixed bug now split-k is working 2025-11-05 13:47:38 -05:00
bssrdf 6f44f47113 added split-k mode for skinny mnk shapes 2025-11-05 13:04:37 -05:00
bssrdf c1f67c19e0 make CI happy 2025-10-29 23:23:21 -04:00
bssrdf 2b5351a898 make CI happy 2025-10-29 23:17:36 -04:00
bssrdf 2dfbbee73f clean up 2025-10-29 13:19:35 -04:00
bssrdf 980ddc1e87 properly use __CUDA_ARCH__ to protect the tensor path 2025-10-24 21:56:58 -04:00
bssrdf 6c90c20cb1 WIP: bug fix 2025-10-24 15:33:57 -04:00
bssrdf be25be8ed3 WIP: debugging tensor core kernel 2025-10-24 14:24:26 -04:00
bssrdf 80a996cfc0 WIP: tensore code compiled ok 2025-10-24 11:41:11 -04:00
bssrdf 66f6d16265 WIP 2025-10-23 13:52:26 -04:00
bssrdf 215ebf6526 WIP 2025-10-22 15:56:55 -04:00
bssrdf f931ad883f WIP 2025-10-21 17:12:50 -04:00
bssrdf b70cca2ea3 add support for both NCHW and NHWC layouts 2025-10-14 14:24:35 -04:00
bssrdf 2237722056 added block variants; to be debugged 2025-10-14 11:02:10 -04:00
bssrdf 16b0f0ae3c work in progress 2025-10-13 18:41:30 -04:00
bssrdf 8a589317b6 Add implicit GEMM convolution operation for 2D tensors in CUDA 2025-09-02 22:47:41 -04:00