Commit Graph

6524 Commits

Author SHA1 Message Date
bssrdf a2db92f41c make CI happy 2025-11-08 20:33:05 -05:00
bssrdf 6106e9068b make CI happy 2025-11-08 19:35:29 -05:00
bssrdf a3fb36fb71 make split-k condition check more robust 2025-11-08 18:47:12 -05:00
bssrdf a1fb3c1509 fixed a bug now split-k can choose a better split factor 2025-11-08 16:45:59 -05:00
bssrdf 9cbc099493 broken for some test cases 2025-11-08 14:51:45 -05:00
bssrdf 64ead3fd4f remove commented code 2025-11-07 23:21:30 -05:00
bssrdf 414bb8d9ed further reduce index swizzling computation cycles 2025-11-07 23:20:46 -05:00
bssrdf 8809af79a8 now bank conflicts free and performance get a bit boosted too 2025-11-07 22:11:21 -05:00
bssrdf 949eca4cba swizzling working, may still have room to optimize 2025-11-07 19:20:12 -05:00
bssrdf 76885c7697 WIP: debugging 2025-11-07 17:44:00 -05:00
bssrdf df88b2c917 trying to get rid of remaining bank conflicts; also fixed a bug for split-k condition check 2025-11-07 15:38:36 -05:00
bssrdf 4e9ebe92e0 minor update 2025-11-06 22:31:28 -05:00
bssrdf ba70ad8e59 added test cases exactly replicating sdxl unet steps 2025-11-06 20:35:37 -05:00
bssrdf 28b7094750 Merge branch 'refactor-cuda-core-path' into conv2d-implicit 2025-11-06 11:05:06 -05:00
bssrdf 311213d209 make sure there are enough channels for split-k 2025-11-06 10:21:49 -05:00
bssrdf 68ccd2a899 refactor cuda core code path 2025-11-06 09:54:01 -05:00
bssrdf 09e3a5f07d try to reduce index calculation 2025-11-05 22:02:57 -05:00
bssrdf d9a48580fc use a better criterian to use split-k 2025-11-05 13:58:25 -05:00
bssrdf 688de6d7d8 fixed bug now split-k is working 2025-11-05 13:47:38 -05:00
bssrdf 6f44f47113 added split-k mode for skinny mnk shapes 2025-11-05 13:04:37 -05:00
bssrdf 275c08d25d add more sd like test cases 2025-11-04 15:16:31 -05:00
bssrdf 00a49c2fc1 another CI fix 2025-11-03 19:49:56 -05:00
bssrdf 8572313000 remove trailing blank 2025-11-03 19:45:22 -05:00
bssrdf 27881fbe7b fixes for CI 2025-11-03 19:43:55 -05:00
bssrdf fa9e415c9b minor update of test case 2025-11-03 15:48:57 -05:00
bssrdf f95664c76c make tensor core path available for cc 7.5 and above 2025-11-01 14:35:44 -04:00
bssrdf 417cfc3cc6 added a test case to directly compare im2col and implicit gemm 2025-10-31 19:57:28 -04:00
bssrdf c1f67c19e0 make CI happy 2025-10-29 23:23:21 -04:00
bssrdf 2b5351a898 make CI happy 2025-10-29 23:17:36 -04:00
bssrdf c141ce3533 make CI happy 2025-10-29 22:56:27 -04:00
bssrdf 1f3d5eb8e9 prevent CI compile failure 2025-10-29 22:47:03 -04:00
bssrdf 70132278cb more clean up 2025-10-29 21:57:12 -04:00
bssrdf a3b4d8d31e clean up 2025-10-29 21:46:15 -04:00
bssrdf 55859a86aa remove implicit op and related calls; replace conv_2d with conv_2d_implicit kernel 2025-10-29 21:36:03 -04:00
bssrdf 2dfbbee73f clean up 2025-10-29 13:19:35 -04:00
bssrdf 1e568252b5 switch to default conv2d interface 2025-10-29 12:11:26 -04:00
bssrdf 4b1920e9e7 reduced bank conflicts for output 2025-10-29 10:40:52 -04:00
bssrdf 75dde410a8 WIP: minor tweak 2025-10-28 14:41:48 -04:00
bssrdf 3ea524e9c4 WIP: almost working 2025-10-27 23:10:19 -04:00
bssrdf 6d12288037 WIP: fixed a bug in cpy transpos index computation 2025-10-27 17:32:03 -04:00
bssrdf a3784e17ad WIP: debugging cpy transpose 2025-10-27 15:09:03 -04:00
bssrdf cc327f5224 added a specialization for cuda copy op when tensor is transposed 2025-10-27 11:23:27 -04:00
bssrdf 30990788e8 WIP 2025-10-27 08:29:20 -04:00
bssrdf c68fe36ae2 WIP: cleanup; enhanced test case 2025-10-25 21:57:39 -04:00
bssrdf 475f9879c5 WIP: fixed another bug 2025-10-25 20:24:14 -04:00
bssrdf 396f55831c WIP: bug fix 2025-10-25 18:14:12 -04:00
bssrdf 610e41ae2d still debugging 2025-10-25 11:10:39 -04:00
bssrdf c45df12ee7 this case is broken; to be debugged 2025-10-24 22:40:34 -04:00
bssrdf 980ddc1e87 properly use __CUDA_ARCH__ to protect the tensor path 2025-10-24 21:56:58 -04:00
bssrdf 24b553204b WIP: fixed another bug 2025-10-24 16:53:40 -04:00