bssrdf
1fdcb05dc8
increase maximum split factor to 16; use better heuristics to choose split-K factor, reducing tail effect
2025-11-10 11:47:56 -05:00
bssrdf
5ed2c1b787
reduce bank conflicts in filter transpose
2025-11-09 00:51:51 -05:00
bssrdf
8e0e944b70
reduced uncoalesced global access in filter transpose
2025-11-09 00:14:56 -05:00
bssrdf
a2db92f41c
make CI happy
2025-11-08 20:33:05 -05:00
bssrdf
a3fb36fb71
make split-k condition check more robust
2025-11-08 18:47:12 -05:00
bssrdf
a1fb3c1509
fixed a bug now split-k can choose a better split factor
2025-11-08 16:45:59 -05:00
bssrdf
9cbc099493
broken for some test cases
2025-11-08 14:51:45 -05:00
bssrdf
414bb8d9ed
further reduce index swizzling computation cycles
2025-11-07 23:20:46 -05:00
bssrdf
8809af79a8
now bank conflicts free and performance get a bit boosted too
2025-11-07 22:11:21 -05:00
bssrdf
949eca4cba
swizzling working, may still have room to optimize
2025-11-07 19:20:12 -05:00
bssrdf
df88b2c917
trying to get rid of remaining bank conflicts; also fixed a bug for split-k condition check
2025-11-07 15:38:36 -05:00
bssrdf
4e9ebe92e0
minor update
2025-11-06 22:31:28 -05:00
bssrdf
ba70ad8e59
added test cases exactly replicating sdxl unet steps
2025-11-06 20:35:37 -05:00
bssrdf
311213d209
make sure there are enough channels for split-k
2025-11-06 10:21:49 -05:00
bssrdf
09e3a5f07d
try to reduce index calculation
2025-11-05 22:02:57 -05:00
bssrdf
688de6d7d8
fixed bug now split-k is working
2025-11-05 13:47:38 -05:00
bssrdf
6f44f47113
added split-k mode for skinny mnk shapes
2025-11-05 13:04:37 -05:00
bssrdf
275c08d25d
add more sd like test cases
2025-11-04 15:16:31 -05:00
bssrdf
00a49c2fc1
another CI fix
2025-11-03 19:49:56 -05:00
bssrdf
8572313000
remove trailing blank
2025-11-03 19:45:22 -05:00
bssrdf
27881fbe7b
fixes for CI
2025-11-03 19:43:55 -05:00
bssrdf
fa9e415c9b
minor update of test case
2025-11-03 15:48:57 -05:00
bssrdf
417cfc3cc6
added a test case to directly compare im2col and implicit gemm
2025-10-31 19:57:28 -04:00
bssrdf
70132278cb
more clean up
2025-10-29 21:57:12 -04:00
bssrdf
a3b4d8d31e
clean up
2025-10-29 21:46:15 -04:00
bssrdf
55859a86aa
remove implicit op and related calls; replace conv_2d with conv_2d_implicit kernel
2025-10-29 21:36:03 -04:00
bssrdf
2dfbbee73f
clean up
2025-10-29 13:19:35 -04:00
bssrdf
1e568252b5
switch to default conv2d interface
2025-10-29 12:11:26 -04:00
bssrdf
4b1920e9e7
reduced bank conflicts for output
2025-10-29 10:40:52 -04:00
bssrdf
75dde410a8
WIP: minor tweak
2025-10-28 14:41:48 -04:00
bssrdf
3ea524e9c4
WIP: almost working
2025-10-27 23:10:19 -04:00
bssrdf
a3784e17ad
WIP: debugging cpy transpose
2025-10-27 15:09:03 -04:00
bssrdf
30990788e8
WIP
2025-10-27 08:29:20 -04:00
bssrdf
c68fe36ae2
WIP: cleanup; enhanced test case
2025-10-25 21:57:39 -04:00
bssrdf
475f9879c5
WIP: fixed another bug
2025-10-25 20:24:14 -04:00
bssrdf
396f55831c
WIP: bug fix
2025-10-25 18:14:12 -04:00
bssrdf
610e41ae2d
still debugging
2025-10-25 11:10:39 -04:00
bssrdf
c45df12ee7
this case is broken; to be debugged
2025-10-24 22:40:34 -04:00
bssrdf
980ddc1e87
properly use __CUDA_ARCH__ to protect the tensor path
2025-10-24 21:56:58 -04:00
bssrdf
24b553204b
WIP: fixed another bug
2025-10-24 16:53:40 -04:00
bssrdf
6c90c20cb1
WIP: bug fix
2025-10-24 15:33:57 -04:00
bssrdf
be25be8ed3
WIP: debugging tensor core kernel
2025-10-24 14:24:26 -04:00
bssrdf
15484c9bd6
turn on tests for implicit conv2d
2025-10-17 22:16:16 -04:00
bssrdf
3f99818925
unroll some loops
2025-10-15 12:46:46 -04:00
bssrdf
b70cca2ea3
add support for both NCHW and NHWC layouts
2025-10-14 14:24:35 -04:00
bssrdf
2237722056
added block variants; to be debugged
2025-10-14 11:02:10 -04:00
bssrdf
c6255442bb
minor updates
2025-10-08 13:38:16 -04:00
bssrdf
53a2ccbe12
minor update and add direct conv in benchmarking
2025-09-24 21:48:20 -04:00
bssrdf
2ec76aa8f3
Merge branch 'master' into conv2d-implicit
2025-09-10 22:04:20 -04:00
Oliver Simons
00681dfc16
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance ( #15872 )
...
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
2025-09-10 22:04:03 +02:00