bssrdf
|
11bd9806bf
|
add/fix GGML_UNUSED
|
2025-11-14 17:01:24 -05:00 |
bssrdf
|
e4fbece606
|
various small optimizations
|
2025-11-14 13:51:07 -05:00 |
bssrdf
|
ecbbdb6608
|
reducing integer ops
|
2025-11-14 13:05:31 -05:00 |
bssrdf
|
b4530b4f8b
|
disable m16n8k16 mma for ampere for now
|
2025-11-14 12:11:52 -05:00 |
bssrdf
|
0cb1ff419a
|
move some register to const memory space
|
2025-11-14 12:02:13 -05:00 |
bssrdf
|
b015e4b7dc
|
WIP: fixed bugs now results are correct
|
2025-11-14 11:10:34 -05:00 |
bssrdf
|
7d99222a61
|
WIP: debugging
|
2025-11-13 22:08:41 -05:00 |
bssrdf
|
63c53fe1f1
|
WIP: move rs loop into block-k-loop following cutlass
|
2025-11-13 18:44:32 -05:00 |
bssrdf
|
8bfb7ed2f2
|
restore smem pointer at teh end of evry rs loop
|
2025-11-13 16:32:27 -05:00 |
bssrdf
|
0939511846
|
change mac loop to match cutlass
|
2025-11-13 15:45:43 -05:00 |
bssrdf
|
9f498d29f1
|
only enable m16n8k16 on ampere or above
|
2025-11-12 11:55:15 -05:00 |
bssrdf
|
ea438d8b0e
|
trying to reduce integer ops; simply code
|
2025-11-12 11:32:27 -05:00 |
bssrdf
|
c33e4301dc
|
m16n8k16 mma works; to be cleaned up
|
2025-11-12 10:26:01 -05:00 |
bssrdf
|
fac6f0adc3
|
add missing batch index bounds check
|
2025-11-10 20:05:39 -05:00 |
bssrdf
|
a660d4d45d
|
get rid of a convert unary kernel call and fuse the type cast into conv epilogue
|
2025-11-10 12:39:50 -05:00 |
bssrdf
|
1fdcb05dc8
|
increase maximum split factor to 16; use better heuristics to choose split-K factor, reducing tail effect
|
2025-11-10 11:47:56 -05:00 |
bssrdf
|
496c3599c6
|
add loop unrolling
|
2025-11-09 09:23:14 -05:00 |
bssrdf
|
5ed2c1b787
|
reduce bank conflicts in filter transpose
|
2025-11-09 00:51:51 -05:00 |
bssrdf
|
8e0e944b70
|
reduced uncoalesced global access in filter transpose
|
2025-11-09 00:14:56 -05:00 |
bssrdf
|
a2db92f41c
|
make CI happy
|
2025-11-08 20:33:05 -05:00 |
bssrdf
|
6106e9068b
|
make CI happy
|
2025-11-08 19:35:29 -05:00 |
bssrdf
|
a3fb36fb71
|
make split-k condition check more robust
|
2025-11-08 18:47:12 -05:00 |
bssrdf
|
a1fb3c1509
|
fixed a bug now split-k can choose a better split factor
|
2025-11-08 16:45:59 -05:00 |
bssrdf
|
9cbc099493
|
broken for some test cases
|
2025-11-08 14:51:45 -05:00 |
bssrdf
|
64ead3fd4f
|
remove commented code
|
2025-11-07 23:21:30 -05:00 |
bssrdf
|
414bb8d9ed
|
further reduce index swizzling computation cycles
|
2025-11-07 23:20:46 -05:00 |
bssrdf
|
8809af79a8
|
now bank conflicts free and performance get a bit boosted too
|
2025-11-07 22:11:21 -05:00 |
bssrdf
|
949eca4cba
|
swizzling working, may still have room to optimize
|
2025-11-07 19:20:12 -05:00 |
bssrdf
|
76885c7697
|
WIP: debugging
|
2025-11-07 17:44:00 -05:00 |
bssrdf
|
df88b2c917
|
trying to get rid of remaining bank conflicts; also fixed a bug for split-k condition check
|
2025-11-07 15:38:36 -05:00 |
bssrdf
|
4e9ebe92e0
|
minor update
|
2025-11-06 22:31:28 -05:00 |
bssrdf
|
ba70ad8e59
|
added test cases exactly replicating sdxl unet steps
|
2025-11-06 20:35:37 -05:00 |
bssrdf
|
28b7094750
|
Merge branch 'refactor-cuda-core-path' into conv2d-implicit
|
2025-11-06 11:05:06 -05:00 |
bssrdf
|
311213d209
|
make sure there are enough channels for split-k
|
2025-11-06 10:21:49 -05:00 |
bssrdf
|
68ccd2a899
|
refactor cuda core code path
|
2025-11-06 09:54:01 -05:00 |
bssrdf
|
09e3a5f07d
|
try to reduce index calculation
|
2025-11-05 22:02:57 -05:00 |
bssrdf
|
d9a48580fc
|
use a better criterian to use split-k
|
2025-11-05 13:58:25 -05:00 |
bssrdf
|
688de6d7d8
|
fixed bug now split-k is working
|
2025-11-05 13:47:38 -05:00 |
bssrdf
|
6f44f47113
|
added split-k mode for skinny mnk shapes
|
2025-11-05 13:04:37 -05:00 |
bssrdf
|
275c08d25d
|
add more sd like test cases
|
2025-11-04 15:16:31 -05:00 |
bssrdf
|
00a49c2fc1
|
another CI fix
|
2025-11-03 19:49:56 -05:00 |
bssrdf
|
8572313000
|
remove trailing blank
|
2025-11-03 19:45:22 -05:00 |
bssrdf
|
27881fbe7b
|
fixes for CI
|
2025-11-03 19:43:55 -05:00 |
bssrdf
|
fa9e415c9b
|
minor update of test case
|
2025-11-03 15:48:57 -05:00 |
bssrdf
|
f95664c76c
|
make tensor core path available for cc 7.5 and above
|
2025-11-01 14:35:44 -04:00 |
bssrdf
|
417cfc3cc6
|
added a test case to directly compare im2col and implicit gemm
|
2025-10-31 19:57:28 -04:00 |
bssrdf
|
c1f67c19e0
|
make CI happy
|
2025-10-29 23:23:21 -04:00 |
bssrdf
|
2b5351a898
|
make CI happy
|
2025-10-29 23:17:36 -04:00 |
bssrdf
|
c141ce3533
|
make CI happy
|
2025-10-29 22:56:27 -04:00 |
bssrdf
|
1f3d5eb8e9
|
prevent CI compile failure
|
2025-10-29 22:47:03 -04:00 |