* CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman> |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||