- Tunes l_warptile to match m_warptile (64x64) for ARM GPUs, fixing low occupancy on medium-sized matrices. - Re-enables FP16/BF16 support for ARM as the tiling fix resolves the performance regression. - Adds comments clarifying the UMA memory allocation fallback strategy. |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||