Llama 4 vision models use a head dimension of D=88. Previously, this fell back to unoptimized operations, causing massive VRAM bloat and slow inference. This adds D=88 to the CUDA flash attention tile backend. It explicitly excludes 88 from the Turing/Volta/WMMA/MMA Tensor Core checks to prevent memory misalignment/segfaults, forcing the fallback to the TILE kernel. Also updates generate_cu_files.py to dynamically generate the required template instance. |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||