The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it. |
||
|---|---|---|
| .. | ||
| cmake | ||
| vulkan-shaders | ||
| CMakeLists.txt | ||
| ggml-vulkan.cpp | ||