ggml-webgpu: improve flastAttention performance by software pipelining (#19151)
* webgpu : pipeline flash_attn Q/K loads in WGSL
* ggml-webgpu: unroll Q*K accumlation inner loop
* ggml-webgpu: vectorization
* ggml-webgpu: unrolling
* ggml-webgpu: remove redundant unrolling
* ggml-webgpu: restore the config
* ggml-webgpu: remove redundant comments
* ggml-webgpu: formatting
* ggml-webgpu: formatting and remove vectorization
* ggml-webgpu: remove unnecessary constants
* ggml-webgpu: change QKV buffer to read_write to pass validation
* ggml-webgpu: add explanation for the additional bracket around Q K accumulate
* Indentation and for -> if for tail
* Kick off CI on wgsl only commits
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>