Commit Graph

21 Commits

Author SHA1 Message Date
Georgi Gerganov c70bfd7bcb
cuda : "constexpr dim3" -> "const dim3"
ggml-ci
2024-04-22 20:31:23 +03:00
Georgi Gerganov 5408d55506
cuda : uint -> uint32_t 2024-04-22 19:12:06 +03:00
Johannes Gäßler 87968de9a9 fix KQ FP32 precision fpr parallel_blocks > 1 2024-04-18 13:15:32 +02:00
Johannes Gäßler 0bc67dd1c8 Calculate KQ as FP32 if KQV has GGML_PREC_F32 2024-04-18 13:15:32 +02:00
Johannes Gäßler a5b0e2dea0 store temp KQ in registers 2024-04-18 13:15:32 +02:00
Johannes Gäßler ef9e1593f3 flush softmax exp below threshold to 0 2024-04-18 13:15:32 +02:00
Johannes Gäßler 6a3b84236d fix flash_attn_vec_f16 race condition 2024-04-18 13:15:32 +02:00
Johannes Gäßler 34f93bbb39 CUDA: refactor host code, dyn. par. blocks 2024-04-18 13:15:32 +02:00
Johannes Gäßler ee19a4ab7e
fix KV cache padding, NaN from INFINITY (#6438) 2024-04-02 17:26:22 +02:00
Johannes Gäßler c63dfdf765 fix cmake build 2024-04-02 13:48:13 +03:00
Johannes Gäßler bb0d51accd fix excessive KQ_b loads 2024-04-02 13:48:13 +03:00
Johannes Gäßler e1ecd3b129 fix compile warnings 2024-04-02 13:48:13 +03:00
Johannes Gäßler 3f777acf06 Multiple parallel blocks for batch size 1 2024-04-02 13:48:13 +03:00
Johannes Gäßler 68d793bee8 no ncols == 64 2024-04-02 13:48:13 +03:00
Johannes Gäßler cca6d027a3 4 warps, 256 stride for all D 2024-04-02 13:48:13 +03:00
Johannes Gäßler 269374ed81 adjust kernel selection logic 2024-04-02 13:48:13 +03:00
Johannes Gäßler 81da919864 no vec for hs, no hs==256 ncols==32 for Volta 2024-04-02 13:48:13 +03:00
Johannes Gäßler d59ac670bf 16 cols for Phi-2 2024-04-02 13:48:13 +03:00
Johannes Gäßler 75aa7b4b18 CUDA: faster FlashAttention, kernel for bs == 1 2024-04-02 13:48:13 +03:00
Georgi Gerganov 6be02b5969
cuda : fix build 2024-03-27 10:31:52 +02:00
Georgi Gerganov 013721df2b
Merge branch 'master' into gg/flash-attn 2024-03-27 10:24:09 +02:00