Johannes Gäßler
|
ef9e1593f3
|
flush softmax exp below threshold to 0
|
2024-04-18 13:15:32 +02:00 |
Johannes Gäßler
|
6a3b84236d
|
fix flash_attn_vec_f16 race condition
|
2024-04-18 13:15:32 +02:00 |
Johannes Gäßler
|
34f93bbb39
|
CUDA: refactor host code, dyn. par. blocks
|
2024-04-18 13:15:32 +02:00 |
Johannes Gäßler
|
ee19a4ab7e
|
fix KV cache padding, NaN from INFINITY (#6438)
|
2024-04-02 17:26:22 +02:00 |
Johannes Gäßler
|
c63dfdf765
|
fix cmake build
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
bb0d51accd
|
fix excessive KQ_b loads
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
e1ecd3b129
|
fix compile warnings
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
3f777acf06
|
Multiple parallel blocks for batch size 1
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
68d793bee8
|
no ncols == 64
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
cca6d027a3
|
4 warps, 256 stride for all D
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
269374ed81
|
adjust kernel selection logic
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
81da919864
|
no vec for hs, no hs==256 ncols==32 for Volta
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
d59ac670bf
|
16 cols for Phi-2
|
2024-04-02 13:48:13 +03:00 |
Johannes Gäßler
|
75aa7b4b18
|
CUDA: faster FlashAttention, kernel for bs == 1
|
2024-04-02 13:48:13 +03:00 |
Georgi Gerganov
|
6be02b5969
|
cuda : fix build
|
2024-03-27 10:31:52 +02:00 |
Georgi Gerganov
|
013721df2b
|
Merge branch 'master' into gg/flash-attn
|
2024-03-27 10:24:09 +02:00 |