* ggml-hexagon: fa improvements
ggml-hexagon: optimize flash attention calculations with improved variable handling
ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32
ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements
ggml-hexagon: optimize flash attention by changing slope vector type to F16
* hexfa: fixed test-backend-ops failurs due to leftover element handling
* hexagon: refactor and optimize fa to use local context struct
* ggml-hexagon: optimize flash-attention using hvx_vec_expf
Use HVX for online softmax.
---------
Co-authored-by: chraac <chraac@gmail.com>
* common : remove legacy .json to .etag migration code
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : simplify common_download_file_single_online
This commit also force a redownload if the file exists
but has no .etag file.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Do not mutate cgraph for fused ADDs
1. We should try to minimize in-place changes to the incoming
ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
step as we store the properties before modifying the graph in-place
in the cuda-backend
* Assert ggml_tensor is trivially copyable
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Updated documentation
Model is no longer a parameter
* llama : fix trailing whitespace in comment
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* feat: Enable adding System Prompt per-chat
* fix: Save draft message in Chat Form when adding System Prompt from new chat view
* fix: Proper system message deletion logic
* chore: Formatting
* chore: update webui build output
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).
The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
2549 | using ARegsT = typename Impl::ARegsT;
```
Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2]. When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.
Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>