* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization
- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask
* cuda: iq2xxs: simplify sum scaling
express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`
saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small
* uint -> uint32_t
error: identifier "uint" is undefined
This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).
I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
* ggml-cpu: FA add GEMM microkernel
* add guard for sizeless vector types
* fix case where DV % GGML_F32_EPR !=0
* move memset out of the loop
* move another memset out of the loop
* use RM=4 for arm
* simd_gemm: convert everything to int
* convert everything to size_t to avoid warnings
* fixup
* add pragma for ignoring aggressive loop optimizations
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL
Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.
The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.
* addressed code review comments
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Refactoring to use new llama_put_adapter_loras
* cont : alternative lora API
---------
Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.
* Update llama-mmap.cpp
* Update llama-mmap.cpp
Remove trailing whitespace from line 567
* ggml-hexagon: fa improvements
ggml-hexagon: optimize flash attention calculations with improved variable handling
ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32
ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements
ggml-hexagon: optimize flash attention by changing slope vector type to F16
* hexfa: fixed test-backend-ops failurs due to leftover element handling
* hexagon: refactor and optimize fa to use local context struct
* ggml-hexagon: optimize flash-attention using hvx_vec_expf
Use HVX for online softmax.
---------
Co-authored-by: chraac <chraac@gmail.com>
* common : remove legacy .json to .etag migration code
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : simplify common_download_file_single_online
This commit also force a redownload if the file exists
but has no .etag file.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Do not mutate cgraph for fused ADDs
1. We should try to minimize in-place changes to the incoming
ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
step as we store the properties before modifying the graph in-place
in the cuda-backend
* Assert ggml_tensor is trivially copyable
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Updated documentation
Model is no longer a parameter
* llama : fix trailing whitespace in comment
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* feat: Enable adding System Prompt per-chat
* fix: Save draft message in Chat Form when adding System Prompt from new chat view
* fix: Proper system message deletion logic
* chore: Formatting
* chore: update webui build output