* lora: make sure model keep track of associated adapters
* deprecate llama_adapter_lora_free
* minor : std::unordered_set over std::set
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars
* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32
Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY
* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs
hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc
hvx-utils library got a nice round of cleanup and refactoring to reduce duplication
use hvx_vec_store_a where possible
* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h
Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.
* hexagon: factor out hvx-sqrt.h
* hexagon: mintor update to hvx-utils.h
* hexagon: remove spurios log
* hexagon: factor out and optimize hvx_add/sub/mul
* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions
* hexagon: refactor reduction functions to hvx-reduce.h
Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.
* hexagon: refactor the rest of arithmetic functions to hvx-arith.h
Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.
Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h
Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.
* hexagon: refactor hvx_sum_of_squares_f32
- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.
* hexagon: use hvx_splat instead of memset
* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML
* hexagon: fix hvx_copy_f16_f32 on v75 and older
* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL
* scripts: update snapdragon/adb scripts to enable host param
* CUDA: Refactor and expose two_stage_warp_reduce_* function
* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it
Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Use two_stage_warp_reduce in group_norm_f32
* Use two_stage_warp_reduce in rms_norm_f32
* Fix smem calculation which expects bytes
* Make `two_stage_warp_reduce` accept all values warp_reduce accepts
Also integrate it into norm_f32 function
* Use two_stage_warp_reduce in l2_norm_f32
* Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
* Make norm tests cover all cuda paths
* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
* Use `enum class` for `block_reduce_method`
This is more type-safe than plain enum
* Rename variables as suggested in code review by @am17an
* Rename two_stage_warp_reduce -> block_reduce
* Fix trailing whitespace in common.cuh
* Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:
https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
* Inline definitions
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Haiku OS does not support RLIMIT_MEMLOCK, similar to visionOS/tvOS.
Skip the resource limit check on Haiku to allow mlock functionality
to work without compile errors.
Tested on Haiku with NVIDIA RTX 3080 Ti using Vulkan backend.
* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit removes the `-c, --ctx-size N` from the llama-server
command in the model card template for causal models.
The motivation for this is that -c 0 is the default and specifying it
is redundant.
* fix: Remove unnecessary `h` loops where `h` was only ever 0
Branch: CleanUpT5InputBuilders
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary padding loop that is never hit anymore
The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was
removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the
loop dead.
Branch: CleanUpT5InputBuilders
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* server : add arg for disabling prompt caching
Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.
* address review comments
* address review comments
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.
The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.
This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.
- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.
64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
* vulkan: Enable and optimize large matmul parameter combination for AMD
* limit tuning to AMD GPUs with coopmat support
* use tx_m values instead of _l
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check
This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.
* debug : normalize both pooled and token embeddings
This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.
The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
* qwen3next: simplify qkvz projection
* use ggml_swiglu_split
* revert swiglu_split, but remove redundant repeat()
* fix missing reshape
* rm 2 redundant transposes
* move mul_mat(k,q) to outside of chunking
* rm redundant cont
* improve g_cs_chunk
* add comments about no cont
* use std::pair instead of ggml_concat
* vectorize key_gdiff calculation
* rm unused tensor
* avoid ggml_concat inside loop
* bring back ggml_concat as it may not work on other backend
* nits