Commit Graph

7773 Commits

Author SHA1 Message Date
suhyun-hwang 0bffc5c960 refactor: remove redundant int() cast in sliding_window_pattern 2026-01-15 21:08:02 +09:00
suhyun-hwang bfc92e954b refactor: remove rope_parameters from VaetkiModel 2026-01-15 21:03:05 +09:00
suhyun-hwang 75323b3e08 refactor: simplify sliding window pattern handling in VaetkiModel 2026-01-15 21:03:05 +09:00
suhyun-hwang 5d0870207a refactor: clean up VaetkiModel class 2026-01-15 21:03:05 +09:00
suhyun-hwang ad04d34047 chore: remove some missed traces 2026-01-15 21:03:05 +09:00
suhyun-hwang 56c89a1216 add: VAETKI tokenizer implementation 2026-01-15 21:03:05 +09:00
suhyun-hwang ca85717886 revert: remove VAETKI tokenizer implementation 2026-01-15 21:03:05 +09:00
suhyun-hwang 487909ae0e refactor: simplify VaetkiModel set_gguf_parameters 2026-01-15 20:59:11 +09:00
suhyun-hwang ab233049dc refactor: remove redundant deepseek2 compatibility code 2026-01-15 20:59:11 +09:00
Xuan Son Nguyen d85a08830b use min_pixels/max_pixels from preproc config 2026-01-15 20:59:11 +09:00
Xuan Son Nguyen 89db71702b add min/max pixels gguf metadata 2026-01-15 20:59:11 +09:00
suhyun-hwang 8bbeab0616 fix: use tensor stride for fused QKV support in vaetki 2026-01-15 20:59:11 +09:00
suhyun-hwang c947c74a4c fix: restore QKV splitting for VAETKI (fused QKV not working) 2026-01-15 20:59:11 +09:00
suhyun-hwang 808642295b fix: use mm_ffn_down_w for VAETKI projector embedding size 2026-01-15 20:59:11 +09:00
suhyun-hwang c13747b93d refactor: remove manual QKV splitting (handled by build_vit) 2026-01-15 20:59:11 +09:00
suhyun-hwang 566128ffb7 refactor: use standard tensor naming for VAETKI projector 2026-01-15 20:59:11 +09:00
suhyun-hwang 8657eceda5 fix: use num_patches instead of n_pos for position array size 2026-01-15 20:59:11 +09:00
suhyun-hwang c9e44c7451 style: add whitespace around arithmetic operators 2026-01-15 20:59:11 +09:00
suhyun-hwang 025ce711b6 Add VaetkiVisionModel mmproj converter with Rice ViT support 2026-01-15 20:59:11 +09:00
suhyun-hwang 96294c6ad9 refactor: simplify partial RoPE with weight reordering 2026-01-15 20:59:11 +09:00
suhyun-hwang db84faff3a fix: correct VAETKI model type naming 2026-01-15 20:59:11 +09:00
suhyun-hwang 5d08f3e87b feat: VAETKI dynamic image size support 2026-01-15 20:59:11 +09:00
suhyun-hwang d61a3f817c refactor: use build_vit for VAETKI vision encoder 2026-01-15 20:59:11 +09:00
suhyun-hwang 9d531ea9d5 refactor: move class_pos_emb to VAETKI case 2026-01-15 20:59:11 +09:00
suhyun-hwang c5e9eac8c5 refactor: merge VAETKI positions case with QWEN2VL 2026-01-15 20:59:11 +09:00
suhyun-hwang d8e8b77c44 fix: add VAETKI pre-tokenizer hash 2026-01-15 20:59:11 +09:00
suhyun-hwang 4358557fe7 fix: sliding_window_pattern type error 2026-01-15 20:58:38 +09:00
suhyun-hwang b267aada03 mtmd : add VAETKI vision encoder support 2026-01-15 20:58:38 +09:00
suhyun-hwang 488cdee96f model : add VAETKI architecture support 2026-01-15 20:58:04 +09:00
shalinib-ibm 8cc0ba957b
ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837) 2026-01-15 17:31:18 +08:00
Xuan-Son Nguyen a7e6ddb8bd
lora: make sure model keep track of associated adapters (#18490)
* lora: make sure model keep track of associated adapters

* deprecate llama_adapter_lora_free

* minor : std::unordered_set over std::set

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-15 10:24:28 +01:00
Sigbjørn Skjæret 2a13180100
model-loader : support bool array sliding window pattern (#18850) 2026-01-15 10:12:46 +01:00
Adrien Gallouët ec997b4f2b
tests : download models only when running ctest (#18843)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-15 09:47:29 +01:00
Max Krasnyansky cff777f226
hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822)
* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars

* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32

Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY

* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs

hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc

hvx-utils library got a nice round of cleanup and refactoring to reduce duplication

use hvx_vec_store_a where possible

* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h

Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.

* hexagon: factor out hvx-sqrt.h

* hexagon: mintor update to hvx-utils.h

* hexagon: remove spurios log

* hexagon: factor out and optimize hvx_add/sub/mul

* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions

* hexagon: refactor reduction functions to hvx-reduce.h

Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.

* hexagon: refactor the rest of arithmetic functions to hvx-arith.h

Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.

Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h

Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.

* hexagon: refactor hvx_sum_of_squares_f32

- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.

* hexagon: use hvx_splat instead of memset

* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML

* hexagon: fix hvx_copy_f16_f32 on v75 and older

* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL

* scripts: update snapdragon/adb scripts to enable host param
2026-01-14 21:46:12 -08:00
Oliver Simons 36f0132464
CUDA: Factor out and re-use `block_reduce` function (#18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-15 10:44:54 +08:00
Piotr Wilkin (ilintar) d98b548120
Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)
* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality

* Move to common

* Remove unneeded header

* Unlink from common

* chore: update webui build output

* Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code.

* Revert change to webapp

* Post-merge adjust

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply code review changes

* Remove changes to server-context

* Remove mtmd.h include

* Remove utility functions from header

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Rename functions

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-01-14 20:29:35 +01:00
Junwon Hwang 8fb7175576
model : clean up and fix EXAONE-MoE configuration (#18840)
* Fix mismatch of EXAONE-MoE configuration

* ensure gating func is set, cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-14 19:38:21 +01:00
Adrien Gallouët 516a4ca9b5
refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
Jeff Bolz 3e4bb29666
vulkan: Check maxStorageBufferRange in supports_op (#18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-14 10:59:05 +01:00
Aman Gupta 47f9612492
llama-model: fix unfortunate typo (#18832) 2026-01-14 17:55:15 +08:00
Daniel Bevenius 01cbdfd7eb
CUDA : fix typo in clang pragma comment [no ci] (#18830) 2026-01-14 10:31:49 +01:00
Ruben Ortlam 635ef78ec5
vulkan: work around Intel fp16 bug in mmq (#18814) 2026-01-14 09:41:23 +01:00
Perry Naseck 7d587e5544
ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705) 2026-01-14 09:22:25 +02:00
Daniel Benjaminsson d34aa07193
mmap: add Haiku support by skipping RLIMIT_MEMLOCK check (#18819)
Haiku OS does not support RLIMIT_MEMLOCK, similar to visionOS/tvOS.
Skip the resource limit check on Haiku to allow mlock functionality
to work without compile errors.

Tested on Haiku with NVIDIA RTX 3080 Ti using Vulkan backend.
2026-01-14 09:11:05 +02:00
Adrien Gallouët f709c7a33f
ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-14 07:46:27 +01:00
ddh0 6e36299b47
llama : print_info alignment fix (#18708)
* fix text spacing in print_info

* align all
2026-01-14 00:05:11 +01:00
Junwon Hwang 60591f01d4
model : add EXAONE MoE (#18543)
* Add EXAONE MoE implementations

Co-authored-by: Junwon Hwang <nuclear1221@gmail.com>

* Address PR feedback

* Address PR feedback

* [WIP] Add MTP for EXAONE-MoE

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

---------

Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>
2026-01-13 23:28:38 +01:00
Georgi Gerganov e4832e3ae4
vocab : fix attribute overrides for harmony (#18806)
* vocab : fix attribute overrides for harmony

* cont : add warning log
2026-01-13 17:40:13 +02:00
Ruben Ortlam 960e5e3b46
llama-mmap: fix direct-io loading fallback EOF exception (#18801) 2026-01-13 15:57:07 +01:00
Daniel Bevenius 20ca2e12c4
model-conversion : remove -c 0 from model card template [no ci] (#18807)
This commit removes the `-c, --ctx-size N` from the llama-server
command in the model card template for causal models.

The motivation for this is that -c 0 is the default and specifying it
is redundant.
2026-01-13 14:13:10 +01:00