Commit Graph

1518 Commits

Author SHA1 Message Date
xiaofei a0f7016d17
rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
2025-04-30 09:29:22 +03:00
matteo e2e1ddb93a
server : Prefilling assistant message in openai compatible API (#13174)
* Prefilling assistant message in openai compatible API

* fixed indentation

* fixed code convention

* simplify method usage

* no more than one assistant message at end of messages

* merge checks into prefill code

* Update examples/server/utils.hpp

---------

Co-authored-by: matteo <matteo@naspc.lan>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-04-29 20:33:10 +02:00
Alberto Cabrera Pérez 5a63980117
llama-bench: fixed size of fields to correctly map to values (#13183) 2025-04-29 17:24:36 +02:00
Xuan-Son Nguyen 00e3e5a194
mtmd : add qwen2vl and qwen2.5vl (#13141)
* llava : add clip_n_output_tokens, deprecate clip_n_patches

* mtmd : add qwen2vl and qwen2.5vl

* decode_embd_batch::set_position_...

* working version

* deprecate llama-qwen2vl-cli

* correct order W, H of clip_embd_nbytes_by_img

* edit existing line in hot topics
2025-04-29 11:47:04 +02:00
Xuan-Son Nguyen eaea325324
clip : fix model size display (#13153) 2025-04-28 21:23:19 +02:00
Vishal Agarwal 1831f538f7
llama-bench: add `-d` depth arg (#13096)
* add depth param

* update llama-bench README and add depth param

* llama-bench: default params for depth arg for faster execution

* Update examples/llama-bench/README.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix buffer print ub

* use user provided args

* remove extra whitespaces

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-04-28 16:50:39 +02:00
Xuan-Son Nguyen 4e87962e34
mtmd : fix glm-edge redundant token count (#13139)
* mtmd : fix glm-edge redundant token count

* fix chat template

* temporary disable GLMEdge test chat tmpl
2025-04-28 16:12:56 +02:00
Xuan-Son Nguyen d2b2031e5f
llama : (mrope) allow using normal 1D position for text token (#13138)
* llama : (mrope) use normal position for text token

* rm n_pos_per_embd from llm_graph_input_attn_temp
2025-04-28 14:20:56 +02:00
Xuan-Son Nguyen 5fa9e63be8
clip : refactor set input for cgraph + fix qwen2.5vl input (#13136)
* clip : refactor set input for cgraph

* more strict assert

* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere

* split qwen2 and qwen2.5 code blocks

* minor style fix
2025-04-28 12:18:59 +02:00
4onen c0a97b762e
llama-bench : Add `--override-tensors` arg (#12922)
* Add --override-tensors option to llama-bench

* Correct llama-bench --override-tensors to --override-tensor

* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.

* Make new llama-bench util functions static to fix Ubuntu CI

* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
2025-04-27 23:48:26 +02:00
LostRuins Concedo 59e991c23c
Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133) 2025-04-27 12:43:37 +02:00
HimariO ca2bb89eac
clip : Add Qwen2.5VL support (#12402)
* implment vision model architecture, gguf convertor

* handle window attention inputs

* add debug utils

* fix few incorrect tensor memory layout

* move position id remap out of ggml to avoid int32 cuda operations

* cleaning up

* ignore transformers Qwen2_5_xxx type check

* remove not so often use `qwen2vl-cli` debug functions

* remove commented-out code blocks

* fix attn weight scaling after rebase

* add `PROJECTOR_TYPE_QWEN2_5_VL`

* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`

* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`

* remove `attn_window_size` from gguf

* fix model conversion

* clean up

* fix merging problem

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-27 10:10:34 +02:00
Xuan-Son Nguyen 4753791e70
clip : improve projector naming (#13118)
* clip : improve projector naming

* no more kv has_llava_projector

* rm unused kv

* rm more unused
2025-04-26 22:39:47 +02:00
frob d5fe4e81bd
grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-04-26 10:10:20 +02:00
Xuan-Son Nguyen edb18b6e8f
clip : fix pixtral on some GPU backends (#13097)
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
2025-04-25 14:31:42 +02:00
Xuan-Son Nguyen 13be08daf9
clip : remove boi/eoi embeddings for GLM-edge model (#13081) 2025-04-24 22:17:04 +02:00
Georgi Gerganov 226251ed56
embeddings : fix batch sizes (#13076)
ggml-ci
2025-04-24 22:29:22 +03:00
Georgi Gerganov 13b4548877
cmake : do not include ./src as public for libllama (#13062)
* cmake : do not include ./src as public for libllama

ggml-ci

* cmake : rework tests

ggml-ci

* llguidance : remove unicode include

ggml-ci

* cmake : make c++17 private

ggml-ci
2025-04-24 16:00:10 +03:00
Xuan-Son Nguyen 7c727fbe39
arg : add --no-mmproj-offload (#13093)
* arg : add --no-mmproj-offload

* Update common/arg.cpp
2025-04-24 14:04:14 +02:00
Xuan-Son Nguyen 80982e815e
arg : clean up handling --mmproj with -hf (#13082)
* arg : clean up handling --mmproj with -hf

* rm change about no_mmproj

* Revert "rm change about no_mmproj"

This reverts commit 2cac8e0efb.

* handle no_mmproj explicitly

* skip download mmproj on examples not using it
2025-04-24 12:14:13 +02:00
pl752 5630406959
llama-mtmd-cli: Sigint rework in mtmd vision example (#13080)
* Sigint rework in mtmd vision example

* Applied suggestions on mtmd-cli PR

* Forgot to invert one of the conditions

* Update examples/llava/mtmd-cli.cpp

* Removed redundant exit check

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-04-23 23:32:35 +02:00
Xuan-Son Nguyen ecda2ec4b3
mtmd : Support Pixtral 12B (#13065)
* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script
2025-04-23 20:21:59 +02:00
Radoslav Gerganov 2cca6c01e4
rpc : add command line option for number of threads for the CPU backend (#13060)
closes #13051
2025-04-23 10:32:49 +03:00
Xuan-Son Nguyen dc39a5e7a8
mtmd : support SmolVLM (version 1 and 2) (#13050)
* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test
2025-04-22 16:24:54 +02:00
Xuan-Son Nguyen 243453533e
llava : update documentations (#13055)
* llava : update documentations

* fix typo
2025-04-22 10:37:00 +02:00
Xuan-Son Nguyen 84a9bf2fc2
mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-21 15:32:58 +02:00
Xuan-Son Nguyen 2016f07bd1
convert : experimental support for `--mmproj` flag (#13023)
* convert : experimental support for `--mmproj` flag

* fix bad ctrl+f replace

* fix style

* split into subclasses TextModel and VisionModel

* rename Mode --> ModelBase

* small fix

* correct CLIP_VISION arch name (because existing GGUF already use it)

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* fix Mistral3Model

* fix typo

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2025-04-20 23:29:36 +02:00
Jeffrey Morgan 6602304814
llava: fix errors in clip.h on certain compilers (#13030) 2025-04-20 12:15:41 +02:00
Xuan-Son Nguyen 37b9f0d29d
clip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011)
* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
2025-04-19 09:15:45 +02:00
Daniel Tang 6408210082
main : Fix Ctrl+D/newline handling (#12951)
This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).

This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."

Fixes #12949
2025-04-18 22:02:55 +02:00
Xuan-Son Nguyen 35370ba945
server : use std::move whenever possible (#12936)
* server : use std::move whenever possible

* use r-value ref

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make task creation scoped

* restore std::move

* fix task_id not set correctly

* apply changes from suggestion

Co-authored-by: ggerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-18 19:58:12 +02:00
Xuan-Son Nguyen b9154ecff9
mtmd : add methods to access `mtmd_image_tokens` (#12906)
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
2025-04-18 10:04:51 +02:00
Radoslav Gerganov 2db9ba1464
rpc : add RPC_CMD_HELLO (#12955)
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
2025-04-18 10:13:42 +03:00
Russyyds d6d2c2ab8c
Add performance print for gemma3 in example (#12929) 2025-04-14 19:18:20 +02:00
Neo Zhang Jianyu 81c7e64fc2
dsiable curl lib check, this action is missed by commit bd3f59f812 (#12761) (#12937) 2025-04-14 18:19:07 +08:00
Ed Addario 71e90e8813
quantize: Handle user-defined quantization levels for additional tensors (#12511)
* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' coding guidelines

* Update descriptions to match existing style

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' guidelines

* Implement general --tensor-type instead of tensor-specific command option

* Fix implied type bug

* Restore missing #includes

* Add regex capability for tensor selection

* Refactor function name and update ALLOWED_TENSOR_TYPE

* Add missing #include

* Handle edge case when tensor name is cls.output

* Minor logging improvement
2025-04-13 21:29:28 +03:00
Prajwal B Mehendarkar bc091a4dc5
common : Define cache directory on AIX (#12915) 2025-04-12 17:33:39 +02:00
Matt Clayton e59ea539b8
llava: Fix cpu-only clip image encoding sefault (#12907)
* llava: Fix cpu-only clip image encoding

* clip : no smart ptr for ggml_backend_t

* Fix for backend_ptr push_back

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-12 07:29:03 +02:00
Georgi Gerganov c94085df28
server : add VSCode's Github Copilot Chat support (#12896)
* server : add VSCode's Github Copilot Chat support

* cont : update handler name
2025-04-11 23:37:41 +03:00
yuri@FreeBSD e8a62631b3
rpc : Set cache directory in rpc-server.cpp on FreeBSD (#12903) 2025-04-11 22:04:14 +02:00
tastelikefeet b2034c2b55
contrib: support modelscope community (#12664)
* support download from modelscope

* support login

* remove comments

* add arguments

* fix code

* fix win32

* test passed

* fix readme

* revert readme

* change to MODEL_ENDPOINT

* revert tail line

* fix readme

* refactor model endpoint

* remove blank line

* fix header

* fix as comments

* update comment

* update readme

---------

Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>
2025-04-11 14:01:56 +02:00
Xuan-Son Nguyen 0c50923944
clip : use smart pointer (⚠️ breaking change) (#12869)
* clip : use smart pointers

* fix warmup

* add forward declaration

* misisng include

* fix include (2)

* composite

* simplify batch ptr

* fix conflict
2025-04-11 12:09:39 +02:00
Xuan-Son Nguyen 8b9cc7cdd8
llava : introduce libmtmd (#12849)
* wip llava2

* migrated gemma3 to llava2

* add timings

* correct pre/postfix

* fix missing include

* fix compilation unused var warn

* update llava2_tokenize

* change name llava2 --> mtmd

* improve api

* refine helpers

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-10 22:57:16 +02:00
Plamen Minev 381603a775
ci: detach common from the library (#12827)
* fix: detach common from the library

* fix: building chat test template
2025-04-09 10:11:11 +02:00
Xuan-Son Nguyen 65a69e6e1b
clip : do not print ftype (#12832) 2025-04-09 10:09:53 +02:00
Matt Clayton b32efad2bc
llava: improve clip_ctx destructor to not memleak load_image_size (#12834) 2025-04-08 22:01:58 +02:00
Georgi Gerganov a19b5cef16
llama : fix FA when KV cache is not used (i.e. embeddings) (#12825)
* ggml : FA supports F32 V

* graph : cast KV to F16 when the KV cache is not used

ggml-ci

* server : add test that exercises embeddings with FA enabled

ggml-ci
2025-04-08 19:54:51 +03:00
Xuan-Son Nguyen 78a1ba0a4f
server : fix thread.join() on exit (#12831) 2025-04-08 18:37:06 +02:00
dm4 2dabf759e7
llava: add more helper functions to check projector types in clip context (#12824)
Signed-off-by: dm4 <sunrisedm4@gmail.com>
2025-04-08 15:49:13 +02:00
characharm 8ca6e1c3a4
server : webui : Improve Chat Input with Auto-Sizing Textarea (#12785)
* Update ChatScreen.tsx

* useAutosizeTextarea.ts

useAutosizeTextarea to encapsulate the logic.

* Implement responsive auto-sizing chat textarea

Replaces the manual textarea resizing with an automatic height adjustment based on content.

- `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization
- Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up).
- Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability.
- Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize.

* -update compressed index.html.gz after npm run build
-refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook

* chore: normalize line endings to LF
refactor: AutosizeTextareaApi -> chatTextareaApi

* refactor: Rename interface to PascalCase

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-08 11:14:59 +02:00
stduhpf 4ccea213bc
hellaswag: display estimated score confidence interval (#12797) 2025-04-07 18:47:08 +03:00
Xuan-Son Nguyen bd3f59f812
cmake : enable curl by default (#12761)
* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo
2025-04-07 13:35:19 +02:00
Sergey Fedorov f1e3eb4249
common : fix includes in arg.cpp and gemma3-cli.cpp (#12766)
* arg.cpp: add a missing include

* gemma3-cli.cpp: fix cinttypes include
2025-04-05 17:46:00 +02:00
Xuan-Son Nguyen 0364178ca2
clip : refactor clip_init, add tests (#12757)
* refactor clip_init

* fix loading file

* fix style

* test ok

* better test with report

* add missing headers

* clarify

* add KEY_MM_PATCH_MERGE_TYPE

* remove bool has_* pattern

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use ggml_soft_max_ext

* refactor logging system

* add minicpm-v-o 2.6 for testing

* use nullptr everywhere

* fix Yi-VL model

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-05 17:17:40 +02:00
Nauful Shaikh b772394297
server : webui : Upgrade daisyui, tailwindcss. (#12735)
* Upgrade daisyui, tailwindcss.

* Switch to all themes.

* Revert a change.

* Update formatting.

* Install packages before npm build.

* Revert "Install packages before npm build."

This reverts commit 336c5147e6.

* Add index.html.gz

* run build

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-04 16:09:52 +02:00
nick huang 23106f94ea
gguf-split : --merge now respects --dry-run option (#12681)
* gguf-split now respects dry-run option

* removing trailing space
2025-04-04 16:09:12 +02:00
Georgi Gerganov a10b36c91a
llama : refactor kv cache guard (#12695)
* llama : refactor kv cache guard

ggml-ci

* cont : fix comment [no ci]

* llama : fix kv_cache restore logic

ggml-ci

* context : simplify kv cache updates

ggml-ci

* cont : better name [no ci]

* llama : fix llama_decode return code when could not find KV slot

ggml-ci

* context : change log err -> warn [no ci]

* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Xuan-Son Nguyen 42eb248f46
common : remove json.hpp from common.cpp (#12697)
* common : remove json.hpp from common.cpp

* fix comment
2025-04-02 09:58:34 +02:00
Xuan-Son Nguyen 267c1399f1
common : refactor downloading system, handle mmproj with -hf option (#12694)
* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2
2025-04-01 23:44:05 +02:00
Sigbjørn Skjæret 1a85949067
llava : proper description fix (#12668) 2025-03-31 11:28:30 +02:00
Sigbjørn Skjæret f52d59d771
llava : fix clip loading GGUFs with missing description (#12660) 2025-03-31 11:07:07 +02:00
marcoStocchi 52de2e5949
tts : remove printfs (#12640)
* tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output.
2025-03-31 11:20:30 +03:00
Benson Wong 5d01670266
server : include speculative decoding stats when timings_per_token is enabled (#12603)
* Include speculative decoding stats when timings_per_token is true

New fields added to the `timings` object:

  - draft_n           : number of draft tokens generated
  - draft_accepted_n  : number of draft tokens accepted
  - draft_accept_ratio: ratio of accepted/generated

* Remove redundant draft_accept_ratio var

* add draft acceptance rate to server console output
2025-03-28 10:05:44 +02:00
Radoslav Gerganov ef03229ff4
rpc : update README for cache usage (#12620) 2025-03-28 09:44:13 +02:00
Radoslav Gerganov ab6ab8f809
rpc : send hash when tensor data is above some fixed threshold (#12496)
* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency
2025-03-28 08:18:04 +02:00
Piotr 2099a9d5db
server : Support listening on a unix socket (#12613)
* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

---------

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-03-27 23:41:04 +01:00
Ivy233 02082f1519
clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566)
* [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize.
After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA.

* [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.
2025-03-26 15:06:04 +01:00
Eric Curtin ef19c71769
run: de-duplicate fmt and format functions and optimize (#11596) 2025-03-25 18:46:11 +01:00
Marius Gerdes 77f9c6bbe5
server : Add verbose output to OAI compatible chat endpoint. (#12246)
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
2025-03-23 19:30:26 +01:00
marcoStocchi ea1518e839
llama-tts : avoid crashes related to bad model file paths (#12482) 2025-03-21 11:12:45 +02:00
Woof Dog e04643063b
webui : Prevent rerendering on textarea input (#12299)
* webui: Make textarea uncontrolled to eliminate devastating lag

* Update index.html.gz

* use signal-style implementation

* rm console log

* no duplicated savedInitValue set

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-20 15:57:43 +01:00
Georgi Gerganov c6af2161b2
speculative : fix seg fault in certain cases (#12454) 2025-03-18 19:35:11 +02:00
Georgi Gerganov 810e0af3f5
server : fix warmup draft cache type (#12446)
ggml-ci
2025-03-18 12:05:42 +02:00
Sigbjørn Skjæret 60c902926c
docs : bring llama-cli conversation/template docs up-to-date (#12426) 2025-03-17 21:14:32 +01:00
marcoStocchi f4c3dd5daa
llama-tts : add '-o' option (#12398)
* added -o option to specify an output file name

* llama-tts returns ENOENT in case of file write error

note : PR #12042 is closed as superseded with this one.
2025-03-15 17:23:11 +01:00
Eric Curtin 9f2250ba72
Add CLI arg to llama-run to adjust the number of threads used (#12370)
We default to 4, sometimes we want to manually adjust this

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-03-14 16:41:20 +00:00
Victor add2a3aa5a
server: fix "--grammar-file" parameter (#12285) 2025-03-14 11:21:17 +01:00
Georgi Gerganov e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)
* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
2025-03-13 12:35:44 +02:00
Ishaan Gandhi 2048b5913d
server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338)
* Fix DOS index bug

* Remove new APIs

* remove extra line

* Remove from API

* Add extra newline

* Update examples/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-13 11:10:05 +01:00
Daniel Bevenius 80a02aa858
llama.swiftui : fix xcframework dir in README [no ci] (#12353)
This commit fixes the path to the xcframework in the README file which I
had forgotten to change after renaming the build directory.
2025-03-12 13:45:32 +01:00
Xuan-Son Nguyen 7841fc723e
llama : Add Gemma 3 support (+ experimental vision capability) (#12343)
* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
2025-03-12 09:30:24 +01:00
Xuan-Son Nguyen 96e1280839
clip : bring back GPU support (#12322)
* clip : bring back GPU support

* use n_gpu_layers param

* fix double free

* ggml_backend_init_by_type

* clean up
2025-03-11 09:20:16 +01:00
marcoStocchi 6ef79a67ca
common : refactor '-o' option (#12278)
As discussed in PR 'llama-tts : add -o option' (#12042):

* common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option.

* cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.
2025-03-10 13:34:13 +02:00
Olivier Chafik be421fc429
`tool-call`: ensure there's always a non-empty tool call id (#12292) 2025-03-10 09:45:29 +00:00
Olivier Chafik 2b3a25c212
`sampler`: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291)
* Fix typo in lazy grammar handling (fixes trigger tokens)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-10 09:44:42 +00:00
tc-mb 8352cdc87b
llava : fix bug in minicpm-v code (#11513)
* fix bug in minicpm-v code

* update readme of minicpm-v
2025-03-10 10:33:24 +02:00
Georgi Gerganov 7ab364390f
server : infill gen ends on new line (#12254) 2025-03-07 20:54:30 +02:00
Sigbjørn Skjæret 8fad3c7a7c
server : Log original chat template parsing error (#12233) 2025-03-07 11:15:33 +01:00
Aaron Teo e9b2f84f14
llava: add big-endian conversion for image encoder (#12218)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-03-06 09:33:21 +01:00
Han Yin 57b6abf85a
android : fix KV cache log message condition (#12212) 2025-03-06 08:22:49 +02:00
Olivier Chafik 669912d9a5
`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)
* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-05 13:05:13 +00:00
Clauszy 06a92a193a
server : fix cache reuse logic (#12161)
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
2025-03-05 09:25:45 +02:00
Daniel Bevenius a057897ad4
llama : add xcframework build script (#11996)
* llama : add xcframework build script

This commit adds a script to build an XCFramework for Apple
ios, macos, visionos, and tvos platforms.

The generated XCFramework can then be added to a project and used in
the same way as a regular framework. The llama.swiftui example project
has been updated to use the XCFramework and can be started using the
following command:
```console
$ open examples/llama.swiftui/llama.swiftui.xcodeproj/
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10747

* examples : remove llama.cpp (source dir ref) from project.pbxproj

This commit removes the reference to llama.cpp from the project.pbxproj
file since Package.swift has been removed.

* ci : updated build.yml to use build-xcframework.sh

* ci : add xcframework build to github releases

This commit adds the ability to create a GitHub release with the
xcframework build artifact.

* scripts : add apple app validation scripts

This commit adds scripts that can validate the iOS, macOS, tvOS, and
VisionOS applications. The scripts create a simple test app project,
copy the llama.xcframework to the test project, build and archive the
app, create an IPA from the archive, and validate the IPA using altool.

The motivation for this is to provide some basic validation and
hopefully avoid having to manually validate apps in Xcode.

* llama : remove Package.swift

This commit removes the Package.swift file, as we are now building an
XCFramework for the project.

* llama : remove Sources and spm-headers directories

* llama : use TargetConditionals.h for visionOS/tvOS
2025-03-05 06:30:31 +01:00
mgroeber9110 5bbe6a9fe9
ggml : portability fixes for VS 2017 (#12150)
* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
Sigbjørn Skjæret 56d7a9f812
main: allow preloading conversation with -p and add -st / --single-turn (#12145)
* Add chat template formatting to -no-cnv

* only enable prompt formatting if explicitly enabled

* add -st / --single-turn

* add --single-turn and -p in conversation mode

* fix -sys + -p

* reword warning

* small readability change and fix (long) outdated example usage

* only activate single turn in conversation mode
2025-03-04 12:19:39 -04:00
Olivier Chafik 1a24c4621f
`server`: fix deadly typo in response_format.json_schema.schema handling (#12168) 2025-03-04 08:24:07 +02:00
dm4 c43af9276b
tts: add speaker file support (#12048)
* tts: add speaker file support

Signed-off-by: dm4 <sunrisedm4@gmail.com>

* tts: handle outetts-0.3

* tts : add new line in error message

---------

Signed-off-by: dm4 <sunrisedm4@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-03 15:09:29 +02:00
Eric Curtin c950a1f692
Adding UTF-8 support to llama.cpp (#12111)
For emojis, non-alpha characters, etc.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-03-03 12:44:56 +00:00
Xuan-Son Nguyen 7b69003af7
webui : add ?m=... and ?q=... params (#12148)
* webui : add ?m=... and ?q=... params

* also clear prefilledMessage variable

* better approach

* fix comment

* test: bump timeout on GITHUB_ACTION
2025-03-03 11:42:45 +01:00
Sigbjørn Skjæret 14dec0c2f2
main: use jinja chat template system prompt by default (#12118)
* Use jinja chat template system prompt by default

* faster conditional order

* remove nested ternary

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-02 14:53:48 +01:00