Commit Graph

715 Commits

Author SHA1 Message Date
Daniel Bevenius ebfe545cf9
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-30 07:59:02 +01:00
o7si daa242dfc8
common: fix return value check for setpriority (#18412)
* common: fix return value check for setpriority

* tools: add logging for process priority setting
2025-12-29 11:07:49 +02:00
o7si 60f17f56da
rpc: fix segfault on invalid endpoint format (#18387)
* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection
2025-12-28 12:34:41 +02:00
Daniel Bevenius 82c2600585
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-28 07:34:17 +01:00
Johannes Gäßler 026d2ad472
llama: fix magic number of 999 for GPU layers (#18266)
* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Xuan-Son Nguyen f5acfb2ffa
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option

* also allow stop while loading

* add docs

* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Georgi Gerganov 0ce03597e8
Merge branch 'master' into HEAD 2025-12-24 10:33:21 +02:00
ddh0 10355dc7d0
common: add `LLAMA_ARG_OVERRIDE_TENSOR` env var for `-ot` arg (#18267) 2025-12-24 14:19:12 +08:00
Johannes Gäßler 147a521636
tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
Daniel Bevenius f1310ab904
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-22 06:46:54 +01:00
Aldehir Rojas 9496bbb808
common : reorganize includes to prioritize vendored deps (#18222) 2025-12-20 21:43:21 -06:00
Xuan-Son Nguyen ddcb75dd8a
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level

* implement server-context suspend

* add test

* add docs

* optimization: add fast path

* make sure to free llama_init

* nits

* fix use-after-free

* allow /models to be accessed during sleeping, fix use-after-free

* don't allow accessing /models during sleep, it is not thread-safe

* fix data race on accessing props and model_meta

* small clean up

* trailing whitespace

* rm outdated comments
2025-12-21 02:24:42 +01:00
Xuan-Son Nguyen 9e39a1e6a9
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-20 09:25:27 +01:00
Pascal 14931a826e
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form

* arg: update doc

* arg: update test-arg-parser

* arg: address review feedback from ngxson

simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified

* arg: update doc
2025-12-19 18:01:56 +01:00
Xuan-Son Nguyen 98c1c7a7bf
presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources

* update docs

* fix neg arg handling

* fix empty mmproj

* also filter out server-controlled args before to_ini()

* skip loading custom_models if not specified

* fix unset_reserved_args

* fix crash on windows
2025-12-19 12:08:20 +01:00
Oliver Simons 1750917420 Fix different RNG-states between backend-sampling and llama-sampling
By default, we perform a warm-up step where the ggml_cgraph is computed
once. For backend-sampling, this graph contains the sampler, and thus
the RNG state of the backend's dist sampler is advanced once.

Solution to this is to reset the samplers after the warmup has finished
2025-12-19 11:42:10 +01:00
Daniel Bevenius bc5195c585
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-19 09:38:01 +01:00
Xuan-Son Nguyen 8ea958d4d9
model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106)
* ASR with LFM2-Audio-1.5B

* Set rope_theta

* Fix comment

* Remove rope_theta setting

* Address PR feedback

* rename functions to conformer

* remove some redundant ggml_cont

* fix missing tensor

* add prefix "a." for conv tensors

* remove redundant reshape

* clean up

* add test model

---------

Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
2025-12-19 00:18:01 +01:00
Xuan-Son Nguyen 4d1316c440
arg: fix ASAN error on sampler_type_names empty (#18167) 2025-12-18 14:30:32 +01:00
Georgi Gerganov 3b3f5fed31
common : disable backend sampling when grammar is involved 2025-12-18 10:52:21 +02:00
Georgi Gerganov eefdb0da17
Merge branch 'master' into HEAD 2025-12-18 10:12:47 +02:00
Pascal 6ce3d85796
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support

Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.

Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes

Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls

Addresses feedback from @ngxson and @ggerganov

* server: address review feedback from ngxson

* server: regenerate README with llama-gen-docs
2025-12-17 21:45:45 +01:00
Georgi Gerganov 4301e27319
common : restore grammar-based rejection sampling (#18137)
* common : restart grammar-based rejection sampling

* sampling : allow null samplers
2025-12-17 19:46:00 +02:00
Johannes Gäßler a2c199e479
common: clarify instructions for bug reports (#18134) 2025-12-17 18:44:13 +01:00
Pascal 487674fbb3
common: fix --override-kv to support comma-separated values (#18056)
* common: fix --override-kv to support comma-separated values

* Update common/arg.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* common: deprecate repeated arguments, suggest comma-separated values

* common: add comma escape support for --override-kv

* common: optimize duplicate detection with insert().second

Co-authored-by: personalmountains <46615898+personalmountains@users.noreply.github.com>

* common: migrate all repeated args to comma-separated syntax

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: personalmountains <46615898+personalmountains@users.noreply.github.com>
2025-12-17 11:36:23 +02:00
TrevorS 4b2a4778f8
arg: allow -kvu flag for llama-perplexity (#18117)
The -kvu (--kv-unified) flag is required for hellaswag and winogrande
benchmarks which use coupled sequences. Without unified KV cache,
these benchmarks fail with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the -kvu flag)

This change adds LLAMA_EXAMPLE_PERPLEXITY to the allowed examples for
the -kvu argument, enabling its use with llama-perplexity.
2025-12-17 08:33:02 +02:00
Xuan-Son Nguyen 7b1db3d3b7
arg: clarify auto kvu/np being set on server (#17997)
* arg: clarify auto kvu/np being set on server

* improve docs

* use invalid_argument
2025-12-16 12:01:27 +01:00
Aldehir Rojas c05aa69f32
common : add nemotron 3 parsing (#18077)
* common : expose json-schema functionality to extract type info

* common : fix peg parser negation during needs_more_input

* common : add some defensive measures in constructed peg parser

* common : add nemotron nano 3 support

* common : add nemotron nano 3 tests

* remove debug line
2025-12-16 04:05:23 -06:00
Daniel Bevenius ad1b60abc4
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-16 09:45:08 +01:00
Johannes Gäßler b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Xuan-Son Nguyen 52392291b2
preset: handle negated arg, reverse the meaning if needed (#18041) 2025-12-14 22:08:10 +01:00
Georgi Gerganov 22c7f85b9c
Merge branch 'master' into HEAD 2025-12-14 10:19:58 +02:00
Georgi Gerganov 254098a279
common : refactor common_sampler + grammar logic changes (#17937)
* common : refactor common_sampler + grammar logic changes

* tests : increase max_tokens to get needed response

* batched : fix uninitialized samplers
2025-12-14 10:11:13 +02:00
Xuan-Son Nguyen 4d5ae24c0a
arg: fix common_params_parse not accepting negated arg (#17991) 2025-12-13 12:53:37 +01:00
Sigbjørn Skjæret 8e4d678528
common : skip model validation when --completion-bash is requested (#17975) 2025-12-13 08:40:50 +01:00
Sigbjørn Skjæret 2bc94e7928
add llama-completion to completion-bash executables (#17976) 2025-12-13 08:35:50 +01:00
Xuan-Son Nguyen 380b4c984e
common: support negated args (#17919)
* args: support negated args

* update docs

* fix typo

* add more neg options

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rm duplicated arg

* fix LLAMA_ARG_NO_HOST

* add test

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-12 23:58:53 +01:00
Xuan-Son Nguyen 54a0fee4b7
arg: add -mm and -mmu as short form of --mmproj and --mmproj-url (#17958)
* arg: add -mm and -mmu as short form of --mmproj and --mmproj-url

* correct order

* update docs
2025-12-12 14:06:06 +01:00
Adrien Gallouët b8ee22cfde
common : add minimalist multi-thread progress bar (#17602)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 12:44:35 +01:00
Georgi Gerganov 4d10b78e23
Merge branch 'master' into HEAD 2025-12-11 14:42:56 +02:00
Xuan-Son Nguyen 34a6d86982
cli: enable jinja by default (#17911)
* cli: enable jinja by default

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-10 22:19:42 +01:00
Pascal f32ca51bfe
server: add presets (config) when using multiple models (#17859)
* llama-server: recursive GGUF loading

Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model

* server : router config POC (INI-based per-model settings)

* server: address review feedback from @aldehir and @ngxson

PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check

Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>

* server: adopt aldehir's line-oriented PEG parser

Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule

Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs

Co-authored-by: aldehir <aldehir@users.noreply.github.com>

* server: fix CLI/env duplication in child processes

Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.

Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn

* add common/preset.cpp

* fix compile

* cont

* allow custom-path models

* add falsey check

* server: fix router model discovery and child process spawning

- Sanitize model names: replace / and \ with _ for display
- Recursive directory scan with relative path storage
- Convert relative paths to absolute when spawning children
- Filter router control args from child processes
- Refresh args after port assignment for correct port value
- Fallback preset lookup for compatibility
- Fix missing argv[0]: store server binary path before base_args parsing

* Revert "server: fix router model discovery and child process spawning"

This reverts commit e3832b42eeea7fcb108995966c7584479f745857.

* clarify about "no-" prefix

* correct render_args() to include binary path

* also remove arg LLAMA_ARG_MODELS_PRESET for child

* add co-author for ini parser code

Co-authored-by: aldehir <hello@alde.dev>

* also set LLAMA_ARG_HOST

* add CHILD_ADDR

* Remove dead code

---------

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: aldehir <hello@alde.dev>
2025-12-10 22:18:21 +01:00
Georgi Gerganov 38882247d3
Merge branch 'master' into HEAD 2025-12-10 17:07:21 +02:00
Xuan-Son Nguyen 6c2131773c
cli: new CLI experience (#17824)
* wip

* wip

* fix logging, add display info

* handle commands

* add args

* wip

* move old cli to llama-completion

* rm deprecation notice

* move server to a shared library

* move ci to llama-completion

* add loading animation

* add --show-timings arg

* add /read command, improve LOG_ERR

* add args for speculative decoding, enable show timings by default

* add arg --image and --audio

* fix windows build

* support reasoning_content

* fix llama2c workflow

* color default is auto

* fix merge conflicts

* properly fix color problem

Co-authored-by: bandoti <bandoti@users.noreply.github.com>

* better loading spinner

* make sure to clean color on force-exit

* also clear input files on "/clear"

* simplify common_log_flush

* add warning in mtmd-cli

* implement console writter

* fix data race

* add attribute

* fix llama-completion and mtmd-cli

* add some notes about console::log

* fix compilation

---------

Co-authored-by: bandoti <bandoti@users.noreply.github.com>
2025-12-10 15:28:59 +01:00
Georgi Gerganov 81cb5783c8
Merge branch 'master' into HEAD 2025-12-10 13:41:32 +02:00
Aldehir Rojas 2fbe3b7bb7
common : add parser for ministral/mistral large 3/devstral 2 (#17713) 2025-12-09 17:31:04 -06:00
Xuan-Son Nguyen 4e842d5120
console: allow using arrow left/right, home/end keys and history mode (#17836)
* console: allow using arrow left/right to edit the line (with UTF-8 support)

* console: fix arrow keys on Windows using private-use Unicode

* console: add Home/End key support for Windows and Linux

* console: add basic Up/Down history navigation

* fix build

* console: allow using arrow left/right to edit the line (with UTF-8 support)

* console: fix arrow keys on Windows using private-use Unicode

* console: add Home/End key support for Windows and Linux

* console: add basic Up/Down history navigation

* console: remove unreachable wc == 0 check after VK switch

* console: add Ctrl+Left/Right word navigation

- Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
- Windows: detect CTRL modifier via dwControlKeyState
- Linux: parse ANSI sequences with modifier (1;5D/C)
- Implement move_word_left/right with space-skipping logic
- Refactor escape sequence parsing to accumulate params

* console: add Delete key support

- Windows: VK_DELETE detection
- Linux: ESC[3~ sequence parsing
- Forward character deletion with UTF-8 support

* console: implement bash-style history editing

- Edit any history line during UP/DOWN navigation, edits persist
- Pressing Enter appends edited version as new history entry
- Original line stay untouched in their positions

* clean up

* better history impl

* fix decode_utf8

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-09 11:53:59 +01:00
Georgi Gerganov f3beb22b17
sampling : handle n_probs case 2025-12-08 21:30:10 +02:00
Georgi Gerganov 6d38db5dfe
Merge branch 'master' into HEAD 2025-12-08 17:55:24 +02:00
hksdpc255 636fc17a37
Fix Kimi-K2 tool-call parsing issues (#17376)
* Fix kimi-k2 parsing

* fix template & add more tests for kimi-k2

* Another fix for Kimi-K2 chat template.

* enable allow_toolcall_in_think for Kimi-K2

* Refine key-value separator and value end format

* Enable tool call in think for kimi-k2

* allow_toolcall_in_think is now tested with Kimi-K2

* Remove outdated TODO comment in XML tool call parser

Removed TODO comment about untested tool call feature.

* Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"
2025-12-08 14:32:04 +01:00