Oleksandr Kuvshynov
408616adbd
server : [easy] fix per round speculative decode logging ( #18211 )
...
Currently we always log 0, as we clear slot.drafted before.
To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:
```
% ./build/bin/llama-server -v \
-m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
-md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
-c 8192 2> /tmp/llama.cpp.debug
Check the log:
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id 3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id 3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id 3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id 3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id 3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```
After the fix, get correct per round logging:
```
slot update_slots: id 3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id 3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id 3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id 3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id 3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id 3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id 3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
2025-12-20 10:57:40 +01:00
Aman Gupta
cc0a04343e
server: friendlier error msg when ctx < input ( #18174 )
...
* llama-server: friendlier error msg when ctx < input
This PR adds formatted strings to the server's send_error function
* llama-server: use string_format inline
* fix test
2025-12-19 12:10:00 +01:00
Pascal
6ce3d85796
server: (webui) add --webui-config ( #18028 )
...
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
2025-12-17 21:45:45 +01:00
Georgi Gerganov
254098a279
common : refactor common_sampler + grammar logic changes ( #17937 )
...
* common : refactor common_sampler + grammar logic changes
* tests : increase max_tokens to get needed response
* batched : fix uninitialized samplers
2025-12-14 10:11:13 +02:00
Xuan-Son Nguyen
6c2131773c
cli: new CLI experience ( #17824 )
...
* wip
* wip
* fix logging, add display info
* handle commands
* add args
* wip
* move old cli to llama-completion
* rm deprecation notice
* move server to a shared library
* move ci to llama-completion
* add loading animation
* add --show-timings arg
* add /read command, improve LOG_ERR
* add args for speculative decoding, enable show timings by default
* add arg --image and --audio
* fix windows build
* support reasoning_content
* fix llama2c workflow
* color default is auto
* fix merge conflicts
* properly fix color problem
Co-authored-by: bandoti <bandoti@users.noreply.github.com>
* better loading spinner
* make sure to clean color on force-exit
* also clear input files on "/clear"
* simplify common_log_flush
* add warning in mtmd-cli
* implement console writter
* fix data race
* add attribute
* fix llama-completion and mtmd-cli
* add some notes about console::log
* fix compilation
---------
Co-authored-by: bandoti <bandoti@users.noreply.github.com>
2025-12-10 15:28:59 +01:00
Xuan-Son Nguyen
951520ddb0
server: delegate result_state creation to server_task ( #17835 )
...
* server: delegate result_state creation to server_task
* remove unued states
* add more docs
2025-12-08 17:04:38 +01:00
Xuan-Son Nguyen
f896d2c34f
server: improve speed of speculative decoding ( #17808 )
...
* server: improve speed of speculative decoding
* fix small draft case
* add link to the PR
* server : fix generation time measurement
* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)
* server : add comment
* add PR to docs
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:35:28 +01:00
Georgi Gerganov
2bc96931d2
server : make cache_reuse configurable per request ( #17858 )
2025-12-08 12:43:12 +02:00
Xuan-Son Nguyen
c42712b056
server: support multiple generations from one prompt (OAI "n" option) ( #17775 )
...
* backend support
* server: support multiple generations from one prompt (OAI "n" option)
* fix invalid batch
* format oai
* clean up
* disable ctx shift
* add test
* update comments
* fix style
* add n_cmpl to docs [no ci]
* allowing using both n_cmpl and n
2025-12-06 15:54:38 +01:00
Xuan-Son Nguyen
c4c10bfb86
server: move msg diffs tracking to HTTP thread ( #17740 )
...
* server: move msg diffs tracking to HTTP thread
* wip
* tool call tests ok
* minor : style
* cont : fix
* move states to server_response_reader
* add safe-guard
* fix
* fix 2
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-04 15:46:08 +01:00
Xuan-Son Nguyen
13628d8bdb
server: add --media-path for local media files ( #17697 )
...
* server: add --media-path for local media files
* remove unused fn
2025-12-02 22:49:20 +01:00
Xuan-Son Nguyen
5d6bd842ea
server: remove default "gpt-3.5-turbo" model name ( #17668 )
...
* server: remove default "gpt-3.5-turbo" model name
* do not reflect back model name from request
* fix test
2025-12-02 11:38:57 +01:00
Xuan-Son Nguyen
ecf74a8417
mtmd: add mtmd_context_params::warmup option ( #17652 )
...
* mtmd: add mtmd_context_params::warmup option
* reuse the common_params::warmup
2025-12-01 21:32:25 +01:00
Xuan-Son Nguyen
ab49f094d2
server: move server-context to its own cpp|h ( #17595 )
...
* git mv
* add server-context.h
* add server-context.h
* clean up headers
* cont : cleanup
* also expose server_response_reader (to be used by CLI)
* fix windows build
* decouple server_routes and server_http
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-29 22:04:44 +01:00