* Prevent crash if TTFT >300sec, boosted to 90 days
* server : allow configurable HTTP timeouts for child models
* server : pass needed timeouts from params only
---------
Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
* webui: apply webui_settings on first load
The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null
Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes
* chore: update webui build output
* server: prevent data race from HTTP threads
* fix params
* fix default_generation_settings
* nits: make handle_completions_impl looks less strange
* stricter const
* fix GGML_ASSERT(idx < states.size())
* move index to be managed by server_response_reader
* http: make sure req & res lifecycle are tied together
* fix compile
* fix index handling buggy
* fix data race for lora endpoint
* nits: fix shadow variable
* nits: revert redundant changes
* nits: correct naming for json_webui_settings
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
* llama-server: friendlier error msg when ctx < input
This PR adds formatted strings to the server's send_error function
* llama-server: use string_format inline
* fix test
* presets: refactor, allow cascade presets from different sources
* update docs
* fix neg arg handling
* fix empty mmproj
* also filter out server-controlled args before to_ini()
* skip loading custom_models if not specified
* fix unset_reserved_args
* fix crash on windows
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
* server: fix crash when batch > ubatch with embeddings (#12836)
Fixes#12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.
Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.
Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
* Log warnings explaining the issue
* Automatically set n_batch = n_ubatch
* Prevent server crash
This follows the approach suggested by @ggerganov in issue #12836.
Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).
Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model
* Update tools/server/server.cpp
---------
Co-authored-by: ytian218 <ytian218@bloomberg.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* convert ok
* no deepstack
* less new tensors
* cgraph ok
* add mrope for text model
* faster patch merger
* add GGML_ROPE_TYPE_MRNORM
* add support for metal
* move glm4v do dedicated graph
* convert: add norm_embd
* clip: add debugging fn
* working correctly
* fix style
* use bicubic
* fix mrope metal
* improve cpu
* convert to neox ordering on conversion
* revert backend changes
* force stop if using old weight
* support moe variant
* fix conversion
* fix convert (2)
* Update tools/mtmd/clip-graph.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* process mrope_section on TextModel base class
* resolve conflict merge
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Pass disabled state to the file attachments button and the model
selector button.
* Update index.html.gz
* Fix model info card in non-router mode.
* Update index.html.gz