llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	2652e745ef	webui : fix lint	2025-12-14 16:45:07 +02:00
Georgi Gerganov	22c7f85b9c	Merge branch 'master' into HEAD	2025-12-14 10:19:58 +02:00
Georgi Gerganov	254098a279	common : refactor common_sampler + grammar logic changes (#17937 ) * common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers	2025-12-14 10:11:13 +02:00
Sergey Fedorov	4ed2bae50d	server-models.cpp: add missing <filesystem> (#18000 ) Fixes: https://github.com/ggml-org/llama.cpp/issues/17999	2025-12-13 22:02:43 +01:00
Xuan-Son Nguyen	4d5ae24c0a	arg: fix common_params_parse not accepting negated arg (#17991 )	2025-12-13 12:53:37 +01:00
Xuan-Son Nguyen	380b4c984e	common: support negated args (#17919 ) * args: support negated args * update docs * fix typo * add more neg options * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * rm duplicated arg * fix LLAMA_ARG_NO_HOST * add test --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-12 23:58:53 +01:00
Xuan-Son Nguyen	e39a2ce66d	clip: move model cgraphs into their own files (#17965 ) * clip: move model cgraphs into their own files * more explicit enums * fix linux build * fix naming * missing headers * nits: add comments for contributors	2025-12-12 21:14:48 +01:00
Xuan-Son Nguyen	17158965ac	mtmd: explicitly forbidden inclusion of private header and libcommon (#17946 )	2025-12-12 15:16:06 +01:00
Aleksander Grygier	12280ae905	webui: Fix parsing non-LaTeX occurrencies of `$` or `$` (#17810 ) * fix: Improve latex protection logic to prevent turning non-latex `\(` into `$` * chore: update webui build output	2025-12-12 15:13:36 +01:00
Xuan-Son Nguyen	54a0fee4b7	arg: add -mm and -mmu as short form of --mmproj and --mmproj-url (#17958 ) * arg: add -mm and -mmu as short form of --mmproj and --mmproj-url * correct order * update docs	2025-12-12 14:06:06 +01:00
Pascal	a81a569577	Add a search field on model selector / improve mobile display (#17765 ) * webui: add search field to model selector and fixes mobile viewport overflow * webui: simplify model search style and code * refacor: Search Input component & consistent UI for Models Selector search * feat: Use Popover component + improve interactions * fix: Fetching props for only loaded models in ROUTER mode * webui: prevent models selector popover from overflowing viewport Use Floating UI's auto-positioning with 50dvh height limit and proper collision detection instead of forcing top positioning. Fixes overflow on desktop and mobile keyboard issues * webui: keep search field near trigger in models selector Place search at the 'near end' (closest to trigger) by swapping layout with CSS flexbox order based on popover direction. Prevents input from moving during typing as list shrinks * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-12-11 18:21:21 +01:00
Piotr Wilkin (ilintar)	53ecd4fdb9	SOLVE_TRI extension to more dimensions (#17793 ) * Extended TRI * Fix whitespace * chore: update webui build output * Just use cuBLAS for everything... * Merge both versions * Remove incorrect imports causing failures for CI * Still failing... remove all direct cublas imports and rely on common imports from "common.cuh" * Defines for hipBlas * Aaaand MUSA defines... * I hate this job... * Stupid typo... * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-11 17:20:43 +01:00
Georgi Gerganov	4d10b78e23	Merge branch 'master' into HEAD	2025-12-11 14:42:56 +02:00
Xuan-Son Nguyen	c6b2c9310c	mtmd: some small clean up (#17909 ) * clip: add support for fused qkv in build_vit * use bulid_ffn whenever possible * fix internvl * mtmd-cli: move image to beginning * test script: support custom args	2025-12-10 22:20:06 +01:00
Xuan-Son Nguyen	34a6d86982	cli: enable jinja by default (#17911 ) * cli: enable jinja by default * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-10 22:19:42 +01:00
Pascal	f32ca51bfe	server: add presets (config) when using multiple models (#17859 ) * llama-server: recursive GGUF loading Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model server : router config POC (INI-based per-model settings) * server: address review feedback from @aldehir and @ngxson PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> * server: adopt aldehir's line-oriented PEG parser Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_ - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <aldehir@users.noreply.github.com> * server: fix CLI/env duplication in child processes Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn * add common/preset.cpp * fix compile * cont * allow custom-path models * add falsey check * server: fix router model discovery and child process spawning - Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing * Revert "server: fix router model discovery and child process spawning" This reverts commit e3832b42eeea7fcb108995966c7584479f745857. * clarify about "no-" prefix * correct render_args() to include binary path * also remove arg LLAMA_ARG_MODELS_PRESET for child * add co-author for ini parser code Co-authored-by: aldehir <hello@alde.dev> * also set LLAMA_ARG_HOST * add CHILD_ADDR * Remove dead code --------- Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: aldehir <hello@alde.dev>	2025-12-10 22:18:21 +01:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Georgi Gerganov	38882247d3	Merge branch 'master' into HEAD	2025-12-10 17:07:21 +02:00
Xuan-Son Nguyen	6c2131773c	cli: new CLI experience (#17824 ) * wip * wip * fix logging, add display info * handle commands * add args * wip * move old cli to llama-completion * rm deprecation notice * move server to a shared library * move ci to llama-completion * add loading animation * add --show-timings arg * add /read command, improve LOG_ERR * add args for speculative decoding, enable show timings by default * add arg --image and --audio * fix windows build * support reasoning_content * fix llama2c workflow * color default is auto * fix merge conflicts * properly fix color problem Co-authored-by: bandoti <bandoti@users.noreply.github.com> * better loading spinner * make sure to clean color on force-exit * also clear input files on "/clear" * simplify common_log_flush * add warning in mtmd-cli * implement console writter * fix data race * add attribute * fix llama-completion and mtmd-cli * add some notes about console::log * fix compilation --------- Co-authored-by: bandoti <bandoti@users.noreply.github.com>	2025-12-10 15:28:59 +01:00
Georgi Gerganov	0ecee8be37	server : reconnect the backend_sampling setting in the WebUI	2025-12-10 15:42:20 +02:00
Georgi Gerganov	81cb5783c8	Merge branch 'master' into HEAD	2025-12-10 13:41:32 +02:00
Aldehir Rojas	2fbe3b7bb7	common : add parser for ministral/mistral large 3/devstral 2 (#17713 )	2025-12-09 17:31:04 -06:00
Rhys-T	63908b631a	cmake: fix Mach-O current version number (#17877 ) PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the Mach-O 'current version' field's 'micro' part, which only goes up to 255. This just sets the Mach-O current version to 0 to get it building properly again. Fixes #17258.	2025-12-09 13:17:41 +02:00
Georgi Gerganov	560ac16f7d	server : handle unsupported cases	2025-12-09 10:55:11 +02:00
Georgi Gerganov	f3beb22b17	sampling : handle n_probs case	2025-12-08 21:30:10 +02:00
Xuan-Son Nguyen	951520ddb0	server: delegate result_state creation to server_task (#17835 ) * server: delegate result_state creation to server_task * remove unued states * add more docs	2025-12-08 17:04:38 +01:00
Georgi Gerganov	6d38db5dfe	Merge branch 'master' into HEAD	2025-12-08 17:55:24 +02:00
Xuan-Son Nguyen	f896d2c34f	server: improve speed of speculative decoding (#17808 ) * server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-08 14:35:28 +01:00
Xuan-Son Nguyen	37a4f63244	server : add development documentation (#17760 ) * first draft * rewrite * update & remove duplicated sections	2025-12-08 13:54:58 +01:00
Georgi Gerganov	2bc96931d2	server : make cache_reuse configurable per request (#17858 )	2025-12-08 12:43:12 +02:00
Georgi Gerganov	42125f0e10	tests : check temp back to 0.0	2025-12-07 15:54:49 +02:00
Georgi Gerganov	8ef5f900db	cont : fixes	2025-12-07 15:45:00 +02:00
Vishal Singh	017761daf5	ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690 ) * ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>	2025-12-07 00:13:33 +08:00
Georgi Gerganov	fdac9686f7	Merge branch 'master' into HEAD	2025-12-06 16:55:33 +02:00
Xuan-Son Nguyen	c42712b056	server: support multiple generations from one prompt (OAI "n" option) (#17775 ) * backend support * server: support multiple generations from one prompt (OAI "n" option) * fix invalid batch * format oai * clean up * disable ctx shift * add test * update comments * fix style * add n_cmpl to docs [no ci] * allowing using both n_cmpl and n	2025-12-06 15:54:38 +01:00
Georgi Gerganov	30742a6ff5	sampling : expand support (wip)	2025-12-06 16:51:56 +02:00
Aleksander Grygier	a28e3c7567	webui: Stop generation from chat sidebar (#17806 ) * feat: Add stop generation button for Conversation Item * chore: update webui build output	2025-12-06 13:29:15 +01:00
Aleksander Grygier	e31b5c55c3	webui: Fix context available value in Multi-model Router mode (#17804 ) * fix: Use context size from `/props?model=...` in ROUTER mode * chore: update webui build output	2025-12-06 13:23:29 +01:00
Aleksander Grygier	21f24f27a9	webui: Per-conversation system message with UI displaying, edition & branching (#17275 ) * feat: Per-conversation system message with optional display in UI, edition and branching (WIP) * chore: update webui build output	2025-12-06 13:19:05 +01:00
Oliver Simons	7668999518	Merge branch 'master' into gpu-sampling Let's keep `master's` cumsum implementation for it's likely better AMD perf and add back pure-CUB-implementation in follow-up commit	2025-12-05 14:41:08 +01:00
Xuan-Son Nguyen	9d0229967a	server: strip content-length header on proxy (#17734 )	2025-12-04 16:32:57 +01:00
Georgi Gerganov	6958d41366	sampling : check backend support during init	2025-12-04 17:29:08 +02:00
Xuan-Son Nguyen	c4c10bfb86	server: move msg diffs tracking to HTTP thread (#17740 ) * server: move msg diffs tracking to HTTP thread * wip * tool call tests ok * minor : style * cont : fix * move states to server_response_reader * add safe-guard * fix * fix 2 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-04 15:46:08 +01:00
Daniel Bevenius	c0b182f4d6	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-04 08:17:50 +01:00
Adrien Gallouët	ef75a89fdb	build : move _WIN32_WINNT definition to headers (#17736 ) Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds, This caused "macro redefined" warnings with toolchains that define the version. This also removes the `GGML_WIN_VER` variable as it is no longer needed. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-04 07:04:02 +01:00
Piotr Wilkin (ilintar)	c6d1a00aa7	Add a couple of file types to the text section (#17670 ) * Add a couple of file types to the text section * Format + regenerate index * Rebuild after rebase	2025-12-03 21:45:06 +01:00
Aleksander Grygier	e9f9483464	Use OpenAI-compatible `/v1/models` endpoint by default (#17689 ) * refactor: Data fetching via stores * chore: update webui build output * refactor: Use OpenAI compat `/v1/models` endpoint by default to list models * chore: update webui build output * chore: update webui build output	2025-12-03 20:49:09 +01:00
Andika Wasisto	41c5e02f42	webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (#17445 ) * webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden Zero pasteLongTextToFileLen should disable the conversion, but it was overwritten with 2500. * Apply suggestions from code review * Update webui build	2025-12-03 20:45:17 +01:00
Pascal	e7c2cf1356	server: add router multi-model tests (#17704 ) (#17722 ) * llama-server: add router multi-model tests (#17704) Add 4 test cases for model router: - test_router_unload_model: explicit model unloading - test_router_models_max_evicts_lru: LRU eviction with --models-max - test_router_no_models_autoload: --no-models-autoload flag behavior - test_router_api_key_required: API key authentication Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing. utils.py changes: - Add models_max, models_dir, no_models_autoload attributes to ServerProcess - Handle JSONDecodeError for non-JSON error responses (fallback to text) * llama-server: update test models to new HF repos * add offline * llama-server: fix router LRU eviction test and add preloading Fix eviction test: load 2 models first, verify state, then load 3rd to trigger eviction. Previous logic loaded all 3 at once, causing first model to be evicted before verification could occur. Add module fixture to preload models via ServerPreset.load_all() and mark test presets as offline to use cached models * llama-server: fix split model download on Windows --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-12-03 15:10:37 +01:00
Adrien Gallouët	1257491047	server : fix bad fmt, size() is a size_type (#17735 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-03 15:47:22 +02:00

1 2 3 4 5 ...

440 Commits