llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	d8d98bb4bb	Merge branch 'master' into HEAD	2025-11-29 22:38:44 +02:00
o7si	3ce7a65c2f	server: fix: /metrics endpoint returning JSON-escaped Prometheus format (#17386 ) * fix: /metrics endpoint returning JSON-escaped Prometheus format * mod: remove string overload from ok() method	2025-11-28 19:14:00 +01:00
Georgi Gerganov	117e2079a9	refactor : simplify and improve memory management	2025-11-28 16:09:42 +02:00
Fredrik Hultin	ddf9f94389	server : add Anthropic Messages API support (#17570 ) * server : add Anthropic Messages API support * remove -@pytest.mark.slow from tool calling/jinja tests * server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py * server : removed redundant n field logic in anthropic_params_from_json * server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream() * server : refactor Anthropic API to use OAI conversion * make sure basic test always go first * clean up * clean up api key check, add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-28 12:57:04 +01:00
Daniel Bevenius	9e5e09d087	sampling : remove backend-dist option (wip) This commit removes the `--backend-dist` option and instead uses the configured --samplers chain to determine which samplers run on the backend. Backend sampling is still enabled using With `--backend_sampling`, and the sampler chain, either explictly specified using `--samplers` or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence. For example: * If the chain is `top-k -> temperature -> top-p`, and both `top-k` and `temperature` are backend-supported but `top-p` is not, then `top-k` and `temperature` will run on the backend, while `top-p` and subsequent samplers run on the CPU. * If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host. * If the sampler chain starts with an unsupported sampler (e.g., `penalties`), all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example. The following shows how llama-cli can be run with backend sampling: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature' ``` In this case the all sampling will happen on the backend since both `top_k` and `temperature` are supported backend samplers. To enable a partial backend sampling (hybrid sampling), for example running `top_k` and `temperature` on the backend and `typ_p` on the CPU the following sampler chain could be specified: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature;top_p' ``` If this looks good then I'll follow up with updates the llama-cli and llama-server documentation to reflect these changes.	2025-11-25 14:01:23 +01:00
Daniel Bevenius	2b4c7927ee	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-25 06:10:33 +01:00
Xuan-Son Nguyen	b8372eecd9	server: split server.cpp code into server/common/task/queue (#17362 ) * add server-task, server-common * add server-queue * rm redundant includes * move enum stop_type to server-task * server : headers cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-24 14:41:53 +01:00
Daniel Bevenius	0c660e7390	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-20 06:57:24 +01:00
Xuan-Son Nguyen	0de8878c96	server: split HTTP into its own interface (#17216 ) * server: split HTTP into its own interface * move server-http and httplib to its own file * add the remaining endpoints * fix exception/error handling * renaming * missing header * fix missing windows header * fix error responses from http layer * fix slot save/restore handler * fix case where only one stream chunk is returned * add NOMINMAX * do not call sink.write on empty data * use safe_json_to_str for SSE * clean up * add some comments * improve usage of next() * bring back the "server is listening on" message * more generic handler * add req.headers * move the chat template print to init() * add req.path * cont : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 22:05:44 +01:00
Daniel Bevenius	f1f3e68511	server : add backend sampling options/configuration	2025-11-17 16:16:05 +01:00
Georgi Gerganov	5b2093becc	server : handle context overflow during decode (#17267 ) * server : handle context overflow during decode * server : minor refactor	2025-11-16 09:23:37 +02:00
Xuan-Son Nguyen	9b17d74ab7	mtmd: add mtmd_log_set (#17268 )	2025-11-14 15:56:19 +01:00
Georgi Gerganov	d396b43748	server : fix "can batch with" bug (#17263 )	2025-11-14 14:03:45 +02:00
Xuan-Son Nguyen	c4abcb2457	server: fixing naming conflict res_error (#17243 )	2025-11-13 20:53:47 +01:00
Xuan-Son Nguyen	00c94083b3	server: (refactor) implement generator-based API for task results (#17174 ) * server: (refactor) implement generator-based API for task results * improve * moving some code * fix "Response ended prematurely" * add sink.done before return false * rm redundant check * rm unused var * rename generator --> reader	2025-11-12 18:50:52 +01:00
Xuan-Son Nguyen	ee8dd5c658	server: move res_error/res_ok to static function (#17167 )	2025-11-12 14:17:24 +01:00
Georgi Gerganov	cb1adf8851	server : handle failures to restore host cache (#17078 ) * server : handle failures to restore host cache * server : add tests for the prompt cache	2025-11-09 14:27:05 +02:00
Aidan	eeee367de5	server: fix correct time_ms calculation in prompt_progress (#17093 ) * fix: correct time_ms calculation in send_partial_response The time_ms field was incorrectly calculated. The division was happening before the subtraction leading to incorrect values. Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After: (ggml_time_us() - slot.t_start_process_prompt) / 1000 * docs : document time_ms field in prompt_progress	2025-11-08 15:12:11 +02:00
Georgi Gerganov	8c0d6bb455	server : print the samplers chain for each request (#17070 )	2025-11-07 12:24:47 +02:00
Georgi Gerganov	b7f9010d24	server : disable checkpoints with mtmd (#17045 )	2025-11-06 12:09:29 +02:00
Georgi Gerganov	13b339bcd9	server : do not default to multiple slots with speculative decoding (#17017 ) * server : do not default to multiple slots with speculative decoding * cont : fix	2025-11-05 14:32:55 +02:00
Georgi Gerganov	66d8eccd42	server : do context shift only while generating (#17000 )	2025-11-04 19:21:36 +02:00
Georgi Gerganov	48bd26501b	server : add props.model_alias (#16943 ) * server : add props.model_alias * webui : npm run format	2025-11-03 14:38:23 +01:00
Xuan-Son Nguyen	070ff4d535	mtmd: add --image-min/max-tokens (#16921 )	2025-11-03 11:11:18 +01:00
Georgi Gerganov	2f966b8ed8	clip : use FA (#16837 ) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-02 21:21:48 +01:00
Georgi Gerganov	cd5e3b5754	server : support unified cache across slots (#16736 ) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning	2025-11-02 18:14:04 +02:00
Georgi Gerganov	c22473b580	server : don't print user inputs to console (#16871 )	2025-10-31 10:54:19 +02:00
Daniel Bevenius	0f715b4e75	server : fix typos in server.cpp comments [no ci] (#16883 )	2025-10-31 09:51:26 +01:00
Georgi Gerganov	b52edd2558	server : remove n_past (#16818 ) * server : remove n_past * server : replace slot.n_prompt_tokens() with slot.task->n_tokens() * server : fixes + clean-up * cont : fix context shift * server : add server_tokens::pos_next() Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : fix pos_next() usage Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>	2025-10-30 18:42:57 +02:00
Georgi Gerganov	85a7d8677b	memory : remove KV cache size padding (#16812 ) * memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic	2025-10-28 20:19:44 +02:00
Johannes Gäßler	0bf47a1dbb	server: add memory breakdown print (#16740 )	2025-10-23 21:30:17 +02:00
matteo	8cf6b42d46	server : send partial stop string when <EOG> is reached (#15007 )	2025-10-23 12:32:24 +03:00
Georgi Gerganov	17304cbcc1	server : fix img token logs (#16595 )	2025-10-15 16:53:12 +03:00
Georgi Gerganov	554fd578a5	server : fix mtmd checkpoints (#16591 )	2025-10-15 11:51:27 +02:00
Georgi Gerganov	bc07349a7f	server : dynamic token limit for prompt cache (#16560 ) * server : dynamic token limit for prompt cache * cont : print estimated token limit	2025-10-14 08:48:50 +03:00
Yann Follet	31d0ff1869	server / ranking : add sorting and management of top_n (#16403 ) * server / ranking : add sorting and management of top_n * Make the retro compatible if no top_n will return all results here is a script to make some test ```script URL=${1:-http://127.0.0.1:8181} curl "$URL/v1/rerank" -H "Content-Type: application/json" \ -d '{ "model": "M", "query": "What is the recipe to make bread ?", "return_text" : true, "texts" : true, "top_n": 6, "documents": [ "voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel", "it is a bear", "bread recipe : floor, water, yest, salt", "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.", "here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt", "recipe to make cookies : floor, eggs, water, chocolat", "here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt", "il fait tres beau aujourd hui", "je n ai pas faim, je ne veux pas manger", "je suis a paris" ] }' \| jq ``` * use resize() instead for(...) * simplify top_n init since no need to return error result to test : ./tests.sh unit/test_rerank.py -v -x ==================================================== test session starts ===================================================== platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3 cachedir: .pytest_cache rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests configfile: pytest.ini plugins: anyio-4.11.0 collected 8 items unit/test_rerank.py::test_rerank PASSED [ 12%] unit/test_rerank.py::test_rerank_tei_format PASSED [ 25%] unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 37%] unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 50%] unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 62%] unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 75%] unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 87%] unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [100%] ===================================================== 8 passed in 4.31s ====================================================== * add rerank top_n unit test here is the result : ./tests.sh unit/test_rerank.py -v -x =================================================================== test session starts =================================================================== platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3 cachedir: .pytest_cache rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests configfile: pytest.ini plugins: anyio-4.11.0 collected 16 items unit/test_rerank.py::test_rerank PASSED [ 6%] unit/test_rerank.py::test_rerank_tei_format PASSED [ 12%] unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 18%] unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 25%] unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 31%] unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 37%] unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 43%] unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [ 50%] unit/test_rerank.py::test_rerank_top_n[None-4] PASSED [ 56%] unit/test_rerank.py::test_rerank_top_n[2-2] PASSED [ 62%] unit/test_rerank.py::test_rerank_top_n[4-4] PASSED [ 68%] unit/test_rerank.py::test_rerank_top_n[99-4] PASSED [ 75%] unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED [ 81%] unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED [ 87%] unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED [ 93%] unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED [100%] =================================================================== 16 passed in 8.84s =================================================================== * editor config check fix	2025-10-11 16:39:04 +03:00
Georgi Gerganov	e60f01d941	server : fix division by zero when reporting stats (#16501 )	2025-10-10 22:15:05 +03:00
Radoslav Gerganov	68ee98ae18	server : return HTTP 400 if prompt exceeds context length (#16486 ) In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.	2025-10-10 16:11:07 +02:00
Radoslav Gerganov	cdb6da468c	server : log requests to /v1/completions (#16495 )	2025-10-10 13:22:27 +03:00
Georgi Gerganov	d00cbea63c	server : host-memory prompt caching (#16391 ) * minor : code style * server : fix prompt similarity calculation * server : initial host-memory prompt caching * cont * server : refactor * cont * cont : make the server task of the slot const * cont : minor [no ci] * server : cache prompts and checkpoints only for completion tasks * server : improve prompt caching logic * cont : fix check for number of cached prompts [no ci] * server : improve caching logic, add -cram CLI arg * server : print prompt mismatch info * cont : better naming [no ci] * server : improve prompt cache loading logic * server : add option to debug the slot contents (#16482) * server : add option to debug the slot contents * Update tools/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : add option to disable prompt cache --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>	2025-10-09 18:54:51 +03:00
issixx	d2ee056e1d	server : fix cancel pending task (#16467 ) Co-authored-by: DevAI <DevAI@gmail.com>	2025-10-08 11:20:18 +03:00
Georgi Gerganov	7fdd16b432	server : improve context checkpoint logic (#16440 )	2025-10-08 10:57:29 +03:00
Georgi Gerganov	df1b612e29	server : add `/v1/health` endpoint (#16461 ) * server : add /v1/health endpoint * cont : update readme	2025-10-07 15:57:14 +03:00
ddh0	f6dcda3900	server : context checkpointing for hybrid and recurrent models (#16382 ) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-03 21:34:51 +03:00
Isaac McFadyen	e0539eb6ae	webui: switch to hash-based routing (alternative of #16079 ) (#16157 ) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link	2025-09-26 18:36:48 +03:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 )	2025-09-25 11:53:09 +03:00
Benni	459c0c2c1a	server: fix SSE and OpenAI compatibility for error messages when streaming (#16109 ) * server: fix SSE and OpenAI compatibility for error messages when streaming * server: remove obsolete event parameter and use required data fieldname instead	2025-09-20 07:56:30 +02:00
Radoslav Gerganov	2b6b55a59f	server : include usage statistics only when user request them (#16052 ) * server : include usage statistics only when user request them When serving the OpenAI compatible API, we should check if {"stream_options": {"include_usage": true} is set in the request when deciding whether we should send usage statistics closes: #16048 * add unit test	2025-09-18 10:36:57 +00:00
Aleksander Grygier	a7a98e0fff	SvelteKit-based WebUI (#14839 )	2025-09-17 19:29:13 +02:00
Sigbjørn Skjæret	6c019cb04e	server : only attempt to enable thinking if using jinja (#15967 )	2025-09-14 21:17:04 +02:00

1 2 3

119 Commits