llama.cpp

Commit Graph

Author	SHA1	Message	Date
Pascal	5113efd34c	fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356 ) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-10-03 08:01:31 +02:00
Aleksander Grygier	764799279f	Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369 ) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build	2025-10-01 18:18:10 +02:00
Aleksander Grygier	2a9b63383a	Improve code block color theming (#16325 ) * feat: Improve code block theming * chore: update webui build output * chore: Update webui static build	2025-10-01 15:54:42 +02:00
Aleksander Grygier	4f1575921c	Add optional setting for showing "Model used:" information (#16337 ) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output	2025-10-01 12:08:16 +02:00
Aleksander Grygier	aa9538a63a	webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363 )	2025-10-01 08:40:26 +03:00
Pascal	16b0ca0d2e	Chatapi ignore empty sampling (#16330 ) * fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output	2025-09-30 19:18:54 +02:00
Pascal	5f7e166cbf	Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326 ) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-09-29 18:49:47 +02:00
Aleksander Grygier	3a2bdcda0b	Improve Mobile UI for dialogs and action dropdowns (#16222 ) * fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com>	2025-09-29 10:37:20 +02:00
Pascal	66bb7985c3	fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312 )	2025-09-29 09:08:41 +02:00
Imad Saddik	2811c65286	Fixed a few typos in the README of the LLaMA.cpp HTTP Server [no ci] (#16297 )	2025-09-28 13:04:46 +02:00
Aleksander Grygier	4807e8f96a	Show message actions by default (#16289 )	2025-09-27 19:56:40 +02:00
Adrien Gallouët	234e2ff8ed	server : remove old LLAMA_SERVER_SSL (#16290 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-27 19:17:08 +03:00
Aleksander Grygier	807e8c6d31	Enhance text file detection logic for file attachments (#16199 ) * feat: Enhances text file detection logic * chore: Build static `webui` output * chore: update webui build output	2025-09-26 19:25:29 +02:00
Aleksander Grygier	1a18927894	Allow viewing conversations even when llama server is down (#16255 ) * webui: allow viewing conversations and sending messages even if llama-server is down - Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable. - Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation. - Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data. * feat: Add UI for `props` endpoint unavailable + cleanup logic * webui: extend cached props fallback to offline errors Treat connection failures (refused, DNS, timeout, fetch) the same way as server 5xx so the warning banner shows up when cache is available, instead of falling back to a full error screen. * webui: Left the chat form enabled when a server warning is present so operators can keep sending messages e.g., to restart the backend over llama-swap, even while cached /props data is in use * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com>	2025-09-26 18:35:42 +02:00
Isaac McFadyen	e0539eb6ae	webui: switch to hash-based routing (alternative of #16079 ) (#16157 ) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link	2025-09-26 18:36:48 +03:00
Aleksander Grygier	5d0a40f390	Always show message actions for mobile UI + improvements for user message sizing (#16076 )	2025-09-26 15:59:07 +02:00
Daniel Bevenius	d0991da39d	server : add support for external server for tests (#16243 ) This commit adds support for using an externally started llama-server instance for the server tests. This can be enabled by setting the DEBUG_EXTERNAL environment variable. The motivation for this is to allow debugging of the server itself when investigating a test failure. Instructions for how to do this are added to the README.md file in the tests directory.	2025-09-25 11:36:47 +02:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 )	2025-09-25 11:53:09 +03:00
Quentin Bramas	138c87ce8b	webui : fix handling incomplete chunks (#16107 )	2025-09-22 11:53:13 +03:00
Benni	459c0c2c1a	server: fix SSE and OpenAI compatibility for error messages when streaming (#16109 ) * server: fix SSE and OpenAI compatibility for error messages when streaming * server: remove obsolete event parameter and use required data fieldname instead	2025-09-20 07:56:30 +02:00
Aleksander Grygier	4067f07fc5	feat: Improve mobile UI for Settings Dialog (#16084 ) * feat: Improve mobile UI for Settings Dialog * chore: update webui build output * fix: Linting errors * chore: update webui build output	2025-09-19 09:52:27 +02:00
Radoslav Gerganov	2b6b55a59f	server : include usage statistics only when user request them (#16052 ) * server : include usage statistics only when user request them When serving the OpenAI compatible API, we should check if {"stream_options": {"include_usage": true} is set in the request when deciding whether we should send usage statistics closes: #16048 * add unit test	2025-09-18 10:36:57 +00:00
Aleksander Grygier	a7a98e0fff	SvelteKit-based WebUI (#14839 )	2025-09-17 19:29:13 +02:00
Sigbjørn Skjæret	6c019cb04e	server : only attempt to enable thinking if using jinja (#15967 )	2025-09-14 21:17:04 +02:00
Georgi Gerganov	f088b6a84f	server : adjust prompt similarity thold + add logs (#15913 ) ggml-ci	2025-09-12 17:02:55 +03:00
Daniel Bevenius	70cd37dbbe	requirements : update transformers/torch for Embedding Gemma (#15828 ) * requirements : update transformers/torch for Embedding Gemma This commit updates the requirements to support converting Embedding Gemma 300m models. The motivation for this change is that during development I had a local copy of the transformers package which is what I used for converting the models. This was a mistake on my part and I should have also updated my transformers version to the official release. I had checked the requirements/requirements-convert_legacy_llama.txt file and noted that the version was >=4.45.1,<5.0.0 and came to the conculusion that no updated would be needed, this assumed that Embedding Gemma would be in a transformers release at the time Commit `fb15d649ed` ("llama : add support for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to convert themselves would be able to do so. However, Embedding Gemma is a preview release and this commit updates the requirements to use this preview release. * resolve additional python dependencies * fix pyright errors in tokenizer test and remove unused import	2025-09-09 06:06:52 +02:00
Aldehir Rojas	7057faf64b	json : support `enum` values within `allOf` (#15830 )	2025-09-08 16:14:32 -05:00
Xuan-Son Nguyen	56920f5665	server : bring back timings_per_token (#15879 )	2025-09-08 16:50:05 +02:00
Xuan-Son Nguyen	3c3635d2f2	server : speed up tests (#15836 ) * server : speed up tests * clean up * restore timeout_seconds in some places * flake8 * explicit offline	2025-09-06 14:45:24 +02:00
Xuan-Son Nguyen	61bdfd5298	server : implement prompt processing progress report in stream mode (#15827 ) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-06 13:35:04 +02:00
Gabe Goodhart	fd621880f3	aLoRA Support (#15327 ) * feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use \|\| instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-05 17:32:39 -06:00
Gabe Goodhart	5fac79cbc7	Thinking model disabled assistant prefill (#15404 ) * feat: Set enable_thinking IFF not disabled and supported Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix inverted logic condition for prefill error Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always parse the enable_thinking kwarg to overwrite the default value From what I can tell, this started as a Qwen3-specific keyword, but from the use in `chat.cpp` translates this inputs.enable_thinking to the right thinking kwarg for the given model, this is now more of a standardized kwarg, so it should always override the default value when sent as part of the chat_template_kwargs field in the API. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Don't limit tempalte expansion check to jinja With the use_jinja check, non-jinja models would enable thinking and always fail assistant prefill Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add the error text to json type errors in json_value Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Explicitly reject string values for "enable_thinking" There are too many possible "truthy" / "falsy" strings and too many ambiguous strings that don't have a clear truthy/falsy value, so the simplest thing to do here is to reject the request. Ideally, this would be a 422 (Unprocessable Entity), but right now it's coming back as a 500. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move logic for detecting template enable_thinking support to common Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use raw pointer for common chat template function Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-09-05 14:31:24 -06:00
Xuan-Son Nguyen	a68d914426	server: add exceed_context_size_error type (#15780 ) * server: add exceed_context_size_error type * change error code to 400	2025-09-04 11:50:23 +02:00
Georgi Gerganov	e92d53b29e	sampling : optimize samplers by reusing bucket sort (#15665 ) * sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-31 20:41:02 +03:00
Georgi Gerganov	0d161f021a	server : enable /slots by default and make it secure (#15630 ) * server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints	2025-08-31 20:11:58 +03:00
Johannes Gäßler	e81b8e4b7f	llama: use FA + max. GPU layers by default (#15434 ) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault	2025-08-30 16:32:10 +02:00
Sergey Alirzaev	d82f6aa34a	server : removed obsolete doc (#15670 ) completing `a4090d1174`	2025-08-30 00:12:53 +02:00
ExtReMLapin	792b44f2ed	server : add documentation for `parallel_tool_calls` param (#15647 ) Co-authored-by: Pierre F <no@p.e>	2025-08-29 20:25:40 +03:00
Sigbjørn Skjæret	84ab83cc0b	model : jina-embeddings-v3 support (#13693 ) * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * fix vocab parsing with only tokenizer.json * set mask token lstrip attribute * additional unk_token_id fallback just in case [no ci] * revert vocab_size() change [no ci] * merge tensor loading into general bert * rope * add lora embedding and loading (non-functional) * export separate lora ggufs instead * add adapter metadata api * use std::string * convert_hf_to_lora compatibility * fix assert * apply suggestions from review * apply suggestion from review	2025-08-28 15:49:50 +02:00
Johannes Gäßler	fbef0fad7a	server: higher timeout for tests (#15621 )	2025-08-27 20:58:09 +02:00
Georgi Gerganov	9ebebef62f	llama : remove KV cache defragmentation logic (#15473 ) ggml-ci	2025-08-22 12:22:13 +03:00
65a	4afb0a746f	server : Support multimodal completion and embeddings prompts in JSON format (#15108 ) - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests.	2025-08-22 10:10:14 +02:00
stduhpf	1b0db8f6e0	server : fix webui (#15462 ) * Fix webui crash after streaming * build webui	2025-08-21 08:19:22 +03:00
teo	1bc664a26a	server: fix OpenAI API compatibility for usage statistics in chat streams (#15444 )	2025-08-21 00:10:08 +02:00
xiaobing318	1a99c2d948	cmake : fix target include directories (#15450 ) * Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * feat:Modify the header file include path 1. There's no llava directory in the tools directory. 2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`. * Restore the docker.yml file	2025-08-20 13:32:05 +03:00
Georgi Gerganov	d2fcd91cf9	server : disable context shift by default (#15416 ) * server : disable context shift by default ggml-ci * server : make scopr of test parameters local	2025-08-19 16:46:37 +03:00
davidef	d1d8241600	server : fix incoming tasks not process in order (#15395 )	2025-08-18 17:51:42 +03:00
Oleksandr Kuvshynov	e5155e6986	server : export max observed n_past value (#15361 ) Add tracking for high watermark cache usage and make it available in /metrics endpoint. Use-case: Tracking largest needed cache usage under realistic workload to better understand memory requirements and be able to adjust cache size/quantization for model/cache accordingly.	2025-08-18 00:28:58 +02:00
Diego Devesa	f75b830647	chat : include kwargs in template example (#15309 )	2025-08-14 10:28:29 -07:00
Aldehir Rojas	b204a5a234	gpt-oss: implement harmony parsing (#15181 ) * model : add harmony parser for gpt-oss * gpt-oss : fix grammar trigger from causing empty stack * gpt-oss: tweak the grammar trigger again * gpt-oss : add support for recipient in role header * gpt-oss : fix ungrouped tool calls in grammar * gpt-oss : loosen function name matching during parse * gpt-oss : clean up workarounds * gpt-oss : add template tests * gpt-oss : simulate thinking and tool call tags * gpt-oss : undo think tags when reasoning_format is none * gpt-oss : set special tokens back to user defined * gpt-oss : update openai-gpt-oss template * server : filter out harmony thought messages * gpt-oss : simplify parsing	2025-08-14 17:23:11 +03:00
Georgi Gerganov	d32e03f449	server : add SWA checkpoints (#15293 ) * server : add SWA checkpoints ggml-ci * cont : server clean-up * server : handle state restore fails * llama : add extended llama_state_seq_ API * server : do not make checkpoints if --swa-full ggml-ci * llama : remove flags value for NONE * server : configure number of SWA checkpoints with CLI arg ggml-ci * args : fix scope of new argument	2025-08-14 14:59:50 +03:00
Sigbjørn Skjæret	b3e16665e1	server : enable -td and -tbd parameters (#15172 )	2025-08-13 15:43:00 +02:00
Copilot	d8914fc47e	common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191 ) * Checkpoint from VS Code for coding agent session * Initial plan * Fix typo in --override-tensor-draft flag implementation * Add null termination for speculative tensor buffer overrides * Apply suggestions from code review * Apply suggestions from code review * Extract tensor override parsing logic to common function (addresses @slaren's feedback) * Apply suggestions from code review * Apply suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-08-13 12:44:40 +02:00
Aldehir Rojas	e885445bc1	server : filter out harmony thought messages (#15278 )	2025-08-13 12:28:21 +02:00
Xuan-Son Nguyen	53d0a12658	server : allow specifying reasoning_format in HTTP request (#15238 )	2025-08-11 14:48:41 +02:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Alex Wu	22f060c9c4	webui: fix markdown table (#15081 ) * webui: fix markdown table * webui: fix table display with themes	2025-08-05 13:56:44 +02:00
Johannes Gäßler	f906275537	server: enable token array inputs for OAI API (#15001 )	2025-08-02 10:12:41 +02:00
g2mt	94933c8c2e	server : implement universal assisted decoding (#12635 ) * llama-server : implement universal assisted decoding * Erase prompt tail for kv-cache * set vocab_dft_compatible in common_speculative * rename ctx_main to ctx_tgt * move vocab_dft_compatible to spec struct * clear mem_dft, remove mem * detokenize id_last for incompatible models * update comment * add --spec-replace flag * accept special tokens when translating between draft/main models * Escape spec-replace * clamp draft result to size to params.n_draft * fix comment * clean up code * restore old example * log common_speculative_are_compatible in speculative example * fix * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-31 14:25:23 +02:00
Lukas Straub	a9f77a8be3	server : add openai-style logit_bias support (#14946 ) Signed-off-by: Lukas Straub <lukasstraub2@web.de>	2025-07-31 14:08:23 +02:00
Daniel Bevenius	41e78c567e	server : add support for `embd_normalize` parameter (#14964 ) This commit adds support for the `embd_normalize` parameter in the server code. The motivation for this is that currently if the server is started with a pooling type that is not `none`, then Euclidean/L2 normalization will be the normalization method used for embeddings. However, this is not always the desired behavior, and users may want to use other normalization (or none) and this commit allows that. Example usage: ```console curl --request POST \ --url http://localhost:8080/embedding \ --header "Content-Type: application/json" \ --data '{"input": "Hello world today", "embd_normalize": -1} ```	2025-07-30 18:07:11 +02:00
Molly Sophia	adef81781a	server : allow setting `--reverse-prompt` arg (#14799 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-07-22 09:24:22 +08:00
IsaacDynamo	b4efd77f8a	server : add parse_special option to /tokenize endpoint (#14783 )	2025-07-21 10:24:51 +03:00
Georgi Gerganov	6ffd4e9c44	server : pre-calculate EOG logit biases (#14721 ) ggml-ci	2025-07-16 14:04:12 +03:00
Georgi Gerganov	538cc77f7f	server : fix handling of the ignore_eos flag (#14710 ) ggml-ci	2025-07-16 12:13:57 +03:00
Johannes Gäßler	5cae766541	scripts: synthetic prompt mode for server-bench.py (#14695 )	2025-07-16 09:33:28 +02:00
Johannes Gäßler	494c5899cb	scripts: benchmark for HTTP server throughput (#14668 ) * scripts: benchmark for HTTP server throughput * fix server connection reset	2025-07-14 13:14:30 +02:00
Douglas Hanley	0c1df14b5f	server : fix pooled embedding output (#14645 )	2025-07-12 13:21:02 +03:00
Alawode Oluwandabira	17a1f0d2d4	server: Add ability to mount server at prefix (#14544 ) * Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix	2025-07-08 11:47:33 +03:00
Sigbjørn Skjæret	ddef99522d	server : fix assistant prefilling when content is an array (#14360 )	2025-07-05 09:17:14 +02:00
Vedran Miletić	e9b6350e61	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
matteo	caf5681fcb	server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196 ) * initial commit for handling extra template kwargs * enable_thinking and assistant prefill cannot be enabled at the same time * can set chat_template_kwargs in command line * added doc * fixed formatting * add support for extra context in generic template init * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix merge conflict * chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context) * normalize environment variable name * simplify code * prefill cannot be used with thinking models * compatibility with the new reasoning-budget parameter * fix prefill for non thinking models --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>	2025-06-29 20:02:53 +02:00
Renat	83790b0e7e	server : fix appearance of the chats list context menu for Safari (#14322 )	2025-06-29 19:29:57 +02:00
Nigel Bosch	1b809cee22	server : move no API key doc to /health (#14352 )	2025-06-24 10:59:11 +02:00
Georgi Gerganov	7b50d589a8	kv-cells : fix tracking of seq_pos (#14339 ) * kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments	2025-06-23 12:27:35 +03:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
Georgi Gerganov	4c9fdfbe15	ubatch : new splitting logic (#14217 ) ggml-ci	2025-06-20 10:14:14 +03:00
aa956	d67341dc18	server : add server parameters for draft model cache type (#13782 ) Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>	2025-06-19 16:01:03 +03:00
Georgi Gerganov	89fea80d29	server : fix incorrect usage of llama_get_embeddings() (#14225 ) * server : fix incorrect usage of llama_get_embeddings() ggml-ci * cont : fix the fix ggml-ci	2025-06-16 22:33:27 +03:00
Georgi Gerganov	d3e64b9f49	llama : rework embeddings logic (#14208 ) * llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci	2025-06-16 14:14:00 +03:00
Eric Curtin	cd355eda7d	server : When listening on a unix domain socket don't print http:// and port (#14180 ) Instead show something like this: main: server is listening on file.sock - starting the main loop Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-06-15 23:36:22 +02:00
Georgi Gerganov	ffad043973	server : fix SWA condition for full context reprocess (#14163 ) ggml-ci	2025-06-13 11:18:25 +03:00
Georgi Gerganov	7d516443dd	server : re-enable SWA speculative decoding (#14131 ) ggml-ci	2025-06-12 11:51:38 +03:00
Aman	7781e5fe99	webui: Wrap long numbers instead of infinite horizontal scroll (#14062 ) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz	2025-06-11 16:42:25 +02:00
Taylor	2baf07727f	server : pass default --keep argument (#14120 )	2025-06-11 13:43:43 +03:00
Juk Armstrong	3a12db23b6	Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104 )	2025-06-10 16:48:07 +01:00
R0CKSTAR	dc0623fddb	webui: fix sidebar being covered by main content (#14082 ) * webui: fix sidebar being covered by main content Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * webui: update index.html.gz Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-06-09 12:01:17 +02:00
Georgi Gerganov	87d34b381d	server : fix LRU check (#14079 ) ggml-ci	2025-06-09 12:57:58 +03:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Georgi Gerganov	3637576288	server : disable speculative decoding for SWA models (#13970 ) * server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models	2025-06-02 21:34:40 +03:00
Olivier Chafik	c9bbc77931	`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 ) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts	2025-06-02 10:15:44 -07:00
Georgi Gerganov	3600cc2886	llama : use n_swa + n_ubatch cells for SWA cache (#13833 ) * llama : use n_swa + n_ubatch cells for SWA cache ggml-ci * llama : add warning about multi-sqeuence SWA contexts	2025-05-31 15:57:44 +03:00
igardev	c7e0a2054b	webui : Replace alert and confirm with custom modals. (#13711 ) * Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons. * use Modal Provider to simplify the use of confirm and alert modals. * Increase the z index of the modal dialogs. * Update index.html.gz * also add showPrompt * rebuild --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-31 11:56:08 +02:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching	2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen	51fa76f172	mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917 ) * mtmd : fix missing public header * no object * apply suggestion from Georgi * rm mtmd-helper, merge it to mtmd * missing vendor include dir	2025-05-31 10:14:29 +02:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci	2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen	10961339b2	mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866 ) * mtmd : move helpers to dedicated library * fix server build * rm leftover cmakelist code	2025-05-28 22:35:22 +02:00
Đinh Trọng Huy	e0e3aa231d	llama : add support for BertForSequenceClassification reranker (#13858 ) * convert: add support for BertForSequenceClassification * add support for reranking using BertForSequenceClassification * merge checks of eos and sep * fix lint --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-28 19:01:58 +02:00
Sky	c962ae3382	server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853 ) [fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode	2025-05-28 16:33:54 +02:00

1 2 3 4

188 Commits