llama.cpp

Commit Graph

Author	SHA1	Message	Date
Oliver Simons	244880ae3a	CUDA: Use standard-compliant preprocessor for MSVC builds Workarounds of https://github.com/NVIDIA/cccl/pull/6791 will not be backported to CCCL 3.2, only the diagnostics/error messages will: https://github.com/NVIDIA/cccl/pull/6827	2025-12-02 11:23:14 +01:00
Oliver Simons	559d058dd2	CUDA: Move cccl fetch to after cuda has been enabled in CMakeLists.txt This will allow cccl to set build flags for the CUDA compiler, required e.g. for MSVC compat, see also https://github.com/NVIDIA/cccl/pull/6791	2025-12-02 11:23:14 +01:00
Daniel Bevenius	3e9a258c14	Merge remote-tracking branch 'upstream/master' into gpu-sampling	2025-12-02 09:26:04 +01:00
Daniel Bevenius	739b597804	sampling : fix backend temp sampler for zero temperature This commit fixes the implementation of the temperature-based sampler for the case when the temperature is set to zero. This now correctly selects the most probable token by masking out all other tokens in the logits.	2025-12-02 09:13:07 +01:00
Aman Gupta	ed32089927	ggml-cuda: reorder only relevant nodes (#17639 )	2025-12-02 12:36:31 +08:00
Aaron Teo	7b6d745364	release: fix duplicate libs, store symbolic links (#17299 )	2025-12-02 11:52:05 +08:00
Neo Zhang Jianyu	98bd9ab1e4	enhance argsort for UT (#17573 ) Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2025-12-02 08:56:46 +08:00
Piotr Wilkin (ilintar)	746f9ee889	Override SSM_A op for Qwen3 Next to reduce splits (#17587 ) * Override SSM_A op for Qwen3 Next to reduce splits * New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context. * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-02 00:43:13 +01:00
Jeff Bolz	9810cb8247	ops.md: update vulkan support (#17661 )	2025-12-01 15:26:21 -06:00
Xuan-Son Nguyen	ecf74a8417	mtmd: add mtmd_context_params::warmup option (#17652 ) * mtmd: add mtmd_context_params::warmup option * reuse the common_params::warmup	2025-12-01 21:32:25 +01:00
Gilad S.	00c361fe53	fix: llama arch implementation (#17665 )	2025-12-01 21:21:13 +01:00
Xuan-Son Nguyen	ec18edfcba	server: introduce API for serving / loading / unloading multiple models (#17470 ) * server: add model management and proxy * fix compile error * does this fix windows? * fix windows build * use subprocess.h, better logging * add test * fix windows * feat: Model/Router server architecture WIP * more stable * fix unsafe pointer * also allow terminate loading model * add is_active() * refactor: Architecture improvements * tmp apply upstream fix * address most problems * address thread safety issue * address review comment * add docs (first version) * address review comment * feat: Improved UX for model information, modality interactions etc * chore: update webui build output * refactor: Use only the message data `model` property for displaying model used info * chore: update webui build output * add --models-dir param * feat: New Model Selection UX WIP * chore: update webui build output * feat: Add auto-mic setting * feat: Attachments UX improvements * implement LRU * remove default model path * better --models-dir * add env for args * address review comments * fix compile * refactor: Chat Form Submit component * ad endpoint docs * Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2 Co-authored-by: Aleksander <aleksander.grygier@gmail.com> * feat: Add copy to clipboard to model name in model info dialog * feat: Model unavailable UI state for model selector * feat: Chat Form Actions UI logic improvements * feat: Auto-select model from last assistant response * chore: update webui build output * expose args and exit_code in API * add note * support extra_args on loading model * allow reusing args if auto_load * typo docs * oai-compat /models endpoint * cleaner * address review comments * feat: Use `model` property for displaying the `repo/model-name` naming format * refactor: Attachments data * chore: update webui build output * refactor: Enum imports * feat: Improve Model Selector responsiveness * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * refactor: Formatters * chore: update webui build output * refactor: Copy To Clipboard Icon component * chore: update webui build output * refactor: Cleanup * chore: update webui build output * refactor: UI badges * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * chore: update webui build output * add --models-allow-extra-args for security * nits * add stdin_file * fix merge * fix: Retrieve lost setting after resolving merge conflict * refactor: DatabaseStore -> DatabaseService * refactor: Database, Conversations & Chat services + stores architecture improvements (WIP) * refactor: Remove redundant settings * refactor: Multi-model business logic WIP * chore: update webui build output * feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic * chore: update webui build output * fix: Add `untrack` inside chat processing info data logic to prevent infinite effect * fix: Regenerate * feat: Remove redundant settigns + rearrange * fix: Audio attachments * refactor: Icons * chore: update webui build output * feat: Model management and selection features WIP * chore: update webui build output * refactor: Improve server properties management * refactor: Icons * chore: update webui build output * feat: Improve model loading/unloading status updates * chore: update webui build output * refactor: Improve API header management via utility functions * remove support for extra args * set hf_repo/docker_repo as model alias when posible * refactor: Remove ConversationsService * refactor: Chat requests abort handling * refactor: Server store * tmp webui build * refactor: Model modality handling * chore: update webui build output * refactor: Processing state reactivity * fix: UI * refactor: Services/Stores syntax + logic improvements Refactors components to access stores directly instead of using exported getter functions. This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction. Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`. * refactor: Architecture cleanup * feat: Improve statistic badges * feat: Condition available models based on modality + better model loading strategy & UX * docs: Architecture documentation * feat: Update logic for PDF as Image * add TODO for http client * refactor: Enhance model info and attachment handling * chore: update webui build output * refactor: Components naming * chore: update webui build output * refactor: Cleanup * refactor: DRY `getAttachmentDisplayItems` function + fix UI * chore: update webui build output * fix: Modality detection improvement for text-based PDF attachments * refactor: Cleanup * docs: Add info comment * refactor: Cleanup * re * refactor: Cleanup * refactor: Cleanup * feat: Attachment logic & UI improvements * refactor: Constants * feat: Improve UI sidebar background color * chore: update webui build output * refactor: Utils imports + move types to `app.d.ts` * test: Fix Storybook mocks * chore: update webui build output * test: Update Chat Form UI tests * refactor: Tooltip Provider from core layout * refactor: Tests to separate location * decouple server_models from server_routes * test: Move demo test to tests/server * refactor: Remove redundant method * chore: update webui build output * also route anthropic endpoints * fix duplicated arg * fix invalid ptr to shutdown_handler * server : minor * rm unused fn * add ?autoload=true\|false query param * refactor: Remove redundant code * docs: Update README documentations + architecture & data flow diagrams * fix: Disable autoload on calling server props for the model * chore: update webui build output * fix ubuntu build * fix: Model status reactivity * fix: Modality detection for MODEL mode * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-01 19:41:04 +01:00
Daniel Bevenius	988261b18d	examples : remove outdated backend sampling section This commit removes the outdated section about using backend samplers from the README.md file in the examples/batched.	2025-12-01 18:20:41 +01:00
Georgi Gerganov	88cca45bb8	sampling : fix top_p empty condition	2025-12-01 18:02:34 +02:00
Georgi Gerganov	04f2822a86	sampling : do not create empty samplers	2025-12-01 17:52:07 +02:00
Georgi Gerganov	4032ce2378	common : simplify sampler chain initialization	2025-12-01 17:11:11 +02:00
Oliver Simons	217469f07f	Make backend's top_p sampler inclusive In addition to match the algorithm proposed in the original [paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case where `max_p is > top_p` for a single logit, where the mask would otherwise be empty (and we thus sample from the whole vocabulary with equal likelihood)	2025-12-01 15:28:06 +01:00
Oliver Simons	ae0bb6a6da	Factor out `ggml_sort` into its own function	2025-12-01 15:28:06 +01:00
Xuan-Son Nguyen	7733409734	common: improve verbosity level definitions (#17630 ) * common: improve verbosity level definitions * string_format * update autogen docs	2025-12-01 14:38:13 +01:00
Georgi Gerganov	16451d6bc3	Merge branch 'master' into HEAD	2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen	cd3c118908	model: support Ministral3 (#17644 ) * conversion script * support ministral 3 * maybe this is better? * add TODO for rope_yarn_log_mul * better ppl (tested on 14B-Instruct) * Add Ministral3 support to Mistral format * improve arch handling * add sizes * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits --------- Co-authored-by: Julien Denize <julien.denize@mistral.ai> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-01 12:26:52 +01:00
Oliver Simons	8bee483c97	Fix backend_top_p_sampler softmax(softmax) will return uniform distribution, so we should not return the softmax but the logits instead.	2025-12-01 12:07:30 +01:00
Georgi Gerganov	649495c9d9	metal : add FA head size 48 (#17619 )	2025-12-01 12:49:53 +02:00
Georgi Gerganov	90c72a614a	ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (#17617 )	2025-12-01 12:49:33 +02:00
Aman Gupta	6eea666912	llama-graph: avoid expand_forward for fusion (#17633 )	2025-12-01 11:12:48 +02:00
Daniel Bevenius	cf0e1475c5	sampling : lower log level for output buffer reallocations [no ci] This commit changes the logging level for output buffer reallocations in the llama_context::output_reserve function from INFO to DEBUG. The motivation for this is that it currently logs to info and when enabling verbose logging for llama-cli this will get mixed with the output, for example: ```console What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB 1. Stockholm 2\. Helsinki Based are the options 1. Stockholm Explanation: Stockholm is the capital of ... ```	2025-12-01 09:13:47 +01:00
Xuan-Son Nguyen	ff90508d68	contributing: update guidelines for AI-generated code (#17625 ) * contributing: update guidelines for AI-generated code * revise	2025-11-30 22:51:34 +01:00
Adrien Gallouët	0a4aeb927d	cmake : add option to build and link LibreSSL (#17552 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-11-30 22:14:32 +01:00
Tarek Dakhran	2ba719519d	model: LFM2-VL fixes (#17577 ) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-11-30 21:57:31 +01:00
Xuan-Son Nguyen	7f8ef50cce	clip: fix nb calculation for qwen3-vl (#17594 )	2025-11-30 15:33:55 +01:00
Xuan-Son Nguyen	3c136b21a3	cli: add migration warning (#17620 )	2025-11-30 15:32:43 +01:00
Adrien Gallouët	beb1f0c503	common : throttle download progress output to reduce IO flush (#17427 ) This change limits progress updates to approximately every 0.1% of the file size to minimize stdio overhead. Also fixes compiler warnings regarding __func__ in lambdas. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-11-30 14:22:44 +02:00
Aaron Teo	def5404f26	common: add LLAMA_LOG_FILE env var (#17609 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-11-30 12:12:32 +01:00
Georgi Gerganov	80742cbaeb	cont : naming	2025-11-30 11:24:30 +02:00
Gilad S.	fa0465954f	ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (#17581 )	2025-11-30 10:00:59 +08:00
ddh0	5a6241feb0	common: update env var name (#17588 )	2025-11-30 09:59:25 +08:00
Aman Gupta	c7af376c29	CUDA: add stream-based concurrency (#16991 ) * CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-30 08:17:55 +08:00
Mahekk Shaikh	00425e2ed1	cuda : add error checking for cudaMemcpyAsync in argsort (#17599 ) * cuda : add error checking for cudaMemcpyAsync in argsort (#12836) * fix indentation	2025-11-30 08:16:28 +08:00
Acly	385c3da5e6	vulkan : fix FA mask load with bounds check (coopmat2) (#17606 )	2025-11-30 01:03:21 +01:00
Georgi Gerganov	c187003d81	llama : naming	2025-11-30 00:05:47 +02:00
Georgi Gerganov	1760bd69b3	llama : reserve graphs with samplers	2025-11-29 23:57:25 +02:00
Georgi Gerganov	467746e3ad	Merge branch 'master' into HEAD	2025-11-29 23:17:25 +02:00
Georgi Gerganov	ff7b0bf632	llama : call backend_init once	2025-11-29 23:09:53 +02:00
Xuan-Son Nguyen	ab49f094d2	server: move server-context to its own cpp\|h (#17595 ) * git mv * add server-context.h * add server-context.h * clean up headers * cont : cleanup * also expose server_response_reader (to be used by CLI) * fix windows build * decouple server_routes and server_http --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-29 22:04:44 +01:00
Georgi Gerganov	d8d98bb4bb	Merge branch 'master' into HEAD	2025-11-29 22:38:44 +02:00
Georgi Gerganov	9028ebfea8	llama : cleanup + naming	2025-11-29 22:37:07 +02:00
Haiyue Wang	8c32d9d96d	server: explicitly set the function name in lambda (#17538 ) As [1] explained, the real debug message will be like: "res operator(): operator() : queue result stop" Set the name explicitly, the message is easy for debugging: "res operator(): recv : queue result stop" The left "operator()" is generated by 'RES_DBG() ... __func__' [1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html Signed-off-by: Haiyue Wang <haiyuewa@163.com>	2025-11-29 18:43:29 +01:00
Igor Smirnov	0874693b44	common : fix json schema with '\' in literals (#17307 ) * Fix json schema with '\' in literals * Add "literal string with escapes" test	2025-11-29 17:06:32 +01:00
Georgi Gerganov	fbc8f49f3c	llama : simplify	2025-11-29 17:01:00 +02:00
Neo Zhang	7d2add51d8	sycl : support to malloc memory on device more than 4GB, update the doc and script (#17566 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-11-29 14:59:44 +02:00

1 2 3 4 5 ...

7307 Commits All Branches Search

7307 Commits

All Branches