llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aaron Lee	d10a5a4a5b	clean up mtp sample typing after rebase	2025-12-21 17:53:27 -05:00
samuel	a8dc54672c	common: simplify speculative sampling to greedy-only for performance Removes heavy penalty checks (repetition, frequency, presence, DRY) from `common_sampler_sample_speculative`. The specialized speculative sampler now uses a pure ArgMax (Greedy) approach. This significantly reduces CPU overhead during the drafting phase, which improves overall tokens per second.	2025-12-21 17:30:00 -05:00
samuel	a3e29da02a	glm-moe: allow skipping MTP tensor loading to save VRAM Adds a new `mtp` boolean to `llama_model_params`. When set to false (default): 1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`. 2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`). This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.	2025-12-21 17:29:55 -05:00
samuel	d9576dd037	glm4: add MTP weight fallback for GLM-4.6 compatibility GLM-4.6 models exclude specific MTP tensors (`embed_tokens` and `shared_head_head`), implying weight tying with the main model. Previously, this caused a crash when building the graph. This commit adds a fallback mechanism to use the main model's token embeddings and output head when the MTP-specific tensors are missing.	2025-12-21 17:25:48 -05:00
samuel	38c91187f9	speculative: optimize graph reuse for GLM-4.5	2025-12-21 17:24:13 -05:00
samuel	fe2baf5e2d	Squashed commit of the following: commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <samueloliveira32df@gmail.com> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9552e3da64ffc85f175664713388752914 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a8f3475a6bbac0a64d8be06dd4b613020e Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf55db8567db4faa99b0152b72c9e854548 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394183b8e6c368af969b8274039a54b11486 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff208958fb66802f20ec53ce5fcaff133edb7 Merge: 171346c74 cae85fe53 Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe531876762ee02524fc4c3f6c5e7824c63 Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c742c310bbcfbd786b61250638ccf8b44d Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6beeb384ec3abbc18b22dbe830f22fcf4b4 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e261ef57ee4cfaa65d06bcd0fcdeacf7797 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe030ac25056717763b812d1dd89681c08522 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d0c1401d0b447dce1bd0101fb56390f50e Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f48d2dab1d9e907cf6c48c921a229a295c Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba38070d62d37bc0fb71ce9871e1a4ffabcc Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719beffccf8c22784c24b52ff6f5ab56b9ff Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd8235d27fe9174ee4b54ea7a10e630939fee0 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c069e0a5496adfd7d8aa6ca7514db5a6f437 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e6fe781c1b65038d69390fb778d760e3a1 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f3309dbb576538850c92c1cbf8fdc6d6ee Author: samuel <samueloliveira32df@gmail.com> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508b937784112168aa099644b60fef015f05 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a829876ed175320df9c8133bcea0c40460 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2de82716710b9853e07bd640443a5a025bb Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c71ffd4485df1c35829c380b63e472fc5dd Merge: 9fab53e43 8742ce0e3 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0e39823eeb101bb5b6099ff4ca7be10c6e Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce85777041d841393b4396e28f8e3065bb10 Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a22c63b1fa335d6ec1c4a1e4255a920848c Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e4388c20aef497efd82e86dcb99ca58064 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6bf223f425f4ecea14f13fc46101f1b44a Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026327cca9f6f58aeefe32129a6cb9390f4f Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5691054958cd1b139f228e5e588d3974cf Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135aa3619294ab8bf87b0de4b1255ab7942f0 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f9790c1bb1d0254241267b1a6c8a7fc82830 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc7a738b4c99f9440c0ec461e08cf6ce702 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0448c2c1736588673114558e5829db7879 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da69eec20677e25e2307d4fe31ac2ede034 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b375504aa557ed21066aa6783b11781a179 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87cc739a1901931d88e33f777170a4e18e7 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623e7926fb151b3cc63f029929122cac342a Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property	2025-12-21 17:23:35 -05:00
Jeff Bolz	e1f15b454f	vulkan: Implement set_tensor_async and the event interfaces (#18047 ) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.	2025-12-21 21:52:09 +01:00
Johannes Gäßler	0e1ccf15c7	llama: fix RPC for -fit on (#18233 )	2025-12-21 19:33:08 +01:00
Xuan-Son Nguyen	5e25ddebff	move copilot instructions to AGENTS.md (#18259 ) * move copilot --> agents.md * agents: add disclose AI usage * refine	2025-12-21 19:09:21 +01:00
Jeff Bolz	fd05c51cec	vulkan: fix im2col overflowing maxworkgroupcount (#18180 )	2025-12-21 10:32:58 +01:00
Jeff Bolz	b365c3ff01	vulkan/cuda: fix topk_moe with exp_probs_b (#18071 ) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests.	2025-12-21 10:27:34 +01:00
Jeff Bolz	cb64222b0c	vulkan: support GGML_UNARY_OP_XIELU (#18062 )	2025-12-21 10:17:58 +01:00
Jeff Bolz	6eb7081860	vulkan: in graph_optimize, try to group ADD operations (#18060 ) I saw the adds not staying together in the new nemotron 3 nano model.	2025-12-21 10:05:08 +01:00
lovedheart	4117ae5557	Vulkan: some improvement on mul_mat_iq2_xs (#18031 ) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace	2025-12-21 09:59:52 +01:00
Daniel Bevenius	65e96a2464	docs : fix links in parsing.md (#18245 ) This commit corrects the links in the parsing.md which currently result in 404 errors.	2025-12-21 09:35:40 +01:00
Aldehir Rojas	9496bbb808	common : reorganize includes to prioritize vendored deps (#18222 )	2025-12-20 21:43:21 -06:00
Xuan-Son Nguyen	ddcb75dd8a	server: add auto-sleep after N seconds of idle (#18228 ) * implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments	2025-12-21 02:24:42 +01:00
Jeff Bolz	52ab19df63	tests: Avoid floating point precision false positives in SUM (#17471 ) * tests: Avoid floating point precision false positives in SUM * also apply to test_mean	2025-12-20 13:46:46 -06:00
Jeff Bolz	5182dd64cd	test-backend-ops: improve msvc build time (#18209 )	2025-12-20 13:45:45 -06:00
Aadeshveer Singh	10b4f82d44	Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (#18212 )	2025-12-20 19:28:57 +08:00
Oleksandr Kuvshynov	408616adbd	server : [easy] fix per round speculative decode logging (#18211 ) Currently we always log 0, as we clear slot.drafted before. To reproduce: Run llama-server with devstral-2 as main model and devstral-2-small as md, and verbose logging: ``` % ./build/bin/llama-server -v \ -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \ -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \ -c 8192 2> /tmp/llama.cpp.debug Check the log: slot update_slots: id 3 \| task 0 \| accepted 11/0 draft tokens, new n_tokens = 741 slot update_slots: id 3 \| task 0 \| accepted 4/0 draft tokens, new n_tokens = 746 slot update_slots: id 3 \| task 0 \| accepted 16/0 draft tokens, new n_tokens = 763 slot update_slots: id 3 \| task 0 \| accepted 11/0 draft tokens, new n_tokens = 775 slot update_slots: id 3 \| task 0 \| accepted 2/0 draft tokens, new n_tokens = 778 slot update_slots: id 3 \| task 0 \| accepted 4/0 draft tokens, new n_tokens = 783 slot update_slots: id 3 \| task 0 \| accepted 8/0 draft tokens, new n_tokens = 792 slot update_slots: id 3 \| task 0 \| accepted 2/0 draft tokens, new n_tokens = 795 slot update_slots: id 3 \| task 0 \| accepted 1/0 draft tokens, new n_tokens = 797 slot update_slots: id 3 \| task 0 \| accepted 1/0 draft tokens, new n_tokens = 799 slot update_slots: id 3 \| task 0 \| accepted 0/0 draft tokens, new n_tokens = 800 slot update_slots: id 3 \| task 0 \| accepted 2/0 draft tokens, new n_tokens = 803 slot update_slots: id 3 \| task 0 \| accepted 1/0 draft tokens, new n_tokens = 805 slot update_slots: id 3 \| task 0 \| accepted 6/0 draft tokens, new n_tokens = 812 slot update_slots: id 3 \| task 0 \| accepted 3/0 draft tokens, new n_tokens = 816 ``` After the fix, get correct per round logging: ``` slot update_slots: id 3 \| task 0 \| accepted 7/8 draft tokens, new n_tokens = 654 slot update_slots: id 3 \| task 0 \| accepted 1/2 draft tokens, new n_tokens = 656 slot update_slots: id 3 \| task 0 \| accepted 2/16 draft tokens, new n_tokens = 659 slot update_slots: id 3 \| task 0 \| accepted 1/16 draft tokens, new n_tokens = 661 slot update_slots: id 3 \| task 0 \| accepted 2/16 draft tokens, new n_tokens = 664 slot update_slots: id 3 \| task 0 \| accepted 16/16 draft tokens, new n_tokens = 681 slot update_slots: id 3 \| task 0 \| accepted 16/16 draft tokens, new n_tokens = 698 slot update_slots: id 3 \| task 0 \| accepted 3/4 draft tokens, new n_tokens = 702 slot update_slots: id 3 \| task 0 \| accepted 5/12 draft tokens, new n_tokens = 708 slot update_slots: id 3 \| task 0 \| accepted 16/16 draft tokens, new n_tokens = 725 slot update_slots: id 3 \| task 0 \| accepted 1/1 draft tokens, new n_tokens = 727 slot update_slots: id 3 \| task 0 \| accepted 8/16 draft tokens, new n_tokens = 736 ```	2025-12-20 10:57:40 +01:00
Xuan-Son Nguyen	9e39a1e6a9	server: support load model on startup, support preset-only options (#18206 ) * server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com>	2025-12-20 09:25:27 +01:00
Sigbjørn Skjæret	74e05131e9	ci : remove non-windows zip artifacts (#18201 ) * remove non-windows zip artifacts * add cuda dll links	2025-12-19 22:29:46 +01:00
Sigbjørn Skjæret	f74747d886	ci : only save ccache on master (#18207 )	2025-12-19 22:29:37 +01:00
Alfred	ce734a8a2f	ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977 ) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>	2025-12-19 09:42:28 -08:00
Pascal	14931a826e	arg: fix order to use short form before long form (#18196 ) * arg: fix order to use short form before long form * arg: update doc * arg: update test-arg-parser * arg: address review feedback from ngxson simplified to check first.length() <= last.length() only fixed: --sampler-seq, --rerank, --draft ordering note: middle positions in 3+ arg sets are not verified * arg: update doc	2025-12-19 18:01:56 +01:00
Julius Tischbein	f99ef53d2a	llama : Changing off_t to size_t for Windows (#18204 )	2025-12-19 16:42:46 +02:00
Aman Gupta	cc0a04343e	server: friendlier error msg when ctx < input (#18174 ) * llama-server: friendlier error msg when ctx < input This PR adds formatted strings to the server's send_error function * llama-server: use string_format inline * fix test	2025-12-19 12:10:00 +01:00
Xuan-Son Nguyen	98c1c7a7bf	presets: refactor, allow cascade presets from different sources, add global section (#18169 ) * presets: refactor, allow cascade presets from different sources * update docs * fix neg arg handling * fix empty mmproj * also filter out server-controlled args before to_ini() * skip loading custom_models if not specified * fix unset_reserved_args * fix crash on windows	2025-12-19 12:08:20 +01:00
Aleksander Grygier	acb73d8340	webui: Add editing attachments in user messages (#18147 ) * feat: Enable editing attachments in user messages * feat: Improvements for data handling & UI * docs: Update Architecture diagrams * chore: update webui build output * refactor: Exports * chore: update webui build output * feat: Add handling paste for Chat Message Edit Form * chore: update webui build output * refactor: Cleanup * chore: update webui build output	2025-12-19 11:14:07 +01:00
Daniel Bevenius	0a271d82b4	model-conversion : add verbose flag in run-org-model.py (#18194 ) This commit adds a --verbose flag to the run-org-model.py script to enable or disable detailed debug output, such as input and output tensors for each layer. Debug utilities (summarize, debug_hook, setup_rope_debug) have been moved to utils/common.py. The motivation for this is that the detailed debug output can be useful for diagnosing issues with model conversion or execution, but it can also produce a large amount of output that may not always be needed. The script will also be further cleaned/refactored in follow-up commits.	2025-12-19 08:43:16 +01:00
Naco Siren	52fc7fee8a	android: fix missing screenshots for Android.md (#18156 ) * Android basic sample app layout polish * Add missing screenshots and polish android README doc * Replace file blobs with URLs served by GitHub pages service.	2025-12-19 09:32:04 +02:00
Jeff Bolz	cdbada8d10	vulkan: Add perf logger mode with concurrency (#17944 ) This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.	2025-12-19 06:36:46 +01:00
Xuan-Son Nguyen	8ea958d4d9	model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106 ) * ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai>	2025-12-19 00:18:01 +01:00
Pascal	f9ec8858ed	webui: display prompt processing stats (#18146 ) * webui: display prompt processing stats * feat: Improve UI of Chat Message Statistics * chore: update webui build output * refactor: Post-review improvements * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-12-18 17:55:03 +01:00
Taimur Ahmad	f716588e63	ggml-cpu: extend support for RVV floating-point kernels (#17318 ) * cmake: add BF16 RVV flag for ggml-cpu * ggml-cpu: add floating-point conversion kernels * ggml: add floating-point kernels Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: fix lmul in vec_dot_bf16 * ggml-cpu: change redsum to lmul 4, fix leftover --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2025-12-18 16:02:09 +02:00
Xuan-Son Nguyen	4d1316c440	arg: fix ASAN error on sampler_type_names empty (#18167 )	2025-12-18 14:30:32 +01:00
Sigbjørn Skjæret	ec7b9329ae	gguf-py : use copy-on-write mode for localtensor (#18162 )	2025-12-18 13:45:38 +01:00
yulo	54189c0d39	remove i_major_dual (#18157 ) Co-authored-by: zhang hui <you@example.com>	2025-12-18 12:50:56 +01:00
Aleksander Grygier	9ce64aed7d	webui: Fix selecting generated output issues during active streaming (#18091 ) * draft: incremental markdown rendering with stable blocks * refactor: Logic improvements * refactor: DRY Markdown post-processing logic * refactor: ID generation improvements * fix: Remove runes * refactor: Clean up & add JSDocs * chore: update webui static output * fix: Add tick to prevent race conditions for rendering Markdown blocks Suggestion from @ServeurpersoCom Co-authored-by: Pascal <admin@serveurperso.com> * chore: Run `npm audit fix` * chore: update webui static output * feat: Improve performance using global counter & id instead of UUID * refactor: Enhance Markdown rendering with link and code features * chore: update webui static output * fix: Code block content extraction * chore: update webui static output * chore: update webui static output --------- Co-authored-by: Pascal <admin@serveurperso.com>	2025-12-18 11:13:52 +01:00
Kim S.	900316da4e	webui: fix chat screen shadow width (#18010 ) * webui: fix chat screen shadow width * chore: add index.html.gz	2025-12-18 11:08:42 +01:00
Johannes Gäßler	57c1e05643	llama: offload output layer to GPU first (#18148 )	2025-12-18 08:12:18 +01:00
Sigbjørn Skjæret	9cff4cc554	convert : sort and use file parts from model index if present (#18043 ) * keep file part order from model index * treat index as authoritative * sort index parts	2025-12-18 07:54:54 +01:00
Julius Tischbein	4d4f4cacd1	llama : Async DirectIO model loading on Linux (#18012 ) * Uncached model read * Removing additional --mmap arg * Removing trailing whitespaces * Adding fallback when O_DIRECT is not supported * Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp * Adding maybe unused keyword for Mac and Windows. * File seek aligned * Removing all branches for direct_io in llama-model-loader.cpp * Always use alignment from llama_file * use_mmap=true	2025-12-18 08:27:19 +02:00
Shouyu	0a0bba05e8	ggml-hexagon: swiglu_oai operation (#18114 ) * snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue	2025-12-17 13:38:21 -08:00
Sigbjørn Skjæret	5166aaf868	convert : force patch_merger tensors to f16/f32 (#18124 )	2025-12-17 22:15:53 +01:00
Pascal	6ce3d85796	server: (webui) add --webui-config (#18028 ) * server/webui: add server-side WebUI config support Add CLI arguments --webui-config (inline JSON) and --webui-config-file (file path) to configure WebUI default settings from server side. Backend changes: - Parse JSON once in server_context::load_model() for performance - Cache parsed config in webui_settings member (zero overhead on /props) - Add proper error handling in router mode with try/catch - Expose webui_settings in /props endpoint for both router and child modes Frontend changes: - Add 14 configurable WebUI settings via parameter sync - Add tests for webui settings extraction - Fix subpath support with base path in API calls Addresses feedback from @ngxson and @ggerganov * server: address review feedback from ngxson * server: regenerate README with llama-gen-docs	2025-12-17 21:45:45 +01:00
Xuan-Son Nguyen	e85e9d7637	server: (router) disable SSL on child process (#18141 )	2025-12-17 21:39:08 +01:00
Johannes Gäßler	8dcc3662a2	llama-fit-params: fix memory print (#18136 )	2025-12-17 21:10:03 +01:00
Kim S.	d37fc93505	webui: fix chat header width when sidebar is closed (#17981 ) * webui: fix chat header width when sidebar is closed * chore: add index.html.gz	2025-12-17 20:05:45 +01:00

1 2 3 4 5 ...

7508 Commits All Branches Search

7508 Commits

All Branches