Removes heavy penalty checks (repetition, frequency, presence, DRY) from
`common_sampler_sample_speculative`.
The specialized speculative sampler now uses a pure ArgMax (Greedy) approach.
This significantly reduces CPU overhead during the drafting phase, which
improves overall tokens per second.
Adds a new `mtp` boolean to `llama_model_params`. When set to false (default):
1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`.
2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`).
This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.
GLM-4.6 models exclude specific MTP tensors (`embed_tokens` and `shared_head_head`), implying weight tying with the main model. Previously, this caused a crash when building the graph.
This commit adds a fallback mechanism to use the main model's token embeddings and output head when the MTP-specific tensors are missing.
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Dec 7 23:00:29 2025 -0300
speculative (feat): implement recursive MTP drafting for GLM-4.5
commit bdf72d9552e3da64ffc85f175664713388752914
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 16:10:16 2025 -0300
sampling (feat): optimize speculative drafting with fast-path selection
commit a91980a8f3475a6bbac0a64d8be06dd4b613020e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 15:18:19 2025 -0300
mtp (chore): clean old code
commit 6de0ecf55db8567db4faa99b0152b72c9e854548
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 14:40:13 2025 -0300
mtp (feat): add mtp arg
commit ea77394183b8e6c368af969b8274039a54b11486
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 13:47:54 2025 -0300
mtp-graph (fix): move llama_get_logits_ith outside the loop
commit 15dff208958fb66802f20ec53ce5fcaff133edb7
Merge: 171346c74 cae85fe53
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 16 13:44:41 2025 -0300
Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache
commit cae85fe531
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 16 13:42:31 2025 -0300
mtp-batch(fix): avoid logits for mtp kv cache operations
commit 171346c742c310bbcfbd786b61250638ccf8b44d
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Oct 12 16:33:01 2025 -0300
mtp-graph(feat): Reactivate graph reuse only for main model path
commit 0127c6beeb
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 22:20:54 2025 -0300
mtp-batch(chore): Remove final MTP debug logs and dead code
commit 4bcc9e261e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 18:51:22 2025 -0300
mtp-batch(fix): Correctly advance cache head and add MTP documentation
commit b4cbe030ac
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 18:37:40 2025 -0300
mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs
commit a99709d0c1
Author: samuel <samueloliveira32df@gmail.com>
Date: Fri Oct 10 17:24:34 2025 -0300
mtp-batch(refactor): Extract decode context and MTP input logic into helper methods
commit 913af8f48d
Author: samuel <samueloliveira32df@gmail.com>
Date: Fri Oct 10 16:44:28 2025 -0300
mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum
commit 6f74ba3807
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 9 22:27:18 2025 -0300
mtp-batch (fix): prevent mtp draft from polluting the cache
commit 5e1d719bef
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 9 15:21:23 2025 -0300
mtp-batch (feat): Create and manage sinfo for MTP
commit febd8235d2
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Oct 5 14:43:40 2025 -0300
mtp-batch (wip): fix how to warmup kv cache for MTP
commit 67c6c069e0
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 27 19:42:32 2025 -0300
mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption
commit 75dc25e6fe
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 27 17:17:00 2025 -0300
mtp-batch (wip): organize batch for mtp cache
commit 3da7e7f330
Author: samuel <samueloliveira32df@gmail.com>
Date: Tue Sep 23 22:45:11 2025 -0300
mtp-batch (fix): warm mtp cache for small batch size
commit df64508b93
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 21 21:55:41 2025 -0300
mtp-batch (wip): merge glm graphs
commit 042eb8a829
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 21 21:29:00 2025 -0300
mtp-batch (wip): merge mtp and model graph
commit 1318b2de82
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 14 10:22:59 2025 -0300
mtp-batch (wip): move mtp execution to batch format
commit c6237c71ff
Merge: 9fab53e438742ce0e3
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sat Sep 13 02:57:01 2025 -0400
Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp
feat: implemented sampling for MTP
commit 8742ce0e39
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 6 00:21:18 2025 -0300
feat: apply logits + greedy sampler
commit 5a5bce8577
Author: samuel <samueloliveira32df@gmail.com>
Date: Wed Sep 3 17:56:14 2025 -0300
fix: add sample acceptance
commit 07670a22c6
Author: samuel <samueloliveira32df@gmail.com>
Date: Wed Sep 3 13:25:21 2025 -0300
feat: implemented sampling for MTP
commit 9fab53e438
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Sep 2 17:14:09 2025 -0400
fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch
commit 98bc0c6bf2
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 26 01:26:51 2025 -0400
replace standard sampler with greedy sampler for mtp draft
commit 471e026327
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 19 23:10:56 2025 -0400
fixed vram leak
commit d72f9d5691
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 19 01:50:34 2025 -0400
kludge-y kv cache management of mtp layer
commit 382135aa36
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 17 21:54:45 2025 -0400
fixed mtp kv cache update sequencing after prompt processing
commit 6870f9790c
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 17 04:59:36 2025 -0400
added proper KV cache management for MTP layers and slightly refactored
commit 6e9bafc7a7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Fri Aug 15 23:13:56 2025 -0400
failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable
commit cf0f7c0448
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Wed Aug 13 02:21:17 2025 -0400
broad thrust of the mtp implementation
commit 03231da69e
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 12 01:03:59 2025 -0400
add model member function to build mtp graph, to be called from speculative.cpp
commit 1f477b3755
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Mon Aug 11 20:54:45 2025 -0400
make nextn weights loadable without a crash
commit e434f87cc7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Mon Aug 11 01:21:47 2025 -0400
some work towards building mtp layer graph
commit db60623e79
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 10 23:52:54 2025 -0400
added getter for nextn layer count and server slot has_mtp property
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
* Some improvement on mul_mat_iq2_xs
Refactor calculations for db values and grid data to optimize performance and reduce redundancy.
* Fix trailing whitespace
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
* llama-server: friendlier error msg when ctx < input
This PR adds formatted strings to the server's send_error function
* llama-server: use string_format inline
* fix test
* presets: refactor, allow cascade presets from different sources
* update docs
* fix neg arg handling
* fix empty mmproj
* also filter out server-controlled args before to_ini()
* skip loading custom_models if not specified
* fix unset_reserved_args
* fix crash on windows
This commit adds a --verbose flag to the run-org-model.py script to
enable or disable detailed debug output, such as input and output
tensors for each layer. Debug utilities (summarize, debug_hook,
setup_rope_debug) have been moved to utils/common.py.
The motivation for this is that the detailed debug output can be useful
for diagnosing issues with model conversion or execution, but it can
also produce a large amount of output that may not always be needed.
The script will also be further cleaned/refactored in follow-up commits.
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.
GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).
GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
* Uncached model read
* Removing additional --mmap arg
* Removing trailing whitespaces
* Adding fallback when O_DIRECT is not supported
* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp
* Adding maybe unused keyword for Mac and Windows.
* File seek aligned
* Removing all branches for direct_io in llama-model-loader.cpp
* Always use alignment from llama_file
* use_mmap=true
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs