Adds a new `mtp` boolean to `llama_model_params`. When set to false (default):
1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`.
2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`).
This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Dec 7 23:00:29 2025 -0300
speculative (feat): implement recursive MTP drafting for GLM-4.5
commit bdf72d9552e3da64ffc85f175664713388752914
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 16:10:16 2025 -0300
sampling (feat): optimize speculative drafting with fast-path selection
commit a91980a8f3475a6bbac0a64d8be06dd4b613020e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 15:18:19 2025 -0300
mtp (chore): clean old code
commit 6de0ecf55db8567db4faa99b0152b72c9e854548
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 14:40:13 2025 -0300
mtp (feat): add mtp arg
commit ea77394183b8e6c368af969b8274039a54b11486
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 13:47:54 2025 -0300
mtp-graph (fix): move llama_get_logits_ith outside the loop
commit 15dff208958fb66802f20ec53ce5fcaff133edb7
Merge: 171346c74 cae85fe53
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 16 13:44:41 2025 -0300
Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache
commit cae85fe531
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 16 13:42:31 2025 -0300
mtp-batch(fix): avoid logits for mtp kv cache operations
commit 171346c742c310bbcfbd786b61250638ccf8b44d
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Oct 12 16:33:01 2025 -0300
mtp-graph(feat): Reactivate graph reuse only for main model path
commit 0127c6beeb
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 22:20:54 2025 -0300
mtp-batch(chore): Remove final MTP debug logs and dead code
commit 4bcc9e261e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 18:51:22 2025 -0300
mtp-batch(fix): Correctly advance cache head and add MTP documentation
commit b4cbe030ac
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 18:37:40 2025 -0300
mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs
commit a99709d0c1
Author: samuel <samueloliveira32df@gmail.com>
Date: Fri Oct 10 17:24:34 2025 -0300
mtp-batch(refactor): Extract decode context and MTP input logic into helper methods
commit 913af8f48d
Author: samuel <samueloliveira32df@gmail.com>
Date: Fri Oct 10 16:44:28 2025 -0300
mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum
commit 6f74ba3807
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 9 22:27:18 2025 -0300
mtp-batch (fix): prevent mtp draft from polluting the cache
commit 5e1d719bef
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 9 15:21:23 2025 -0300
mtp-batch (feat): Create and manage sinfo for MTP
commit febd8235d2
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Oct 5 14:43:40 2025 -0300
mtp-batch (wip): fix how to warmup kv cache for MTP
commit 67c6c069e0
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 27 19:42:32 2025 -0300
mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption
commit 75dc25e6fe
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 27 17:17:00 2025 -0300
mtp-batch (wip): organize batch for mtp cache
commit 3da7e7f330
Author: samuel <samueloliveira32df@gmail.com>
Date: Tue Sep 23 22:45:11 2025 -0300
mtp-batch (fix): warm mtp cache for small batch size
commit df64508b93
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 21 21:55:41 2025 -0300
mtp-batch (wip): merge glm graphs
commit 042eb8a829
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 21 21:29:00 2025 -0300
mtp-batch (wip): merge mtp and model graph
commit 1318b2de82
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 14 10:22:59 2025 -0300
mtp-batch (wip): move mtp execution to batch format
commit c6237c71ff
Merge: 9fab53e438742ce0e3
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sat Sep 13 02:57:01 2025 -0400
Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp
feat: implemented sampling for MTP
commit 8742ce0e39
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 6 00:21:18 2025 -0300
feat: apply logits + greedy sampler
commit 5a5bce8577
Author: samuel <samueloliveira32df@gmail.com>
Date: Wed Sep 3 17:56:14 2025 -0300
fix: add sample acceptance
commit 07670a22c6
Author: samuel <samueloliveira32df@gmail.com>
Date: Wed Sep 3 13:25:21 2025 -0300
feat: implemented sampling for MTP
commit 9fab53e438
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Sep 2 17:14:09 2025 -0400
fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch
commit 98bc0c6bf2
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 26 01:26:51 2025 -0400
replace standard sampler with greedy sampler for mtp draft
commit 471e026327
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 19 23:10:56 2025 -0400
fixed vram leak
commit d72f9d5691
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 19 01:50:34 2025 -0400
kludge-y kv cache management of mtp layer
commit 382135aa36
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 17 21:54:45 2025 -0400
fixed mtp kv cache update sequencing after prompt processing
commit 6870f9790c
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 17 04:59:36 2025 -0400
added proper KV cache management for MTP layers and slightly refactored
commit 6e9bafc7a7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Fri Aug 15 23:13:56 2025 -0400
failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable
commit cf0f7c0448
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Wed Aug 13 02:21:17 2025 -0400
broad thrust of the mtp implementation
commit 03231da69e
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 12 01:03:59 2025 -0400
add model member function to build mtp graph, to be called from speculative.cpp
commit 1f477b3755
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Mon Aug 11 20:54:45 2025 -0400
make nextn weights loadable without a crash
commit e434f87cc7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Mon Aug 11 01:21:47 2025 -0400
some work towards building mtp layer graph
commit db60623e79
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 10 23:52:54 2025 -0400
added getter for nextn layer count and server slot has_mtp property
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
* llama: automatically fit args to free memory
llama-fit-params tool
* fix CI
* hints for bug reports, ensure no reallocation
* fix segfault with Vulkan
* add llama-fit-params to CI
* fix CI
* fix CI
* fix CI
* minor adjustments
* fix assignment of 1 dense layer
* fix logger not being reset on model load failure
* remove --n-gpu-layer hint on model load failure
* fix llama-fit-params verbosity
* fix edge case
* fix typo [no ci]
* llama-server: recursive GGUF loading
Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model
* server : router config POC (INI-based per-model settings)
* server: address review feedback from @aldehir and @ngxson
PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check
Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic
Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
* server: adopt aldehir's line-oriented PEG parser
Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule
Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs
Co-authored-by: aldehir <aldehir@users.noreply.github.com>
* server: fix CLI/env duplication in child processes
Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.
Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn
* add common/preset.cpp
* fix compile
* cont
* allow custom-path models
* add falsey check
* server: fix router model discovery and child process spawning
- Sanitize model names: replace / and \ with _ for display
- Recursive directory scan with relative path storage
- Convert relative paths to absolute when spawning children
- Filter router control args from child processes
- Refresh args after port assignment for correct port value
- Fallback preset lookup for compatibility
- Fix missing argv[0]: store server binary path before base_args parsing
* Revert "server: fix router model discovery and child process spawning"
This reverts commit e3832b42eeea7fcb108995966c7584479f745857.
* clarify about "no-" prefix
* correct render_args() to include binary path
* also remove arg LLAMA_ARG_MODELS_PRESET for child
* add co-author for ini parser code
Co-authored-by: aldehir <hello@alde.dev>
* also set LLAMA_ARG_HOST
* add CHILD_ADDR
* Remove dead code
---------
Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: aldehir <hello@alde.dev>
Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds,
This caused "macro redefined" warnings with toolchains that define the version.
This also removes the `GGML_WIN_VER` variable as it is no longer needed.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* arg: add --cache-list argument to list cached models
* new manifest naming format
* improve naming
* Update common/arg.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* minor : code style
* server : fix prompt similarity calculation
* server : initial host-memory prompt caching
* cont
* server : refactor
* cont
* cont : make the server task of the slot const
* cont : minor [no ci]
* server : cache prompts and checkpoints only for completion tasks
* server : improve prompt caching logic
* cont : fix check for number of cached prompts [no ci]
* server : improve caching logic, add -cram CLI arg
* server : print prompt mismatch info
* cont : better naming [no ci]
* server : improve prompt cache loading logic
* server : add option to debug the slot contents (#16482)
* server : add option to debug the slot contents
* Update tools/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : add option to disable prompt cache
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing
- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
* refactor: implement streaming-aware universal reasoning parser
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.
- Rework try_parse_reasoning() to track whitespace, partial tags, and
multiple reasoning segments, allowing proper separation of reasoning_content
and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments
The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.
Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
* refactor: address review feedback from allozaur
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)
- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
* refactor: address review feedback from ngxson
* debug: say goodbye to curl -N, hello one-click raw stream
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story
- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
* npm run format
* chat-parser: address review feedback from ngxson
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* implement --no-host to disable host buffer
* fix equal_mparams
* move no-host enumeration order together with other model params
---------
Co-authored-by: slaren <slarengh@gmail.com>
* initial commit for branch 3
* generalize `swa_checkpoint` to `ctx_checkpoint`
this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite
* oops
* disable debug prints
* keep backwards compat with `--swa-checkpoints`
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update prompt re-processing message
* fix off-by-one error per GG
* keep `seq_rm` log per GG
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix checkpoint logic to support recurrent caches
* server : cleanup and fixes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
To pull and run models via: llama-server -dr gemma3
Add some validators and sanitizers for Docker Model urls and metadata
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
* server : enable /slots by default and make it secure
ggml-ci
* server : fix tests to pass `--no-slots` when necessary
* server : extend /props with info about enabled endpoints
* server : add SWA checkpoints
ggml-ci
* cont : server clean-up
* server : handle state restore fails
* llama : add extended llama_state_seq_ API
* server : do not make checkpoints if --swa-full
ggml-ci
* llama : remove flags value for NONE
* server : configure number of SWA checkpoints with CLI arg
ggml-ci
* args : fix scope of new argument
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Checkpoint from VS Code for coding agent session
* Initial plan
* Fix typo in --override-tensor-draft flag implementation
* Add null termination for speculative tensor buffer overrides
* Apply suggestions from code review
* Apply suggestions from code review
* Extract tensor override parsing logic to common function (addresses @slaren's feedback)
* Apply suggestions from code review
* Apply suggestions
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* llama-server : implement universal assisted decoding
* Erase prompt tail for kv-cache
* set vocab_dft_compatible in common_speculative
* rename ctx_main to ctx_tgt
* move vocab_dft_compatible to spec struct
* clear mem_dft, remove mem
* detokenize id_last for incompatible models
* update comment
* add --spec-replace flag
* accept special tokens when translating between draft/main models
* Escape spec-replace
* clamp draft result to size to params.n_draft
* fix comment
* clean up code
* restore old example
* log common_speculative_are_compatible in speculative example
* fix
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add support for Llada-8b: diffusion model
* Add README
* Fix README and convert_hf_to_gguf
* convert_hf_to_gguf.py: address review comments
* Make everything in a single example
* Remove model-specific sampling
* Remove unused argmax
* Remove braced initializers, improve README.md a bit
* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps
* Remove adding the mask token
* Move add_add_bos_token to set_vocab
* use add_bool in gguf_writer.py