* model: add Solar-Open model
* vocab: add solar-open to end eog blacklist
* model: add proper llm type
* chat: basic template for solar open
* typo: fix comment about vocab
* convert: sugested changes
* convert: suggested changes
* chat: change reasoning end tag for solar-open
* llama-chat: add solar-open template
* chat: make tool description and parameters optional per OpenAI spec
Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.
Attempts to fix#17667
* refactor: use value() for cleaner optional field access
Removes heavy penalty checks (repetition, frequency, presence, DRY) from
`common_sampler_sample_speculative`.
The specialized speculative sampler now uses a pure ArgMax (Greedy) approach.
This significantly reduces CPU overhead during the drafting phase, which
improves overall tokens per second.
Adds a new `mtp` boolean to `llama_model_params`. When set to false (default):
1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`.
2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`).
This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Dec 7 23:00:29 2025 -0300
speculative (feat): implement recursive MTP drafting for GLM-4.5
commit bdf72d9552e3da64ffc85f175664713388752914
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 16:10:16 2025 -0300
sampling (feat): optimize speculative drafting with fast-path selection
commit a91980a8f3475a6bbac0a64d8be06dd4b613020e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 15:18:19 2025 -0300
mtp (chore): clean old code
commit 6de0ecf55db8567db4faa99b0152b72c9e854548
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 14:40:13 2025 -0300
mtp (feat): add mtp arg
commit ea77394183b8e6c368af969b8274039a54b11486
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Dec 6 13:47:54 2025 -0300
mtp-graph (fix): move llama_get_logits_ith outside the loop
commit 15dff208958fb66802f20ec53ce5fcaff133edb7
Merge: 171346c74 cae85fe53
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 16 13:44:41 2025 -0300
Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache
commit cae85fe531
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 16 13:42:31 2025 -0300
mtp-batch(fix): avoid logits for mtp kv cache operations
commit 171346c742c310bbcfbd786b61250638ccf8b44d
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Oct 12 16:33:01 2025 -0300
mtp-graph(feat): Reactivate graph reuse only for main model path
commit 0127c6beeb
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 22:20:54 2025 -0300
mtp-batch(chore): Remove final MTP debug logs and dead code
commit 4bcc9e261e
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 18:51:22 2025 -0300
mtp-batch(fix): Correctly advance cache head and add MTP documentation
commit b4cbe030ac
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Oct 11 18:37:40 2025 -0300
mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs
commit a99709d0c1
Author: samuel <samueloliveira32df@gmail.com>
Date: Fri Oct 10 17:24:34 2025 -0300
mtp-batch(refactor): Extract decode context and MTP input logic into helper methods
commit 913af8f48d
Author: samuel <samueloliveira32df@gmail.com>
Date: Fri Oct 10 16:44:28 2025 -0300
mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum
commit 6f74ba3807
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 9 22:27:18 2025 -0300
mtp-batch (fix): prevent mtp draft from polluting the cache
commit 5e1d719bef
Author: samuel <samueloliveira32df@gmail.com>
Date: Thu Oct 9 15:21:23 2025 -0300
mtp-batch (feat): Create and manage sinfo for MTP
commit febd8235d2
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Oct 5 14:43:40 2025 -0300
mtp-batch (wip): fix how to warmup kv cache for MTP
commit 67c6c069e0
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 27 19:42:32 2025 -0300
mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption
commit 75dc25e6fe
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 27 17:17:00 2025 -0300
mtp-batch (wip): organize batch for mtp cache
commit 3da7e7f330
Author: samuel <samueloliveira32df@gmail.com>
Date: Tue Sep 23 22:45:11 2025 -0300
mtp-batch (fix): warm mtp cache for small batch size
commit df64508b93
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 21 21:55:41 2025 -0300
mtp-batch (wip): merge glm graphs
commit 042eb8a829
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 21 21:29:00 2025 -0300
mtp-batch (wip): merge mtp and model graph
commit 1318b2de82
Author: samuel <samueloliveira32df@gmail.com>
Date: Sun Sep 14 10:22:59 2025 -0300
mtp-batch (wip): move mtp execution to batch format
commit c6237c71ff
Merge: 9fab53e438742ce0e3
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sat Sep 13 02:57:01 2025 -0400
Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp
feat: implemented sampling for MTP
commit 8742ce0e39
Author: samuel <samueloliveira32df@gmail.com>
Date: Sat Sep 6 00:21:18 2025 -0300
feat: apply logits + greedy sampler
commit 5a5bce8577
Author: samuel <samueloliveira32df@gmail.com>
Date: Wed Sep 3 17:56:14 2025 -0300
fix: add sample acceptance
commit 07670a22c6
Author: samuel <samueloliveira32df@gmail.com>
Date: Wed Sep 3 13:25:21 2025 -0300
feat: implemented sampling for MTP
commit 9fab53e438
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Sep 2 17:14:09 2025 -0400
fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch
commit 98bc0c6bf2
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 26 01:26:51 2025 -0400
replace standard sampler with greedy sampler for mtp draft
commit 471e026327
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 19 23:10:56 2025 -0400
fixed vram leak
commit d72f9d5691
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 19 01:50:34 2025 -0400
kludge-y kv cache management of mtp layer
commit 382135aa36
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 17 21:54:45 2025 -0400
fixed mtp kv cache update sequencing after prompt processing
commit 6870f9790c
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 17 04:59:36 2025 -0400
added proper KV cache management for MTP layers and slightly refactored
commit 6e9bafc7a7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Fri Aug 15 23:13:56 2025 -0400
failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable
commit cf0f7c0448
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Wed Aug 13 02:21:17 2025 -0400
broad thrust of the mtp implementation
commit 03231da69e
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Tue Aug 12 01:03:59 2025 -0400
add model member function to build mtp graph, to be called from speculative.cpp
commit 1f477b3755
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Mon Aug 11 20:54:45 2025 -0400
make nextn weights loadable without a crash
commit e434f87cc7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Mon Aug 11 01:21:47 2025 -0400
some work towards building mtp layer graph
commit db60623e79
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date: Sun Aug 10 23:52:54 2025 -0400
added getter for nextn layer count and server slot has_mtp property
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
* presets: refactor, allow cascade presets from different sources
* update docs
* fix neg arg handling
* fix empty mmproj
* also filter out server-controlled args before to_ini()
* skip loading custom_models if not specified
* fix unset_reserved_args
* fix crash on windows
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
The -kvu (--kv-unified) flag is required for hellaswag and winogrande
benchmarks which use coupled sequences. Without unified KV cache,
these benchmarks fail with:
split_equal: sequential split is not supported when there are
coupled sequences in the input batch (you may need to use the -kvu flag)
This change adds LLAMA_EXAMPLE_PERPLEXITY to the allowed examples for
the -kvu argument, enabling its use with llama-perplexity.
* common : expose json-schema functionality to extract type info
* common : fix peg parser negation during needs_more_input
* common : add some defensive measures in constructed peg parser
* common : add nemotron nano 3 support
* common : add nemotron nano 3 tests
* remove debug line
* llama: automatically fit args to free memory
llama-fit-params tool
* fix CI
* hints for bug reports, ensure no reallocation
* fix segfault with Vulkan
* add llama-fit-params to CI
* fix CI
* fix CI
* fix CI
* minor adjustments
* fix assignment of 1 dense layer
* fix logger not being reset on model load failure
* remove --n-gpu-layer hint on model load failure
* fix llama-fit-params verbosity
* fix edge case
* fix typo [no ci]
* llama-server: recursive GGUF loading
Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model
* server : router config POC (INI-based per-model settings)
* server: address review feedback from @aldehir and @ngxson
PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check
Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic
Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
* server: adopt aldehir's line-oriented PEG parser
Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule
Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs
Co-authored-by: aldehir <aldehir@users.noreply.github.com>
* server: fix CLI/env duplication in child processes
Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.
Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn
* add common/preset.cpp
* fix compile
* cont
* allow custom-path models
* add falsey check
* server: fix router model discovery and child process spawning
- Sanitize model names: replace / and \ with _ for display
- Recursive directory scan with relative path storage
- Convert relative paths to absolute when spawning children
- Filter router control args from child processes
- Refresh args after port assignment for correct port value
- Fallback preset lookup for compatibility
- Fix missing argv[0]: store server binary path before base_args parsing
* Revert "server: fix router model discovery and child process spawning"
This reverts commit e3832b42eeea7fcb108995966c7584479f745857.
* clarify about "no-" prefix
* correct render_args() to include binary path
* also remove arg LLAMA_ARG_MODELS_PRESET for child
* add co-author for ini parser code
Co-authored-by: aldehir <hello@alde.dev>
* also set LLAMA_ARG_HOST
* add CHILD_ADDR
* Remove dead code
---------
Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: aldehir <hello@alde.dev>
* console: allow using arrow left/right to edit the line (with UTF-8 support)
* console: fix arrow keys on Windows using private-use Unicode
* console: add Home/End key support for Windows and Linux
* console: add basic Up/Down history navigation
* fix build
* console: allow using arrow left/right to edit the line (with UTF-8 support)
* console: fix arrow keys on Windows using private-use Unicode
* console: add Home/End key support for Windows and Linux
* console: add basic Up/Down history navigation
* console: remove unreachable wc == 0 check after VK switch
* console: add Ctrl+Left/Right word navigation
- Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
- Windows: detect CTRL modifier via dwControlKeyState
- Linux: parse ANSI sequences with modifier (1;5D/C)
- Implement move_word_left/right with space-skipping logic
- Refactor escape sequence parsing to accumulate params
* console: add Delete key support
- Windows: VK_DELETE detection
- Linux: ESC[3~ sequence parsing
- Forward character deletion with UTF-8 support
* console: implement bash-style history editing
- Edit any history line during UP/DOWN navigation, edits persist
- Pressing Enter appends edited version as new history entry
- Original line stay untouched in their positions
* clean up
* better history impl
* fix decode_utf8
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* Fix kimi-k2 parsing
* fix template & add more tests for kimi-k2
* Another fix for Kimi-K2 chat template.
* enable allow_toolcall_in_think for Kimi-K2
* Refine key-value separator and value end format
* Enable tool call in think for kimi-k2
* allow_toolcall_in_think is now tested with Kimi-K2
* Remove outdated TODO comment in XML tool call parser
Removed TODO comment about untested tool call feature.
* Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"
This commit skips the model validation check when the user specifies the
--help option.
The motivation for this is that currently and error is thrown before the
--help could be processed. Now skips validation if params.usage is set,
allowing help to display without requiring --model.
Resolves: https://github.com/ggml-org/llama.cpp/issues/17754
Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds,
This caused "macro redefined" warnings with toolchains that define the version.
This also removes the `GGML_WIN_VER` variable as it is no longer needed.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : implement parser combinators to simplify chat parsing
* add virtual destructor to parser_base
* fix memory leak from circular references of rules
* implement gbnf grammar building
* remove unused private variable
* create a base visitor and implement id assignment as a visitor
* fix const ref for grammar builder
* clean up types, friend classes, and class declarations
* remove builder usage from until_parser
* Use a counter class to help assign rule ids
* cache everything
* add short description for each parser
* create a type for the root parser
* implement repetition parser
* Make optional, one_or_more, and zero_or_more subclasses of repetition
* improve context constructor
* improve until parsing and add benchmarks
* remove cached() pattern, cache in parser_base with specialized parsing functions for each parser
* improve json parsing performance to better match legacy parsing
* fix const auto * it for windows
* move id assignment to classes instead of using a visitor
* create named rules in the command r7b example
* use '.' for any in GBNF
* fix parens around choices in gbnf grammar
* add convenience operators to turn strings to literals
* add free-form operators for const char * to simplify defining literals
* simplify test case parser
* implement semantic actions
* remove groups in favor of actions and a scratchpad
* add built in actions for common operations
* add actions to command r7b example
* use std::default_searcher for platforms that don't have bm
* improve parser_type handling and add cast helper
* add partial result type to better control when to run actions
* fix bug in until()
* run actions on partial results by default
* use common_chat_msg for result
* add qwen3 example wip
* trash partial idea and simplify
* move action arguments to a struct
* implement aho-corasick matcher for until_parser and to build exclusion grammars
* use std::string for input, since std::string_view is incompatible with std::regex
* Refactor tests
* improve qwen3 example
* implement sax-style parsing and refactor
* fix json string in test
* rename classes to use common_chat_ prefix
* remove is_ suffix from functions
* rename from id_counter to just counter
* Final refactored tests
* Fix executable name and editorconfig-checker
* Third time's the charm...
* add trigger parser to begin lazy grammar rule generation
* working lazy grammar
* refactor json rules now that we check for reachability
* reduce pointer usage
* print out grammars in example
* rename to chat-peg-parser* and common_chat_peg_parser*
* Revert unrelated changes
* New macros for CMakeLists to enable multi-file compilations
* starting unicode support
* add unicode support to char_parser
* use unparsed args as additional sources
* Refactor tests to new harness
* Fix CMakeLists
* fix rate calculation
* add unicode tests
* fix trailing whitespace and line endings
skip-checks: true
* Helpers + rewrite qwen3 with helpers
* Fix whitespace
* extract unicode functions to separate file
* refactor parse unicode function
* fix compiler error
* improve construction of sequence/choice parsers
* be less clever
* add make_parser helper function
* expand usage of make_parser, alias common_chat_msg_peg_parser_builder to builder in source
* lower bench iterations
* add unicode support to until_parser
* add unicode support to json_string_parser
* clean up unicode tests
* reduce unicode details to match src/unicode.cpp
* simplify even further
* remove unused functions
* fix type
* reformat char class parsing
* clean up json string parser
* clean up + fix diagnostics
* reorder includes
* compact builder functions
* replace action_parser with capture_parser, rename env to semantics
* rename env to semantics
* clean up common_chat_parse_context
* move type() to below constant
* use default constructor for common_chat_peg_parser
* make all operators functions for consistency
* fix compilation errors in test-optional.cpp
* simplify result values
* rename json_string_unquoted to json_string_content
* Move helper to separate class, add separate explicit and helper classes
* Whitespace
* Change + to append()
* Reformat
* Add extra helpers, tests and Minimax example
* Add some extra optional debugging prints + real example of how to use them
* fix bug in repetitions when min_count = 0 reports failures
* dump rule in debug
* fix token accumulation and assert parsing never fails
* indent debug by depth
* use LOG_* in tests so logs sync up with test logs
* - Add selective testing
- Refactor all messaging to use LOG_ERR
- Fix lack of argument / tool name capturing
- Temporary fix for double event capture
* refactor rule() and introduce ref()
* clean up visitor
* clean up indirection in root parser w.r.t rules
* store shared ptr directly in parser classes
* replace aho-corasick automation with a simple trie
* Reset prev for qwen3 helper example variant
* refactor to use value semantics with std::variant/std::visit
* simplify trie_matcher result
* fix linting issues
* add annotations to rules
* revert test workaround
* implement serializing the parser
* remove redundant parsers
* remove tests
* gbnf generation fixes
* remove LOG_* use in tests
* update gbnf tests to test entire grammar
* clean up gbnf generation and fix a few bugs
* fix typo in test output
* remove implicit conversion rules
* improve test output
* rename trie_matcher to trie
* simplify trie to just know if a node is the end of a word
* remove common_chat_ prefix and ensure a common_peg_ prefix to all types
* rename chat-peg-parser -> peg-parser
* promote chat-peg-parser-helper to chat-peg-parser
* checkpoint
* use a static_assert to ensure we handle every branch
* inline trivial peg parser builders
* use json strings for now
* implement basic and native chat peg parser builders/extractors
* resolve refs to their rules
* remove packrat caching (for now)
* update tests
* compare parsers with incremental input
* benchmark both complete and incremental parsing
* add raw string generation from json schema
* add support for string schemas in gbnf generation
* fix qwen example to include \n
* tidy up example
* rename extractor to mapper
* rename ast_arena to ast
* place basic tests into one
* use gbnf_format_literal from json-schema-to-grammar
* integrate parser with common/chat and server
* clean up schema and serialization
* add json-schema raw string tests
* clean up json creation and remove capture parser
* trim spaces from reasoning and content
* clean up redundant rules and comments
* rename input_is_complete to is_partial to match rest of project
* simplify json rules
* remove extraneous file
* remove comment
* implement += and |= operators
* add comments to qwen3 implementation
* reorder arguments to common_chat_peg_parse
* remove commented outdated tests
* add explicit copy constructor
* fix operators and constness
* wip: update test-chat for qwen3-coder
* bring json parser closer to json-schema-to-grammar rules
* trim trailing space for most things
* fix qwen3 coder rules w.r.t. trailing spaces
* group rules
* do not trim trailing space from string args
* tweak spacing of qwen3 grammar
* update qwen3-coder tests
* qwen3-coder small fixes
* place parser in common_chat_syntax to simplify invocation
* use std::set to collect rules to keep order predictable for tests
* initialize parser to make certain platforms happy
* revert back to std::unordered_set, sort rule names at the end instead
* uncomment rest of chat tests
* define explicit default constructor
* improve arena init and server integration
* fix chat test
* add json_member()
* add a comprehensive native example
* clean up example qwen test and add response_format example to native test
* make build_peg_parser accept std::function instead of template
* change peg parser parameters into const ref
* push tool call on tool open for constructed parser
* add parsing documentation
* clean up some comments
* add json schema support to qwen3-coder
* add id initializer in tests
* remove grammar debug line from qwen3-coder
* refactor qwen3-coder to use sequence over operators
* only call common_chat_peg_parse if appropriate format
* simplify qwen3-coder space handling
* revert qwen3-coder implementation
* revert json-schema-to-grammar changes
* remove unnecessary forward declaration
* small adjustment to until_parser
* rename C/C++ files to use dashes
* codeowners : add aldehir to peg-parser and related files
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>