Commit Graph

17 Commits

Author SHA1 Message Date
Aaron Lee d10a5a4a5b clean up mtp sample typing after rebase 2025-12-21 17:53:27 -05:00
samuel fe2baf5e2d Squashed commit of the following:
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Dec 7 23:00:29 2025 -0300

    speculative (feat): implement recursive MTP drafting for GLM-4.5

commit bdf72d9552e3da64ffc85f175664713388752914
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 16:10:16 2025 -0300

    sampling (feat): optimize speculative drafting with fast-path selection

commit a91980a8f3475a6bbac0a64d8be06dd4b613020e
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 15:18:19 2025 -0300

    mtp (chore): clean old code

commit 6de0ecf55db8567db4faa99b0152b72c9e854548
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 14:40:13 2025 -0300

    mtp (feat): add mtp arg

commit ea77394183b8e6c368af969b8274039a54b11486
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 13:47:54 2025 -0300

    mtp-graph (fix): move llama_get_logits_ith outside the loop

commit 15dff208958fb66802f20ec53ce5fcaff133edb7
Merge: 171346c74 cae85fe53
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 16 13:44:41 2025 -0300

    Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache

commit cae85fe531
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 16 13:42:31 2025 -0300

    mtp-batch(fix): avoid logits for mtp kv cache operations

commit 171346c742c310bbcfbd786b61250638ccf8b44d
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Oct 12 16:33:01 2025 -0300

    mtp-graph(feat): Reactivate graph reuse only for main model path

commit 0127c6beeb
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Oct 11 22:20:54 2025 -0300

    mtp-batch(chore): Remove final MTP debug logs and dead code

commit 4bcc9e261e
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Oct 11 18:51:22 2025 -0300

    mtp-batch(fix): Correctly advance cache head and add MTP documentation

commit b4cbe030ac
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Oct 11 18:37:40 2025 -0300

    mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs

commit a99709d0c1
Author: samuel <samueloliveira32df@gmail.com>
Date:   Fri Oct 10 17:24:34 2025 -0300

    mtp-batch(refactor): Extract decode context and MTP input logic into helper methods

commit 913af8f48d
Author: samuel <samueloliveira32df@gmail.com>
Date:   Fri Oct 10 16:44:28 2025 -0300

    mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum

commit 6f74ba3807
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 9 22:27:18 2025 -0300

    mtp-batch (fix): prevent mtp draft from polluting the cache

commit 5e1d719bef
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 9 15:21:23 2025 -0300

    mtp-batch (feat): Create and manage sinfo for MTP

commit febd8235d2
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Oct 5 14:43:40 2025 -0300

    mtp-batch (wip): fix how to warmup kv cache for MTP

commit 67c6c069e0
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Sep 27 19:42:32 2025 -0300

    mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption

commit 75dc25e6fe
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Sep 27 17:17:00 2025 -0300

    mtp-batch (wip): organize batch for mtp cache

commit 3da7e7f330
Author: samuel <samueloliveira32df@gmail.com>
Date:   Tue Sep 23 22:45:11 2025 -0300

    mtp-batch (fix): warm mtp cache for small batch size

commit df64508b93
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Sep 21 21:55:41 2025 -0300

    mtp-batch (wip): merge glm graphs

commit 042eb8a829
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Sep 21 21:29:00 2025 -0300

    mtp-batch (wip): merge mtp and model graph

commit 1318b2de82
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Sep 14 10:22:59 2025 -0300

    mtp-batch (wip): move mtp execution to batch format

commit c6237c71ff
Merge: 9fab53e43 8742ce0e3
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sat Sep 13 02:57:01 2025 -0400

    Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp

    feat: implemented sampling for MTP

commit 8742ce0e39
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Sep 6 00:21:18 2025 -0300

    feat: apply logits + greedy sampler

commit 5a5bce8577
Author: samuel <samueloliveira32df@gmail.com>
Date:   Wed Sep 3 17:56:14 2025 -0300

    fix: add sample acceptance

commit 07670a22c6
Author: samuel <samueloliveira32df@gmail.com>
Date:   Wed Sep 3 13:25:21 2025 -0300

    feat: implemented sampling for MTP

commit 9fab53e438
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Sep 2 17:14:09 2025 -0400

    fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch

commit 98bc0c6bf2
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 26 01:26:51 2025 -0400

    replace standard sampler with greedy sampler for mtp draft

commit 471e026327
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 19 23:10:56 2025 -0400

    fixed vram leak

commit d72f9d5691
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 19 01:50:34 2025 -0400

    kludge-y kv cache management of mtp layer

commit 382135aa36
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sun Aug 17 21:54:45 2025 -0400

    fixed mtp kv cache update sequencing after prompt processing

commit 6870f9790c
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sun Aug 17 04:59:36 2025 -0400

    added proper KV cache management for MTP layers and slightly refactored

commit 6e9bafc7a7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Fri Aug 15 23:13:56 2025 -0400

    failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable

commit cf0f7c0448
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Wed Aug 13 02:21:17 2025 -0400

    broad thrust of the mtp implementation

commit 03231da69e
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 12 01:03:59 2025 -0400

    add model member function to build mtp graph, to be called from speculative.cpp

commit 1f477b3755
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Mon Aug 11 20:54:45 2025 -0400

    make nextn weights loadable without a crash

commit e434f87cc7
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Mon Aug 11 01:21:47 2025 -0400

    some work towards building mtp layer graph

commit db60623e79
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sun Aug 10 23:52:54 2025 -0400

    added getter for nextn layer count and server slot has_mtp property
2025-12-21 17:23:35 -05:00
Xuan-Son Nguyen ddcb75dd8a
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level

* implement server-context suspend

* add test

* add docs

* optimization: add fast path

* make sure to free llama_init

* nits

* fix use-after-free

* allow /models to be accessed during sleeping, fix use-after-free

* don't allow accessing /models during sleep, it is not thread-safe

* fix data race on accessing props and model_meta

* small clean up

* trailing whitespace

* rm outdated comments
2025-12-21 02:24:42 +01:00
Oleksandr Kuvshynov 408616adbd
server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before.

To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:

```
% ./build/bin/llama-server -v  \
  -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
  -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
  -c 8192 2> /tmp/llama.cpp.debug

Check the log:

slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id  3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id  3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id  3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id  3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id  3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```

After the fix, get correct per round logging:

```
slot update_slots: id  3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id  3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id  3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id  3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id  3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id  3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id  3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
2025-12-20 10:57:40 +01:00
Aman Gupta cc0a04343e
server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input

This PR adds formatted strings to the server's send_error function

* llama-server: use string_format inline

* fix test
2025-12-19 12:10:00 +01:00
Pascal 6ce3d85796
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support

Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.

Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes

Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls

Addresses feedback from @ngxson and @ggerganov

* server: address review feedback from ngxson

* server: regenerate README with llama-gen-docs
2025-12-17 21:45:45 +01:00
Georgi Gerganov 254098a279
common : refactor common_sampler + grammar logic changes (#17937)
* common : refactor common_sampler + grammar logic changes

* tests : increase max_tokens to get needed response

* batched : fix uninitialized samplers
2025-12-14 10:11:13 +02:00
Xuan-Son Nguyen 6c2131773c
cli: new CLI experience (#17824)
* wip

* wip

* fix logging, add display info

* handle commands

* add args

* wip

* move old cli to llama-completion

* rm deprecation notice

* move server to a shared library

* move ci to llama-completion

* add loading animation

* add --show-timings arg

* add /read command, improve LOG_ERR

* add args for speculative decoding, enable show timings by default

* add arg --image and --audio

* fix windows build

* support reasoning_content

* fix llama2c workflow

* color default is auto

* fix merge conflicts

* properly fix color problem

Co-authored-by: bandoti <bandoti@users.noreply.github.com>

* better loading spinner

* make sure to clean color on force-exit

* also clear input files on "/clear"

* simplify common_log_flush

* add warning in mtmd-cli

* implement console writter

* fix data race

* add attribute

* fix llama-completion and mtmd-cli

* add some notes about console::log

* fix compilation

---------

Co-authored-by: bandoti <bandoti@users.noreply.github.com>
2025-12-10 15:28:59 +01:00
Xuan-Son Nguyen 951520ddb0
server: delegate result_state creation to server_task (#17835)
* server: delegate result_state creation to server_task

* remove unued states

* add more docs
2025-12-08 17:04:38 +01:00
Xuan-Son Nguyen f896d2c34f
server: improve speed of speculative decoding (#17808)
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:35:28 +01:00
Georgi Gerganov 2bc96931d2
server : make cache_reuse configurable per request (#17858) 2025-12-08 12:43:12 +02:00
Xuan-Son Nguyen c42712b056
server: support multiple generations from one prompt (OAI "n" option) (#17775)
* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
2025-12-06 15:54:38 +01:00
Xuan-Son Nguyen c4c10bfb86
server: move msg diffs tracking to HTTP thread (#17740)
* server: move msg diffs tracking to HTTP thread

* wip

* tool call tests ok

* minor : style

* cont : fix

* move states to server_response_reader

* add safe-guard

* fix

* fix 2

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-04 15:46:08 +01:00
Xuan-Son Nguyen 13628d8bdb
server: add --media-path for local media files (#17697)
* server: add --media-path for local media files

* remove unused fn
2025-12-02 22:49:20 +01:00
Xuan-Son Nguyen 5d6bd842ea
server: remove default "gpt-3.5-turbo" model name (#17668)
* server: remove default "gpt-3.5-turbo" model name

* do not reflect back model name from request

* fix test
2025-12-02 11:38:57 +01:00
Xuan-Son Nguyen ecf74a8417
mtmd: add mtmd_context_params::warmup option (#17652)
* mtmd: add mtmd_context_params::warmup option

* reuse the common_params::warmup
2025-12-01 21:32:25 +01:00
Xuan-Son Nguyen ab49f094d2
server: move server-context to its own cpp|h (#17595)
* git mv

* add server-context.h

* add server-context.h

* clean up headers

* cont : cleanup

* also expose server_response_reader (to be used by CLI)

* fix windows build

* decouple server_routes and server_http

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-29 22:04:44 +01:00