lhez
eb492bf43f
opencl: unpack q4_0 for adreno in get_tensor ( #18278 )
2025-12-22 10:19:01 -08:00
Jeff Bolz
e3b35ddf1c
vulkan: Extend rope fusions to allow mrope ( #18264 )
...
Extend the test-backend-ops tests as well.
2025-12-22 11:03:13 -06:00
Xuan-Son Nguyen
6ce863c803
server: prevent data race from HTTP threads ( #18263 )
...
* server: prevent data race from HTTP threads
* fix params
* fix default_generation_settings
* nits: make handle_completions_impl looks less strange
* stricter const
* fix GGML_ASSERT(idx < states.size())
* move index to be managed by server_response_reader
* http: make sure req & res lifecycle are tied together
* fix compile
* fix index handling buggy
* fix data race for lora endpoint
* nits: fix shadow variable
* nits: revert redundant changes
* nits: correct naming for json_webui_settings
2025-12-22 14:23:34 +01:00
Xuan-Son Nguyen
3997c78e33
server: fix data race in to_json_anthropic ( #18283 )
2025-12-22 13:21:43 +01:00
Mattt
ee74642982
release: update release workflow to store XCFramework as Zip file ( #18284 )
...
* Update release workflow to store XCFramework as Zip file
* Add comments to document Zip file requirement for XCFramework
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-22 20:11:46 +08:00
Aaron Teo
a28310488c
convert: rework ftype heuristics ( #18214 )
...
* convert: rework ftype heuristics
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
convert: fix type-check
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
convert: bring back heuristics comment
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* convert: revert to using first tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* convert: rework heuristics logic
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* convert: rm redundant float32 check
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-22 20:03:49 +08:00
Xuan-Son Nguyen
86af848153
server: (docs) remove mention about extra_args ( #18262 )
2025-12-22 12:22:01 +01:00
Johannes Gäßler
147a521636
tool/ex/tests: consistently free ctx, then model ( #18168 )
2025-12-22 11:00:37 +01:00
Jeff Bolz
e1f15b454f
vulkan: Implement set_tensor_async and the event interfaces ( #18047 )
...
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896 . This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
2025-12-21 21:52:09 +01:00
Johannes Gäßler
0e1ccf15c7
llama: fix RPC for -fit on ( #18233 )
2025-12-21 19:33:08 +01:00
Xuan-Son Nguyen
5e25ddebff
move copilot instructions to AGENTS.md ( #18259 )
...
* move copilot --> agents.md
* agents: add disclose AI usage
* refine
2025-12-21 19:09:21 +01:00
Jeff Bolz
fd05c51cec
vulkan: fix im2col overflowing maxworkgroupcount ( #18180 )
2025-12-21 10:32:58 +01:00
Jeff Bolz
b365c3ff01
vulkan/cuda: fix topk_moe with exp_probs_b ( #18071 )
...
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
2025-12-21 10:27:34 +01:00
Jeff Bolz
cb64222b0c
vulkan: support GGML_UNARY_OP_XIELU ( #18062 )
2025-12-21 10:17:58 +01:00
Jeff Bolz
6eb7081860
vulkan: in graph_optimize, try to group ADD operations ( #18060 )
...
I saw the adds not staying together in the new nemotron 3 nano model.
2025-12-21 10:05:08 +01:00
lovedheart
4117ae5557
Vulkan: some improvement on mul_mat_iq2_xs ( #18031 )
...
* Some improvement on mul_mat_iq2_xs
Refactor calculations for db values and grid data to optimize performance and reduce redundancy.
* Fix trailing whitespace
2025-12-21 09:59:52 +01:00
Daniel Bevenius
65e96a2464
docs : fix links in parsing.md ( #18245 )
...
This commit corrects the links in the parsing.md which currently result
in 404 errors.
2025-12-21 09:35:40 +01:00
Aldehir Rojas
9496bbb808
common : reorganize includes to prioritize vendored deps ( #18222 )
2025-12-20 21:43:21 -06:00
Xuan-Son Nguyen
ddcb75dd8a
server: add auto-sleep after N seconds of idle ( #18228 )
...
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
2025-12-21 02:24:42 +01:00
Jeff Bolz
52ab19df63
tests: Avoid floating point precision false positives in SUM ( #17471 )
...
* tests: Avoid floating point precision false positives in SUM
* also apply to test_mean
2025-12-20 13:46:46 -06:00
Jeff Bolz
5182dd64cd
test-backend-ops: improve msvc build time ( #18209 )
2025-12-20 13:45:45 -06:00
Imad Saddik
642a4d68b8
chore: update webui build output
2025-12-20 15:44:02 +01:00
Imad Saddik
9b87eaf898
Updated the disabled text
2025-12-20 15:42:21 +01:00
Imad Saddik
fac6ef71a8
chore: update webui build output
2025-12-20 15:27:35 +01:00
Imad Saddik
37be01ac4c
Applied formatting
2025-12-20 15:26:23 +01:00
Imad Saddik
689b7a5bd6
Removed console log and refactored the code
2025-12-20 15:25:41 +01:00
Imad Saddik
6db71973ab
Applied formatting
2025-12-20 15:18:41 +01:00
Imad Saddik
b519235e0a
Added the ability to set custom pixel values in the combobox
2025-12-20 15:17:52 +01:00
Imad Saddik
3be006dcaf
Display the custom chat width combobox in the settings with just presets
2025-12-20 14:29:15 +01:00
Aadeshveer Singh
10b4f82d44
Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context ( #18212 )
2025-12-20 19:28:57 +08:00
Imad Saddik
14929a77b0
Added the command and popover components
2025-12-20 11:10:06 +01:00
Oleksandr Kuvshynov
408616adbd
server : [easy] fix per round speculative decode logging ( #18211 )
...
Currently we always log 0, as we clear slot.drafted before.
To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:
```
% ./build/bin/llama-server -v \
-m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
-md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
-c 8192 2> /tmp/llama.cpp.debug
Check the log:
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id 3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id 3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id 3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id 3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id 3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```
After the fix, get correct per round logging:
```
slot update_slots: id 3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id 3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id 3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id 3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id 3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id 3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id 3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
2025-12-20 10:57:40 +01:00
Imad Saddik
416bb35130
Used the new getChatWidth function in ChatProcessingInfo
2025-12-20 09:28:55 +01:00
Imad Saddik
62614a5faa
Used the new getChatWidth function in all chat messages components that need it
2025-12-20 09:27:49 +01:00
Xuan-Son Nguyen
9e39a1e6a9
server: support load model on startup, support preset-only options ( #18206 )
...
* server: support autoload model, support preset-only options
* add docs
* load-on-startup
* fix
* Update common/arg.cpp
Co-authored-by: Pascal <admin@serveurperso.com>
---------
Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-20 09:25:27 +01:00
Imad Saddik
dccfcc02eb
Used the new getChatWidth function in ChatForm
2025-12-20 09:09:45 +01:00
Imad Saddik
33d8d0f461
Performed formatting
2025-12-20 09:07:54 +01:00
Imad Saddik
61d99bbd88
Used the new chat width logic in ChatScreen and ChatWarning
2025-12-20 09:07:04 +01:00
Imad Saddik
b6dbbcc1fb
Moved and renamed the width-classes.ts file
2025-12-20 09:04:13 +01:00
Imad Saddik
d784cf9bea
Renamed the settings keys and added a new field in the settings
2025-12-20 08:51:27 +01:00
Imad Saddik
fe680a932b
Added support for custom width presets and renamed the constants
2025-12-20 08:49:46 +01:00
Imad Saddik
1cccfaea0f
Added new records to SETTING_CONFIG_DEFAULT
2025-12-20 08:25:01 +01:00
Sigbjørn Skjæret
74e05131e9
ci : remove non-windows zip artifacts ( #18201 )
...
* remove non-windows zip artifacts
* add cuda dll links
2025-12-19 22:29:46 +01:00
Sigbjørn Skjæret
f74747d886
ci : only save ccache on master ( #18207 )
2025-12-19 22:29:37 +01:00
Alfred
ce734a8a2f
ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations ( #17977 )
...
* feat: implement real Q8_0
* feat: adding cmake option for configuring FP32 quantize group size
* typo: set() shall be used
---------
Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
2025-12-19 09:42:28 -08:00
Pascal
14931a826e
arg: fix order to use short form before long form ( #18196 )
...
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
2025-12-19 18:01:56 +01:00
Julius Tischbein
f99ef53d2a
llama : Changing off_t to size_t for Windows ( #18204 )
2025-12-19 16:42:46 +02:00
Aman Gupta
cc0a04343e
server: friendlier error msg when ctx < input ( #18174 )
...
* llama-server: friendlier error msg when ctx < input
This PR adds formatted strings to the server's send_error function
* llama-server: use string_format inline
* fix test
2025-12-19 12:10:00 +01:00
Xuan-Son Nguyen
98c1c7a7bf
presets: refactor, allow cascade presets from different sources, add global section ( #18169 )
...
* presets: refactor, allow cascade presets from different sources
* update docs
* fix neg arg handling
* fix empty mmproj
* also filter out server-controlled args before to_ini()
* skip loading custom_models if not specified
* fix unset_reserved_args
* fix crash on windows
2025-12-19 12:08:20 +01:00
Aleksander Grygier
acb73d8340
webui: Add editing attachments in user messages ( #18147 )
...
* feat: Enable editing attachments in user messages
* feat: Improvements for data handling & UI
* docs: Update Architecture diagrams
* chore: update webui build output
* refactor: Exports
* chore: update webui build output
* feat: Add handling paste for Chat Message Edit Form
* chore: update webui build output
* refactor: Cleanup
* chore: update webui build output
2025-12-19 11:14:07 +01:00