Herman Semenoff
37adc9c6ba
ggml, llama : use defaulted constructors/destructors ( #17649 )
2025-12-03 07:12:18 +01:00
Marcos Del Sol Vives
16cc3c606e
build: document how to compile with Vulkan using Debian/Ubuntu packages ( #17688 )
2025-12-03 08:25:11 +08:00
Xuan-Son Nguyen
13628d8bdb
server: add --media-path for local media files ( #17697 )
...
* server: add --media-path for local media files
* remove unused fn
2025-12-02 22:49:20 +01:00
Xuan-Son Nguyen
a96283adc4
mtmd: fix --no-warmup ( #17695 )
2025-12-02 22:48:08 +01:00
Ali Tariq
4eba8d9451
ci : RVV1.0 builds with tests ( #16682 )
...
* Added RISC-V supported tests
* Added default value for LLAMA_FATAL_WARNINGS and option to specify by user
* Added RISC-V supported tests
* Added default value for LLAMA_FATAL_WARNINGS and option to specify by user
* Removed apt prompt
* Added RISC-V specific tests with corrections
Corrections included:
1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie
2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded
to throw errors with rvv1.0 and some other extensions
3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04
4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work
* Resolved the merge conflict and cleaned up run.sh
* Update ci/run.sh
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Removed previously added build ci for RISC-V
* Removed trailing whitespaces
* corrected build name
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cleanup
* Enabled build tests (1)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Enabled build tests (2)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* enable openssl
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-02 21:46:10 +01:00
Jeff Bolz
61bde8e21f
vulkan: Reduce temporary memory usage for TOP_K ( #17623 )
...
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
2025-12-02 19:22:04 +01:00
xiaobing318
e251e5ebbe
cmake : add utf8 compilation options for msvc ( #17682 )
2025-12-02 19:50:57 +02:00
Chad Voegele
c4357dcc35
Server: Change Invalid Schema from Server Error (500) to User Error (400) ( #17572 )
...
* Make invalid schema a user error (400)
* Move invalid_argument exception handler to ex_wrapper
* Fix test
* Simplify test back to original pattern
2025-12-02 17:33:50 +01:00
Daniel Bevenius
aad5a6afd7
sampling : implement temp_ext_backend sampling
...
This commit implements the apply function for the extended temperature
sampling.
2025-12-02 17:26:04 +01:00
Adrien Gallouët
e148380c7c
ggml : use svcntb() for SVE vector length detection ( #17474 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 18:21:11 +02:00
TianHao324
a2b0fe8d37
CANN: Disable Ger operator of OUT_PROD on 310p device ( #17563 )
2025-12-02 20:35:23 +08:00
Daniel Bevenius
7f3a72a8ed
ggml : remove redundant n_copies check when setting input/output ( #17612 )
...
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.
The motivation for this change is to clarify the code as the outer if
statement already performs this check.
2025-12-02 12:52:45 +01:00
Eric Curtin
b9a37717b0
codeowners : remove ericcurtin ( #17658 )
...
Taking a break from llama.cpp . I wasn't around at the start of llama.cpp
but I want to thank @ggerganov and @slaren for creating a neat community
here.
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-12-02 12:18:15 +01:00
Daniel Bevenius
2595818a68
Merge remote-tracking branch 'upstream/master' into backend-sampling
2025-12-02 12:07:01 +01:00
Adrien Gallouët
f3a9674ae8
llama : fix signed comparison warning on FreeBSD ( #17497 )
...
This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit).
warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
488 | if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) {
| ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 12:05:38 +01:00
Daniel Bevenius
db8972e251
squash! sampling : fix backend temp sampler for zero temperature
...
This modifies the parent commit to simply return the most probably token
instead of masking the logits.
2025-12-02 11:53:29 +01:00
Xuan-Son Nguyen
2c453c6c77
convert: add error message for mistral3 quantized weight ( #17686 )
2025-12-02 11:48:31 +01:00
Xuan-Son Nguyen
5d6bd842ea
server: remove default "gpt-3.5-turbo" model name ( #17668 )
...
* server: remove default "gpt-3.5-turbo" model name
* do not reflect back model name from request
* fix test
2025-12-02 11:38:57 +01:00
Oliver Simons
516af33ca6
CUDA: Update CCCL's rc candidate
2025-12-02 11:23:14 +01:00
Oliver Simons
244880ae3a
CUDA: Use standard-compliant preprocessor for MSVC builds
...
Workarounds of https://github.com/NVIDIA/cccl/pull/6791 will not be
backported to CCCL 3.2, only the diagnostics/error messages will:
https://github.com/NVIDIA/cccl/pull/6827
2025-12-02 11:23:14 +01:00
Oliver Simons
559d058dd2
CUDA: Move cccl fetch to after cuda has been enabled in CMakeLists.txt
...
This will allow cccl to set build flags for the CUDA compiler, required
e.g. for MSVC compat, see also
https://github.com/NVIDIA/cccl/pull/6791
2025-12-02 11:23:14 +01:00
senhtry
fd3abe849e
server: fixing naming conflict res_error in server-models.cpp ( #17679 )
2025-12-02 11:18:39 +01:00
Xuan-Son Nguyen
682e6658bb
server: explicitly set exec path when create new instance ( #17669 )
...
* Revert "rm unused fn"
This reverts commit f2dbe9c087 .
* server: explicitly set exec path when create new instance
* put back TODO
* only call get_server_exec_path() once
* add fallback logic
2025-12-02 10:25:11 +01:00
Adrien Gallouët
4574f2949e
ci : skip winget update when not in ggml-org ( #17465 )
...
Prevent forks from generating daily failure notifications.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 10:15:01 +01:00
Adrien Gallouët
ab6726eeff
ggml : add fallback definition for HWCAP2_SVE2 ( #17683 )
...
This align with other HWCAP2 feature flags
See #17528
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 10:41:26 +02:00
Daniel Bevenius
3e9a258c14
Merge remote-tracking branch 'upstream/master' into gpu-sampling
2025-12-02 09:26:04 +01:00
Aleksander Grygier
cee92af553
Add context info to server error ( #17663 )
...
* fix: Add context info to server error
* chore: update webui build output
2025-12-02 09:20:57 +01:00
Daniel Bevenius
739b597804
sampling : fix backend temp sampler for zero temperature
...
This commit fixes the implementation of the temperature-based sampler
for the case when the temperature is set to zero. This now correctly
selects the most probable token by masking out all other tokens in the
logits.
2025-12-02 09:13:07 +01:00
Aman Gupta
ed32089927
ggml-cuda: reorder only relevant nodes ( #17639 )
2025-12-02 12:36:31 +08:00
Aaron Teo
7b6d745364
release: fix duplicate libs, store symbolic links ( #17299 )
2025-12-02 11:52:05 +08:00
Neo Zhang Jianyu
98bd9ab1e4
enhance argsort for UT ( #17573 )
...
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2025-12-02 08:56:46 +08:00
Piotr Wilkin (ilintar)
746f9ee889
Override SSM_A op for Qwen3 Next to reduce splits ( #17587 )
...
* Override SSM_A op for Qwen3 Next to reduce splits
* New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context.
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-02 00:43:13 +01:00
Jeff Bolz
9810cb8247
ops.md: update vulkan support ( #17661 )
2025-12-01 15:26:21 -06:00
Xuan-Son Nguyen
ecf74a8417
mtmd: add mtmd_context_params::warmup option ( #17652 )
...
* mtmd: add mtmd_context_params::warmup option
* reuse the common_params::warmup
2025-12-01 21:32:25 +01:00
Gilad S.
00c361fe53
fix: llama arch implementation ( #17665 )
2025-12-01 21:21:13 +01:00
Xuan-Son Nguyen
ec18edfcba
server: introduce API for serving / loading / unloading multiple models ( #17470 )
...
* server: add model management and proxy
* fix compile error
* does this fix windows?
* fix windows build
* use subprocess.h, better logging
* add test
* fix windows
* feat: Model/Router server architecture WIP
* more stable
* fix unsafe pointer
* also allow terminate loading model
* add is_active()
* refactor: Architecture improvements
* tmp apply upstream fix
* address most problems
* address thread safety issue
* address review comment
* add docs (first version)
* address review comment
* feat: Improved UX for model information, modality interactions etc
* chore: update webui build output
* refactor: Use only the message data `model` property for displaying model used info
* chore: update webui build output
* add --models-dir param
* feat: New Model Selection UX WIP
* chore: update webui build output
* feat: Add auto-mic setting
* feat: Attachments UX improvements
* implement LRU
* remove default model path
* better --models-dir
* add env for args
* address review comments
* fix compile
* refactor: Chat Form Submit component
* ad endpoint docs
* Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2
Co-authored-by: Aleksander <aleksander.grygier@gmail.com>
* feat: Add copy to clipboard to model name in model info dialog
* feat: Model unavailable UI state for model selector
* feat: Chat Form Actions UI logic improvements
* feat: Auto-select model from last assistant response
* chore: update webui build output
* expose args and exit_code in API
* add note
* support extra_args on loading model
* allow reusing args if auto_load
* typo docs
* oai-compat /models endpoint
* cleaner
* address review comments
* feat: Use `model` property for displaying the `repo/model-name` naming format
* refactor: Attachments data
* chore: update webui build output
* refactor: Enum imports
* feat: Improve Model Selector responsiveness
* chore: update webui build output
* refactor: Cleanup
* refactor: Cleanup
* refactor: Formatters
* chore: update webui build output
* refactor: Copy To Clipboard Icon component
* chore: update webui build output
* refactor: Cleanup
* chore: update webui build output
* refactor: UI badges
* chore: update webui build output
* refactor: Cleanup
* refactor: Cleanup
* chore: update webui build output
* add --models-allow-extra-args for security
* nits
* add stdin_file
* fix merge
* fix: Retrieve lost setting after resolving merge conflict
* refactor: DatabaseStore -> DatabaseService
* refactor: Database, Conversations & Chat services + stores architecture improvements (WIP)
* refactor: Remove redundant settings
* refactor: Multi-model business logic WIP
* chore: update webui build output
* feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic
* chore: update webui build output
* fix: Add `untrack` inside chat processing info data logic to prevent infinite effect
* fix: Regenerate
* feat: Remove redundant settigns + rearrange
* fix: Audio attachments
* refactor: Icons
* chore: update webui build output
* feat: Model management and selection features WIP
* chore: update webui build output
* refactor: Improve server properties management
* refactor: Icons
* chore: update webui build output
* feat: Improve model loading/unloading status updates
* chore: update webui build output
* refactor: Improve API header management via utility functions
* remove support for extra args
* set hf_repo/docker_repo as model alias when posible
* refactor: Remove ConversationsService
* refactor: Chat requests abort handling
* refactor: Server store
* tmp webui build
* refactor: Model modality handling
* chore: update webui build output
* refactor: Processing state reactivity
* fix: UI
* refactor: Services/Stores syntax + logic improvements
Refactors components to access stores directly instead of using exported getter functions.
This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction.
Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`.
* refactor: Architecture cleanup
* feat: Improve statistic badges
* feat: Condition available models based on modality + better model loading strategy & UX
* docs: Architecture documentation
* feat: Update logic for PDF as Image
* add TODO for http client
* refactor: Enhance model info and attachment handling
* chore: update webui build output
* refactor: Components naming
* chore: update webui build output
* refactor: Cleanup
* refactor: DRY `getAttachmentDisplayItems` function + fix UI
* chore: update webui build output
* fix: Modality detection improvement for text-based PDF attachments
* refactor: Cleanup
* docs: Add info comment
* refactor: Cleanup
* re
* refactor: Cleanup
* refactor: Cleanup
* feat: Attachment logic & UI improvements
* refactor: Constants
* feat: Improve UI sidebar background color
* chore: update webui build output
* refactor: Utils imports + move types to `app.d.ts`
* test: Fix Storybook mocks
* chore: update webui build output
* test: Update Chat Form UI tests
* refactor: Tooltip Provider from core layout
* refactor: Tests to separate location
* decouple server_models from server_routes
* test: Move demo test to tests/server
* refactor: Remove redundant method
* chore: update webui build output
* also route anthropic endpoints
* fix duplicated arg
* fix invalid ptr to shutdown_handler
* server : minor
* rm unused fn
* add ?autoload=true|false query param
* refactor: Remove redundant code
* docs: Update README documentations + architecture & data flow diagrams
* fix: Disable autoload on calling server props for the model
* chore: update webui build output
* fix ubuntu build
* fix: Model status reactivity
* fix: Modality detection for MODEL mode
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-01 19:41:04 +01:00
Daniel Bevenius
988261b18d
examples : remove outdated backend sampling section
...
This commit removes the outdated section about using backend samplers
from the README.md file in the examples/batched.
2025-12-01 18:20:41 +01:00
Georgi Gerganov
88cca45bb8
sampling : fix top_p empty condition
2025-12-01 18:02:34 +02:00
Georgi Gerganov
04f2822a86
sampling : do not create empty samplers
2025-12-01 17:52:07 +02:00
Georgi Gerganov
4032ce2378
common : simplify sampler chain initialization
2025-12-01 17:11:11 +02:00
Oliver Simons
217469f07f
Make backend's top_p sampler inclusive
...
In addition to match the algorithm proposed in the original
[paper](https://arxiv.org/abs/1904.09751 ), this resolves the edge-case
where `max_p is > top_p` for a single logit, where the mask would
otherwise be empty (and we thus sample from the whole vocabulary with
equal likelihood)
2025-12-01 15:28:06 +01:00
Oliver Simons
ae0bb6a6da
Factor out `ggml_sort` into its own function
2025-12-01 15:28:06 +01:00
Xuan-Son Nguyen
7733409734
common: improve verbosity level definitions ( #17630 )
...
* common: improve verbosity level definitions
* string_format
* update autogen docs
2025-12-01 14:38:13 +01:00
Georgi Gerganov
16451d6bc3
Merge branch 'master' into HEAD
2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen
cd3c118908
model: support Ministral3 ( #17644 )
...
* conversion script
* support ministral 3
* maybe this is better?
* add TODO for rope_yarn_log_mul
* better ppl (tested on 14B-Instruct)
* Add Ministral3 support to Mistral format
* improve arch handling
* add sizes
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* nits
---------
Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Oliver Simons
8bee483c97
Fix backend_top_p_sampler
...
softmax(softmax) will return uniform distribution, so we should not
return the softmax but the logits instead.
2025-12-01 12:07:30 +01:00
Georgi Gerganov
649495c9d9
metal : add FA head size 48 ( #17619 )
2025-12-01 12:49:53 +02:00
Georgi Gerganov
90c72a614a
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler ( #17617 )
2025-12-01 12:49:33 +02:00
Aman Gupta
6eea666912
llama-graph: avoid expand_forward for fusion ( #17633 )
2025-12-01 11:12:48 +02:00
Daniel Bevenius
cf0e1475c5
sampling : lower log level for output buffer reallocations [no ci]
...
This commit changes the logging level for output buffer reallocations
in the llama_context::output_reserve function from INFO to DEBUG.
The motivation for this is that it currently logs to info and when
enabling verbose logging for llama-cli this will get mixed with the
output, for example:
```console
What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB
1. Stockholm
2\. Helsinki
Based are the options
1. Stockholm
Explanation: Stockholm is the capital of
...
```
2025-12-01 09:13:47 +01:00