Pascal
14931a826e
arg: fix order to use short form before long form ( #18196 )
...
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
2025-12-19 18:01:56 +01:00
Oliver Simons
1da013c66e
Build with CCCL 3.2 for CUDA backends
...
Gives best perf for backend-sampling on CUDA. Flag can be removed once
CCCL 3.2 is bundled within CTK and that CTK version is used in llama.cpp
2025-12-19 16:10:51 +01:00
Julius Tischbein
f99ef53d2a
llama : Changing off_t to size_t for Windows ( #18204 )
2025-12-19 16:42:46 +02:00
Oliver Simons
b5ec0fd76c
Update CCCL version to v3.2.0-rc2
2025-12-19 13:42:27 +01:00
Aman Gupta
cc0a04343e
server: friendlier error msg when ctx < input ( #18174 )
...
* llama-server: friendlier error msg when ctx < input
This PR adds formatted strings to the server's send_error function
* llama-server: use string_format inline
* fix test
2025-12-19 12:10:00 +01:00
Xuan-Son Nguyen
98c1c7a7bf
presets: refactor, allow cascade presets from different sources, add global section ( #18169 )
...
* presets: refactor, allow cascade presets from different sources
* update docs
* fix neg arg handling
* fix empty mmproj
* also filter out server-controlled args before to_ini()
* skip loading custom_models if not specified
* fix unset_reserved_args
* fix crash on windows
2025-12-19 12:08:20 +01:00
Oliver Simons
0a17687c72
Make backend dist sampler use same rnd's as dist sampler
...
We sample in double precision and cast to float to match rnd numbers of
llama_dampler_dist which uses double precision (sampling from
std::uniform_real_distribution<double> and
std::uniform_real_distribution<float> with same rng will produce
different sequences).
2025-12-19 11:43:19 +01:00
Oliver Simons
1750917420
Fix different RNG-states between backend-sampling and llama-sampling
...
By default, we perform a warm-up step where the ggml_cgraph is computed
once. For backend-sampling, this graph contains the sampler, and thus
the RNG state of the backend's dist sampler is advanced once.
Solution to this is to reset the samplers after the warmup has finished
2025-12-19 11:42:10 +01:00
Aleksander Grygier
acb73d8340
webui: Add editing attachments in user messages ( #18147 )
...
* feat: Enable editing attachments in user messages
* feat: Improvements for data handling & UI
* docs: Update Architecture diagrams
* chore: update webui build output
* refactor: Exports
* chore: update webui build output
* feat: Add handling paste for Chat Message Edit Form
* chore: update webui build output
* refactor: Cleanup
* chore: update webui build output
2025-12-19 11:14:07 +01:00
Daniel Bevenius
bc5195c585
Merge remote-tracking branch 'upstream/master' into backend-sampling
2025-12-19 09:38:01 +01:00
Daniel Bevenius
0a271d82b4
model-conversion : add verbose flag in run-org-model.py ( #18194 )
...
This commit adds a --verbose flag to the run-org-model.py script to
enable or disable detailed debug output, such as input and output
tensors for each layer. Debug utilities (summarize, debug_hook,
setup_rope_debug) have been moved to utils/common.py.
The motivation for this is that the detailed debug output can be useful
for diagnosing issues with model conversion or execution, but it can
also produce a large amount of output that may not always be needed.
The script will also be further cleaned/refactored in follow-up commits.
2025-12-19 08:43:16 +01:00
Naco Siren
52fc7fee8a
android: fix missing screenshots for Android.md ( #18156 )
...
* Android basic sample app layout polish
* Add missing screenshots and polish android README doc
* Replace file blobs with URLs served by GitHub pages service.
2025-12-19 09:32:04 +02:00
Jeff Bolz
cdbada8d10
vulkan: Add perf logger mode with concurrency ( #17944 )
...
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.
GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).
GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
2025-12-19 06:36:46 +01:00
Xuan-Son Nguyen
8ea958d4d9
model : add ASR support for LFM2-Audio-1.5B (conformer) ( #18106 )
...
* ASR with LFM2-Audio-1.5B
* Set rope_theta
* Fix comment
* Remove rope_theta setting
* Address PR feedback
* rename functions to conformer
* remove some redundant ggml_cont
* fix missing tensor
* add prefix "a." for conv tensors
* remove redundant reshape
* clean up
* add test model
---------
Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
2025-12-19 00:18:01 +01:00
Pascal
f9ec8858ed
webui: display prompt processing stats ( #18146 )
...
* webui: display prompt processing stats
* feat: Improve UI of Chat Message Statistics
* chore: update webui build output
* refactor: Post-review improvements
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-18 17:55:03 +01:00
Taimur Ahmad
f716588e63
ggml-cpu: extend support for RVV floating-point kernels ( #17318 )
...
* cmake: add BF16 RVV flag for ggml-cpu
* ggml-cpu: add floating-point conversion kernels
* ggml: add floating-point kernels
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
* ggml-cpu: fix lmul in vec_dot_bf16
* ggml-cpu: change redsum to lmul 4, fix leftover
---------
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2025-12-18 16:02:09 +02:00
Xuan-Son Nguyen
4d1316c440
arg: fix ASAN error on sampler_type_names empty ( #18167 )
2025-12-18 14:30:32 +01:00
Sigbjørn Skjæret
ec7b9329ae
gguf-py : use copy-on-write mode for localtensor ( #18162 )
2025-12-18 13:45:38 +01:00
yulo
54189c0d39
remove i_major_dual ( #18157 )
...
Co-authored-by: zhang hui <you@example.com>
2025-12-18 12:50:56 +01:00
Aleksander Grygier
9ce64aed7d
webui: Fix selecting generated output issues during active streaming ( #18091 )
...
* draft: incremental markdown rendering with stable blocks
* refactor: Logic improvements
* refactor: DRY Markdown post-processing logic
* refactor: ID generation improvements
* fix: Remove runes
* refactor: Clean up & add JSDocs
* chore: update webui static output
* fix: Add tick to prevent race conditions for rendering Markdown blocks
Suggestion from @ServeurpersoCom
Co-authored-by: Pascal <admin@serveurperso.com>
* chore: Run `npm audit fix`
* chore: update webui static output
* feat: Improve performance using global counter & id instead of UUID
* refactor: Enhance Markdown rendering with link and code features
* chore: update webui static output
* fix: Code block content extraction
* chore: update webui static output
* chore: update webui static output
---------
Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-18 11:13:52 +01:00
Kim S.
900316da4e
webui: fix chat screen shadow width ( #18010 )
...
* webui: fix chat screen shadow width
* chore: add index.html.gz
2025-12-18 11:08:42 +01:00
Georgi Gerganov
3b3f5fed31
common : disable backend sampling when grammar is involved
2025-12-18 10:52:21 +02:00
Georgi Gerganov
eefdb0da17
Merge branch 'master' into HEAD
2025-12-18 10:12:47 +02:00
Johannes Gäßler
57c1e05643
llama: offload output layer to GPU first ( #18148 )
2025-12-18 08:12:18 +01:00
Sigbjørn Skjæret
9cff4cc554
convert : sort and use file parts from model index if present ( #18043 )
...
* keep file part order from model index
* treat index as authoritative
* sort index parts
2025-12-18 07:54:54 +01:00
Julius Tischbein
4d4f4cacd1
llama : Async DirectIO model loading on Linux ( #18012 )
...
* Uncached model read
* Removing additional --mmap arg
* Removing trailing whitespaces
* Adding fallback when O_DIRECT is not supported
* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp
* Adding maybe unused keyword for Mac and Windows.
* File seek aligned
* Removing all branches for direct_io in llama-model-loader.cpp
* Always use alignment from llama_file
* use_mmap=true
2025-12-18 08:27:19 +02:00
Shouyu
0a0bba05e8
ggml-hexagon: swiglu_oai operation ( #18114 )
...
* snapshot: debug ggml-hexagon swiglu-oai
* fix: fix hvx_min_scalar_f32
* feat: working swiglu-oai
* chore: fix formating isue
2025-12-17 13:38:21 -08:00
Sigbjørn Skjæret
5166aaf868
convert : force patch_merger tensors to f16/f32 ( #18124 )
2025-12-17 22:15:53 +01:00
Pascal
6ce3d85796
server: (webui) add --webui-config ( #18028 )
...
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
2025-12-17 21:45:45 +01:00
Xuan-Son Nguyen
e85e9d7637
server: (router) disable SSL on child process ( #18141 )
2025-12-17 21:39:08 +01:00
Johannes Gäßler
8dcc3662a2
llama-fit-params: fix memory print ( #18136 )
2025-12-17 21:10:03 +01:00
Kim S.
d37fc93505
webui: fix chat header width when sidebar is closed ( #17981 )
...
* webui: fix chat header width when sidebar is closed
* chore: add index.html.gz
2025-12-17 20:05:45 +01:00
Shouyu
4470a0764a
ggml-hexagon: gelu operation ( #17921 )
...
* feat: inital support for gelu using sigmoid approximation
* snapshot: faster gelu using polynomial approximation
* test: disable l2-block prefetch in polynomail approximation
* Revert "test: disable l2-block prefetch in polynomail approximation"
This reverts commit 72339994d4 .
* Revert "snapshot: faster gelu using polynomial approximation"
This reverts commit 2a787a61d1 .
* debug: temporarily disable unnecessary log message for debug purpose
* Feat: optiized unaligned sigmoid_f32
* Feat: larger l2prefetch block
* feat: apply unaligned-load optimization on mul and mul_scalar
* Revert "debug: temporarily disable unnecessary log message for debug purpose"
This reverts commit 84f2f23aa9 .
* refactor: cleanup commented unused code
* chore: reformat code with clang-formatter to pass cli test
* Revert "chore: reformat code with clang-formatter to pass cli test"
This reverts commit 952877ec24 .
* fix: fix loop overflow
* chore: fix formating ci error
2025-12-17 10:39:32 -08:00
Georgi Gerganov
4301e27319
common : restore grammar-based rejection sampling ( #18137 )
...
* common : restart grammar-based rejection sampling
* sampling : allow null samplers
2025-12-17 19:46:00 +02:00
Johannes Gäßler
a2c199e479
common: clarify instructions for bug reports ( #18134 )
2025-12-17 18:44:13 +01:00
HonestQiao
15dd67d869
model: fix GLM-ASR-Nano-2512 load error ( #18130 ) ( #18142 )
2025-12-17 16:34:35 +01:00
Daniel Bevenius
981475fedc
tests : add --device option support to backend sampler tests
...
This commit adds support for specifying a device to run the test on.
2025-12-17 15:31:21 +01:00
Xuan-Son Nguyen
bde461de8c
server: (router) allow child process to report status via stdout ( #18110 )
...
* server: (router) allow child process to report status via stdout
* apply suggestions
2025-12-17 14:54:11 +01:00
Piotr Wilkin (ilintar)
8faa87db02
Extend run-org-model.py, add (a) batching (b) loading prompt from file (c) multimodal capacity ( #18034 )
2025-12-17 14:21:51 +01:00
Daniel Bevenius
a519aea35c
tests : fix batch token position tracking in test_backend_sampler.cpp
2025-12-17 13:49:39 +01:00
Johannes Gäßler
6f1f6a961a
Github: ask for -v logs for params_fit [no ci] ( #18128 )
2025-12-17 13:46:48 +01:00
Alberto Cabrera Pérez
669696e00d
ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) ( #18096 )
...
* wip: skeleton for q8_0 repack
* q8_0 repack GEMV implementations
* GEMM implementations
* Formatting
* Fixed format consistency of repack gemm and gemv declarations
* gemv and gemm generic location consistent with declarations
* Removed non-correct unused variables statements
* Cleanup, consistent style
* Missing generic fallbacks for x86 and powerpc
2025-12-17 13:39:13 +02:00
Tarek Dakhran
982060fadc
model: fix LFM2_MOE missing tensors ( #18132 )
2025-12-17 12:17:11 +01:00
Daniel Bevenius
cc31e6a20e
tests : extract batch info update to separate method
2025-12-17 11:53:15 +01:00
Daniel Bevenius
76a1b7fe8c
tests : remove vocab member from test_model_context
...
Also includes some minor cleanups related to nullptr checks.
2025-12-17 11:48:41 +01:00
Daniel Bevenius
9845996919
tests : use smart pointers for model and context
2025-12-17 11:26:05 +01:00
Daniel Bevenius
9a9ea2f6b1
tests : use smart pointers for backend samplers
2025-12-17 11:08:08 +01:00
Sigbjørn Skjæret
6853bee680
ci : clean up webui jobs ( #18116 )
...
* clean up webui jobs
* refined step control
* forgot dependencies
* apparently always() is needed
2025-12-17 10:45:40 +01:00
Pascal
487674fbb3
common: fix --override-kv to support comma-separated values ( #18056 )
...
* common: fix --override-kv to support comma-separated values
* Update common/arg.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* common: deprecate repeated arguments, suggest comma-separated values
* common: add comma escape support for --override-kv
* common: optimize duplicate detection with insert().second
Co-authored-by: personalmountains <46615898+personalmountains@users.noreply.github.com>
* common: migrate all repeated args to comma-separated syntax
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: personalmountains <46615898+personalmountains@users.noreply.github.com>
2025-12-17 11:36:23 +02:00
yulo
acec774ef6
HIP: Refactor mma for RDNA and CDNA ( #17990 )
...
* mma.cuh for rdna4
* mma for rdna3
* mmq for rdna4
* mmq for rdna3
* align i-major and j-major
* cdna
* fix cuda error
* add missing tile of mfma
* fix j-major wrong ne on CDNA
* fix gramma and empty spaces
---------
Co-authored-by: zhang hui <you@example.com>
2025-12-17 09:34:54 +01:00