Georgi Gerganov
6ecba0d0d0
fix 5
2025-12-30 14:53:52 +02:00
Georgi Gerganov
94bfa7803e
fix 4
2025-12-30 14:15:04 +02:00
Georgi Gerganov
3e0a3e865b
fix 3
2025-12-30 14:06:42 +02:00
Georgi Gerganov
bd48a0ac10
fix2
2025-12-30 14:02:58 +02:00
Georgi Gerganov
ab6f1122a4
fix
2025-12-30 14:02:09 +02:00
Georgi Gerganov
faad7d4743
test
2025-12-30 14:00:36 +02:00
Georgi Gerganov
23e8bb4077
arg : add shorthand for --backend-sampling
2025-12-30 13:56:22 +02:00
Daniel Bevenius
ebfe545cf9
Merge remote-tracking branch 'upstream/master' into backend-sampling
2025-12-30 07:59:02 +01:00
Xuan-Son Nguyen
51a48720b8
webui: fix prompt progress ETA calculation ( #18468 )
...
* webui: fix prompt progress ETA calculation
* handle case done === 0
2025-12-29 21:42:11 +01:00
Pascal
c9a3b40d65
Webui/prompt processing progress ( #18300 )
...
* webui: display prompt preprocessing progress
* webui: add percentage/ETA and exclude cached tokens from progress
Address review feedback from ngxson
* webui: add minutes and first chunk (0%) case
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: address review feedback from allozaur
* chore: update webui build output
* webui: address review feedback from allozaur
* nit
* chore: update webui build output
* feat: Enhance chat processing state
* feat: Improve chat processing statistics UI
* chore: update webui build output
* feat: Add live generation statistics to processing state hook
* feat: Persist prompt processing stats in hook for better UX
* refactor: Enhance ChatMessageStatistics for live stream display
* feat: Implement enhanced live chat statistics into assistant message
* chore: update webui build output
* fix: Proper tab for each stage of prompt processing/generation
* chore: update webui build output
* fix: Improved ETA calculation & display logic
* chore: update webui build output
* feat: Simplify logic & remove ETA from prompt progress
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-29 19:32:21 +01:00
Johannes Gäßler
0bd1212a43
CUDA: fix replacment of bad archs in CMake ( #18457 )
2025-12-29 17:58:20 +01:00
wbtek
5b1248c9af
server : Cmdline arg -to changes http read timeout from current 600sec default ( #18279 )
...
* Prevent crash if TTFT >300sec, boosted to 90 days
* server : allow configurable HTTP timeouts for child models
* server : pass needed timeouts from params only
---------
Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen
3595ae5963
contributing: tighten AI usage policy ( #18388 )
...
* contributing: tighten AI usage policy
* refactor AGENTS.md
* proofreading
* update contributing
* add claude.md
* add trailing newline
* add note about dishonest practices
* rm point about dishonest
* rm requirement watermarking
* add .gemini/settings.json
* allow initially AI-generated content
* revise
* Update CONTRIBUTING.md
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* improve
* trailing space
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* update
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-29 16:01:32 +01:00
Naco Siren
c1366056f6
android: routine maintenance - Dec 2025 ( #18338 )
...
* Fix `msg` typo
* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.
* UI polish: stack new message change from below; fix GGUF margin not in view port
* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.
* Bump dependencies' versions; Deprecated outdated dsl usage.
2025-12-29 15:51:13 +02:00
Georgi Gerganov
2a85f720b8
server : handle closed connection for tasks ( #18459 )
2025-12-29 15:34:41 +02:00
Daniel Bevenius
7cbec34a63
model-conversion : add device option to embd run orig model ( #18386 )
...
This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.
2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno
0c8986403b
retrieval : use at most n_seq_max chunks ( #18400 )
2025-12-29 13:21:13 +02:00
o7si
daa242dfc8
common: fix return value check for setpriority ( #18412 )
...
* common: fix return value check for setpriority
* tools: add logging for process priority setting
2025-12-29 11:07:49 +02:00
Johannes Gäßler
e70e640db3
CUDA: Blackwell features for non-native builds ( #18436 )
2025-12-29 09:35:42 +01:00
Aman Gupta
5fa66c6e67
cuda: fix race condition in cumsum ( #18448 )
...
* ggml-cuda: fix race condition in cumsum
* remove unneccesary sync_threads
2025-12-29 14:07:17 +08:00
Tim Neumann
382808c14b
ci : re-enable rocm build on amd64 ( #18439 )
...
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913 .
I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).
A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.
The `runs_on` option was added to match the other entries.
2025-12-29 00:29:23 +01:00
uvos
4ffc47cb20
HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated ( #18202 )
2025-12-28 20:12:55 +01:00
momonga
9c675c7140
model : Plamo3 support ( #17304 )
...
* plamo3
* fix plamo3
* clean code
* clean up the code
* fix diff
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* add chat_template if exist
* clean up the code
* fix cpu-backend
* chore: whitespace trim fix + typo fix
* Fix: address review feedback
* restore `FREQ_BASE_SWA` constant
* Fix: address review feedback2
* Fix:typecheck
* Fix: address review feedback3
* final cleanup
---------
Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 17:28:31 +01:00
Aman Gupta
07a0c4ba92
Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )" ( #18426 )
2025-12-28 20:53:36 +08:00
o7si
60f17f56da
rpc: fix segfault on invalid endpoint format ( #18387 )
...
* rpc: fix segfault on invalid endpoint format
* rpc: add error log for failed endpoint connection
2025-12-28 12:34:41 +02:00
Johannes Gäßler
f8d561eb87
llama-fit-params: fix step size for last device ( #18415 )
2025-12-28 10:52:09 +01:00
Johannes Gäßler
e59efe6a78
github: update issue templates [no ci] ( #18410 )
...
* github: update issue templates [no ci]
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen
cffa5c46ea
mtmd: clarify that we no longer accept AI-generated PRs ( #18406 )
2025-12-28 09:57:04 +01:00
Boian Berberov
94de74e7b1
cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` ( #18186 )
...
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`
* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`
- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`
Resolves : #17966
2025-12-28 09:33:29 +02:00
Daniel Bevenius
060c0a585e
ggml : include cub/cub.cuh instead of block_scan.cuh
...
This commit updates the include directive in cumsum.cu to use
cub/cub.cuh instead of cub/block/block_scan.cuh.
The motivation of this change is that without it compilation fails
with the following error:
```console
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name
cub::DeviceScan::InclusiveSum(nullptr,
^
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name
cub::DeviceScan::InclusiveSum((void *) tmp_alloc.get(), tmp_size, src, dst, ne, stream);
^
2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu".
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2
```
Commit 83b3b1c271 ("cuda: optimize
cumsum cub path (#18362 )") updated the include directive replacing
device_scan.cuh which is causing this issue.
This commit uses cub/cub.cuh umbrella header which is consistent with
other files in the ggml-cuda directory like mean.cu, sum.cu, etc.
2025-12-28 08:03:04 +01:00
Daniel Bevenius
82c2600585
Merge remote-tracking branch 'upstream/master' into backend-sampling
2025-12-28 07:34:17 +01:00
QDelta
4fd59e8427
ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )
2025-12-28 09:33:14 +08:00
lhez
08566977a7
opencl: allow resizing transpose buffers ( #18384 )
...
* opencl: allow resizing transpose buffers instead of using fixed sizes
* opencl: remove commented code
2025-12-27 15:51:14 -08:00
Johannes Gäßler
a4bf35889e
llama-fit-params: fix overflow check ( #18354 )
2025-12-27 20:20:45 +01:00
Johannes Gäßler
026d2ad472
llama: fix magic number of 999 for GPU layers ( #18266 )
...
* llama: fix magic number of 999 for GPU layers
* use strings for -ngl, -ngld
* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Aman Gupta
06705fdcb3
ggml-cuda: Use same regex for GGML_NATIVE=OFF ( #18407 )
2025-12-27 19:56:27 +08:00
Johannes Gäßler
a52dc60ba3
llama_fit_params: return enum for fail vs. error ( #18374 )
2025-12-27 09:59:19 +01:00
Johannes Gäßler
9045c9afe5
llama-fit-params: fix Gemma 3 calculation ( #18372 )
2025-12-27 09:56:04 +01:00
Jeff Bolz
c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly ( #18352 )
...
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-26 16:12:58 -06:00
Jeff Bolz
7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader ( #18349 )
...
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-26 18:15:50 +01:00
Jeff Bolz
9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id ( #18332 )
2025-12-26 18:15:02 +01:00
Eve
cb999704fb
vulkan: small dequantization improvements ( #18380 )
...
* iq4_xs
* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz
b96b82fc85
vulkan: Support UPSCALE w/antialias ( #18327 )
2025-12-26 17:00:57 +01:00
Jeff Bolz
10dc500bdb
vulkan: handle rope with large number of rows ( #18306 )
2025-12-26 16:53:46 +01:00
o7si
4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models ( #18391 )
...
* server : fix crash when seq_rm fails for hybrid/recurrent models
* server : add allow_processing param to clear_slot
2025-12-26 16:35:29 +01:00
Francisco Herrera
af3be131c0
docs: added note for pre SYCL Intel hardware ( #18016 )
...
Specify that it's for pre sycl hardware
2025-12-26 10:34:30 +08:00
0Marble
b07cda687c
CANN: implement the SSM_CONV operator ( #17737 )
...
* CANN: implement SSM_CONV operator
Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
* CANN: remove custom error limit for SSM_CONV
* CANN: merge SSM_CONV tensor shape/strides into one line
---------
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-26 09:12:04 +08:00
Aman Gupta
85c40c9b02
ggml-cuda: fix regex for arch list ( #18371 )
...
* ggml-cuda: fix regex for arch list
* make regex exact
2025-12-26 01:35:14 +08:00
Aman Gupta
83b3b1c271
cuda: optimize cumsum cub path ( #18362 )
...
* cuda: optimize cumsum cub path
* remove heavy perf test
2025-12-25 23:55:38 +08:00
Aman Gupta
b0fb0f0aee
ggml-cuda: fix blackwell native builds ( #18361 )
...
* ggml-cuda: fix blackwell native builds
Replace 12x in native architectures by 12xa
* replace for GGML_NATIVE=OFF too
* only replace for native
* remove 120f-virtual for default compilation
---------
Co-authored-by: Aman Gupta <aman>
2025-12-25 22:12:11 +08:00