Commit Graph

7747 Commits

Author SHA1 Message Date
Georgi Gerganov 6ecba0d0d0
fix 5 2025-12-30 14:53:52 +02:00
Georgi Gerganov 94bfa7803e
fix 4 2025-12-30 14:15:04 +02:00
Georgi Gerganov 3e0a3e865b
fix 3 2025-12-30 14:06:42 +02:00
Georgi Gerganov bd48a0ac10
fix2 2025-12-30 14:02:58 +02:00
Georgi Gerganov ab6f1122a4
fix 2025-12-30 14:02:09 +02:00
Georgi Gerganov faad7d4743
test 2025-12-30 14:00:36 +02:00
Georgi Gerganov 23e8bb4077
arg : add shorthand for --backend-sampling 2025-12-30 13:56:22 +02:00
Daniel Bevenius ebfe545cf9
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-30 07:59:02 +01:00
Xuan-Son Nguyen 51a48720b8
webui: fix prompt progress ETA calculation (#18468)
* webui: fix prompt progress ETA calculation

* handle case done === 0
2025-12-29 21:42:11 +01:00
Pascal c9a3b40d65
Webui/prompt processing progress (#18300)
* webui: display prompt preprocessing progress

* webui: add percentage/ETA and exclude cached tokens from progress

Address review feedback from ngxson

* webui: add minutes and first chunk (0%) case

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: address review feedback from allozaur

* chore: update webui build output

* webui: address review feedback from allozaur

* nit

* chore: update webui build output

* feat: Enhance chat processing state

* feat: Improve chat processing statistics UI

* chore: update webui build output

* feat: Add live generation statistics to processing state hook

* feat: Persist prompt processing stats in hook for better UX

* refactor: Enhance ChatMessageStatistics for live stream display

* feat: Implement enhanced live chat statistics into assistant message

* chore: update webui build output

* fix: Proper tab for each stage of prompt processing/generation

* chore: update webui build output

* fix: Improved ETA calculation & display logic

* chore: update webui build output

* feat: Simplify logic & remove ETA from prompt progress

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-29 19:32:21 +01:00
Johannes Gäßler 0bd1212a43
CUDA: fix replacment of bad archs in CMake (#18457) 2025-12-29 17:58:20 +01:00
wbtek 5b1248c9af
server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)
* Prevent crash if TTFT >300sec, boosted to 90 days

* server : allow configurable HTTP timeouts for child models

* server : pass needed timeouts from params only

---------

Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen 3595ae5963
contributing: tighten AI usage policy (#18388)
* contributing: tighten AI usage policy

* refactor AGENTS.md

* proofreading

* update contributing

* add claude.md

* add trailing newline

* add note about dishonest practices

* rm point about dishonest

* rm requirement watermarking

* add .gemini/settings.json

* allow initially AI-generated content

* revise

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve

* trailing space

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-29 16:01:32 +01:00
Naco Siren c1366056f6
android: routine maintenance - Dec 2025 (#18338)
* Fix `msg` typo

* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.

* UI polish: stack new message change from below; fix GGUF margin not in view port

* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.

* Bump dependencies' versions; Deprecated outdated dsl usage.
2025-12-29 15:51:13 +02:00
Georgi Gerganov 2a85f720b8
server : handle closed connection for tasks (#18459) 2025-12-29 15:34:41 +02:00
Daniel Bevenius 7cbec34a63
model-conversion : add device option to embd run orig model (#18386)
This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.
2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno 0c8986403b
retrieval : use at most n_seq_max chunks (#18400) 2025-12-29 13:21:13 +02:00
o7si daa242dfc8
common: fix return value check for setpriority (#18412)
* common: fix return value check for setpriority

* tools: add logging for process priority setting
2025-12-29 11:07:49 +02:00
Johannes Gäßler e70e640db3
CUDA: Blackwell features for non-native builds (#18436) 2025-12-29 09:35:42 +01:00
Aman Gupta 5fa66c6e67
cuda: fix race condition in cumsum (#18448)
* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads
2025-12-29 14:07:17 +08:00
Tim Neumann 382808c14b
ci : re-enable rocm build on amd64 (#18439)
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913.

I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).

A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.

The `runs_on` option was added to match the other entries.
2025-12-29 00:29:23 +01:00
uvos 4ffc47cb20
HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202) 2025-12-28 20:12:55 +01:00
momonga 9c675c7140
model : Plamo3 support (#17304)
* plamo3

* fix plamo3

* clean code

* clean up the code

* fix diff

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* add chat_template if exist

* clean up the code

* fix cpu-backend

* chore: whitespace trim fix + typo fix

* Fix: address review feedback

* restore `FREQ_BASE_SWA` constant

* Fix: address review feedback2

* Fix:typecheck

* Fix: address review feedback3

* final cleanup

---------

Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 17:28:31 +01:00
Aman Gupta 07a0c4ba92
Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (#18426) 2025-12-28 20:53:36 +08:00
o7si 60f17f56da
rpc: fix segfault on invalid endpoint format (#18387)
* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection
2025-12-28 12:34:41 +02:00
Johannes Gäßler f8d561eb87
llama-fit-params: fix step size for last device (#18415) 2025-12-28 10:52:09 +01:00
Johannes Gäßler e59efe6a78
github: update issue templates [no ci] (#18410)
* github: update issue templates [no ci]

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen cffa5c46ea
mtmd: clarify that we no longer accept AI-generated PRs (#18406) 2025-12-28 09:57:04 +01:00
Boian Berberov 94de74e7b1
cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (#18186)
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966
2025-12-28 09:33:29 +02:00
Daniel Bevenius 060c0a585e
ggml : include cub/cub.cuh instead of block_scan.cuh
This commit updates the include directive in cumsum.cu to use
cub/cub.cuh instead of cub/block/block_scan.cuh.

The motivation of this change is that without it compilation fails
with the following error:
```console
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name
      cub::DeviceScan::InclusiveSum(nullptr,
           ^

/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name
      cub::DeviceScan::InclusiveSum((void *) tmp_alloc.get(), tmp_size, src, dst, ne, stream);
           ^

2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu".
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2
```
Commit 83b3b1c271 ("cuda: optimize
cumsum cub path (#18362)") updated the include directive replacing
device_scan.cuh which is causing this issue.

This commit uses cub/cub.cuh umbrella header which is consistent with
other files in the ggml-cuda directory like mean.cu, sum.cu, etc.
2025-12-28 08:03:04 +01:00
Daniel Bevenius 82c2600585
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-28 07:34:17 +01:00
QDelta 4fd59e8427
ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413) 2025-12-28 09:33:14 +08:00
lhez 08566977a7
opencl: allow resizing transpose buffers (#18384)
* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code
2025-12-27 15:51:14 -08:00
Johannes Gäßler a4bf35889e
llama-fit-params: fix overflow check (#18354) 2025-12-27 20:20:45 +01:00
Johannes Gäßler 026d2ad472
llama: fix magic number of 999 for GPU layers (#18266)
* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Aman Gupta 06705fdcb3
ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407) 2025-12-27 19:56:27 +08:00
Johannes Gäßler a52dc60ba3
llama_fit_params: return enum for fail vs. error (#18374) 2025-12-27 09:59:19 +01:00
Johannes Gäßler 9045c9afe5
llama-fit-params: fix Gemma 3 calculation (#18372) 2025-12-27 09:56:04 +01:00
Jeff Bolz c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-26 16:12:58 -06:00
Jeff Bolz 7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-26 18:15:50 +01:00
Jeff Bolz 9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332) 2025-12-26 18:15:02 +01:00
Eve cb999704fb
vulkan: small dequantization improvements (#18380)
* iq4_xs

* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz b96b82fc85
vulkan: Support UPSCALE w/antialias (#18327) 2025-12-26 17:00:57 +01:00
Jeff Bolz 10dc500bdb
vulkan: handle rope with large number of rows (#18306) 2025-12-26 16:53:46 +01:00
o7si 4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models

* server : add allow_processing param to clear_slot
2025-12-26 16:35:29 +01:00
Francisco Herrera af3be131c0
docs: added note for pre SYCL Intel hardware (#18016)
Specify that it's for pre sycl hardware
2025-12-26 10:34:30 +08:00
0Marble b07cda687c
CANN: implement the SSM_CONV operator (#17737)
* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-26 09:12:04 +08:00
Aman Gupta 85c40c9b02
ggml-cuda: fix regex for arch list (#18371)
* ggml-cuda: fix regex for arch list

* make regex exact
2025-12-26 01:35:14 +08:00
Aman Gupta 83b3b1c271
cuda: optimize cumsum cub path (#18362)
* cuda: optimize cumsum cub path

* remove heavy perf test
2025-12-25 23:55:38 +08:00
Aman Gupta b0fb0f0aee
ggml-cuda: fix blackwell native builds (#18361)
* ggml-cuda: fix blackwell native builds

Replace 12x in native architectures by 12xa

* replace for GGML_NATIVE=OFF too

* only replace for native

* remove 120f-virtual for default compilation

---------

Co-authored-by: Aman Gupta <aman>
2025-12-25 22:12:11 +08:00