Georgi Gerganov
2a85f720b8
server : handle closed connection for tasks ( #18459 )
2025-12-29 15:34:41 +02:00
Daniel Bevenius
7cbec34a63
model-conversion : add device option to embd run orig model ( #18386 )
...
This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.
2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno
0c8986403b
retrieval : use at most n_seq_max chunks ( #18400 )
2025-12-29 13:21:13 +02:00
o7si
daa242dfc8
common: fix return value check for setpriority ( #18412 )
...
* common: fix return value check for setpriority
* tools: add logging for process priority setting
2025-12-29 11:07:49 +02:00
Johannes Gäßler
e70e640db3
CUDA: Blackwell features for non-native builds ( #18436 )
2025-12-29 09:35:42 +01:00
Aman Gupta
5fa66c6e67
cuda: fix race condition in cumsum ( #18448 )
...
* ggml-cuda: fix race condition in cumsum
* remove unneccesary sync_threads
2025-12-29 14:07:17 +08:00
Tim Neumann
382808c14b
ci : re-enable rocm build on amd64 ( #18439 )
...
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913 .
I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).
A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.
The `runs_on` option was added to match the other entries.
2025-12-29 00:29:23 +01:00
uvos
4ffc47cb20
HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated ( #18202 )
2025-12-28 20:12:55 +01:00
momonga
9c675c7140
model : Plamo3 support ( #17304 )
...
* plamo3
* fix plamo3
* clean code
* clean up the code
* fix diff
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* add chat_template if exist
* clean up the code
* fix cpu-backend
* chore: whitespace trim fix + typo fix
* Fix: address review feedback
* restore `FREQ_BASE_SWA` constant
* Fix: address review feedback2
* Fix:typecheck
* Fix: address review feedback3
* final cleanup
---------
Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 17:28:31 +01:00
Aman Gupta
07a0c4ba92
Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )" ( #18426 )
2025-12-28 20:53:36 +08:00
o7si
60f17f56da
rpc: fix segfault on invalid endpoint format ( #18387 )
...
* rpc: fix segfault on invalid endpoint format
* rpc: add error log for failed endpoint connection
2025-12-28 12:34:41 +02:00
Johannes Gäßler
f8d561eb87
llama-fit-params: fix step size for last device ( #18415 )
2025-12-28 10:52:09 +01:00
Johannes Gäßler
e59efe6a78
github: update issue templates [no ci] ( #18410 )
...
* github: update issue templates [no ci]
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen
cffa5c46ea
mtmd: clarify that we no longer accept AI-generated PRs ( #18406 )
2025-12-28 09:57:04 +01:00
Boian Berberov
94de74e7b1
cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` ( #18186 )
...
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`
* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`
- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`
Resolves : #17966
2025-12-28 09:33:29 +02:00
QDelta
4fd59e8427
ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )
2025-12-28 09:33:14 +08:00
lhez
08566977a7
opencl: allow resizing transpose buffers ( #18384 )
...
* opencl: allow resizing transpose buffers instead of using fixed sizes
* opencl: remove commented code
2025-12-27 15:51:14 -08:00
Johannes Gäßler
a4bf35889e
llama-fit-params: fix overflow check ( #18354 )
2025-12-27 20:20:45 +01:00
Johannes Gäßler
026d2ad472
llama: fix magic number of 999 for GPU layers ( #18266 )
...
* llama: fix magic number of 999 for GPU layers
* use strings for -ngl, -ngld
* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Aman Gupta
06705fdcb3
ggml-cuda: Use same regex for GGML_NATIVE=OFF ( #18407 )
2025-12-27 19:56:27 +08:00
Johannes Gäßler
a52dc60ba3
llama_fit_params: return enum for fail vs. error ( #18374 )
2025-12-27 09:59:19 +01:00
Johannes Gäßler
9045c9afe5
llama-fit-params: fix Gemma 3 calculation ( #18372 )
2025-12-27 09:56:04 +01:00
Jeff Bolz
c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly ( #18352 )
...
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-26 16:12:58 -06:00
Jeff Bolz
7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader ( #18349 )
...
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-26 18:15:50 +01:00
Jeff Bolz
9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id ( #18332 )
2025-12-26 18:15:02 +01:00
Eve
cb999704fb
vulkan: small dequantization improvements ( #18380 )
...
* iq4_xs
* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz
b96b82fc85
vulkan: Support UPSCALE w/antialias ( #18327 )
2025-12-26 17:00:57 +01:00
Jeff Bolz
10dc500bdb
vulkan: handle rope with large number of rows ( #18306 )
2025-12-26 16:53:46 +01:00
o7si
4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models ( #18391 )
...
* server : fix crash when seq_rm fails for hybrid/recurrent models
* server : add allow_processing param to clear_slot
2025-12-26 16:35:29 +01:00
Francisco Herrera
af3be131c0
docs: added note for pre SYCL Intel hardware ( #18016 )
...
Specify that it's for pre sycl hardware
2025-12-26 10:34:30 +08:00
0Marble
b07cda687c
CANN: implement the SSM_CONV operator ( #17737 )
...
* CANN: implement SSM_CONV operator
Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
* CANN: remove custom error limit for SSM_CONV
* CANN: merge SSM_CONV tensor shape/strides into one line
---------
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-26 09:12:04 +08:00
Aman Gupta
85c40c9b02
ggml-cuda: fix regex for arch list ( #18371 )
...
* ggml-cuda: fix regex for arch list
* make regex exact
2025-12-26 01:35:14 +08:00
Aman Gupta
83b3b1c271
cuda: optimize cumsum cub path ( #18362 )
...
* cuda: optimize cumsum cub path
* remove heavy perf test
2025-12-25 23:55:38 +08:00
Aman Gupta
b0fb0f0aee
ggml-cuda: fix blackwell native builds ( #18361 )
...
* ggml-cuda: fix blackwell native builds
Replace 12x in native architectures by 12xa
* replace for GGML_NATIVE=OFF too
* only replace for native
* remove 120f-virtual for default compilation
---------
Co-authored-by: Aman Gupta <aman>
2025-12-25 22:12:11 +08:00
Penglin Cai
e68c19b0fd
CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 ( #17934 )
...
* CONV_TRANSPOSE_1D kernel_size>255
* remove condition check
* fix the bug of type conversion
* removing trailing whitespaces
* fix: return true in the switch case
2025-12-25 16:46:09 +08:00
Aadeshveer Singh
c54bba869d
ggml : optimize cuda cumsum fallback kernel ( #18343 )
2025-12-25 12:11:13 +08:00
Xuan-Son Nguyen
f5acfb2ffa
server: (router) add stop-timeout option ( #18350 )
...
* server: (router) add stop-timeout option
* also allow stop while loading
* add docs
* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Xuan-Son Nguyen
4cbafad4f0
model: support MiMo-V2-Flash ( #18328 )
...
* mimov2: convert ok
* rename mimov2 --> mimo2
* fix conversion
* runnable not incorrect
* use sink
* add_sliding_window_pattern
* add swa and per-layer n_head_kv
* correct params
* somewhat working
* correct gating func
* nits
* mimo2: wire RMS eps + MoE bias + converter guards
* add co-author
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>
* use add_rope_freq_base_swa
---------
Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com>
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>
2025-12-24 23:07:08 +01:00
Aadeshveer Singh
c184284230
fit-params : fix race condition in fit-params output ( #18276 )
2025-12-24 15:57:38 +01:00
Aman Gupta
c8a2417d7b
CUDA: experimental native mxfp4 support for blackwell ( #17906 )
...
* CUDA: experimental native mxfp4 support for blackwell
* optimize load_tiles
* optimize quantize_mxfp4
* cleanup
* first pass review: formatting
* use interleaved layout for mma
* mmq: add assert for size
* use __nv_fp4x4_e2m1
* use iter_k as 512, cleanup
* Use 1200 as blackwell instead of 1000
* address review comments
* mmq: fix stride
* quantize.cu: use reference impl of e8m0 scale
* address review comments
* add 120f-virtual + minor fixes
---------
Co-authored-by: Aman Gupta <aman>
2025-12-24 22:28:26 +08:00
Saba Fallah
54132f1b1f
model : support for LlamaBidirectionalModel architecture ( #18220 )
...
* model: llama-embed-nemotron
* minor: python lint
* changed arch-name
* templated llm_build_llama to be used for both llama and llama-embed arch
2025-12-24 14:02:36 +01:00
Jeff Bolz
2a9ea2020c
vulkan: fix command buffer corruption in ggml_backend_vk_event_wait ( #18302 )
2025-12-24 12:36:34 +01:00
Wang Weixuan
ce7a6dc0fc
CANN : refactor ACL graph cache ( #17752 )
...
Move the graph property checking code into methods of LRU cache.
Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>
2025-12-24 17:50:24 +08:00
Jesse Ikonen
1ce0126b18
docs: Fix typos in SYCL documentation ( #18269 )
2025-12-24 17:19:47 +08:00
Ruben Ortlam
7f459c98e7
vulkan: use fewer FA rows for small cache runs ( #18280 )
2025-12-24 08:59:14 +01:00
TianHao324
cf2ffc02bc
CANN: Uses yarn_ramp cache in ROPE ( #17725 )
2025-12-24 14:55:33 +08:00
ddh0
10355dc7d0
common: add `LLAMA_ARG_OVERRIDE_TENSOR` env var for `-ot` arg ( #18267 )
2025-12-24 14:19:12 +08:00
Xuan-Son Nguyen
5ee4e43f26
server: return_progress to also report 0% processing state ( #18305 )
2025-12-23 21:49:05 +01:00
Pascal
5b6c9bc0f3
webui: apply webui_settings on first load ( #18223 )
...
* webui: apply webui_settings on first load
The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null
Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes
* chore: update webui build output
2025-12-23 15:48:03 +01:00
Xuan-Son Nguyen
849d021104
server: fix crash with model not having BOS/EOS ( #18321 )
2025-12-23 14:39:36 +01:00