Héctor Estrada Moreno
0c8986403b
retrieval : use at most n_seq_max chunks ( #18400 )
2025-12-29 13:21:13 +02:00
o7si
daa242dfc8
common: fix return value check for setpriority ( #18412 )
...
* common: fix return value check for setpriority
* tools: add logging for process priority setting
2025-12-29 11:07:49 +02:00
Johannes Gäßler
e70e640db3
CUDA: Blackwell features for non-native builds ( #18436 )
2025-12-29 09:35:42 +01:00
Aman Gupta
5fa66c6e67
cuda: fix race condition in cumsum ( #18448 )
...
* ggml-cuda: fix race condition in cumsum
* remove unneccesary sync_threads
2025-12-29 14:07:17 +08:00
Tim Neumann
382808c14b
ci : re-enable rocm build on amd64 ( #18439 )
...
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913 .
I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).
A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.
The `runs_on` option was added to match the other entries.
2025-12-29 00:29:23 +01:00
uvos
4ffc47cb20
HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated ( #18202 )
2025-12-28 20:12:55 +01:00
momonga
9c675c7140
model : Plamo3 support ( #17304 )
...
* plamo3
* fix plamo3
* clean code
* clean up the code
* fix diff
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* add chat_template if exist
* clean up the code
* fix cpu-backend
* chore: whitespace trim fix + typo fix
* Fix: address review feedback
* restore `FREQ_BASE_SWA` constant
* Fix: address review feedback2
* Fix:typecheck
* Fix: address review feedback3
* final cleanup
---------
Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 17:28:31 +01:00
Aman Gupta
07a0c4ba92
Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )" ( #18426 )
2025-12-28 20:53:36 +08:00
o7si
60f17f56da
rpc: fix segfault on invalid endpoint format ( #18387 )
...
* rpc: fix segfault on invalid endpoint format
* rpc: add error log for failed endpoint connection
2025-12-28 12:34:41 +02:00
Johannes Gäßler
f8d561eb87
llama-fit-params: fix step size for last device ( #18415 )
2025-12-28 10:52:09 +01:00
Johannes Gäßler
e59efe6a78
github: update issue templates [no ci] ( #18410 )
...
* github: update issue templates [no ci]
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen
cffa5c46ea
mtmd: clarify that we no longer accept AI-generated PRs ( #18406 )
2025-12-28 09:57:04 +01:00
Boian Berberov
94de74e7b1
cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` ( #18186 )
...
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`
* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`
- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`
Resolves : #17966
2025-12-28 09:33:29 +02:00
QDelta
4fd59e8427
ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )
2025-12-28 09:33:14 +08:00
lhez
08566977a7
opencl: allow resizing transpose buffers ( #18384 )
...
* opencl: allow resizing transpose buffers instead of using fixed sizes
* opencl: remove commented code
2025-12-27 15:51:14 -08:00
Johannes Gäßler
a4bf35889e
llama-fit-params: fix overflow check ( #18354 )
2025-12-27 20:20:45 +01:00
Johannes Gäßler
026d2ad472
llama: fix magic number of 999 for GPU layers ( #18266 )
...
* llama: fix magic number of 999 for GPU layers
* use strings for -ngl, -ngld
* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Aman Gupta
06705fdcb3
ggml-cuda: Use same regex for GGML_NATIVE=OFF ( #18407 )
2025-12-27 19:56:27 +08:00
Johannes Gäßler
a52dc60ba3
llama_fit_params: return enum for fail vs. error ( #18374 )
2025-12-27 09:59:19 +01:00
Johannes Gäßler
9045c9afe5
llama-fit-params: fix Gemma 3 calculation ( #18372 )
2025-12-27 09:56:04 +01:00
Jeff Bolz
c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly ( #18352 )
...
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-26 16:12:58 -06:00
Jeff Bolz
7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader ( #18349 )
...
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-26 18:15:50 +01:00
Jeff Bolz
9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id ( #18332 )
2025-12-26 18:15:02 +01:00
Eve
cb999704fb
vulkan: small dequantization improvements ( #18380 )
...
* iq4_xs
* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz
b96b82fc85
vulkan: Support UPSCALE w/antialias ( #18327 )
2025-12-26 17:00:57 +01:00
Jeff Bolz
10dc500bdb
vulkan: handle rope with large number of rows ( #18306 )
2025-12-26 16:53:46 +01:00
o7si
4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models ( #18391 )
...
* server : fix crash when seq_rm fails for hybrid/recurrent models
* server : add allow_processing param to clear_slot
2025-12-26 16:35:29 +01:00
Francisco Herrera
af3be131c0
docs: added note for pre SYCL Intel hardware ( #18016 )
...
Specify that it's for pre sycl hardware
2025-12-26 10:34:30 +08:00
0Marble
b07cda687c
CANN: implement the SSM_CONV operator ( #17737 )
...
* CANN: implement SSM_CONV operator
Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
* CANN: remove custom error limit for SSM_CONV
* CANN: merge SSM_CONV tensor shape/strides into one line
---------
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-26 09:12:04 +08:00
Aman Gupta
85c40c9b02
ggml-cuda: fix regex for arch list ( #18371 )
...
* ggml-cuda: fix regex for arch list
* make regex exact
2025-12-26 01:35:14 +08:00
Aman Gupta
83b3b1c271
cuda: optimize cumsum cub path ( #18362 )
...
* cuda: optimize cumsum cub path
* remove heavy perf test
2025-12-25 23:55:38 +08:00
Aman Gupta
b0fb0f0aee
ggml-cuda: fix blackwell native builds ( #18361 )
...
* ggml-cuda: fix blackwell native builds
Replace 12x in native architectures by 12xa
* replace for GGML_NATIVE=OFF too
* only replace for native
* remove 120f-virtual for default compilation
---------
Co-authored-by: Aman Gupta <aman>
2025-12-25 22:12:11 +08:00
Penglin Cai
e68c19b0fd
CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 ( #17934 )
...
* CONV_TRANSPOSE_1D kernel_size>255
* remove condition check
* fix the bug of type conversion
* removing trailing whitespaces
* fix: return true in the switch case
2025-12-25 16:46:09 +08:00
Aadeshveer Singh
c54bba869d
ggml : optimize cuda cumsum fallback kernel ( #18343 )
2025-12-25 12:11:13 +08:00
Xuan-Son Nguyen
f5acfb2ffa
server: (router) add stop-timeout option ( #18350 )
...
* server: (router) add stop-timeout option
* also allow stop while loading
* add docs
* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Xuan-Son Nguyen
4cbafad4f0
model: support MiMo-V2-Flash ( #18328 )
...
* mimov2: convert ok
* rename mimov2 --> mimo2
* fix conversion
* runnable not incorrect
* use sink
* add_sliding_window_pattern
* add swa and per-layer n_head_kv
* correct params
* somewhat working
* correct gating func
* nits
* mimo2: wire RMS eps + MoE bias + converter guards
* add co-author
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>
* use add_rope_freq_base_swa
---------
Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com>
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>
2025-12-24 23:07:08 +01:00
Aadeshveer Singh
c184284230
fit-params : fix race condition in fit-params output ( #18276 )
2025-12-24 15:57:38 +01:00
Aman Gupta
c8a2417d7b
CUDA: experimental native mxfp4 support for blackwell ( #17906 )
...
* CUDA: experimental native mxfp4 support for blackwell
* optimize load_tiles
* optimize quantize_mxfp4
* cleanup
* first pass review: formatting
* use interleaved layout for mma
* mmq: add assert for size
* use __nv_fp4x4_e2m1
* use iter_k as 512, cleanup
* Use 1200 as blackwell instead of 1000
* address review comments
* mmq: fix stride
* quantize.cu: use reference impl of e8m0 scale
* address review comments
* add 120f-virtual + minor fixes
---------
Co-authored-by: Aman Gupta <aman>
2025-12-24 22:28:26 +08:00
Saba Fallah
54132f1b1f
model : support for LlamaBidirectionalModel architecture ( #18220 )
...
* model: llama-embed-nemotron
* minor: python lint
* changed arch-name
* templated llm_build_llama to be used for both llama and llama-embed arch
2025-12-24 14:02:36 +01:00
Jeff Bolz
2a9ea2020c
vulkan: fix command buffer corruption in ggml_backend_vk_event_wait ( #18302 )
2025-12-24 12:36:34 +01:00
Wang Weixuan
ce7a6dc0fc
CANN : refactor ACL graph cache ( #17752 )
...
Move the graph property checking code into methods of LRU cache.
Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>
2025-12-24 17:50:24 +08:00
Jesse Ikonen
1ce0126b18
docs: Fix typos in SYCL documentation ( #18269 )
2025-12-24 17:19:47 +08:00
Ruben Ortlam
7f459c98e7
vulkan: use fewer FA rows for small cache runs ( #18280 )
2025-12-24 08:59:14 +01:00
TianHao324
cf2ffc02bc
CANN: Uses yarn_ramp cache in ROPE ( #17725 )
2025-12-24 14:55:33 +08:00
ddh0
10355dc7d0
common: add `LLAMA_ARG_OVERRIDE_TENSOR` env var for `-ot` arg ( #18267 )
2025-12-24 14:19:12 +08:00
Xuan-Son Nguyen
5ee4e43f26
server: return_progress to also report 0% processing state ( #18305 )
2025-12-23 21:49:05 +01:00
Pascal
5b6c9bc0f3
webui: apply webui_settings on first load ( #18223 )
...
* webui: apply webui_settings on first load
The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null
Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes
* chore: update webui build output
2025-12-23 15:48:03 +01:00
Xuan-Son Nguyen
849d021104
server: fix crash with model not having BOS/EOS ( #18321 )
2025-12-23 14:39:36 +01:00
Daniel Bevenius
8e3ead6e4d
model-conversion : add device option to run-org-model.py ( #18318 )
...
* model-conversion : add device option to run-org-model.py
This commit refactors the `run-org-model.py` script to include a
`--device` argument, to allow users to specify the device on which to
run the model (e.g., cpu, cuda, mps, auto).
It also extracts a few common functions to prepare for future changes
where some code duplication will be removed which there currently
exists in embedding scripts.
The Makefile is also been updated to pass the device argument, for
example:
```console
(venv) $ make causal-verify-logits DEVICE=cpu
```
* fix error handling and remove parser reference
This commit fixes the error handling which previously referenced an
undefined 'parser' variable.
2025-12-23 14:07:25 +01:00
Chris Rohlf
12ee1763a6
rpc : add check for rpc buffer type ( #18242 )
2025-12-23 11:56:49 +02:00