Commit Graph

5872 Commits

Author SHA1 Message Date
Xuan Son Nguyen a67685e0e1 Merge commit 'refs/pull/14417/head' of github.com:ggerganov/llama.cpp into xsn/ggml_scale_bias 2025-07-09 14:13:30 +02:00
Xuan Son Nguyen ebbad7796d add x param to ggml_vec_mad1_f32 2025-07-09 14:11:53 +02:00
Xuan Son Nguyen 60b03ff968 ggml-ci 2025-07-09 12:18:49 +02:00
Xuan Son Nguyen 533016efa5 Merge commit 'refs/pull/14417/head' of github.com:ggerganov/llama.cpp into xsn/ggml_scale_bias 2025-07-09 12:18:36 +02:00
Xuan Son Nguyen cd1703a3bc use scalar for __ARM_FEATURE_SVE 2025-07-09 12:16:40 +02:00
Xuan Son Nguyen 34bacc8365 ggml-ci 2025-07-09 12:09:36 +02:00
Xuan Son Nguyen 4ea74b04e5 make code looks more consistent 2025-07-09 12:07:05 +02:00
Xuan Son Nguyen 0d70ca81e8 use memcpy for op params 2025-07-09 12:05:34 +02:00
Xuan Son Nguyen 50c678f6da rm __ARM_FEATURE_SVE 2025-07-09 11:56:48 +02:00
Xuan Son Nguyen 563aca0b56 vDSP_vsmsa 2025-07-09 11:55:56 +02:00
Xuan Son Nguyen 265cb43538 fix cann compile error 2025-07-09 11:52:58 +02:00
Xuan Son Nguyen c8d89317c9 suggestions from coderabbit 2025-07-09 00:06:53 +02:00
Xuan Son Nguyen b22708fd90 fix cuda 2025-07-09 00:00:44 +02:00
Xuan Son Nguyen 4d0195324e will this fix cpu? 2025-07-09 00:00:31 +02:00
Xuan Son Nguyen 0e51a0a8b0 opencl 2025-07-08 23:36:47 +02:00
Xuan Son Nguyen 477a97ad87 cann (placeholder) 2025-07-08 23:34:15 +02:00
Xuan Son Nguyen 782b58fa06 vulkan 2025-07-08 23:31:04 +02:00
Xuan Son Nguyen a28df6f00c sycl 2025-07-08 23:27:32 +02:00
Xuan Son Nguyen 92a8738452 add CUDA 2025-07-08 23:26:21 +02:00
Xuan Son Nguyen e427af75fb add more simd 2025-07-08 23:19:16 +02:00
Xuan Son Nguyen a5ccf168f1 ggml_vec_mad1_f32 2025-07-08 23:13:42 +02:00
Xuan Son Nguyen 7af3fd98a1 Merge branch 'master' into xsn/ggml_scale_bias 2025-07-08 23:02:15 +02:00
Jeff Bolz 6efcd65945
vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
stevenkuang 699f4392a3
model : fix hunyuan moe chat template (#14584)
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
2025-07-08 18:29:29 +02:00
Xuan-Son Nguyen 08382869a2
model : add SmolLM3 (#14581)
* Init - first pass.

* Model -> ModelBase.

* fix errors in conversion.

* Update the graph.

* up.

* up.

* wip

* cgraph ok

* rm redundant code

---------

Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>
2025-07-08 18:07:01 +02:00
compilade bb4f7a9e4e
memory : fix broken batch splits for recurrent cache (#14575)
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.
2025-07-08 18:37:47 +03:00
Jeff Bolz b8eeb8741d
vulkan : fix rope with partial rotation and non-cont src (#14582) 2025-07-08 15:21:21 +02:00
Alawode Oluwandabira 17a1f0d2d4
server: Add ability to mount server at prefix (#14544)
* Add server_prefix

* Correct server path env

* Rename cli flag to --api-prefix

* Change all to api_prefix
2025-07-08 11:47:33 +03:00
Xuan-Son Nguyen 8f22dc0a53
model : add hunyuan moe (#14425)
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
2025-07-08 11:24:06 +03:00
Jeff Bolz 53903ae6fa
vulkan: increase timeout for CI (#14574) 2025-07-08 09:38:31 +02:00
Georgi Gerganov 4d0dcd4a06
cuda : fix rope with partial rotation and non-cont src (#14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-08 10:15:21 +03:00
Aman Gupta 75c91de6e9
CUDA: add bilinear interpolation for upscale (#14563) 2025-07-08 10:11:18 +08:00
R0CKSTAR 68155c66f0
musa: fix build warnings (unused variable) (#14561)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-08 07:58:30 +08:00
Sigbjørn Skjæret e1a7059053
llama : fix incorrect minicpm3 v_states shape (#14571) 2025-07-07 23:35:35 +02:00
Sigbjørn Skjæret 12f55c302b
llama : remove ggml_cont where possible (#14568) 2025-07-07 21:35:08 +02:00
Aman Gupta b9c3eefde1
CUDA: add bf16 and i32 to getrows (#14529) 2025-07-07 21:45:43 +08:00
Eve 6491d6e4f1
vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>
2025-07-06 12:29:36 +02:00
Jeff Bolz e592be1575
vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.
2025-07-06 10:08:16 +02:00
Jeff Bolz a0374a67e2
vulkan: Handle updated FA dim2/3 definition (#14518)
* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret ddef99522d
server : fix assistant prefilling when content is an array (#14360) 2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret 6681688146
opencl: add GELU_ERF (#14476) 2025-07-04 23:24:56 -07:00
Georgi Gerganov bac8bed248
eval-callback : check for empty input (#14539) 2025-07-05 07:18:09 +03:00
R0CKSTAR b81510a7b7
test-backend-ops: add support for specifying output format (#14368)
* test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* refactor

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* remove visitor nonsense

* remove visitor comment

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2025-07-05 12:10:53 +08:00
Georgi Gerganov ef797db357
metal : disable fast math in all quantize kernels (#14528)
ggml-ci
2025-07-04 19:19:09 +03:00
Georgi Gerganov 67d1ef23c6
batch : add optional for sequential equal split (#14511)
ggml-ci
2025-07-04 09:08:59 +03:00
Georgi Gerganov 7b50f7c025
graph : prepare for 4D mask (#14515)
ggml-ci
2025-07-04 09:05:36 +03:00
Georgi Gerganov c79184d2d1
batch : add n_used count (#14512)
ggml-ci
2025-07-04 09:04:59 +03:00
luyhcsu 499a8f5a78
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
2025-07-04 11:50:07 +08:00
Sigbjørn Skjæret 28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445) 2025-07-03 23:07:22 +02:00
lhez bee28421be
opencl : broadcast for soft_max (#14510) 2025-07-03 20:22:24 +02:00