Xuan Son Nguyen
a67685e0e1
Merge commit 'refs/pull/14417/head' of github.com:ggerganov/llama.cpp into xsn/ggml_scale_bias
2025-07-09 14:13:30 +02:00
Xuan Son Nguyen
ebbad7796d
add x param to ggml_vec_mad1_f32
2025-07-09 14:11:53 +02:00
Xuan Son Nguyen
60b03ff968
ggml-ci
2025-07-09 12:18:49 +02:00
Xuan Son Nguyen
533016efa5
Merge commit 'refs/pull/14417/head' of github.com:ggerganov/llama.cpp into xsn/ggml_scale_bias
2025-07-09 12:18:36 +02:00
Xuan Son Nguyen
cd1703a3bc
use scalar for __ARM_FEATURE_SVE
2025-07-09 12:16:40 +02:00
Xuan Son Nguyen
34bacc8365
ggml-ci
2025-07-09 12:09:36 +02:00
Xuan Son Nguyen
4ea74b04e5
make code looks more consistent
2025-07-09 12:07:05 +02:00
Xuan Son Nguyen
0d70ca81e8
use memcpy for op params
2025-07-09 12:05:34 +02:00
Xuan Son Nguyen
50c678f6da
rm __ARM_FEATURE_SVE
2025-07-09 11:56:48 +02:00
Xuan Son Nguyen
563aca0b56
vDSP_vsmsa
2025-07-09 11:55:56 +02:00
Xuan Son Nguyen
265cb43538
fix cann compile error
2025-07-09 11:52:58 +02:00
Xuan Son Nguyen
c8d89317c9
suggestions from coderabbit
2025-07-09 00:06:53 +02:00
Xuan Son Nguyen
b22708fd90
fix cuda
2025-07-09 00:00:44 +02:00
Xuan Son Nguyen
4d0195324e
will this fix cpu?
2025-07-09 00:00:31 +02:00
Xuan Son Nguyen
0e51a0a8b0
opencl
2025-07-08 23:36:47 +02:00
Xuan Son Nguyen
477a97ad87
cann (placeholder)
2025-07-08 23:34:15 +02:00
Xuan Son Nguyen
782b58fa06
vulkan
2025-07-08 23:31:04 +02:00
Xuan Son Nguyen
a28df6f00c
sycl
2025-07-08 23:27:32 +02:00
Xuan Son Nguyen
92a8738452
add CUDA
2025-07-08 23:26:21 +02:00
Xuan Son Nguyen
e427af75fb
add more simd
2025-07-08 23:19:16 +02:00
Xuan Son Nguyen
a5ccf168f1
ggml_vec_mad1_f32
2025-07-08 23:13:42 +02:00
Xuan Son Nguyen
7af3fd98a1
Merge branch 'master' into xsn/ggml_scale_bias
2025-07-08 23:02:15 +02:00
Jeff Bolz
6efcd65945
vulkan: optimize flash attention split_k_reduce ( #14554 )
...
* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
stevenkuang
699f4392a3
model : fix hunyuan moe chat template ( #14584 )
...
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
2025-07-08 18:29:29 +02:00
Xuan-Son Nguyen
08382869a2
model : add SmolLM3 ( #14581 )
...
* Init - first pass.
* Model -> ModelBase.
* fix errors in conversion.
* Update the graph.
* up.
* up.
* wip
* cgraph ok
* rm redundant code
---------
Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>
2025-07-08 18:07:01 +02:00
compilade
bb4f7a9e4e
memory : fix broken batch splits for recurrent cache ( #14575 )
...
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512 .
This fixes it by moving the completeness check after the ubatch split loop.
2025-07-08 18:37:47 +03:00
Jeff Bolz
b8eeb8741d
vulkan : fix rope with partial rotation and non-cont src ( #14582 )
2025-07-08 15:21:21 +02:00
Alawode Oluwandabira
17a1f0d2d4
server: Add ability to mount server at prefix ( #14544 )
...
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
2025-07-08 11:47:33 +03:00
Xuan-Son Nguyen
8f22dc0a53
model : add hunyuan moe ( #14425 )
...
* model : add hunyuan moe
* tokenizer ok
* fix tensor name
* cgraph init
* chat template
* wip
* almost working
* skip embed, fix bos
* cleanup
* yarn scaling
* cleanup
* correct rope type
* failed token fix
* ntk alpha freq_base
* tokenization working
* cleanup and pr changes
* vocab_size sanity check
* ntk alpha generic
* Update convert_hf_to_gguf.py
* Apply suggestions from code review
* fix regression
* fix style
---------
Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
2025-07-08 11:24:06 +03:00
Jeff Bolz
53903ae6fa
vulkan: increase timeout for CI ( #14574 )
2025-07-08 09:38:31 +02:00
Georgi Gerganov
4d0dcd4a06
cuda : fix rope with partial rotation and non-cont src ( #14580 )
...
* cuda : fix rope non-cont
ggml-ci
* cont : fix multi-rope + add test
ggml-ci
* sycl : try fix
ggml-ci
* cont : fix sycl + clean-up cuda
ggml-ci
2025-07-08 10:15:21 +03:00
Aman Gupta
75c91de6e9
CUDA: add bilinear interpolation for upscale ( #14563 )
2025-07-08 10:11:18 +08:00
R0CKSTAR
68155c66f0
musa: fix build warnings (unused variable) ( #14561 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-08 07:58:30 +08:00
Sigbjørn Skjæret
e1a7059053
llama : fix incorrect minicpm3 v_states shape ( #14571 )
2025-07-07 23:35:35 +02:00
Sigbjørn Skjæret
12f55c302b
llama : remove ggml_cont where possible ( #14568 )
2025-07-07 21:35:08 +02:00
Aman Gupta
b9c3eefde1
CUDA: add bf16 and i32 to getrows ( #14529 )
2025-07-07 21:45:43 +08:00
Eve
6491d6e4f1
vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) ( #14485 )
...
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>
2025-07-06 12:29:36 +02:00
Jeff Bolz
e592be1575
vulkan: fix rms_norm+mul fusion ( #14545 )
...
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
2025-07-06 10:08:16 +02:00
Jeff Bolz
a0374a67e2
vulkan: Handle updated FA dim2/3 definition ( #14518 )
...
* vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.
* handle null mask for gqa
* allow gqa with dim3>1
2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret
ddef99522d
server : fix assistant prefilling when content is an array ( #14360 )
2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret
6681688146
opencl: add GELU_ERF ( #14476 )
2025-07-04 23:24:56 -07:00
Georgi Gerganov
bac8bed248
eval-callback : check for empty input ( #14539 )
2025-07-05 07:18:09 +03:00
R0CKSTAR
b81510a7b7
test-backend-ops: add support for specifying output format ( #14368 )
...
* test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* refactor
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* remove visitor nonsense
* remove visitor comment
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
---------
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2025-07-05 12:10:53 +08:00
Georgi Gerganov
ef797db357
metal : disable fast math in all quantize kernels ( #14528 )
...
ggml-ci
2025-07-04 19:19:09 +03:00
Georgi Gerganov
67d1ef23c6
batch : add optional for sequential equal split ( #14511 )
...
ggml-ci
2025-07-04 09:08:59 +03:00
Georgi Gerganov
7b50f7c025
graph : prepare for 4D mask ( #14515 )
...
ggml-ci
2025-07-04 09:05:36 +03:00
Georgi Gerganov
c79184d2d1
batch : add n_used count ( #14512 )
...
ggml-ci
2025-07-04 09:04:59 +03:00
luyhcsu
499a8f5a78
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator ( #14002 )
...
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
2025-07-04 11:50:07 +08:00
Sigbjørn Skjæret
28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops ( #14445 )
2025-07-03 23:07:22 +02:00
lhez
bee28421be
opencl : broadcast for soft_max ( #14510 )
2025-07-03 20:22:24 +02:00