llama.cpp

Commit Graph

Author	SHA1	Message	Date
compilade	a57d1bcb3c	cuda : support Falcon-H1 state size for SSM_SCAN (#14602 )	2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen	98bab638fb	ggml : add ggml_scale_bias (#14417 ) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-09 18:16:12 +02:00
Miaoqian Lin	26a48ad699	ggml : prevent integer overflow in gguf tensor size calculation (#14595 )	2025-07-09 14:33:53 +02:00
Jeff Bolz	6efcd65945	vulkan: optimize flash attention split_k_reduce (#14554 ) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).	2025-07-08 20:11:42 +02:00
Jeff Bolz	b8eeb8741d	vulkan : fix rope with partial rotation and non-cont src (#14582 )	2025-07-08 15:21:21 +02:00
Georgi Gerganov	4d0dcd4a06	cuda : fix rope with partial rotation and non-cont src (#14580 ) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci	2025-07-08 10:15:21 +03:00
Aman Gupta	75c91de6e9	CUDA: add bilinear interpolation for upscale (#14563 )	2025-07-08 10:11:18 +08:00
R0CKSTAR	68155c66f0	musa: fix build warnings (unused variable) (#14561 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-08 07:58:30 +08:00
Aman Gupta	b9c3eefde1	CUDA: add bf16 and i32 to getrows (#14529 )	2025-07-07 21:45:43 +08:00
Eve	6491d6e4f1	vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485 ) Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>	2025-07-06 12:29:36 +02:00
Jeff Bolz	e592be1575	vulkan: fix rms_norm+mul fusion (#14545 ) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-06 10:08:16 +02:00
Jeff Bolz	a0374a67e2	vulkan: Handle updated FA dim2/3 definition (#14518 ) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret	6681688146	opencl: add GELU_ERF (#14476 )	2025-07-04 23:24:56 -07:00
Georgi Gerganov	ef797db357	metal : disable fast math in all quantize kernels (#14528 ) ggml-ci	2025-07-04 19:19:09 +03:00
luyhcsu	499a8f5a78	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002 ) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-04 11:50:07 +08:00
Sigbjørn Skjæret	28657a8229	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445 )	2025-07-03 23:07:22 +02:00
lhez	bee28421be	opencl : broadcast for soft_max (#14510 )	2025-07-03 20:22:24 +02:00
Jeff Bolz	2b72bedec1	vulkan: support mixed/deepseekR1 FA head sizes (#14509 ) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-03 20:21:14 +02:00
Johannes Gäßler	c8c4495b8d	ggml: backward pass for split swiglu (#14483 )	2025-07-03 17:05:18 +02:00
Nicolò Scipione	7b63a71a6b	Fix conditional enabling following arch checks for ggml-sycl (#14504 ) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-07-03 11:00:03 +02:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-03 10:53:35 +03:00
Georgi Gerganov	9067487c44	ggml : fix FA mask dim 2 and 3 (#14505 ) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-03 10:46:57 +03:00
Georgi Gerganov	d4cdd9c1c3	ggml : remove kompute backend (#14501 ) ggml-ci	2025-07-03 07:48:32 +03:00
Aman Gupta	55c2646b45	CUDA: add dynamic shared mem to softmax, refactor general usage (#14497 )	2025-07-03 07:45:11 +08:00
compilade	5d46babdc2	llama : initial Mamba-2 support (#9126 ) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-02 13:10:24 -04:00
Daniel Bevenius	c46944aa25	ggml : add version function to get lib version (ggml/1286) * ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-02 20:08:45 +03:00
Aman Gupta	55a1c5a5fd	CUDA: add softmax broadcast (#14475 ) * CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output	2025-07-02 15:48:33 +03:00
Johannes Gäßler	12a81af45f	CUDA: broadcasting for FlashAttention mask (#14500 )	2025-07-02 15:48:33 +03:00
Jeff Bolz	8875523eb3	vulkan: support softmax/FA batch and broadcast (#14449 )	2025-07-02 15:48:33 +03:00
Georgi Gerganov	ec68e84c32	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435 ) ggml-ci	2025-07-02 15:48:33 +03:00
zhouwg	307e79d33d	opencl : fix possible buffer overflow in dump_tensor (#14490 )	2025-07-02 14:38:10 +02:00
Eric Zhang	c8a4e470f6	opencl : skip empty nodes on cgraph compute (#14491 )	2025-07-02 13:00:04 +02:00
lhez	603e43dc91	opencl : update upscale to support align corners (#14488 )	2025-07-02 09:07:42 +02:00
Björn Ganster	68b3cd6514	ggml : Callback before abort (#14481 ) * Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed. * Return previous callback to allow callback chaining * style fixes --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-07-02 08:19:31 +03:00
Georgi Gerganov	de56944147	ci : disable fast-math for Metal GHA CI (#14478 ) * ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci	2025-07-01 18:04:08 +03:00
Chenguang Li	343b6e94b6	CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411 ) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-07-01 16:47:30 +08:00
Jeff Bolz	6a746cf9c4	vulkan: Split large mul_mat_id to fit in shared memory (#14451 )	2025-07-01 10:43:08 +02:00
Sigbjørn Skjæret	eff5e45443	add GELU_ERF (#14455 )	2025-07-01 10:14:21 +02:00
Georgi Gerganov	a6a47958a1	ggml : remove trailing whitespace (#0 )	2025-07-01 11:06:39 +03:00
Acly	431b2c24f3	ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285) * add "align corners" mode for bilinear upscale, and allow downscaling * add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag * test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners	2025-07-01 11:06:39 +03:00
Daniel Bevenius	497be7c01d	ggml-quants : rename best_mad to best_error (ggml/1283) This commit renames the variable `best_mad` to `best_error` in the `make_qkx2_quants` function. The motivation for this is that the name `best_mad` can be somewhat confusing if mean absolute deviation (MAD) is not in use.	2025-07-01 11:06:39 +03:00
lhez	79b33b2317	opencl : add GEGLU, REGLU, SWIGLU (#14456 )	2025-07-01 09:19:16 +02:00
Aman Gupta	0a5a3b5cdf	Add Conv2d for CPU (#14388 ) * Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16	2025-06-30 23:57:04 +08:00
Georgi Gerganov	5dd942de59	metal : disable fast-math for some cpy kernels (#14460 ) * metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci	2025-06-30 17:04:05 +03:00
Romain Biessy	a7417f5594	ggml-cpu: sycl: Re-enable exp f16 (#14462 )	2025-06-30 14:52:02 +02:00
xiaobing318	c839a2da1a	cmake : Remove redundant include path in CMakeLists.txt (#14452 ) * Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * Remove redundant include path in CMakeLists.txt The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths. * Enable scheduled Docker image builds Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.	2025-06-30 12:48:24 +03:00
Akarshan Biswas	f47c1d7106	SYCL: disable faulty fp16 exp kernel (#14395 ) * SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit `ed0aab1ec3`. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel	2025-06-29 21:07:58 +05:30
Sigbjørn Skjæret	a5d1fb6212	ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443 )	2025-06-29 14:38:10 +02:00
Sigbjørn Skjæret	a0535ffa0d	ggml : implement REGLU/GEGLU/SWIGLU ops (#14158 ) * implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-06-29 11:04:10 +02:00
Jeff Bolz	bd9c981d72	vulkan: Add fusion support for RMS_NORM+MUL (#14366 ) * vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-06-29 09:43:36 +02:00
Aman Gupta	27208bf657	CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361 ) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools	2025-06-29 01:30:53 +08:00
Jeff Bolz	63a7bb3c7e	vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378 )	2025-06-28 17:36:40 +02:00
Jeff Bolz	00d5282c7f	vulkan: lock accesses of pinned_memory vector (#14333 )	2025-06-28 17:17:09 +02:00
Xinpeng Dou	b25e92774e	fix async_mode bug (#14432 )	2025-06-28 17:35:41 +08:00
Jeff Bolz	ceb1bf5a34	vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427 ) This setting needs to be passed through to vulkan-shaders-gen	2025-06-27 22:35:30 -05:00
Radoslav Gerganov	8d94219a4a	ggml : add ggml_set_rows (#14274 ) * ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366 * use I64 for indices * ggml : add repeat impl for i64 * ggml : add ggml_is_contiguous_rows * ggml : ggml_set_rows support broadcast * ggml : ggml_set_rows support quantized dst ggml-ci * ggml : support GGML_TYPE_F32 ".from_float" trait * ggml : ggml_set_rows update comment + better index name * tests : add ggml_set_rows * metal : add ggml_set_rows implementation ggml-ci * ggml : simplify forward_dup_f32 * ggml : fix supports_op * tests : add comment to set_rows * ggml : leave the repeat_i64 for a separate PR ggml-ci * ggml : set_rows use std::min instead of MIN * ggml : better error message for set_rows unsupported type * metal : perform op->type check only once * tests : more consistent implementation + more tests ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-27 16:41:40 +03:00
bandoti	a01047b041	cmake: regen vulkan shaders when shaders-gen sources change (#14398 ) * Add shaders-gen sources as target deps	2025-06-26 13:46:53 -03:00
Georgi Gerganov	e8215dbb96	metal : add special-case mat-vec mul for ne00 == 4 (#14385 ) ggml-ci	2025-06-26 15:51:19 +03:00
Georgi Gerganov	5783ae4359	metal : batch rows copy in a single threadgroup (#14384 ) * metal : batch rows copy in a single threadgroup ggml-ci * metal : handle some edge cases when threadgroup size is not a power of 2 ggml-ci	2025-06-26 15:50:15 +03:00
R0CKSTAR	716301d1b0	musa: enable fp16 mma (all) and cublas on qy2 (#13842 ) * musa: enable fp16 mma (all) and cublas on qy2 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-06-26 12:11:59 +08:00
Aaron Teo	60ef23d6c1	ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317 ) * ggml-cpu: add nnpa compile flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `4a9f60c201`) * ggml-cpu: add fp16->fp32 nnpa first Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `8d4a7987f9`) * ggml-cpu: add fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0ff0d65162`) * ggml-cpu: better variable names Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `2f58bbcbb8`) * docs: update s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `01b929491b`) * ggml-cpu: add debugging prints to see if dlf16 is correct Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix print vs printf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix float placeholder Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: ensure fp16 and fp32 load and stores are called Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fp16 load ensured to hit Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove sigint from fp16 store for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa switch to vec_xst test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to vec_xst for 4 element loops also Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework noop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove noop, general code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify variable naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add breakpoint for debugging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test fix for conversion failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: disable fp32->fp16 nnpa conversions for now there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to elif macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix compiler types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: change to typedef vector types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add 4 element loops for fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarified vector naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back fp32->fp16 store nnpa Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add nnpa macro check in ggml-impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add missing __func__ Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: diagnose why __NNPA__ macro is not being defined Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: import vecintrin.h to fix compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update macro tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit `157f856c34`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to importing ggml-cpu-impl instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix macro declaration Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test more macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add debug prints Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bruteforce macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h to cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to private macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `157f856c34`) * ggml-cpu: move things around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to quotes for import Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add compiler error macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add s390x detection in ggml-src Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: undo cmakelists work Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit `18d79e1a30`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedefs.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedef from cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h future notes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add todo comment for future reference Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify naming of dlf16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove unnecessary target compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update broken huggingface link for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix duplicate func names during compile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: fix duplicate func names during compile" This reverts commit `fbb733451f`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu" This reverts commit `bd288e8fa5`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp16<->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h import in quants.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h within repack Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix amx mmq missing simd-mappings.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: attempt at fixing loongarch failing build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa together with other fp16<->fp32 simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix wrong refactor of ggml-base ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove dependency on ggml-cpu from ggml-base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove mistaken fallback macro fallback logic was already implemented but i was too sleepy to realise Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures" This reverts commit `32a3533564`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: move ggml_table_f32_f16 to ggml-cpu" This reverts commit `9e40d984ad`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `9e40d984ad`) * ggml: move ggml_table_f32_f16 to ggml-cpu.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: extern c ggml_table_f32_f16 + chore docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h we rely on the variable declaration in ggml-cpu.c instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h" This reverts commit `f71b21d2f7`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back ggml_table_f32_f16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: bring back ggml_table_f32_f16" This reverts commit `2dce119178`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * fix ggml time initialization * fix f32_f16 table init * remove extra line --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: slaren <slarengh@gmail.com>	2025-06-25 23:49:04 +02:00
Sigbjørn Skjæret	b193d53069	ggml : do not output unprintable characters on GGUF load failure (#14381 )	2025-06-25 23:26:51 +02:00
Anton Mitkov	2bf9d539dd	sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973 )	2025-06-25 18:09:55 +02:00
lhez	73e53dc834	opencl: ref count `ggml_backend_opencl_context` and refactor profiling (#14254 ) * Move profiling info into `ggml_backend_opencl_context` * Add `enqueue_ndrange_kernel` to launch kernel	2025-06-24 11:46:25 -07:00
uvos	0142961a2e	CUDA/HIP: optimize mmv paths taken for HIP devices (#14324 ) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-06-24 01:12:56 +02:00
Johannes Gäßler	defe2158dd	CUDA: mul_mat_v support for batch sizes > 1 (#14262 ) * CUDA: mul_mat_v support for batch sizes > 1 * use 64 bit math for initial offset calculation	2025-06-23 13:11:31 +02:00
uvos	af3373f1ad	HIP: enable vec fattn on RDNA4 (#14323 )	2025-06-22 16:51:23 +02:00
Aman Gupta	aa064b2eb7	CUDA: add mean operation (#14313 ) * CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0	2025-06-22 12:39:54 +08:00
Markus Tavenrath	bb16041cae	Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792 ) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker.	2025-06-21 08:17:12 +02:00
Georgi Gerganov	67ae5312e2	metal : fix thread-safety (#14300 ) ggml-ci	2025-06-21 08:04:18 +03:00
Acly	b7147673f2	Add `ggml_roll` (ggml/1274) * ggml : add ggml_roll * use set/get_op_params & std::min	2025-06-20 21:02:47 +03:00
Aman Gupta	c959f462a0	CUDA: add conv_2d_transpose (#14287 ) * CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts	2025-06-20 22:48:24 +08:00
Nicolò Scipione	8308f98c7f	sycl: add usage of enqueue_functions extension (#14244 ) * Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-06-20 15:07:21 +02:00
Christian Kastner	6369be0735	Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286 ) * Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-06-20 14:17:32 +02:00
Diego Devesa	e28c1b93fd	cuda : synchronize graph capture and cublas handle destruction (#14288 ) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread	2025-06-20 13:57:36 +02:00
Georgi Gerganov	d27b3ca175	ggml : fix repack work size for mul_mat_id (#14292 ) ggml-ci	2025-06-20 11:19:15 +03:00
Charles Xu	9230dbe2c7	ggml: Update KleidiAI to v1.9.0 (#14277 )	2025-06-20 10:51:01 +03:00
Aman Gupta	9eaa51e7f0	CUDA: add conv_2d_dw (#14265 ) * CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const	2025-06-20 09:50:24 +08:00
Diego Devesa	8f71d0f3e8	ggml-cpu : remove unnecesary arm feature detection (#14281 ) Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.	2025-06-19 21:24:14 +02:00
fanyang	456af35eb7	build : suppress gcc15 compile warnings (#14261 ) * Change _contains_any() substrs to std::string_view and fix the find comparison logic.	2025-06-19 14:49:48 +02:00
Anton Mitkov	600e3e9b50	sycl: Cleanup codepaths in Get Rows in sycl backend (#14215 ) Addresses unused reorder path	2025-06-19 11:40:21 +01:00
Aaron Teo	faed5a5f5d	llamafile : support s390x SIMD instruction set (#14273 )	2025-06-19 11:48:54 +02:00
0cc4m	10bb545c5b	Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249 )	2025-06-19 09:15:42 +02:00
Georgi Gerganov	ed3290ab34	metal : add mean kernel (#14267 ) * metal : add mean kernel ggml-ci * cont : dedup implementation ggml-ci	2025-06-19 08:05:21 +03:00
Aaron Teo	50d2227953	ggml-cpu: reduce asm calls for hsum (#14037 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-06-18 18:10:08 +01:00
Aaron Teo	6231c5cd6d	ggml-cpu: fix uncaught underscore terminators (#14023 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-06-18 18:06:49 +01:00
Charles Xu	ef035803eb	ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (#14258 )	2025-06-18 12:40:07 +01:00
Daniel Bevenius	dd8e59f443	ggml : disable warnings for tests when using MSVC (ggml/1273) * ggml : disable warnings for tests when using MSVC This commit disables warnings for tests on windows when using MSVC. The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce. There is still one warning generated for the tests which is: ```console Building Custom Rule C:/ggml/tests/CMakeLists.txt cl : command line warning D9025: overriding '/DNDEBUG' with '/UNDEBUG' [C:\ggml\build\tests\test-arange.vcxproj] test-arange.cpp test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe ``` * ggml : fix typo in tests disable list	2025-06-18 09:59:21 +03:00
Daniel Bevenius	bbe98d2784	ggml : remove unused ggml_context_container (ggml/1272) This commit removes the unused `ggml_context_container` structure from the ggml library. It looks like the usage of this struct was removed in Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc ggml_contexts on the heap (whisper/2525)"). The motivation for this changes is to improve code clarity/readability.	2025-06-18 09:59:21 +03:00
Daniel Bevenius	c2056ed6d4	examples : include examples in msvc disable warn (ggml/1270) This commit adds the examples in the "list" of targets to ignore MSVC warnings. The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.	2025-06-18 09:59:21 +03:00
bandoti	c46503014d	cmake: remove shader-gen step-targets from ggml-vulkan (#14226 ) * Remove step-targets from vulkan-shaders-gen * Unset DESTDIR when building vulkan-shaders-gen	2025-06-17 22:33:25 +02:00
xctan	860a9e4eef	ggml-cpu : remove the weak alias trick (#14221 )	2025-06-17 12:58:32 +03:00
R0CKSTAR	fe9d60e74a	musa: fix build warning (unused variable) (#14231 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-06-17 17:48:08 +08:00
Diego Devesa	6adc3c3ebc	llama : add thread safety test (#14035 ) * llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-16 08:11:43 -07:00
bandoti	0dbcabde8c	cmake: clean up external project logic for vulkan-shaders-gen (#14179 ) * Remove install step for vulkan-shaders-gen * Add install step to normalize msvc with make * Regenerate modified shaders at build-time	2025-06-16 10:32:13 -03:00
uvos	7d6d91babf	HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202 )	2025-06-16 13:47:38 +02:00
Charles Xu	3ba0d843c6	ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206 )	2025-06-16 11:47:57 +02:00
Jeff Bolz	c89c2d1ab9	vulkan: mutex around vkQueueSubmit (#14127 ) This fixes the remaining crash in test-thread-safety on my system.	2025-06-16 08:21:08 +02:00
xctan	3555b3004b	ggml-cpu : rework weak alias on apple targets (#14146 ) * ggml-cpu : rework weak alias on apple targets * fix powerpc detection * fix ppc detection * fix powerpc detection on darwin	2025-06-16 13:54:15 +08:00
uvos	e54b394082	CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196 )	2025-06-15 17:30:13 +02:00

1 2 3 4 5 ...

1082 Commits