llama.cpp

Commit Graph

Author	SHA1	Message	Date
Chenguang Li	3479efd112	CANN: Improve device ID handling and aclnnArange checks (#16752 ) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var	2025-10-28 10:54:53 +08:00
Chenguang Li	7a50cf388a	CANN: format code using .clang-format (#15863 ) This commit applies .clang-format rules to all source files under the ggml-cann directory to ensure consistent coding style and readability. The .clang-format option `SortIncludes: false` has been set to disable automatic reordering of include directives. No functional changes are introduced. Co-authored-by: hipudding <huafengchun@gmail.com>	2025-10-16 16:41:11 +08:00
Chenguang Li	56fc38b965	CANN: fix CPU memory leak in CANN backend (#16549 ) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.	2025-10-13 17:01:24 +08:00
hipudding	f9bc66c3eb	CANN: Update several operators to support FP16 data format (#16251 ) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>	2025-10-13 08:52:22 +08:00
Chenguang Li	aa4711d369	CANN: Improve ACL graph matching (#16166 ) * CANN: improve ACL graph matching Record `ne` and `nb` information for src tensors and include them in the graph matching check. This enhances the robustness of ACL graph matching by preventing incorrect matches when src tensors share the same data address but differ in shape or stride. * CANN: add op_params match	2025-10-09 15:50:25 +08:00
Jeff Bolz	c0b45097c3	rename optimize_graph to graph_optimize (#16082 )	2025-09-18 13:46:17 -05:00
Chenguang Li	62c3b645c5	CANN: Remove print (#16044 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-18 09:26:33 +08:00
Chenguang Li	d5fabe3682	CANN: Optimize ggml_cann_set_device (#15935 ) * CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id	2025-09-17 14:33:08 +08:00
hipudding	c0389dba43	CANN: Disable acl_graph for prefill stage (#15933 ) Since the prefill length is not fixed, graphs constructed for the prefill stage cannot be reused. For this reason, ACL graph execution is disabled by default during prefill.	2025-09-11 15:59:37 +08:00
Chenguang Li	10d8b2b6b0	CANN: Add ROPE sin/cos cache for reuse (#15912 ) * CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-10 18:42:00 +08:00
Chenguang Li	28b5f190ef	CANN: implement LRU cache for ACL graphs (#15814 ) * CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-10 15:29:12 +08:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00
Chenguang Li	85ca66a746	CANN: Stream sync between devices for acl_graph (#15809 ) * CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: add Comments --------- Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-08 10:03:29 +08:00
Chenguang Li	c1c354e44c	CANN: Refactor ND to NZ workspace to be per-device (#15763 ) * CANN:Refactor ND to NZ workspace to be per-device in Ascend backend - Replaced the previous single global ND→NZ workspace with a per-device cache using unordered_map keyed by device ID. - Functions `release_nz_workspace`, `relloc_nz_workspace`, and `get_nz_workspace` now manage workspace independently for each device, preventing memory conflicts in multi-device / pipeline parallel scenarios. - This change fixes potential precision issues caused by workspace overwrites when multiple devices perform ND→NZ conversions concurrently. Co-authored-by: hipudding <huafengchun@gmail.com> * refactor Signed-off-by: noemotiovon <757486878@qq.com> * rename Signed-off-by: noemotiovon <757486878@qq.com> * fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-04 20:20:14 +08:00
leejet	0a1b3982cd	ggml: add ops for WAN video model (cuda && cpu) (#15669 ) * add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit `d8377a0a37`. * Revert "add gguf_init_from_file_ext impl" This reverts commit `d9f1d13208`. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-04 10:38:49 +02:00
hipudding	5421f63ab0	CANN: Fix precision issue on 310I DUO multi-devices (#15784 )	2025-09-04 15:12:30 +08:00
Chenguang Li	239b60e898	CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (#15760 ) Fixes #15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>	2025-09-04 11:03:02 +08:00
hipudding	5eae934883	CANN: Add RoPE contiguous check for 310I DUP device (#15735 )	2025-09-03 16:46:01 +08:00
hipudding	f6da8cb86a	CANN: Mask unsupported TRANSPOSE_1D operator (#15733 ) CANN currently does not support kernels larger than 255. This change disables such cases.	2025-09-03 14:08:22 +08:00
Chenguang Li	8a2234ea0c	CANN: Fix type float_t to float (#15736 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-03 10:43:53 +08:00
hipudding	9961d244f2	CANN: Resolve soft_max precision issue (#15730 ) Previously, the slope tensor was set to fp16 to improve efficiency. While this worked correctly in FA, it caused precision issues in soft_max. This change applies different data types for different operators to balance both accuracy and performance.	2025-09-02 17:12:37 +08:00
Chenguang Li	2f853687b3	CANN: Support eager execution mode under ACL graph compilation (#15712 ) * [CANN] Support eager execution mode under ACL graph compilation Add support for running operators in eager mode while ACL graph compilation is enabled. This allows bypassing graph execution and directly submitting ops, which is useful for debugging and reducing graph build overhead in certain scenarios. Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * rename to acl_graph_mode Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-02 14:07:48 +08:00
hipudding	ef2af57ddf	CANN: Support ext_factor in rope (#15710 )	2025-09-02 14:05:23 +08:00
hipudding	b9382c3877	CANN: Optimize MUL_MAT_ID (#15658 )	2025-09-01 08:57:23 +08:00
hipudding	3dc7397a27	CANN: fix RoPE cache issue on multi-device (#15629 ) * CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment	2025-09-01 08:57:00 +08:00
Chenguang Li	ef476916bb	CANN: FIx compiler warnings (#15661 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-30 10:18:35 +08:00
Georgi Gerganov	8a4280ce43	kv-cache : remove LLAMA_SET_ROWS checks (#15505 ) ggml-ci	2025-08-28 12:27:02 +03:00
Chenguang Li	1e7489745a	CANN: refactor mask handling and improve performance in FA (#15561 ) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-27 17:21:41 +08:00
Chenguang Li	c247d06f38	CANN: ROPE cache sin/cos repeat (#15501 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-25 10:32:21 +08:00
Chenguang Li	a0f98dd604	CANN: Optimize RMS_NORM using cache (#15419 ) * [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * fix review comment Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-22 14:12:07 +08:00
SHUAI YANG	a6d3cfe7fa	CANN: optimize rope operator (#15335 ) * optimize rope ops * amendment * delete trailing whitespace * change the variable name	2025-08-19 21:28:22 +08:00
Chenguang Li	bbd57b7eaf	CANN: GGML_OP_CPY optimization (#15070 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-12 16:12:13 +08:00
hipudding	be48528b06	CANN: Add broadcast for softmax and FA (#15208 ) * refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace	2025-08-11 22:50:31 +08:00
Chenguang Li	2241453252	CANN: add support for ACL Graph (#15065 ) * feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is disabled, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <757486878@qq.com> * Fix review comments Signed-off-by: noemotiovon <757486878@qq.com> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-06 14:12:42 +08:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
diannao	2860d479b4	docker : add cann build pipline (#14591 ) * docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-08-01 10:02:34 +08:00
hipudding	11490b3672	CANN: Improve loading efficiency after converting weights to NZ format. (#14985 ) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-07-31 19:47:20 +08:00
hipudding	204f2cf168	CANN: Add ggml_set_rows (#14943 )	2025-07-29 22:36:43 +08:00
hipudding	11dd5a44eb	CANN: Implement GLU ops (#14884 ) Implement REGLU, GEGLU, SWIGLU ops according to #14158	2025-07-26 17:56:18 +08:00
chen fan	14c28dfc50	CANN: weight format to NZ for Ascend310P3 (#14407 ) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code	2025-07-23 11:58:00 +08:00
Georgi Gerganov	05fec5bd29	ggml : add build-time message to remind about ggml_set_rows (#14661 ) ggml-ci	2025-07-13 10:36:33 +03:00
Xuan-Son Nguyen	98bab638fb	ggml : add ggml_scale_bias (#14417 ) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-09 18:16:12 +02:00
luyhcsu	499a8f5a78	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002 ) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-04 11:50:07 +08:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-03 10:53:35 +03:00
Georgi Gerganov	ec68e84c32	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435 ) ggml-ci	2025-07-02 15:48:33 +03:00
Chenguang Li	343b6e94b6	CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411 ) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-07-01 16:47:30 +08:00
Xinpeng Dou	b25e92774e	fix async_mode bug (#14432 )	2025-06-28 17:35:41 +08:00
Xinpeng Dou	e21d2d4ae2	CANN: Simplify the environment variable setting(#13104 ) * Simplify the environment variable setting to specify the memory pool type. * Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options. * update * fix CI * update * delete whitespace * fix according to review * update CANN.md * update CANN.md	2025-06-09 19:47:39 +08:00
leo-pony	1e8659e65a	CANN: Add SOC TYPE printing in cmake configuration (#13837 )	2025-05-28 11:54:20 +08:00
Bizhao Shi	2d38b6e400	CANN: Add the basic supports of Flash Attention kernel (#13627 ) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-26 10:20:18 +08:00

1 2

90 Commits