llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ilia Ilmer	9ad4f1931e	metal : add `CONV_TRANSPOSE_2D` (#16542 ) * initial: headers and metal-device.cpp updates * adding conv_transpose_2d * fix type * fix type: int32->int64 * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add checks for src[0] and src[1]; add type checks * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add more tests, add optimization to threading * add dynamic memory allocation in metal --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-17 09:33:58 +03:00
Sam/Samuel	f4ce81c45e	metal: optimise `GGML_OP_SUM` (#16559 ) * optimise GGML_OP_SUM * add non-contiguous tests by permuting the input * change tests to require full contiguity of OP_SUM * cuda : add check GGML_OP_SUM --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-15 17:05:56 +03:00
Georgi Gerganov	fa882fd2b1	metal : avoid using Metal's gpuAddress property (#16576 ) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check	2025-10-14 20:33:05 +03:00
Georgi Gerganov	e60f241eac	metal : FA support F32 K and V and head size = 32 (#16531 ) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]	2025-10-13 23:07:57 +03:00
Sam/Samuel	3f750f8d76	metal: add support for opt_step_sgd (#16539 ) * metal: add support for opt_step_sgd * add newline to pass EditorConfig check	2025-10-13 11:25:02 +03:00
Sam/Samuel	a31cf36ad9	metal : add opt_step_adamw and op_sum (#16529 ) * scaffold to support opt step adamw on metal (not written so far) * add opt-step-adamw kernel for metal * pass op->src[4] as a separate buffer to the pipeline * add bounds check to opt-step-adamw kernel * complete scaffold for GGML_OP_SUM * naive GGML_OP_SUM kernel * remove unwanted comment * change OP_SUM capability gate * Add has_simdgroup_reduction to both ops to pass CI	2025-10-12 21:43:14 +03:00
Georgi Gerganov	a3cb04744f	metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494 )	2025-10-11 16:54:10 +03:00
Georgi Gerganov	b2c08c9ec4	metal : mark FA blocks (#16372 ) * metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic	2025-10-08 10:57:53 +03:00
Georgi Gerganov	0a319bb75e	metal : add support for non-padded FA KV (#16148 ) * metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement	2025-10-07 08:23:30 +03:00
Georgi Gerganov	8ae32dc9ec	metal : various optimizations + refactoring (#16446 ) * metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt	2025-10-07 08:21:40 +03:00
Georgi Gerganov	606a73f531	metal : fix loop bound in ggml_mem_ranges (#16412 )	2025-10-03 19:18:56 +03:00
Georgi Gerganov	35fb82497e	metal : dynamic simdgroups for MV kernels (#16340 ) * metal : dynamic simdgroups for MV kernels * cont : minor	2025-09-30 11:03:23 +03:00
Sigbjørn Skjæret	adc76347d7	ggml : check cuda and metal argsort limits and add test (#16323 ) * check cuda argsort limits and add test * add metal check	2025-09-29 11:09:00 +02:00
Georgi Gerganov	6a2c6145a0	metal : extend mat-mat multiplication support (#16225 ) * metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed	2025-09-28 09:34:44 +03:00
Georgi Gerganov	3b53634fe3	metal : fuse non-sequential nodes (#16102 ) * metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks	2025-09-28 09:34:05 +03:00
Georgi Gerganov	54dbc37053	metal : report OOM errors (#16274 )	2025-09-26 14:14:28 +03:00
Georgi Gerganov	dfcd53f7ec	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220 ) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-25 11:30:16 +03:00
Georgi Gerganov	4ea00794b8	metal : relax reorder conditions (#16216 )	2025-09-25 11:29:42 +03:00
Georgi Gerganov	02a6a82ae7	metal : restore im2col perf (#16219 )	2025-09-25 11:29:08 +03:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-22 19:13:00 +02:00
Georgi Gerganov	a71ae3ba7a	ggml : add ggml_op_is_empty (#16122 ) * ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h	2025-09-22 11:12:09 +03:00
Jeff Bolz	c0b45097c3	rename optimize_graph to graph_optimize (#16082 )	2025-09-18 13:46:17 -05:00
Georgi Gerganov	703f9e32c4	metal : use function constants for mul_mv_ext kernels (#16074 ) * metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci	2025-09-18 16:28:41 +03:00
Georgi Gerganov	b213fce89b	metal : improve F32, F16 and BF16 mat-vec multiplication (#16057 ) * metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci	2025-09-18 12:33:45 +03:00
Jhen-Jie Hong	e00f3fd8ff	metal : avoid call free for non-owned buffer (#16067 )	2025-09-18 10:06:48 +03:00
Georgi Gerganov	f2f28380ea	metal : handle nil cv during pipeline creation (#16065 ) ggml-ci	2025-09-18 10:03:24 +03:00
Georgi Gerganov	0320ac5264	metal : refactor + optimize v2 (#15995 ) * metal : improve naming * metal : refactor device ggml-ci * cont : props ggml-ci * metal : apply ggml_mem_ranges_t ggml-ci * metal : remove GGML_METAL_USE_BF16 ggml-ci * metal : refactor device buffer ggml-ci * cont : fix naming * metal : sync before destroying the backend ggml-ci * metal : refactor context ggml-ci * metal : migrate ggml-metal.m to ggml-metal.cpp ggml-ci * metal : adjust ops API ggml-ci * metal : use C++ to store piplienes ggml-ci * metal : migrate ops to separate functions ggml-ci * metal : add ggml_metal_library_t ggml-ci * metal : improve naming ggml-ci * metal : cleanp ggml-ci * metal : add support for GGML_OP_LOG ggml-ci * metal : fix error handling ggml-ci	2025-09-17 20:38:12 +03:00
Daniel Bevenius	3913f8730e	ggml : fix padding in timestep embedding kernels (#15932 ) * ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension.	2025-09-16 15:25:57 +02:00
Georgi Gerganov	9dcd200d57	metal : remove memory pools (#15966 ) * metal : remove mem pool usage ggml-ci * metal : remove mem pool implementation ggml-ci * metal : take into account the actual allocated memory of the tensor ggml-ci * cont : use ggml_backend_buft_get_alloc_size ggml-ci * cont : improve, comments ggml-ci * cont : add functions for the extra tensor sizes * metal : add comments ggml-ci * metal : implement .get_alloc_size for the rest of the buffer types ggml-ci * metal : remove ggml_metal_heap ggml-ci	2025-09-14 22:02:32 +03:00
Georgi Gerganov	a14bd35014	metal : fix kernel requirements (#15983 ) * metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX	2025-09-14 15:33:22 +03:00
Georgi Gerganov	55758b00ca	metal : refactor kernel loading (#15964 ) * metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working	2025-09-13 16:24:22 +03:00
Georgi Gerganov	f161463a54	metal : allow ops to run concurrently (#15929 ) * metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]	2025-09-13 13:54:28 +03:00
Georgi Gerganov	84d7b2fca1	metal : fix memory leaks (#15962 ) ggml-ci	2025-09-13 12:45:04 +03:00
Georgi Gerganov	0f0a3c2851	metal : make the backend async (#15906 ) * metal : make the backend async ggml-ci * cont : add comments, extend op offload, clean up ggml-ci * metal : fix batch size for MUL_MAT_ID * metal : remove deprecated ggml_backend_metal_buffer_from_ptr * metal : create only metal buffers, no wrapping of host memory ggml-ci * metal : restore .alloc_buffer for buffer_from_ptr_type ggml-ci * metal : remove broken implementation of GGML_OP_SET ggml-ci * metal : clean-up loose ends, ready for tests ggml-ci * metal : support both private and shared buffers ggml-ci * metal : enable private buffers + add global device queue * metal : disable host buffer to prevent races ggml-ci * metal : avoid extra copy during set_tensor ggml-ci * metal : use separate buffer types for shread and private Metal buffers ggml-ci * metal : simplify synchronization logic ggml-ci * metal : fix build ggml-ci * metal : do not implement cpy_tensor ggml-ci * metal : separate implementations for shared and private buffers ggml-ci	2025-09-10 17:52:35 +03:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00
Georgi Gerganov	f28d4f4ac9	metal : refactor + optimize (#15857 ) * metal : refactor ggml-ci * cont : refactor FA-vec kernel * cont : print metal library load time * minor : warn to debug + bettern kernel names ggml-ci * metal : optimize mul_mv q8_0 ggml-ci * metal : simplify FA pipeline creation functions ggml-ci * metal : improve naming consistency * metal : safer function constants offsets ggml-ci * metal : comments ggml-ci	2025-09-08 13:34:56 +03:00
Xuan-Son Nguyen	9fcb29f22f	ggml: allow casting between f32 and i32 (#15783 ) * ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan	2025-09-08 12:33:01 +02:00
Gabe Goodhart	856ed0947f	metal : Add template specialization for mul_mm_id w/ ne20 == 10 (#15799 ) Branch: GGMLMetalNE20 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-09-04 18:53:22 +03:00
leejet	0a1b3982cd	ggml: add ops for WAN video model (cuda && cpu) (#15669 ) * add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit `d8377a0a37`. * Revert "add gguf_init_from_file_ext impl" This reverts commit `d9f1d13208`. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-04 10:38:49 +02:00
Georgi Gerganov	4efd5a8316	metal : fix checks for available FA kernels (#15700 ) * metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]	2025-08-31 19:43:30 +03:00
compilade	73804145ab	ggml : fix SSM_SCAN for n_groups > 1 (#15625 )	2025-08-28 10:11:36 -04:00
Georgi Gerganov	b3964c1e89	metal : optimize FA vec for large sequences and BS <= 8 (#15566 ) * metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci	2025-08-26 14:22:14 +03:00
Georgi Gerganov	1d8d83deaa	metal : improve `MUL_MAT_ID` (#15541 ) * metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci	2025-08-26 12:46:15 +03:00
Sigbjørn Skjæret	0fd90db585	metal : remove contiguous assertion for src0 in IM2COL (#15577 ) * remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op	2025-08-26 09:51:43 +03:00
Ihar Hrachyshka	111f8d06f0	metal: fix regression when no metal devices are present (#15531 )	2025-08-25 18:27:34 +03:00
Georgi Gerganov	b0ba31f525	metal : add FA kernels for HS=40 (#15559 ) ggml-ci	2025-08-25 10:14:48 +03:00
Xuan-Son Nguyen	945e1f12a6	ggml : fix condition of im2col on Metal backend (#15460 )	2025-08-21 08:32:26 +03:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Gabe Goodhart	793c0d7f46	metal: SSM_SCAN performance (#14743 ) * feat: Add s_off as a parameter in the args struct This may not be necessary, but it more closely mirrors the CUDA kernel Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update logic to correctly do the multi-layer parallel sum Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Correctly size the shared memory bufer and assert expected size relationships Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Compute block offsets once rather than once per token Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use local variable for state recursion Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use a secondary simd_sum instead of a for loop Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add assertion and comment about relationship between simd size and num simd groups Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallelize of d_state for mamba-1 Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallel sum in SSM_CONV Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "feat: Parallel sum in SSM_CONV" After discussion with @compilade, the size of the parallelism here is not worth the cost in complexity or overhead of the parallel for. https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357 This reverts commit `16bc059660`. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify shared memory sizing Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-25 10:47:39 -06:00
Georgi Gerganov	065908cb09	metal : fix fusion across different encoders (#14849 ) * metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci	2025-07-24 10:24:05 +03:00

1 2 3

123 Commits