llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ruben Ortlam	32d504cd94	fix editorconfig	2026-02-14 13:02:32 +01:00
Ruben Ortlam	02ccf81496	tmpsh size fix	2026-02-14 11:43:31 +01:00
Ruben Ortlam	0b4b0d2e57	device tuning	2026-02-14 11:16:20 +01:00
Ruben Ortlam	dd92b1f8d5	fix regressions	2026-02-14 11:16:20 +01:00
Ruben Ortlam	9f9a8743c4	add Intel shader core count lookup-table	2026-02-14 11:16:20 +01:00
Ruben Ortlam	3ae5466aaf	Use wave32 on AMD RDNA for scalar FA	2026-02-14 11:16:20 +01:00
Ruben Ortlam	16cb912442	Bc 4 for scalar FA is not a valid configuration	2026-02-14 11:16:20 +01:00
Ruben Ortlam	cd54ba2b86	fixes	2026-02-14 11:16:20 +01:00
Ruben Ortlam	3946eb657f	fix rebase issues	2026-02-14 11:16:20 +01:00
Ruben Ortlam	28a3c0b859	fix shmem support function	2026-02-14 11:16:20 +01:00
Ruben Ortlam	3ed9183ac9	use minimal subgroup size on Intel	2026-02-14 11:16:20 +01:00
Ruben Ortlam	9f9b701ff5	relax flash attention split_k condition to allow non-gqa use	2026-02-14 11:16:17 +01:00
Ruben Ortlam	d6a004547f	use smaller scalar rows size for smaller rows count	2026-02-14 07:05:36 +01:00
Ruben Ortlam	de6db3fed6	use float_type for dequantize4 functions	2026-02-14 07:05:36 +01:00
Ruben Ortlam	356f18c444	use vectorized stores	2026-02-14 07:05:36 +01:00
Ruben Ortlam	4819fd3014	dynamic subgroups for intel	2026-02-14 07:05:16 +01:00
Ruben Ortlam	b626e3296d	also stage V through shmem when this is done for K	2026-02-14 07:05:16 +01:00
Ruben Ortlam	8fbd3575e0	default to Bc 32	2026-02-14 07:05:16 +01:00
Ruben Ortlam	d8d536cf98	only stage through shmem on Nvidia	2026-02-14 07:05:16 +01:00
Ruben Ortlam	8236c453a5	stage V loads through shmem	2026-02-14 07:05:16 +01:00
Ruben Ortlam	b7b67f8742	stage K loads through shmem	2026-02-14 07:05:16 +01:00
Ruben Ortlam	50a420e044	fuse lf accumulation, pf and v accumulation into a loop	2026-02-14 07:05:16 +01:00
Ruben Ortlam	ca5ec63cfb	cache q values into registers for KQ	2026-02-14 07:05:16 +01:00
Ruben Ortlam	3c2088121c	add padding to mask shmem buffer	2026-02-14 07:05:15 +01:00
Ruben Ortlam	07afb5128f	fixes	2026-02-14 07:04:32 +01:00
Ruben Ortlam	e3bba64e82	add medium rows FA shader Br size	2026-02-14 07:03:07 +01:00
Ruben Ortlam	c0f419351c	optimize masksh use	2026-02-14 07:03:06 +01:00
Ruben Ortlam	9b309bbc51	fix amd workgroup size issue	2026-02-14 06:57:22 +01:00
Ruben Ortlam	f92d7eddab	use f32 scalar FA if f16 is not supported by device	2026-02-14 06:57:22 +01:00
Ruben Ortlam	828b7e9bb1	use row_split when Br >= 4, change reductions to use shared memory if row_split == 1	2026-02-14 06:57:22 +01:00
Ruben Ortlam	e7a758fb66	split rows inside of subgroups for faster synchronization	2026-02-14 06:57:22 +01:00
Ruben Ortlam	015d7bcd66	vulkan: allow using fp16 in coopmat1 flash attention shader	2026-02-14 06:57:21 +01:00
Jeff Bolz	dbb023336b	vulkan: support L2_NORM with contiguous rows (#19604 )	2026-02-14 06:42:04 +01:00
Jeff Bolz	53aef25a88	vulkan: support GGML_OP_SET (#19584 )	2026-02-14 06:36:38 +01:00
Sophon	2dec548094	vulkan: Add vendor id for Qualcomm drivers (#19569 ) This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.	2026-02-14 06:29:17 +01:00
Max Krasnyansky	0ccbfdef3e	hexagon: further optimizations and refactoring for flash attention (#19583 ) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>	2026-02-13 16:27:30 -08:00
Jeff Bolz	05a6f0e894	vulkan: restore -inf check in FA shaders (#19582 )	2026-02-13 13:35:29 -06:00
Alberto Cabrera Pérez	cc2aa81513	Fix wrong memcpy length for block_interleave == 4 (#19575 )	2026-02-13 20:32:14 +08:00
ymcki	0e21991472	fix vulkan ggml_acc only works in 3d but not 4d (#19426 ) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-13 13:31:37 +01:00
Aman Gupta	5065da554e	CUDA: loop over ne2ne3 in case it overflows (#19538 ) CUDA: loop over ne2ne3 in case it overflows use fastdiv	2026-02-13 17:01:40 +05:30
Oliver Simons	43919b7f4f	CUDA: Do not mutate cgraph for fused ADDs (#19566 ) * Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-02-13 15:07:55 +05:30
Georgi Gerganov	0644baefde	metal : improve concurrency (#19555 )	2026-02-13 07:35:57 +02:00
Georgi Gerganov	490eb96b88	metal : support GGML_OP_SET (#19548 )	2026-02-13 07:34:52 +02:00
Shupei Fan	3bb78133ab	hexagon: fix typo in vtcm_needs_release (#19545 )	2026-02-12 15:07:49 -08:00
lhez	79cc0f2daf	opencl: add basic support for q4_1 (#19534 ) * opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm	2026-02-12 14:52:37 -08:00
Georgi Gerganov	3b3a948134	metal : update sum_rows kernel to support float4 (#19524 )	2026-02-12 11:35:28 +02:00
Mario Limonciello	6845f7f87f	Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461 ) There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices). The specialized types aren't resolved properly: ``` /opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>' 2549 \| using ARegsT = typename Impl::ARegsT; ``` Add a workaround to explicitly declare the types and cast when compiling with HIP and ROCWMMA_FATTN [2]. When this is actually fixed upstream some guards can be used to detect and wrap the version that has the fix to only apply when necessary. Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1] Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>	2026-02-12 09:38:35 +01:00
Max Krasnyansky	b1ff83bbb0	hexagon: further optimization and tuning of matmul and dot kernels (#19407 ) * ggml-hexagon: implement 2x2 matmul kernel * hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4 * hexagon: fix editor config failures * hexagon: refactor matmul ops to use context struct and remove wrappers Also implement vec_dot_f16 2x2 * hexagon: refactor dyn quantizers to use mmctx * hexagon: remove mm fastdiv from op_ctx * hexagon: refactor matmul entry point to reduce code duplication --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-02-11 23:04:27 -08:00
lhez	4d3daf80f8	opencl: add general Q6_K mm and Q4_K mv (#19347 ) * opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace	2026-02-11 10:33:13 -08:00
Georgi Gerganov	914dde72ba	ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511 ) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-11 18:58:43 +02:00

1 2 3 4 5 ...

2056 Commits