llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ruben Ortlam	856d20988b	Merge `32d504cd94` into `2ba9adc093`	2026-02-16 14:50:19 +01:00
Mario Limonciello	2ba9adc093	Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591 ) Avoids issues with ROCm 6.4.4. Closes: https://github.com/ggml-org/llama.cpp/issues/19580 Fixes: `6845f7f87` ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2026-02-16 14:46:08 +01:00
abhijain1204fujitsu	267ba5a1d9	ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (#19132 ) * Updated repack.cpp * Updated repack.cpp * Updated repack.cpp * Added if condition to support only vector length 256. * Changed the format removed comments and duplicate variable * If SVE 256 not present then was using generic function to compute, hence slowing the performance. So added code if SVE 256 is not present then use NEON code. * Code format change suggestion --------- Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>	2026-02-16 14:38:43 +08:00
Georgi Gerganov	55d58599c8	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-15 22:24:29 +02:00
Georgi Gerganov	1a8c700bfd	ggml : bump version to 0.9.6 (ggml/1423)	2026-02-15 22:24:29 +02:00
David Friehs	27b93cbd15	cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624 ) * cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization - load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask * cuda: iq2xxs: simplify sum scaling express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8` express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 \| 1)` saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight AFAICT no overflow can occur here as iq2xxs values are far too small * uint -> uint32_t error: identifier "uint" is undefined	2026-02-15 22:38:42 +05:30
Daniel Bevenius	57088276d4	cmake : check if KleidiAI API has been fetched (#19640 ) This commit addresses a build issue with the KleidiAI backend when building multiple cpu backends. Commmit `3a00c98584` ("cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL") introduced a change where FetchContent_Populate is called instead of FetchContent_MakeAvailable, where the latter does handle this case (it is idempotent but FetchContent_Populate is not). I missed this during my review and I should not have commited without verifying the CI failure, sorry about that.	2026-02-15 13:59:38 +01:00
Georgi Gerganov	08e6d914b8	ggml : avoid UB in gemm ukernel (#19642 )	2026-02-15 14:56:35 +02:00
Aaron Teo	184c694f45	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399 )	2026-02-15 18:20:35 +08:00
Aman Gupta	684b36101c	ggml-cpu: FA add GEMM microkernel (#19422 ) * ggml-cpu: FA add GEMM microkernel * add guard for sizeless vector types * fix case where DV % GGML_F32_EPR !=0 * move memset out of the loop * move another memset out of the loop * use RM=4 for arm * simd_gemm: convert everything to int * convert everything to size_t to avoid warnings * fixup * add pragma for ignoring aggressive loop optimizations	2026-02-15 11:09:24 +05:30
SamareshSingh	3a00c98584	cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581 ) * cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used. The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality. * addressed code review comments	2026-02-15 06:22:53 +01:00
Ruben Ortlam	32d504cd94	fix editorconfig	2026-02-14 13:02:32 +01:00
Georgi Gerganov	1725e316c1	models : optimize qwen3next graph (#19375 ) * models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs	2026-02-14 12:57:36 +02:00
Ruben Ortlam	02ccf81496	tmpsh size fix	2026-02-14 11:43:31 +01:00
Adrien Gallouët	b7742cf321	ggml : fix GGML_DEBUG with OpenMP (#19599 ) last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-14 11:22:57 +01:00
Ruben Ortlam	0b4b0d2e57	device tuning	2026-02-14 11:16:20 +01:00
Ruben Ortlam	dd92b1f8d5	fix regressions	2026-02-14 11:16:20 +01:00
Ruben Ortlam	9f9a8743c4	add Intel shader core count lookup-table	2026-02-14 11:16:20 +01:00
Ruben Ortlam	3ae5466aaf	Use wave32 on AMD RDNA for scalar FA	2026-02-14 11:16:20 +01:00
Ruben Ortlam	16cb912442	Bc 4 for scalar FA is not a valid configuration	2026-02-14 11:16:20 +01:00
Ruben Ortlam	cd54ba2b86	fixes	2026-02-14 11:16:20 +01:00
Ruben Ortlam	3946eb657f	fix rebase issues	2026-02-14 11:16:20 +01:00
Ruben Ortlam	28a3c0b859	fix shmem support function	2026-02-14 11:16:20 +01:00
Ruben Ortlam	3ed9183ac9	use minimal subgroup size on Intel	2026-02-14 11:16:20 +01:00
Ruben Ortlam	9f9b701ff5	relax flash attention split_k condition to allow non-gqa use	2026-02-14 11:16:17 +01:00
Georgi Gerganov	6e473fb384	metal : fix ACC op (#19427 )	2026-02-14 09:54:03 +02:00
Ruben Ortlam	d6a004547f	use smaller scalar rows size for smaller rows count	2026-02-14 07:05:36 +01:00
Ruben Ortlam	de6db3fed6	use float_type for dequantize4 functions	2026-02-14 07:05:36 +01:00
Ruben Ortlam	356f18c444	use vectorized stores	2026-02-14 07:05:36 +01:00
Ruben Ortlam	4819fd3014	dynamic subgroups for intel	2026-02-14 07:05:16 +01:00
Ruben Ortlam	b626e3296d	also stage V through shmem when this is done for K	2026-02-14 07:05:16 +01:00
Ruben Ortlam	8fbd3575e0	default to Bc 32	2026-02-14 07:05:16 +01:00
Ruben Ortlam	d8d536cf98	only stage through shmem on Nvidia	2026-02-14 07:05:16 +01:00
Ruben Ortlam	8236c453a5	stage V loads through shmem	2026-02-14 07:05:16 +01:00
Ruben Ortlam	b7b67f8742	stage K loads through shmem	2026-02-14 07:05:16 +01:00
Ruben Ortlam	50a420e044	fuse lf accumulation, pf and v accumulation into a loop	2026-02-14 07:05:16 +01:00
Ruben Ortlam	ca5ec63cfb	cache q values into registers for KQ	2026-02-14 07:05:16 +01:00
Ruben Ortlam	3c2088121c	add padding to mask shmem buffer	2026-02-14 07:05:15 +01:00
Ruben Ortlam	07afb5128f	fixes	2026-02-14 07:04:32 +01:00
Ruben Ortlam	e3bba64e82	add medium rows FA shader Br size	2026-02-14 07:03:07 +01:00
Ruben Ortlam	c0f419351c	optimize masksh use	2026-02-14 07:03:06 +01:00
Ruben Ortlam	9b309bbc51	fix amd workgroup size issue	2026-02-14 06:57:22 +01:00
Ruben Ortlam	f92d7eddab	use f32 scalar FA if f16 is not supported by device	2026-02-14 06:57:22 +01:00
Ruben Ortlam	828b7e9bb1	use row_split when Br >= 4, change reductions to use shared memory if row_split == 1	2026-02-14 06:57:22 +01:00
Ruben Ortlam	e7a758fb66	split rows inside of subgroups for faster synchronization	2026-02-14 06:57:22 +01:00
Ruben Ortlam	015d7bcd66	vulkan: allow using fp16 in coopmat1 flash attention shader	2026-02-14 06:57:21 +01:00
Jeff Bolz	dbb023336b	vulkan: support L2_NORM with contiguous rows (#19604 )	2026-02-14 06:42:04 +01:00
Jeff Bolz	53aef25a88	vulkan: support GGML_OP_SET (#19584 )	2026-02-14 06:36:38 +01:00
Sophon	2dec548094	vulkan: Add vendor id for Qualcomm drivers (#19569 ) This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.	2026-02-14 06:29:17 +01:00
Max Krasnyansky	0ccbfdef3e	hexagon: further optimizations and refactoring for flash attention (#19583 ) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>	2026-02-13 16:27:30 -08:00

1 2 3 4 5 ...

2070 Commits