llama.cpp

History

Yibo Cai 54a2c7a8cd arm64: optimize q4_k_q8_k kernel with i8mm (#13886 ) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ```		2025-05-29 14:39:20 +03:00
..
amx	ggml : upgrade init_tensor API to return a ggml_status (#11854 )	2025-02-28 14:41:47 +01:00
cmake	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
kleidiai	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509 )	2025-05-13 18:02:28 +03:00
llamafile	ggml : Enable MMA for BF16 in llamafile_sgemm (#13148 )	2025-05-02 19:53:12 +03:00
CMakeLists.txt	cmake: Factor out CPU architecture detection (#13883 )	2025-05-29 12:50:25 +02:00
binary-ops.cpp	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
binary-ops.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
common.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
cpu-feats-x86.cpp	ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871 )	2025-04-21 18:13:51 +02:00
ggml-cpu-aarch64.cpp	ggml : riscv: add xtheadvector support (#13720 )	2025-05-27 16:21:36 +03:00
ggml-cpu-aarch64.h	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.cpp	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.h	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-impl.h	ggml : riscv: add xtheadvector support (#13720 )	2025-05-27 16:21:36 +03:00
ggml-cpu-quants.c	arm64: optimize q4_k_q8_k kernel with i8mm (#13886 )	2025-05-29 14:39:20 +03:00
ggml-cpu-quants.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-cpu-traits.cpp	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-traits.h	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu.c	arm64: optimize q4_k_q8_k kernel with i8mm (#13886 )	2025-05-29 14:39:20 +03:00
ggml-cpu.cpp	rpc : use backend registry, support dl backends (#13304 )	2025-05-04 21:25:43 +02:00
ops.cpp	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882 )	2025-05-29 12:18:43 +03:00
ops.h	ggml : Depthwise 2D convolution (ggml/1152)	2025-04-24 17:32:47 +03:00
simd-mappings.h	ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843 )	2025-05-29 09:01:33 +03:00
unary-ops.cpp	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
unary-ops.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
vec.cpp	ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843 )	2025-05-29 09:01:33 +03:00
vec.h	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882 )	2025-05-29 12:18:43 +03:00