llama.cpp

History

Adrien Gallouët d0b79aaa2f ggml : add native AVX512-FP16 support for F16 operations (#20529 ) The overall benchmark speed remains almost the same because the CPU is now calculating faster than the RAM can deliver the data. (See perf stat results below showing 2.7 billion fewer instructions). Also note that this path will be only enabled for native build or with custom flags. now: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 189,073.52 msec task-clock # 14.658 CPUs utilized 404 context-switches # 2.137 /sec 19 cpu-migrations # 0.100 /sec 372,390 page-faults # 1.970 K/sec 310,877,195,595 instructions # 0.54 insn per cycle 581,071,530,602 cycles # 3.073 GHz 19,352,107,994 branches # 102.352 M/sec 48,304,438 branch-misses # 0.25% of all branches 84,998,431,152 L1-dcache-loads # 449.552 M/sec 12,186,410,279 L1-dcache-load-misses # 14.34% of all L1-dcache accesses 12.899358742 seconds time elapsed 187.823044000 seconds user 1.253416000 seconds sys ``` before: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 190,594.56 msec task-clock # 14.652 CPUs utilized 436 context-switches # 2.288 /sec 22 cpu-migrations # 0.115 /sec 372,782 page-faults # 1.956 K/sec 313,574,921,966 instructions # 0.54 insn per cycle 586,064,970,425 cycles # 3.075 GHz 19,585,778,563 branches # 102.761 M/sec 48,437,488 branch-misses # 0.25% of all branches 86,219,336,628 L1-dcache-loads # 452.370 M/sec 12,232,085,771 L1-dcache-load-misses # 14.19% of all L1-dcache accesses 13.007923164 seconds time elapsed 189.395316000 seconds user 1.202612000 seconds sys ``` Signed-off-by: Adrien Gallouët <angt@huggingface.co>		2026-03-14 10:06:14 +01:00
..
amx	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
arch	ggml-cpu: add RVV vec dot kernels for quantization types (#18859 )	2026-03-13 17:36:04 +02:00
cmake	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
kleidiai	kleidiai : support for concurrent sme and neon kernel execution (#20070 )	2026-03-10 09:25:25 +02:00
llamafile	ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083 ) (#20130 )	2026-03-06 23:22:39 +08:00
spacemit	ggml : fix SpaceMit IME array out-of-bounds in task assignment (#16629 )	2025-10-17 13:01:23 +03:00
CMakeLists.txt	kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043 )	2026-03-03 11:40:26 +02:00
arch-fallback.h	ggml-cpu: add RVV vec dot kernels for quantization types (#18859 )	2026-03-13 17:36:04 +02:00
binary-ops.cpp	ggml : extend bin bcast for permuted src1 (#19484 )	2026-02-11 07:52:00 +02:00
binary-ops.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
common.h	ggml-cpu: FA add GEMM microkernel (#19422 )	2026-02-15 11:09:24 +05:30
ggml-cpu-impl.h	ggml-cpu: FA split across kv for faster TG (#19209 )	2026-02-03 01:19:55 +08:00
ggml-cpu.c	ggml : add NVFP4 quantization type support (#19769 )	2026-03-11 21:02:54 +01:00
ggml-cpu.cpp	ggml-cpu: FA split across kv for faster TG (#19209 )	2026-02-03 01:19:55 +08:00
hbm.cpp	ggml-cpu : split arch-specific implementations (#13892 )	2025-06-09 16:47:13 +02:00
hbm.h	ggml-cpu : split arch-specific implementations (#13892 )	2025-06-09 16:47:13 +02:00
ops.cpp	graph : remove redundant GDN state transposes (#20443 )	2026-03-13 22:12:54 +02:00
ops.h	ggml: add GATED_DELTA_NET op (#19504 )	2026-03-07 15:41:10 +08:00
quants.c	ggml : add NVFP4 quantization type support (#19769 )	2026-03-11 21:02:54 +01:00
quants.h	ggml : add NVFP4 quantization type support (#19769 )	2026-03-11 21:02:54 +01:00
repack.cpp	ggml-cpu: add RVV repack GEMM and GEMV for quantization types (#19121 )	2026-03-10 08:49:52 +02:00
repack.h	ggml-cpu: add RVV repack GEMM and GEMV for quantization types (#19121 )	2026-03-10 08:49:52 +02:00
simd-gemm.h	ggml : avoid UB in gemm ukernel (#19642 )	2026-02-15 14:56:35 +02:00
simd-mappings.h	ggml : add native AVX512-FP16 support for F16 operations (#20529 )	2026-03-14 10:06:14 +01:00
traits.cpp	ggml : fix fallback to CPU for ununsupported ops (#15118 )	2025-08-06 14:37:35 +02:00
traits.h	ggml : fix fallback to CPU for ununsupported ops (#15118 )	2025-08-06 14:37:35 +02:00
unary-ops.cpp	ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511 )	2026-02-11 18:58:43 +02:00
unary-ops.h	ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063 )	2025-11-13 20:54:47 +02:00
vec.cpp	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399 )	2026-02-15 18:20:35 +08:00
vec.h	ggml-cpu: extend support for RVV floating-point kernels (#17318 )	2025-12-18 16:02:09 +02:00