llama.cpp/ggml/src/ggml-cpu
Yibo Cai 54a2c7a8cd
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
2025-05-29 14:39:20 +03:00
..
amx ggml : upgrade init_tensor API to return a ggml_status (#11854) 2025-02-28 14:41:47 +01:00
cmake ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
kleidiai ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509) 2025-05-13 18:02:28 +03:00
llamafile ggml : Enable MMA for BF16 in llamafile_sgemm (#13148) 2025-05-02 19:53:12 +03:00
CMakeLists.txt cmake: Factor out CPU architecture detection (#13883) 2025-05-29 12:50:25 +02:00
binary-ops.cpp cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
binary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
common.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
cpu-feats-x86.cpp ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871) 2025-04-21 18:13:51 +02:00
ggml-cpu-aarch64.cpp ggml : riscv: add xtheadvector support (#13720) 2025-05-27 16:21:36 +03:00
ggml-cpu-aarch64.h ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.cpp ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.h ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-impl.h ggml : riscv: add xtheadvector support (#13720) 2025-05-27 16:21:36 +03:00
ggml-cpu-quants.c arm64: optimize q4_k_q8_k kernel with i8mm (#13886) 2025-05-29 14:39:20 +03:00
ggml-cpu-quants.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-cpu-traits.cpp ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-traits.h ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu.c arm64: optimize q4_k_q8_k kernel with i8mm (#13886) 2025-05-29 14:39:20 +03:00
ggml-cpu.cpp rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
ops.cpp ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882) 2025-05-29 12:18:43 +03:00
ops.h ggml : Depthwise 2D convolution (ggml/1152) 2025-04-24 17:32:47 +03:00
simd-mappings.h ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843) 2025-05-29 09:01:33 +03:00
unary-ops.cpp cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
unary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
vec.cpp ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843) 2025-05-29 09:01:33 +03:00
vec.h ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882) 2025-05-29 12:18:43 +03:00