llama.cpp/ggml/src/ggml-cpu
Adrien Gallouët d0b79aaa2f
ggml : add native AVX512-FP16 support for F16 operations (#20529)
The overall benchmark speed remains almost the same because the CPU is
now calculating faster than the RAM can deliver the data. (See perf stat
results below showing 2.7 billion fewer instructions).

Also note that this path will be only enabled for native build or with
custom flags.

now:
```
 Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':

        189,073.52 msec task-clock                       #   14.658 CPUs utilized
               404      context-switches                 #    2.137 /sec
                19      cpu-migrations                   #    0.100 /sec
           372,390      page-faults                      #    1.970 K/sec
   310,877,195,595      instructions                     #    0.54  insn per cycle
   581,071,530,602      cycles                           #    3.073 GHz
    19,352,107,994      branches                         #  102.352 M/sec
        48,304,438      branch-misses                    #    0.25% of all branches
    84,998,431,152      L1-dcache-loads                  #  449.552 M/sec
    12,186,410,279      L1-dcache-load-misses            #   14.34% of all L1-dcache accesses

      12.899358742 seconds time elapsed

     187.823044000 seconds user
       1.253416000 seconds sys
```

before:
```
 Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':

        190,594.56 msec task-clock                       #   14.652 CPUs utilized
               436      context-switches                 #    2.288 /sec
                22      cpu-migrations                   #    0.115 /sec
           372,782      page-faults                      #    1.956 K/sec
   313,574,921,966      instructions                     #    0.54  insn per cycle
   586,064,970,425      cycles                           #    3.075 GHz
    19,585,778,563      branches                         #  102.761 M/sec
        48,437,488      branch-misses                    #    0.25% of all branches
    86,219,336,628      L1-dcache-loads                  #  452.370 M/sec
    12,232,085,771      L1-dcache-load-misses            #   14.19% of all L1-dcache accesses

      13.007923164 seconds time elapsed

     189.395316000 seconds user
       1.202612000 seconds sys
```

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-14 10:06:14 +01:00
..
amx chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
arch ggml-cpu: add RVV vec dot kernels for quantization types (#18859) 2026-03-13 17:36:04 +02:00
cmake ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
kleidiai kleidiai : support for concurrent sme and neon kernel execution (#20070) 2026-03-10 09:25:25 +02:00
llamafile ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083) (#20130) 2026-03-06 23:22:39 +08:00
spacemit ggml : fix SpaceMit IME array out-of-bounds in task assignment (#16629) 2025-10-17 13:01:23 +03:00
CMakeLists.txt kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043) 2026-03-03 11:40:26 +02:00
arch-fallback.h ggml-cpu: add RVV vec dot kernels for quantization types (#18859) 2026-03-13 17:36:04 +02:00
binary-ops.cpp ggml : extend bin bcast for permuted src1 (#19484) 2026-02-11 07:52:00 +02:00
binary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
common.h ggml-cpu: FA add GEMM microkernel (#19422) 2026-02-15 11:09:24 +05:30
ggml-cpu-impl.h ggml-cpu: FA split across kv for faster TG (#19209) 2026-02-03 01:19:55 +08:00
ggml-cpu.c ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
ggml-cpu.cpp ggml-cpu: FA split across kv for faster TG (#19209) 2026-02-03 01:19:55 +08:00
hbm.cpp ggml-cpu : split arch-specific implementations (#13892) 2025-06-09 16:47:13 +02:00
hbm.h ggml-cpu : split arch-specific implementations (#13892) 2025-06-09 16:47:13 +02:00
ops.cpp graph : remove redundant GDN state transposes (#20443) 2026-03-13 22:12:54 +02:00
ops.h ggml: add GATED_DELTA_NET op (#19504) 2026-03-07 15:41:10 +08:00
quants.c ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
quants.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
repack.cpp ggml-cpu: add RVV repack GEMM and GEMV for quantization types (#19121) 2026-03-10 08:49:52 +02:00
repack.h ggml-cpu: add RVV repack GEMM and GEMV for quantization types (#19121) 2026-03-10 08:49:52 +02:00
simd-gemm.h ggml : avoid UB in gemm ukernel (#19642) 2026-02-15 14:56:35 +02:00
simd-mappings.h ggml : add native AVX512-FP16 support for F16 operations (#20529) 2026-03-14 10:06:14 +01:00
traits.cpp ggml : fix fallback to CPU for ununsupported ops (#15118) 2025-08-06 14:37:35 +02:00
traits.h ggml : fix fallback to CPU for ununsupported ops (#15118) 2025-08-06 14:37:35 +02:00
unary-ops.cpp ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511) 2026-02-11 18:58:43 +02:00
unary-ops.h ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063) 2025-11-13 20:54:47 +02:00
vec.cpp ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399) 2026-02-15 18:20:35 +08:00
vec.h ggml-cpu: extend support for RVV floating-point kernels (#17318) 2025-12-18 16:02:09 +02:00