llama.cpp/ggml
Jianhui Zhou 11b753e786 ggml: optimize repack on NUMA by binding threads
When using repack buffer type, the physical memory allocation is dictated
by the first-touch policy. Since the main thread performs the write
operations, memory is often allocated on a single NUMA node, leading to
uneven weight distribution.

Multi-threaded repack can alleviate this problem, but the threads are
not bound to NUMA nodes.

This patch applies the same thread affinity strategy (--numa distribute)
to the repacking phase. By binding the repack threads to the same nodes
as the compute threads, we ensure that weights are written (and thus
allocated) on the local NUMA node, minimizing cross-node memory access
during inference.

Performance on Intel Xeon Silver 4514Y (32 core):
qwen3 8B  Q4_K: 19.39 -> 26.92 t/s (+39%)
qwen3 32B Q4_K:  4.99 ->  7.38 t/s (+48%)

Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
2026-01-08 14:12:59 +00:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include ggml-cpu : fix RISC-V Q4_0 repack select and RVV feature reporting (#17951) 2025-12-12 16:26:03 +02:00
src ggml: optimize repack on NUMA by binding threads 2026-01-08 14:12:59 +00:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt cmake : set `CMAKE_RUNTIME_OUTPUT_DIRECTORY` for non standalone build (ggml/1394) 2025-12-14 08:33:51 +02:00