When using repack buffer type, the physical memory allocation is dictated
by the first-touch policy. Since the main thread performs the write
operations, memory is often allocated on a single NUMA node, leading to
uneven weight distribution.
Multi-threaded repack can alleviate this problem, but the threads are
not bound to NUMA nodes.
This patch applies the same thread affinity strategy (--numa distribute)
to the repacking phase. By binding the repack threads to the same nodes
as the compute threads, we ensure that weights are written (and thus
allocated) on the local NUMA node, minimizing cross-node memory access
during inference.
Performance on Intel Xeon Silver 4514Y (32 core):
qwen3 8B Q4_K: 19.39 -> 26.92 t/s (+39%)
qwen3 32B Q4_K: 4.99 -> 7.38 t/s (+48%)
Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>