llama.cpp/ggml/include
Jianhui Zhou 5714d4b86e ggml: Add thread count control during repacking
This change enables the repack stage to utilize the user-specified
thread count, ensuring that both the logical thread IDs and the total
number of threads remain consistent between the repack and inference
stages.

In a NUMA architecture where the `--numa distribute` parameter is used,
logical threads are pinned to specific physical NUMA nodes. By aligning
the thread configuration across these two stages, we can fully leverage
the operating system's "first-touch" memory allocation policy:

1. Repack Stage: Logical thread i (bound to NUMA node j) is responsible
   for repacking and writing the weight data. Since the "first touch"
   occurs within this thread, the corresponding physical memory is
   allocated on node j.

2. Inference Stage: The same logical thread i (still bound to node j)
   reads these weights. Since the data already resides on the local
   node, low-latency local memory access is achieved.

Without ensuring consistency in the number of threads, data may be
randomly allocated to mismatched nodes, resulting in significant
cross-node access overhead during inference.

Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
2026-01-13 07:36:31 +00:00
..
ggml-alloc.h ggml : upgrade init_tensor API to return a ggml_status (#11854) 2025-02-28 14:41:47 +01:00
ggml-backend.h rpc : add support for multiple devices (#16276) 2025-10-04 12:49:16 +03:00
ggml-blas.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-cann.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-cpp.h ggml : fix ggml_gallocr_ptr type (ggml/1205) 2025-05-01 09:58:44 +03:00
ggml-cpu.h ggml: Add thread count control during repacking 2026-01-13 07:36:31 +00:00
ggml-cuda.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-hexagon.h Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
ggml-metal.h metal : refactor + optimize v2 (#15995) 2025-09-17 20:38:12 +03:00
ggml-opencl.h Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs (#10693) 2024-12-13 12:23:52 -08:00
ggml-opt.h finetune: SGD optimizer, more CLI args (#13873) 2025-08-14 12:03:57 +02:00
ggml-rpc.h rpc : fix alloc size logic (#17116) 2025-12-05 19:39:04 +02:00
ggml-sycl.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-vulkan.h vulkan: Make Vulkan optional at runtime (#11493). (#11494) 2025-02-10 07:17:21 +01:00
ggml-webgpu.h ggml: Add initial WebGPU backend (#14521) 2025-07-16 18:18:51 +03:00
ggml-zdnn.h zdnn: refactor codebase + add docs (#16178) 2025-09-23 14:53:05 +08:00
ggml-zendnn.h ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690) 2025-12-07 00:13:33 +08:00
ggml.h ggml : remove GGML_KQ_MASK_PAD constant (#17910) 2025-12-10 20:53:16 +02:00
gguf.h GGUF: C++ refactor, backend support, misc fixes (#11030) 2025-01-07 18:01:58 +01:00