llama.cpp

History

Jianhui Zhou 5714d4b86e ggml: Add thread count control during repacking This change enables the repack stage to utilize the user-specified thread count, ensuring that both the logical thread IDs and the total number of threads remain consistent between the repack and inference stages. In a NUMA architecture where the `--numa distribute` parameter is used, logical threads are pinned to specific physical NUMA nodes. By aligning the thread configuration across these two stages, we can fully leverage the operating system's "first-touch" memory allocation policy: 1. Repack Stage: Logical thread i (bound to NUMA node j) is responsible for repacking and writing the weight data. Since the "first touch" occurs within this thread, the corresponding physical memory is allocated on node j. 2. Inference Stage: The same logical thread i (still bound to node j) reads these weights. Since the data already resides on the local node, low-latency local memory access is achieved. Without ensuring consistency in the number of threads, data may be randomly allocated to mismatched nodes, resulting in significant cross-node access overhead during inference. Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>		2026-01-13 07:36:31 +00:00
..
ggml-alloc.h	ggml : upgrade init_tensor API to return a ggml_status (#11854 )	2025-02-28 14:41:47 +01:00
ggml-backend.h	rpc : add support for multiple devices (#16276 )	2025-10-04 12:49:16 +03:00
ggml-blas.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-cann.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-cpp.h	ggml : fix ggml_gallocr_ptr type (ggml/1205)	2025-05-01 09:58:44 +03:00
ggml-cpu.h	ggml: Add thread count control during repacking	2026-01-13 07:36:31 +00:00
ggml-cuda.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-hexagon.h	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00
ggml-metal.h	metal : refactor + optimize v2 (#15995 )	2025-09-17 20:38:12 +03:00
ggml-opencl.h	Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs (#10693 )	2024-12-13 12:23:52 -08:00
ggml-opt.h	finetune: SGD optimizer, more CLI args (#13873 )	2025-08-14 12:03:57 +02:00
ggml-rpc.h	rpc : fix alloc size logic (#17116 )	2025-12-05 19:39:04 +02:00
ggml-sycl.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-vulkan.h	vulkan: Make Vulkan optional at runtime (#11493 ). (#11494 )	2025-02-10 07:17:21 +01:00
ggml-webgpu.h	ggml: Add initial WebGPU backend (#14521 )	2025-07-16 18:18:51 +03:00
ggml-zdnn.h	zdnn: refactor codebase + add docs (#16178 )	2025-09-23 14:53:05 +08:00
ggml-zendnn.h	ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690 )	2025-12-07 00:13:33 +08:00
ggml.h	ggml : remove GGML_KQ_MASK_PAD constant (#17910 )	2025-12-10 20:53:16 +02:00
gguf.h	GGUF: C++ refactor, backend support, misc fixes (#11030 )	2025-01-07 18:01:58 +01:00