llama.cpp

History

pestopoppa b1366757cf ggml-cpu: parallelize tensor repacking with OpenMP Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs. Measured on AMD EPYC 9655 (96 cores): \| Model Size \| Before \| After \| Speedup \| \|------------\|--------\|-------\|---------\| \| 6.8GB Q4_K \| 5.0s \| 3.3s \| 1.5x \| \| 19GB Q4_K \| 11.9s \| 5.3s \| 2.2x \| \| 271GB Q4_K \| ~150s \| ~60s \| ~2.5x \| The repack functions convert quantized tensors from storage layout to SIMD-optimized layout for AVX-512. This was previously single-threaded and is now parallelized across row groups. Key changes: - Convert pointer-increment loops to explicit indexing - Add #pragma omp parallel for to outer loops (guarded by #ifdef _OPENMP) - Each thread processes independent row groups - Move thread-local dst_tmp arrays inside parallel region Functions parallelized: - repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave) - repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models) - repack_q2_K_to_q2_K_8_bl (Q2_K models) - repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave) - repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4) - repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8) Tested on: AMD EPYC 9655 "Turin" with 192 threads		2026-01-01 12:51:30 +01:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	ggml-cpu : fix RISC-V Q4_0 repack select and RVV feature reporting (#17951 )	2025-12-12 16:26:03 +02:00
src	ggml-cpu: parallelize tensor repacking with OpenMP	2026-01-01 12:51:30 +01:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	cmake : set `CMAKE_RUNTIME_OUTPUT_DIRECTORY` for non standalone build (ggml/1394)	2025-12-14 08:33:51 +02:00