Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs. Measured on AMD EPYC 9655 (96 cores): | Model Size | Before | After | Speedup | |------------|--------|-------|---------| | 6.8GB Q4_K | 5.0s | 3.3s | 1.5x | | 19GB Q4_K | 11.9s | 5.3s | 2.2x | | 271GB Q4_K | ~150s | ~60s | ~2.5x | The repack functions convert quantized tensors from storage layout to SIMD-optimized layout for AVX-512. This was previously single-threaded and is now parallelized across row groups. Key changes: - Convert pointer-increment loops to explicit indexing - Add #pragma omp parallel for to outer loops (guarded by #ifdef _OPENMP) - Each thread processes independent row groups - Move thread-local dst_tmp arrays inside parallel region Functions parallelized: - repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave) - repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models) - repack_q2_K_to_q2_K_8_bl (Q2_K models) - repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave) - repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4) - repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8) Tested on: AMD EPYC 9655 "Turin" with 192 threads |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||