llama.cpp

History

Te-Hsiu Huang 0586379302 CUDA: add float4 vectorized load/store for rms_norm_f32 Add a separate rms_norm_f32_vec4 kernel using float4 (128-bit) vectorized memory loads/stores. Host-side dispatch routes to the vec4 kernel when ncols is divisible by 4 and strides are aligned; otherwise falls back to the original rms_norm_f32 kernel which is completely untouched. A separate kernel is used instead of a runtime branch inside the existing kernel to avoid register pressure and instruction cache pollution that would degrade the scalar path (~22% measured regression with runtime if). Performance (A100, nrows=512, test-backend-ops perf, 5-run avg): [512,512]: 427 -> 624 GB/s (+46%) [768,512]: 626 -> 850 GB/s (+36%) [1024,512]: 495 -> 645 GB/s (+30%) [2048,512]: 911 -> 1171 GB/s (+28%) [3072,512]: 1220 -> 1490 GB/s (+22%) [5120,512]: 1668 -> 1815 GB/s (+9%) Scalar fallback (4097,512): 1476 -> 1471 GB/s (no regression) Correctness: RMS_NORM 17/17, RMS_NORM_MUL_ADD 30/30, ADD_RMS_NORM 25/25, RMS_NORM_MUL_ROPE 72/72 passed.		2026-03-13 18:43:35 -07:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	llama : enable chunked fused GDN path (#20340 )	2026-03-11 22:46:40 +02:00
src	CUDA: add float4 vectorized load/store for rms_norm_f32	2026-03-13 18:43:35 -07:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-15 22:24:29 +02:00