llama.cpp/ggml
Te-Hsiu Huang 0586379302 CUDA: add float4 vectorized load/store for rms_norm_f32
Add a separate rms_norm_f32_vec4 kernel using float4 (128-bit) vectorized
memory loads/stores. Host-side dispatch routes to the vec4 kernel when
ncols is divisible by 4 and strides are aligned; otherwise falls back to
the original rms_norm_f32 kernel which is completely untouched.

A separate kernel is used instead of a runtime branch inside the existing
kernel to avoid register pressure and instruction cache pollution that
would degrade the scalar path (~22% measured regression with runtime if).

Performance (A100, nrows=512, test-backend-ops perf, 5-run avg):
  [512,512]:  427 -> 624 GB/s (+46%)
  [768,512]:  626 -> 850 GB/s (+36%)
  [1024,512]: 495 -> 645 GB/s (+30%)
  [2048,512]: 911 -> 1171 GB/s (+28%)
  [3072,512]: 1220 -> 1490 GB/s (+22%)
  [5120,512]: 1668 -> 1815 GB/s (+9%)
  Scalar fallback (4097,512): 1476 -> 1471 GB/s (no regression)

Correctness: RMS_NORM 17/17, RMS_NORM_MUL_ADD 30/30,
ADD_RMS_NORM 25/25, RMS_NORM_MUL_ROPE 72/72 passed.
2026-03-13 18:43:35 -07:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include llama : enable chunked fused GDN path (#20340) 2026-03-11 22:46:40 +02:00
src CUDA: add float4 vectorized load/store for rms_norm_f32 2026-03-13 18:43:35 -07:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt ggml : bump version to 0.9.7 (ggml/1425) 2026-02-15 22:24:29 +02:00