Add a separate rms_norm_f32_vec4 kernel using float4 (128-bit) vectorized
memory loads/stores. Host-side dispatch routes to the vec4 kernel when
ncols is divisible by 4 and strides are aligned; otherwise falls back to
the original rms_norm_f32 kernel which is completely untouched.
A separate kernel is used instead of a runtime branch inside the existing
kernel to avoid register pressure and instruction cache pollution that
would degrade the scalar path (~22% measured regression with runtime if).
Performance (A100, nrows=512, test-backend-ops perf, 5-run avg):
[512,512]: 427 -> 624 GB/s (+46%)
[768,512]: 626 -> 850 GB/s (+36%)
[1024,512]: 495 -> 645 GB/s (+30%)
[2048,512]: 911 -> 1171 GB/s (+28%)
[3072,512]: 1220 -> 1490 GB/s (+22%)
[5120,512]: 1668 -> 1815 GB/s (+9%)
Scalar fallback (4097,512): 1476 -> 1471 GB/s (no regression)
Correctness: RMS_NORM 17/17, RMS_NORM_MUL_ADD 30/30,
ADD_RMS_NORM 25/25, RMS_NORM_MUL_ROPE 72/72 passed.