Load both s_w and s_kg before the first barrier instead of using
separate barriers for each. Reduces per-token barriers from 3 to 2,
eliminating 64 barriers per chunk.
GDN per-op: 6818 → 5205 µs (-23.6%). 16/16 tests pass.
PR #20443 removed redundant state transposes from the graph and updated
the autoregressive shader to use col*S_V+i (coalesced) instead of
i*S_V+col (strided). The chunked inter kernel was not updated, causing
uncoalesced state reads and a ~8% PP regression.
Fix state_in load and final_out write to match the new layout.
h_snapshots (h_out/h_in) are internal scratch and keep their existing
layout since inter and output kernels agree.
PP-512: 202 → 218 t/s. 16/16 tests pass.
Three-dispatch chunked pipeline for prompt processing acceleration:
intra-chunk WY decomposition, inter-chunk state propagation, output
combination. Currently disabled (threshold=UINT32_MAX).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>