PR #20443 removed redundant state transposes from the graph and updated the autoregressive shader to use col*S_V+i (coalesced) instead of i*S_V+col (strided). The chunked inter kernel was not updated, causing uncoalesced state reads and a ~8% PP regression. Fix state_in load and final_out write to match the new layout. h_snapshots (h_out/h_in) are internal scratch and keep their existing layout since inter and output kernels agree. PP-512: 202 → 218 t/s. 16/16 tests pass. |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||