Comprehensive documentation for PR #20377 covering architecture, benchmarks, PPL validation, per-kernel timing, and scaling analysis. Includes side-by-side autoregressive vs chunked comparison on 890M.