Commit Graph

4 Commits

Author SHA1 Message Date
Progeny Alpha 530e5bb117 vulkan: fuse w/k_gated broadcasts in chunked inter kernel
Load both s_w and s_kg before the first barrier instead of using
separate barriers for each. Reduces per-token barriers from 3 to 2,
eliminating 64 barriers per chunk.

GDN per-op: 6818 → 5205 µs (-23.6%). 16/16 tests pass.
2026-03-14 22:32:46 -04:00
Progeny Alpha e22c2b2c85 vulkan: clean up chunked GDN shaders for PR review
Remove verbose algorithm comments, section dividers, stale inline
constant annotations, and unused extensions. Match llama.cpp codebase
style (minimal comments, no section decorators).

No functional changes. 16/16 tests pass.
2026-03-14 03:49:27 -04:00
Progeny Alpha d2fabedf09 vulkan: fix chunked inter kernel state layout for PR #20443
PR #20443 removed redundant state transposes from the graph and updated
the autoregressive shader to use col*S_V+i (coalesced) instead of
i*S_V+col (strided). The chunked inter kernel was not updated, causing
uncoalesced state reads and a ~8% PP regression.

Fix state_in load and final_out write to match the new layout.
h_snapshots (h_out/h_in) are internal scratch and keep their existing
layout since inter and output kernels agree.

PP-512: 202 → 218 t/s. 16/16 tests pass.
2026-03-13 23:34:59 -04:00
Progeny Alpha 949a7e86d3 vulkan: add chunked parallel kernel infrastructure for GATED_DELTA_NET
Three-dispatch chunked pipeline for prompt processing acceleration:
intra-chunk WY decomposition, inter-chunk state propagation, output
combination. Currently disabled (threshold=UINT32_MAX).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 21:45:42 -04:00