Load both s_w and s_kg before the first barrier instead of using separate barriers for each. Reduces per-token barriers from 3 to 2, eliminating 64 barriers per chunk. GDN per-op: 6818 → 5205 µs (-23.6%). 16/16 tests pass. |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||