* K quant speedup (#20)
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* no gibberish, all k quants added, merged
* vec memory fix
* q6_k matching metal on my machine, tests passing
* Set tile size for q6_k separately
* Separate out fast shaders
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Move towards writeBuffer for params
* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups
* Remove extra file
* Formatting
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* ggml-webgpu: fix workgroup dispatch limit for large batch sizes
WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.
* add compute_2d_workgroups() helper to split total workgroup ID across
X/Y dimensions
* update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
dispatch
* update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
from 2D dispatch
* update mul_mat.wgsl to compute global index from 2D workgroup
coordinates
* refactor all three mul_mat dispatch paths to use the shared helper
* ggml-webgpu: add bounds checking for over-dispatched workgroups
2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.
* add batch_idx bound check to mul_mat_reg_tile.wgsl and
mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
from accessing invalid memory
* fixes test failures with large batch sizes (eg., bs=[128, 1024])
* ggml-webgpu: add back TODO for spliting large sizes into batches
* Optimize 2d workgroup provisioning
* Set some parameters that increase speed
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* flashattention and matrix multiplication moved to new format
* clean up preprocessing
* Formatting
* remove duplicate constants
* Split large shaders into multiple static strings
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
2026-02-18 07:51:02 -07:00
Renamed from ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.tmpl.wgsl (Browse further)