* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
* K quant speedup (#20)
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* no gibberish, all k quants added, merged
* vec memory fix
* q6_k matching metal on my machine, tests passing
* Set tile size for q6_k separately
* Separate out fast shaders
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Move towards writeBuffer for params
* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups
* Remove extra file
* Formatting
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* ggml-webgpu: Add binary op support for overlapping and non-contiguous.
* Add newline to binary.wgsl
* Append the test of binary op for src overlapping to test_bin_bcast.
* Remove unnecessary newline.
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* flashattention and matrix multiplication moved to new format
* clean up preprocessing
* Formatting
* remove duplicate constants
* Split large shaders into multiple static strings
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool
* Free pools
* Cleanup
* More cleanup
* Run clang-format
* Fix arg-parser and tokenizer test errors that free an unallocated buffer
* Fix device lost callback to not print on device teardown
* Fix include and run clang-format
* remove unused unused
* Update binary ops
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* ggml webgpu: port binary operators to use pre-wgsl
* Add binary.wgsl: unified shader with conditionals for all 4 ops
* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor
* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)
* Update CMake to generate binary operator shaders at build time
* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling
* port binary operators from AOT to pre-wgsl JIT compilation
* add src1=dst overlap handling for binary ops
* use compile-time workgroup size defines instead of runtime overrides
* ggml-webgpu: complete overlap handling for binary ops
* add support for inplace & overlap case in binding setup
* restructure conditional logic to handle all overlap cases
* ensure all buffer bindings are correctly assigned for edge cases
* ggml-webgpu: remove unused binary overlap cases
Remove src0==src1 binary overlap case that never occurs in practice.
* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT
* remove unused src0==src1 and all-same variant
* refactor wgsl to eliminate duplication
* FlashAttention (#13)
* Add inplace softmax
* Move rms_norm to split row approach
* Update debug for supports_op
* clean up debug statements
* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
* neg passes backend test
* unary operators pass ggml tests
* rms_norm double declaration bug atoned
* abides by editor-config
* removed vestigial files
* fixed autoconfig
* All operators (inlcluding xielu) working
* removed unnecesarry checking if node->src[1] exists for unary operators
* responded and dealt with PR comments
* implemented REPL_Template support and removed bug in unary operators kernel
* formatted embed wgsl and ggml-webgpu.cpp
* Faster tensors (#8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (#9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Refactored pipelines and workgroup calculations (#10)
* refactored pipelines
* refactored workgroup calculation
* removed commented out block of prior maps
* Clean up ceiling division pattern
---------
Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on flash attention
* Shader structure set up (many bugs still)
* debugging
* Working first test
* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32
* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling
* Start work on integrating pre-wgsl
* Separate structs/initial shader compilation library into separate files
* Work on compilation choices for flashattention
* Work on subgroup matrix/tile size portability
* subgroup size agnostic online softmax
* Cleanups, quantization types
* more cleanup
* fix wasm build
* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases
* Checkpoint
* formatting
* Update to account for default kv cache padding
* formatting shader
* Add workflow for ggml-ci webgpu
* Try passing absolute path to dawn in ggml-ci
* Avoid error on device destruction, add todos for proper cleanup
* Fix unused warning
* Forgot one parameter unused
* Move some flashattn computation to f32 for correctness