* rename
* Refactor vector operations in vec_op_impl and vec_dot_product_impl for improved clarity and performance
* wip
* Enhance vector copy functions for improved performance and clarity in vec_ops.hpp
* wip
* wip
* wip
* Optimize vector dot product implementations for enhanced performance and efficiency
* Enhance flash attention implementation and type traits for improved vector operations and alignment checks
# Conflicts:
# ggml/src/ggml-qnn/npu/device/type_traits.cpp
* remove align
* wip
* Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs
* Revert "Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs"
This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.
* Enhance flash attention implementation with type checks for tensor data types and improved constexpr usage
* wip
* opt mask calc
* Revert "opt mask calc"
This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.
* wip
* opt mul mat caching logic to add dst cache
* Revert "opt mul mat caching logic to add dst cache"
This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.
* wip
* Refactor matrix multiplication implementation to include vector conversion and performance tracking
* wip
* wip
* wip
* create vec_ops.inl for more aggressive compiler inline
* wip
* refactor vector dot product implementations for improved readability and performance
* refactor vector conversion functions to use HVX_Vector_Dual for improved clarity and consistency
* wip
* wip
* wip
* implement row size caching logic and enhance type traits for F32 support
* refactor matrix multiplication functions to improve caching logic and simplify tensor alignment handling
* add vector zeroing functions for F32 and F16 types to optimize memory initialization
* Revert "add vector zeroing functions for F32 and F16 types to optimize memory initialization"
This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.
* wip
* refactor alignment checks in dot product function to handle null pointers
* wip
* refactor load_block_generic and related functions for improved alignment handling
* wip
* refactor flash attention implementation and introduce type-erased dot function for improved type handling
* refactor dot product implementations for improved loop handling and clarity
* refactor thread_pool constructor to pre-allocate VTCM cache for each thread
* Revert "refactor thread_pool constructor to pre-allocate VTCM cache for each thread"
This reverts commit 00cdd3fa88d909feef44ddaa42095274b7627685.
* wip
* opt interfaces for tensor cleanup
* refactor mul_mat_impl to use aligned size for src0 row calculation
* refactor: update dequantized_row_size logic and add size alignment checks for tensors
* wip
* wip
* refactor: replace raw pointer initialization with invalid handle constants for better clarity
* wip
* Documentation: Revised and further improved the Vulkan instructions for Linux users in build.md.
* Minor: Revise step 2 of the Vulkan instructions for Linux users in build.md
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan
* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),
* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests
* * Performance fixes: minimized branch divergence, uses collectives to
eliminate redundant calculation, macros removed.
* Kernel shared memory size check
* Updates test-backend-ops to support graphs for performance
measurement.
* * Apple/Win32 compile errors fixed
* Subgroup size used to determine tile size -> fixes llvmpipe errors.
* Collectives disabled by default.
* Intel support is disabled as the performance is poor.
* Conv2d enabled for Intel with disabled collectives, disabled for Apple
* test-backend-ops modifications are reverted
* Trailing spaces and missing override fixed.
* Triggering pipeline relaunch.
* Code formatted with .clang-format.
* imatrix : allow processing multiple chunks per batch
* perplexity : simplify filling the batch
* imatrix : fix segfault when using a single chunk per batch
* imatrix : use GGUF to store imatrix data
* imatrix : fix conversion problems
* imatrix : use FMA and sort tensor names
* py : add requirements for legacy imatrix convert script
* perplexity : revert changes
* py : include imatrix converter requirements in toplevel requirements
* imatrix : avoid using designated initializers in C++
* imatrix : remove unused n_entries
* imatrix : allow loading mis-ordered tensors
Sums and counts tensors no longer need to be consecutive.
* imatrix : more sanity checks when loading multiple imatrix files
* imatrix : use ggml_format_name instead of std::string concatenation
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* quantize : use unused imatrix chunk_size with LLAMA_TRACE
* common : use GGUF for imatrix output by default
* imatrix : two-way conversion between old format and GGUF
* convert : remove imatrix to gguf python script
* imatrix : use the function name in more error messages
* imatrix : don't use FMA explicitly
This should make comparisons between the formats easier
because this matches the behavior of the previous version.
* imatrix : avoid returning from void function save_imatrix
* imatrix : support 3d tensors with MUL_MAT
* quantize : fix dataset name loading from gguf imatrix
* common : move string_remove_suffix from quantize and imatrix
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* imatrix : add warning when legacy format is written
* imatrix : warn when writing partial data, to help guess dataset coverage
Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.
* imatrix : avoid loading model to convert or combine imatrix
* imatrix : avoid using imatrix.dat in README
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs
Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.
* Exclude `project_per_layer_input` by matching node names
This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.
* Revert unnecessary formatting changes
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults
* Initialize webgpu device
* Making progress on setting up the backend
* Finish more boilerplate/utility functions
* Organize file and work on alloc buffer
* Add webgpu_context to prepare for actually running some shaders
* Work on memset and add shader loading
* Work on memset polyfill
* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it
* Implement get_tensor and buffer_clear
* Finish rest of setup
* Start work on compute graph
* Basic mat mul working
* Work on emscripten build
* Basic WebGPU backend instructions
* Use EMSCRIPTEN flag
* Work on passing ci, implement 4d tensor multiplication
* Pass thread safety test
* Implement permuting for mul_mat and cpy
* minor cleanups
* Address feedback
* Remove division by type size in cpy op
* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends
* Fix name
* Fix macos dawn prefix path
* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now