* musa: apply mublas API changes
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: update musa version to 4.2.0
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: restore MUSA graph settings in CMakeLists.txt
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: disable mudnnMemcpyAsync by default
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: switch back to non-mudnn images
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* minor changes
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: restore rc in docker image tag
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* CMake config: Create target only once
Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.
* CMake config: Add CUDA link libs
* CMake config: Add OpenCL link libs
* CMake config: Use canonical find_dependency
Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.
* CMake config: Wire OpenMP dependency
This commit removes the inclusion of `<cstdlib>`.
The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
MiniCPM models use the llm_build_granite constructor which was changed
in the Granite Four PR to use hparams.rope_finetuned instead of a
use_rope parameter. MiniCPM models need rope enabled by default.
Fixes inference from gibberish to correct responses.
* weight format to nz for 310p
* remove quant weight format to nz
* clean code
* fix
* make the conditions for converting weights to NZ format consistent
* clean code
* rename
* Refactor vector operations in vec_op_impl and vec_dot_product_impl for improved clarity and performance
* wip
* Enhance vector copy functions for improved performance and clarity in vec_ops.hpp
* wip
* wip
* wip
* Optimize vector dot product implementations for enhanced performance and efficiency
* Enhance flash attention implementation and type traits for improved vector operations and alignment checks
# Conflicts:
# ggml/src/ggml-qnn/npu/device/type_traits.cpp
* remove align
* wip
* Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs
* Revert "Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs"
This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.
* Enhance flash attention implementation with type checks for tensor data types and improved constexpr usage
* wip
* opt mask calc
* Revert "opt mask calc"
This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.
* wip
* opt mul mat caching logic to add dst cache
* Revert "opt mul mat caching logic to add dst cache"
This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.
* wip
* Refactor matrix multiplication implementation to include vector conversion and performance tracking
* wip
* wip
* wip
* create vec_ops.inl for more aggressive compiler inline
* wip
* refactor vector dot product implementations for improved readability and performance
* refactor vector conversion functions to use HVX_Vector_Dual for improved clarity and consistency
* wip
* wip
* wip
* implement row size caching logic and enhance type traits for F32 support
* refactor matrix multiplication functions to improve caching logic and simplify tensor alignment handling
* add vector zeroing functions for F32 and F16 types to optimize memory initialization
* Revert "add vector zeroing functions for F32 and F16 types to optimize memory initialization"
This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.
* wip
* refactor alignment checks in dot product function to handle null pointers
* wip
* refactor load_block_generic and related functions for improved alignment handling
* wip
* refactor flash attention implementation and introduce type-erased dot function for improved type handling
* refactor dot product implementations for improved loop handling and clarity
* refactor thread_pool constructor to pre-allocate VTCM cache for each thread
* Revert "refactor thread_pool constructor to pre-allocate VTCM cache for each thread"
This reverts commit 00cdd3fa88d909feef44ddaa42095274b7627685.
* wip
* opt interfaces for tensor cleanup
* refactor mul_mat_impl to use aligned size for src0 row calculation
* refactor: update dequantized_row_size logic and add size alignment checks for tensors
* wip
* wip
* refactor: replace raw pointer initialization with invalid handle constants for better clarity
* wip
* Mtmd: add a way to select device for vision encoder
* simplify
* format
* Warn user if manual device selection failed
* initialize backend to nullptr
* Documentation: Revised and further improved the Vulkan instructions for Linux users in build.md.
* Minor: Revise step 2 of the Vulkan instructions for Linux users in build.md
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan
* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),
* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests
* * Performance fixes: minimized branch divergence, uses collectives to
eliminate redundant calculation, macros removed.
* Kernel shared memory size check
* Updates test-backend-ops to support graphs for performance
measurement.
* * Apple/Win32 compile errors fixed
* Subgroup size used to determine tile size -> fixes llvmpipe errors.
* Collectives disabled by default.
* Intel support is disabled as the performance is poor.
* Conv2d enabled for Intel with disabled collectives, disabled for Apple
* test-backend-ops modifications are reverted
* Trailing spaces and missing override fixed.
* Triggering pipeline relaunch.
* Code formatted with .clang-format.
* imatrix : allow processing multiple chunks per batch
* perplexity : simplify filling the batch
* imatrix : fix segfault when using a single chunk per batch
* imatrix : use GGUF to store imatrix data
* imatrix : fix conversion problems
* imatrix : use FMA and sort tensor names
* py : add requirements for legacy imatrix convert script
* perplexity : revert changes
* py : include imatrix converter requirements in toplevel requirements
* imatrix : avoid using designated initializers in C++
* imatrix : remove unused n_entries
* imatrix : allow loading mis-ordered tensors
Sums and counts tensors no longer need to be consecutive.
* imatrix : more sanity checks when loading multiple imatrix files
* imatrix : use ggml_format_name instead of std::string concatenation
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* quantize : use unused imatrix chunk_size with LLAMA_TRACE
* common : use GGUF for imatrix output by default
* imatrix : two-way conversion between old format and GGUF
* convert : remove imatrix to gguf python script
* imatrix : use the function name in more error messages
* imatrix : don't use FMA explicitly
This should make comparisons between the formats easier
because this matches the behavior of the previous version.
* imatrix : avoid returning from void function save_imatrix
* imatrix : support 3d tensors with MUL_MAT
* quantize : fix dataset name loading from gguf imatrix
* common : move string_remove_suffix from quantize and imatrix
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* imatrix : add warning when legacy format is written
* imatrix : warn when writing partial data, to help guess dataset coverage
Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.
* imatrix : avoid loading model to convert or combine imatrix
* imatrix : avoid using imatrix.dat in README
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>