* rename
* Refactor vector operations in vec_op_impl and vec_dot_product_impl for improved clarity and performance
* wip
* Enhance vector copy functions for improved performance and clarity in vec_ops.hpp
* wip
* wip
* wip
* Optimize vector dot product implementations for enhanced performance and efficiency
* Enhance flash attention implementation and type traits for improved vector operations and alignment checks
# Conflicts:
# ggml/src/ggml-qnn/npu/device/type_traits.cpp
* remove align
* wip
* Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs
* Revert "Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs"
This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.
* Enhance flash attention implementation with type checks for tensor data types and improved constexpr usage
* wip
* opt mask calc
* Revert "opt mask calc"
This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.
* wip
* opt mul mat caching logic to add dst cache
* Revert "opt mul mat caching logic to add dst cache"
This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.
* wip
* Refactor matrix multiplication implementation to include vector conversion and performance tracking
* wip
* wip
* wip
* create vec_ops.inl for more aggressive compiler inline
* wip
* refactor vector dot product implementations for improved readability and performance
* refactor vector conversion functions to use HVX_Vector_Dual for improved clarity and consistency
* wip
* wip
* wip
* implement row size caching logic and enhance type traits for F32 support
* refactor matrix multiplication functions to improve caching logic and simplify tensor alignment handling
* add vector zeroing functions for F32 and F16 types to optimize memory initialization
* Revert "add vector zeroing functions for F32 and F16 types to optimize memory initialization"
This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.
* wip
* refactor alignment checks in dot product function to handle null pointers
* wip
* refactor load_block_generic and related functions for improved alignment handling
* wip
* refactor flash attention implementation and introduce type-erased dot function for improved type handling
* refactor dot product implementations for improved loop handling and clarity
* refactor thread_pool constructor to pre-allocate VTCM cache for each thread
* Revert "refactor thread_pool constructor to pre-allocate VTCM cache for each thread"
This reverts commit 00cdd3fa88d909feef44ddaa42095274b7627685.
* wip
* opt interfaces for tensor cleanup
* refactor mul_mat_impl to use aligned size for src0 row calculation
* refactor: update dequantized_row_size logic and add size alignment checks for tensors
* wip
* wip
* refactor: replace raw pointer initialization with invalid handle constants for better clarity
* wip
* feat: add mixed precision dot product implementation and function declaration
* feat: implement mixed precision vector dot product and conversion functions
* fix: update data type handling in matrix multiplication implementation
* fix: adjust row count handling in matrix multiplication implementation for accurate slicing
* fix: optimize matrix multiplication implementation by unroll loop
* update performance tracking for matrix multiplication implementation
* add fetching
* wip
* fix: support F16 * F32 multiplication in is_mul_mat_supported function
* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling
* fix test failure for row width 67
* try fix failed test
* fix: rename aligned_address to align_down for clarity in vector alignment handling
* wip
* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility
* fix test failure at width == 193
* fix: replace zero vector initialization with previous vector in mixed dot product implementation
* wip
* fix: improve handling of last vector in mixed dot product implementation
* wip
* wip
* wip
* wip
* Enhance mul_mat_f32 function to support quantized types and improve static assertions
* rename
* Refactor dequantization functions to use npu_device_fp16_t and improve type handling
* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16
* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16
* Add hvx_vsf_convert_vhf function for improved vector conversion
* add perf logs
* Refactor dequantize_row_q4_0 for alignment
* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity
* Add support for ROPE operation in NPU capabilities and related functions
* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations
* enable ROPE by adding operation validation
* add support to freq is null case
* wip
* Refactor rope_f32 to improve indexing by introducing total_planes calculation
* reformat
* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers
* Add performance tracking to rope_f32 function for enhanced profiling
* Refactor rope_f32 to use a templated implementation
* Refactor rope_impl to replace loop with memcpy for improved performance
* Refactor mul_mat_impl to support quantization as a template parameter
* wip
* wip
* Refactor rope_impl to optimize plane indexing in the processing loop
* Add aligned vector dot product implementation for mixed precision types
* wip
* Enhance matrix multiplication for F32 and F16 types with alignment checks
* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums
* Add alignment checks for matrix multiplication and vector dot products
* Refactor matrix multiplication to use function pointers for improved readability and maintainability
* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling
* Remove unused f16_to_f32_table parameter from quantization and dequantization functions
* wip
* Add L2 fetch for src1 plane rows in matrix multiplication implementation
* wip
* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication
* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity
* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance
* Refactor vector operation functions to improve clarity and consistency in variable usage
* wip
* wip
* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations
* wip
* Update load_dual_block_generic to use intrinsics
* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity
* wip
* wip
* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop
* wip
* wip
* fix typo
* add qurt_thread
* add thread pool
* add thread_pool obj at device ctx
* wip
* small refactoring to fit the thread pool structure
* set start/end threads for add
* init thread pool
* fix thread creation
* split complete and pending signals
* opt mulmat
* wip
* 2 threads
* back to 4 threads
* use barrier
* remove some unnecessary package
* add multi thread support for mul mat
* wip
* use qurt_barrier_t instead of qurt_signal_t
* wip
* wip
* add log
* split qnn cmake config
* create function to calculate the start and end func
* wip
* fix comment
* fix comment
* fix comment
* wip
* fix typo