llama.cpp/ggml
nullname a29243e7a4
feat: perf opt quant (#47)
* feat: add mixed precision dot product implementation and function declaration

* feat: implement mixed precision vector dot product and conversion functions

* fix: update data type handling in matrix multiplication implementation

* fix: adjust row count handling in matrix multiplication implementation for accurate slicing

* fix: optimize matrix multiplication implementation by unroll loop

* update performance tracking for matrix multiplication implementation

* add fetching

* wip

* fix: support F16 * F32 multiplication in is_mul_mat_supported function

* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling

* fix test failure for row width 67

* try fix failed test

* fix: rename aligned_address to align_down for clarity in vector alignment handling

* wip

* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility

* fix test failure at width == 193

* fix: replace zero vector initialization with previous vector in mixed dot product implementation

* wip

* fix: improve handling of last vector in mixed dot product implementation

* wip

* wip

* wip

* wip

* Enhance mul_mat_f32 function to support quantized types and improve static assertions

* rename

* Refactor dequantization functions to use npu_device_fp16_t and improve type handling

* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16

* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16

* Add hvx_vsf_convert_vhf function for improved vector conversion

* add perf logs

* Refactor dequantize_row_q4_0 for alignment

* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity

* Add support for ROPE operation in NPU capabilities and related functions

* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations

* enable ROPE by adding operation validation

* add support to freq is null case

* wip

* Refactor rope_f32 to improve indexing by introducing total_planes calculation

* reformat

* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers

* Add performance tracking to rope_f32 function for enhanced profiling

* Refactor rope_f32 to use a templated implementation

* Refactor rope_impl to replace loop with memcpy for improved performance

* Refactor mul_mat_impl to support quantization as a template parameter

* wip

* wip

* Refactor rope_impl to optimize plane indexing in the processing loop

* Add aligned vector dot product implementation for mixed precision types

* wip

* Enhance matrix multiplication for F32 and F16 types with alignment checks

* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums

* Add alignment checks for matrix multiplication and vector dot products

* Refactor matrix multiplication to use function pointers for improved readability and maintainability

* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling

* Remove unused f16_to_f32_table parameter from quantization and dequantization functions

* wip

* Add L2 fetch for src1 plane rows in matrix multiplication implementation

* wip

* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication

* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity

* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance

* Refactor vector operation functions to improve clarity and consistency in variable usage

* wip

* wip

* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations

* wip

* Update load_dual_block_generic to use intrinsics

* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity

* wip

* wip

* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop

* wip

* wip

* fix typo
2025-07-11 16:58:45 +08:00
..
cmake ggml-cpu : rework weak alias on apple targets (#14146) 2025-06-16 13:54:15 +08:00
include Merge branch 'master' into dev-refactoring 2025-06-30 14:03:06 +08:00
src feat: perf opt quant (#47) 2025-07-11 16:58:45 +08:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt Merge branch 'master' into dev-refactoring 2025-06-30 14:03:06 +08:00