Commit Graph

8 Commits

Author SHA1 Message Date
nullname 2cd429ca75
feat: perf opt part5 (#52)
* rename

* Refactor vector operations in vec_op_impl and vec_dot_product_impl for improved clarity and performance

* wip

* Enhance vector copy functions for improved performance and clarity in vec_ops.hpp

* wip

* wip

* wip

* Optimize vector dot product implementations for enhanced performance and efficiency

* Enhance flash attention implementation and type traits for improved vector operations and alignment checks

# Conflicts:
#	ggml/src/ggml-qnn/npu/device/type_traits.cpp

* remove align

* wip

* Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs

* Revert "Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs"

This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.

* Enhance flash attention implementation with type checks for tensor data types and improved constexpr usage

* wip

* opt mask calc

* Revert "opt mask calc"

This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.

* wip

* opt mul mat caching logic to add dst cache

* Revert "opt mul mat caching logic to add dst cache"

This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.

* wip

* Refactor matrix multiplication implementation to include vector conversion and performance tracking

* wip

* wip

* wip

* create vec_ops.inl for more aggressive compiler inline

* wip

* refactor vector dot product implementations for improved readability and performance

* refactor vector conversion functions to use HVX_Vector_Dual for improved clarity and consistency

* wip

* wip

* wip

* implement row size caching logic and enhance type traits for F32 support

* refactor matrix multiplication functions to improve caching logic and simplify tensor alignment handling

* add vector zeroing functions for F32 and F16 types to optimize memory initialization

* Revert "add vector zeroing functions for F32 and F16 types to optimize memory initialization"

This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.

* wip

* refactor alignment checks in dot product function to handle null pointers

* wip

* refactor load_block_generic and related functions for improved alignment handling

* wip

* refactor flash attention implementation and introduce type-erased dot function for improved type handling

* refactor dot product implementations for improved loop handling and clarity

* refactor thread_pool constructor to pre-allocate VTCM cache for each thread

* Revert "refactor thread_pool constructor to pre-allocate VTCM cache for each thread"

This reverts commit 00cdd3fa88d909feef44ddaa42095274b7627685.

* wip

* opt interfaces for tensor cleanup

* refactor mul_mat_impl to use aligned size for src0 row calculation

* refactor: update dequantized_row_size logic and add size alignment checks for tensors

* wip

* wip

* refactor: replace raw pointer initialization with invalid handle constants for better clarity

* wip
2025-07-23 00:38:09 +08:00
hongruichen 9a43a23e0b fix compiling error at new hexagon sdk 2025-07-16 21:11:11 +08:00
nullname a29243e7a4
feat: perf opt quant (#47)
* feat: add mixed precision dot product implementation and function declaration

* feat: implement mixed precision vector dot product and conversion functions

* fix: update data type handling in matrix multiplication implementation

* fix: adjust row count handling in matrix multiplication implementation for accurate slicing

* fix: optimize matrix multiplication implementation by unroll loop

* update performance tracking for matrix multiplication implementation

* add fetching

* wip

* fix: support F16 * F32 multiplication in is_mul_mat_supported function

* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling

* fix test failure for row width 67

* try fix failed test

* fix: rename aligned_address to align_down for clarity in vector alignment handling

* wip

* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility

* fix test failure at width == 193

* fix: replace zero vector initialization with previous vector in mixed dot product implementation

* wip

* fix: improve handling of last vector in mixed dot product implementation

* wip

* wip

* wip

* wip

* Enhance mul_mat_f32 function to support quantized types and improve static assertions

* rename

* Refactor dequantization functions to use npu_device_fp16_t and improve type handling

* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16

* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16

* Add hvx_vsf_convert_vhf function for improved vector conversion

* add perf logs

* Refactor dequantize_row_q4_0 for alignment

* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity

* Add support for ROPE operation in NPU capabilities and related functions

* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations

* enable ROPE by adding operation validation

* add support to freq is null case

* wip

* Refactor rope_f32 to improve indexing by introducing total_planes calculation

* reformat

* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers

* Add performance tracking to rope_f32 function for enhanced profiling

* Refactor rope_f32 to use a templated implementation

* Refactor rope_impl to replace loop with memcpy for improved performance

* Refactor mul_mat_impl to support quantization as a template parameter

* wip

* wip

* Refactor rope_impl to optimize plane indexing in the processing loop

* Add aligned vector dot product implementation for mixed precision types

* wip

* Enhance matrix multiplication for F32 and F16 types with alignment checks

* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums

* Add alignment checks for matrix multiplication and vector dot products

* Refactor matrix multiplication to use function pointers for improved readability and maintainability

* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling

* Remove unused f16_to_f32_table parameter from quantization and dequantization functions

* wip

* Add L2 fetch for src1 plane rows in matrix multiplication implementation

* wip

* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication

* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity

* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance

* Refactor vector operation functions to improve clarity and consistency in variable usage

* wip

* wip

* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations

* wip

* Update load_dual_block_generic to use intrinsics

* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity

* wip

* wip

* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop

* wip

* wip

* fix typo
2025-07-11 16:58:45 +08:00
nullname af620a12f7
feat: flash attention support for hexagon-npu (#45)
* add flash attn op

* expend src tensor size

* add flash attn sources

* add quantize row functions

* make a separated file for vec_dot

* wip

* wip

* refactor: rename quants.hpp includes and add vec_dot to type traits

* add flash_attn impl

* split vec_scale_f32

* move vec_reduction_qf32 to vec_ops

* add vec_scale_f16

* opt

* add vec_mad

* implement vec_mad_f16

* opt

* add op template

* opt

* add align version

* enable flash attn

* wip

* log print improve

* add profiler log

* wip

* wip

* add multi sub proc perf tracker

* increase log buffer

* remove sub prov pcycle

* wip

* wip

* add prefetch for vec_dot

* wip

* wip

* opt f16 vec dot

* opt f16 vecdot

* reuse vec_dot_product_impl in vec dot f32

* small opt to unblock pipeline

* opt on aligned address

wip

* Revert "opt on aligned address"

This reverts commit 27be1eb61a7d29d2f5fa6f90383e1b5d7fdf9b6a.

* add profiler log at thread_pool

* wip

* invalidate all...

* Reapply "opt on aligned address"

This reverts commit f075a4c4586e32b7e5819c1fe7f9b6ed218b1767.

* add is_constant for tensor config

* disable align tensor opt in mul_mat

* wip

* wip

* vec_scale_impl: unrolling the loop

* wip

* wip

* replace reinterpret_cast with direct pointer access for write/read buffers

* add fetch

* wip

* wip

* wip

* add log

* check tensor shape at flash_attn

* wip

* wip

* fix: update tensor type handling in flash_attn_impl

* wip

* fix: align cache size

* fix: qf16->hf

* fix: swap order of elements in vector combine for correct scaling

* fix: opt f16 scale and mad

* fix leftover fetch

* wip

* load into vector pair

* opt cache size calculation in flash_attn_impl

* refactoring: hold vtcm at thread local object

* wip

* add profiler log

* mark tensors as modified

* restrict tensor invalidation to the first thread in compute_impl

* Revert "restrict tensor invalidation to the first thread in compute_impl"

This reverts commit 0a8ff2b1bcf366097c16d7437c091382eacbef8b.

* invalidate last tensor in compute_impl

* invalidate last tensor in compute function

* wip

* refactor dequantize_row_q4_0 to simplify vector alignment

* wip

* refactoring: move VTCM quota calculation to thread pool

* wip

* fix: correct condition check for HEXAGON_SDK_ROOT existence

* wip

* wip

* wip

* wip

* fix: update condition checks match the naming

* fix: improve tensor handling checks and logging in graph and operation implementations

* wip
2025-06-18 10:32:08 +08:00
nullname c23ab465c0
feat: perf opt part4 (#43)
* wip

* refactor: rewrite dequantize_row_q4_0 by intrinsic

* log for debug

* fix q4 intrinsic

* small opt

* wip

* wip

* add vtcm_quota_size

* add perf log for hexagon-npu backend

* wip

* add log

* sync after a specfic op

* increase worker thread priority

* fix unbalanced thread slice

* small slict to fit in vtcm cache

* limit the supported row element size

* opt 4_0 dequant

* fix q4 dequant

* add power_utils

* add rms_norm

* wip

* enable rms_norm f32

* fix rms_norm with param

* fix compiling flags

* use float

* fix small row size

* vectorized rms norm

* wip

* read 2 vectors

* rename

* add perf log on update

* set empty tensors handle also

* merge some rpc functions

* opt param update

* wip

* print more log

* add struct for update param config

* add npu_device_graph_set_tensor_with_param

* merge tensor and params update

* wip

* wip

* make as template to reuse

* vectorize dequantize_row_q8_0

* opt

* avoid using union to store q data

* wip

* wip

* wip
2025-05-28 00:00:42 +08:00
nullname 295f7f5957
feat: perf opt part3 (#42)
* add f16 support to etl wise op

* wip

* Revert "wip"

This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.

* qf32 for mul

* wip

* Revert "wip"

This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.

* disable fp16 add/sub

* tempate trick

* wip

* add f16 mulmat

* add log

* fix view liked op

* add log

* fix f16 mulmat

* add quant type

* wip

* add l2fetch

* add vtcm_mem

* wip

* fix fetch

* use vtcm cache in mulmat

* revert vtcm cache

* cache plane

* small opt for plane cache

* cache plane for some element wise op

* wip

* enable fetch even on vtcm

* wip

* copy sysMonApp

* small opt

* init ltu

* add compute_params

* add op common header

* move vtcm_mem allocation to compute_param

* fallback to memcache when vtcm allocate failed

* pre-calculate quantize type

* wip

* try fix test failure

* try fix mulmat nan

* fix inf in mulmat

* remove debug logs

* wip

* small refactoring on the dequant row func

* fix typo

* improve logging

* add q4_0 and q8_0

* wip

* wip

* build hexagon libs in cmake

* wip

* fix qnn only build flag

* fix typo

* fix todo

* wip

* wip

* add to_float

* use to)float directly instead of ltu

* wip

* cache f16_to_f32 table into vtcm

* print tensor dims at log

* init device in supports_op_impl

* revert cache ltu

* wip

* wip

* fix graph calc issues by validate cache manually after each op

* add cache invalidate func

* enable cache fallback only in quantize tensors

* add option to disable quantized tensors

* propagate the asan flag to npu build

* fix asan option

* wip

* invalidate tensors after finished

* implement backend_buffer_reset

* wip

* wip

* refactoring plane cache mechanism

* wip

* split row elements across thread

* use table for f16 to f32 conversion

* sync after each op

* small refactoring to invalidate l2 cahce

* wip

* opt on float fetching

* unroll for loop manually

* reduce vtcm usage

* add perf tracking for npu

* print dimensions for profiler log

* wip

* wip

* wip

* add sub proc tracker

* fix typo

* print pcycles

* wip

* wip

* prefetch rows

* add l2fetch_row

* small tweak based on perf tracer

* opt l2 fetching

* wip
2025-05-16 19:57:33 +08:00
nullname c2b6fec63f
feat: perf opt part2 (#39)
* add qurt_thread

* add thread pool

* add thread_pool obj at device ctx

* wip

* small refactoring to fit the thread pool structure

* set start/end threads for add

* init thread pool

* fix thread creation

* split complete and pending signals

* opt mulmat

* wip

* 2 threads

* back to 4 threads

* use barrier

* remove some unnecessary package

* add multi thread support for mul mat

* wip

* use qurt_barrier_t instead of qurt_signal_t

* wip

* wip

* add log

* split qnn cmake config

* create function to calculate the start and end func

* wip

* fix comment

* fix comment

* fix comment

* wip

* fix typo
2025-04-27 17:43:32 +08:00
nullname beff5c4b78
feat: op perf opt (#38)
* add op define xml

* copy qnn libs in cmake

* fix htp skel path

* add windows copy file list

* wip

* add generated package

* remove unused params

* add cmake list

* set qnn sdk and hexagon sdk path

* wip

* wip

* fix tools version

* fix compiling error

* fix dims calc

* wip

* add mulmat 2d

* wip

* reduction

* wip

* wip

* fix compiling error in x64

* wip

* fix device description in emulator

* wip

* add flag

* copy necessary libs

* wip

* load HtpPrepare first for emulator

* enable custom op for 2d matrix

* verify op config before add to node

* Revert "verify op config before add to node"

This reverts commit 206dec826e560625e053c4c78e023994f993526e.

* wip

* wip

* wip

* revert tool version change

* use hexagon sdk version 5.5.0

https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0

* wip

* move to sub dir

* add hexagon npu device and server lib

* fix npu lib build

* refactoring: rename QNNBackend enum

* fix compiling error

* wip

* remove qnn/backend.hpp

* add hexagon dsp host layer

* extract rpc_mem from qnn submodule

* fix dsp compiling error

* wip

* wip

* open and lose npu device

* split objects into separated files

* fix linking error

* add npu_tensor

* add host graph

* map rpc buffer before usage

* fix some todos

* add shared module

* split rpc_interface from rpc_mem

* get get_dsp_arch from device

* wip

* rename host classes

* fix hexagon sdk arch getter

* fix device open

* fix linking error

* fix crash

* use tensor_data_type

* fix npu lib crash

* fix debug log print

* skip empty graph

* wip

* add log

* fix unmap fail

* fix tensor set

* remove some logs

* flush back memory after finished

* fix nb

* wip

* wip

* add helper function

* impl add op

* fix some add in test-backend-ops

* add elt wise sub and mul

* fix crash on some inplace op

* wip

* fix elt wise op calc

* wip

* split mul_mat into file

* add caps array

* wip

* wip

* print support/unsupport op

* copy lldb-server for newer android sdk

* add tensor_spec

* add assert

* fix crash when loading model

* rename cmake option

* fix name

* fix device memory and description

* fix compiling error on qnn only build

* fix some potential UBs

* fix comments
2025-04-21 12:06:16 +08:00