nullname
|
295f7f5957
|
feat: perf opt part3 (#42)
* add f16 support to etl wise op
* wip
* Revert "wip"
This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.
* qf32 for mul
* wip
* Revert "wip"
This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.
* disable fp16 add/sub
* tempate trick
* wip
* add f16 mulmat
* add log
* fix view liked op
* add log
* fix f16 mulmat
* add quant type
* wip
* add l2fetch
* add vtcm_mem
* wip
* fix fetch
* use vtcm cache in mulmat
* revert vtcm cache
* cache plane
* small opt for plane cache
* cache plane for some element wise op
* wip
* enable fetch even on vtcm
* wip
* copy sysMonApp
* small opt
* init ltu
* add compute_params
* add op common header
* move vtcm_mem allocation to compute_param
* fallback to memcache when vtcm allocate failed
* pre-calculate quantize type
* wip
* try fix test failure
* try fix mulmat nan
* fix inf in mulmat
* remove debug logs
* wip
* small refactoring on the dequant row func
* fix typo
* improve logging
* add q4_0 and q8_0
* wip
* wip
* build hexagon libs in cmake
* wip
* fix qnn only build flag
* fix typo
* fix todo
* wip
* wip
* add to_float
* use to)float directly instead of ltu
* wip
* cache f16_to_f32 table into vtcm
* print tensor dims at log
* init device in supports_op_impl
* revert cache ltu
* wip
* wip
* fix graph calc issues by validate cache manually after each op
* add cache invalidate func
* enable cache fallback only in quantize tensors
* add option to disable quantized tensors
* propagate the asan flag to npu build
* fix asan option
* wip
* invalidate tensors after finished
* implement backend_buffer_reset
* wip
* wip
* refactoring plane cache mechanism
* wip
* split row elements across thread
* use table for f16 to f32 conversion
* sync after each op
* small refactoring to invalidate l2 cahce
* wip
* opt on float fetching
* unroll for loop manually
* reduce vtcm usage
* add perf tracking for npu
* print dimensions for profiler log
* wip
* wip
* wip
* add sub proc tracker
* fix typo
* print pcycles
* wip
* wip
* prefetch rows
* add l2fetch_row
* small tweak based on perf tracer
* opt l2 fetching
* wip
|
2025-05-16 19:57:33 +08:00 |