nullname
|
c23ab465c0
|
feat: perf opt part4 (#43)
* wip
* refactor: rewrite dequantize_row_q4_0 by intrinsic
* log for debug
* fix q4 intrinsic
* small opt
* wip
* wip
* add vtcm_quota_size
* add perf log for hexagon-npu backend
* wip
* add log
* sync after a specfic op
* increase worker thread priority
* fix unbalanced thread slice
* small slict to fit in vtcm cache
* limit the supported row element size
* opt 4_0 dequant
* fix q4 dequant
* add power_utils
* add rms_norm
* wip
* enable rms_norm f32
* fix rms_norm with param
* fix compiling flags
* use float
* fix small row size
* vectorized rms norm
* wip
* read 2 vectors
* rename
* add perf log on update
* set empty tensors handle also
* merge some rpc functions
* opt param update
* wip
* print more log
* add struct for update param config
* add npu_device_graph_set_tensor_with_param
* merge tensor and params update
* wip
* wip
* make as template to reuse
* vectorize dequantize_row_q8_0
* opt
* avoid using union to store q data
* wip
* wip
* wip
|
2025-05-28 00:00:42 +08:00 |