* add qurt_thread
* add thread pool
* add thread_pool obj at device ctx
* wip
* small refactoring to fit the thread pool structure
* set start/end threads for add
* init thread pool
* fix thread creation
* split complete and pending signals
* opt mulmat
* wip
* 2 threads
* back to 4 threads
* use barrier
* remove some unnecessary package
* add multi thread support for mul mat
* wip
* use qurt_barrier_t instead of qurt_signal_t
* wip
* wip
* add log
* split qnn cmake config
* create function to calculate the start and end func
* wip
* fix comment
* fix comment
* fix comment
* wip
* fix typo