3 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
39fb81f875
|
hexagon refactor all Ops to use local context struct (#19819)
* hexagon: refactor set/get/sum-rows ops to use local context * hexagon: refactor ROPE and Softmax Ops to use local context Improves performance a bit by precomputing things and saving in the context. * hexagon: refactor activation ops to use local context struct * hexagon: refactor unary ops to use local context struct and DMA/VTCM * hexagon: use aligned hvx_scale function * hexagon: remove unused fields from op_context * hexagon: rewrite ROPE to use DMA and VTCM scratchpad * hex-rope: keep N rows in scratchpad (instead of just two) * hex-rope: introduce rowidx cache * hex-rope: remove unused fields * hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute also removes the need for fastdiv. * hex-rope: minor formatting * hex-rope: use indices and unroll the loops * hex-rope: more updates to cleanup rope-block handling * hexagon: cleanup supported type/dims checks * hexagon: all reduce funcs replicated across lanes There is no need to explicitly replicate the first value. * snapdragon: update adb and windows scripts to use ubatch-size 256 Updated Ops support handles larger ubatches. |
|
|
|
cff777f226
|
hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822)
* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars * hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32 Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication. Update HTP ops infra to support OP_CPY * hexagon: cleanup and refactor hex/hvx/htp headers and helper libs hex is basically all scalar/core platform stuff (L2, DMA, basic utils) hvx is all hvx related utils, helpers, etc htp is higher level stuff like Ops, etc hvx-utils library got a nice round of cleanup and refactoring to reduce duplication use hvx_vec_store_a where possible * hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h Moved sigmoid and tanh vector functions from hvx-utils.h to a new header hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid array processing using a macro pattern similar to hvx-copy.h. Updated act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed unused hvx-sigmoid.c. * hexagon: factor out hvx-sqrt.h * hexagon: mintor update to hvx-utils.h * hexagon: remove spurios log * hexagon: factor out and optimize hvx_add/sub/mul * hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions * hexagon: refactor reduction functions to hvx-reduce.h Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`. Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`. Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic. Updated `softmax-ops.c` to use the new functions. * hexagon: refactor the rest of arithmetic functions to hvx-arith.h Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`. Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested. * hexagon: refactor hvx_sum_of_squares_f32 - Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature. - Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`. - Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`. - Update existing reduction functions in `hvx-reduce.h` to use the updated macro. - Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature. * hexagon: use hvx_splat instead of memset * hexagon: consistent use of f32/f16 in all function names to match the rest of GGML * hexagon: fix hvx_copy_f16_f32 on v75 and older * hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL * scripts: update snapdragon/adb scripts to enable host param |
|
|
|
ed75977717
|
ggml-hexagon: create generalized functions for cpu side op (#17500)
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity * add comment * refactor: remove redundant buffer checks in hexagon supported operations * wip * add missing include to fix weak symbol warning * add ggml_hexagon_op_generic * refactor: simplify tensor operation initialization and buffer management in hexagon implementation * refactor: streamline hexagon operation initialization and buffer management * refactor: update function signatures and streamline request handling in hexagon operations * wip * ggml-hexagon: clean up code formatting and improve unary operation handling * wip * rename * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity refactor: remove redundant buffer checks in hexagon supported operations add missing include to fix weak symbol warning add ggml_hexagon_op_generic refactor: simplify tensor operation initialization and buffer management in hexagon implementation refactor: streamline hexagon operation initialization and buffer management refactor: update function signatures and streamline request handling in hexagon operations ggml-hexagon: clean up code formatting and improve unary operation handling fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations # Conflicts: # ggml/src/ggml-hexagon/ggml-hexagon.cpp * hexagon: fix merge conflicts * hexagon: minor cleanup for buffer support checks * hexagon: factor out op_desc and the overal op logging * hexagon: further simplify and cleanup op dispatch logic * snapdragon: update adb scripts to use llama-cli and llama-completion * fix pipeline failure --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> |