llama.cpp

Commit Graph

Author	SHA1	Message	Date
nullname	36bc6f3213	feat: perf opt gemv phase2 (#58 ) * Add power management utilities to NPU device context and update DCVS settings * Update DCVS settings in power_utils to use v3 API and enhance power management * wip * Enhance dequantization functions by adding load_dequant_table support and updating signatures for improved performance * use lut * wip * fix test failure * wip * Refactor load_qual_block_generic to improve block handling and optimize vector operations * Enhance load_dual_block_generic and load_qual_block_generic to accept a mask parameter for improved block handling * Refactor flash_attn_impl to optimize mask l2 prefetch * wip * wip * wip * wip * add log * link against shared libraries instead of static ones * fix swiglu * wip * refactor expf_fix to handle overflow for different data types * enhance is_glu_op_supported to validate shapes for multiple sources * wip * refactor logging macros to use hexagon namespace and improve formatting * fix printf format error * wip * refactor: update static_assert messages for block size validation and add HVX_VectorPred_x3 type alias * rename * feat: enhance fa with mask * wip * wip * refactor: replace instances of Q6_V_vzero() with kZeroV for consistency * wip * wip * wip * fix: improve address alignment check in HVX_Vector handling * refactor: streamline vector dot product implementations for improved readability * refactor: q4k add hvx intrinsic impl * refactor: enhance dequantize_row_q4_K for clarity and performance * refactor: optimize scale mask usage in dequantization functions for improved performance * refactor: optimize dequantize_row_q4_K for intrinsic usage and performance improvements * refactor: move GLU operation implementation into separated file * sync after swiglu * wip * wip * wip * feat: increase prc main thread stack size * fix: replace hardcoded stack size with NPU_THREAD_STACK_SIZE constant * wip * feat: add optimized vector operations for exponential and division with overflow handling * wip * feat: refactor exponential function to handle overflow and underflow with improved logic * wip * wip * feat: add vector loading and scaling functions for improved performance in block processing * wip * feat: optimize block loading by refactoring scale index handling for improved performance * use Q6_Vb_vlut32_VbVbR_nomatch instead * feat: enhance scale loading by adding static assertion and restructuring block handling * wip * feat: refactor vec_dot_product_mixed_impl for improved clarity and performance * wip * feat: simplify vector loading functions and improve alignment handling * wip * feat: enhance scale loading mask with quantization block size validation * wip * feat: implement make_scale_load_mask function and refactor vector handling in vec_ops * feat: enhance load_dual_block_generic to include scale indices for improved vector loading * revert q8 dequant * wip * feat: optimize dequantization functions by removing unnecessary masking and updating lookup methods * wip * wip * add qurt_mutex * Add DMA transfer class and integrate into thread pool * Enhance DMA transfer functionality by adding support for multiple descriptors and initiating transfers in parallel * fix dma crash * fix failed unit tests * wip * use alignas * Improve DMA transfer error handling and update descriptor completion check * Fix VTCM cache size calculation in element-wise operations * Add cache clean operations before DMA transfers in element-wise operations * reduce cache clean operations * Refactor DMA transfer functions to support 1D operations and rename for clarity * Enhance DMA transfer functionality by adding 2D submission support and improving descriptor initialization * Update read buffer method to support forced invalidation and remove unnecessary invalidation calls in element-wise operations * wip * Improve DMA transfer handling in mul_mat_gemv_impl by replacing memcpy with initiate_dma_row_transfer and adding wait_for_dma logic * fix 2d dma * feat: add DMA plane cache * rename * wip * use memcpy for debug * fix cache plane calc * refactor: remove debug logging from mul_mat_impl and optimize cache handling * rename * fix 2d dma type * refactor: enhance DMA transfer handling in mul_mat_gemv_impl and wait functions * refactor: optimize DMA transfer handling in mul_mat_gemv_impl and wait functions * wip * wip * move op impl into sub dir * add log * fix: correct pointer usage in mul_mat_gemv_impl for next plane access * fix: improve DMA transfer error handling in mul_mat_impl and mul_mat_gemv_impl * fix: fix crash by using the entire row bytes * wip * wip * fix: prevent parallelization for scalar src1 in is_mul_mat_supported * fix: add dimension checks for 2D DMA transfers and fallback to 1D if necessary * wip * fix: enable thread barrier for mul multiplication operations * feat: add synchronization checks for tensor operations and update related functions * wip * fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations * Revert "fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations" This reverts commit af3441e67e706b2e5122369dc160353796867dd3. * wip * wip * add comment * fix: improve DMA transfer handling in mul_mat_gemv_impl for quantized source tensors * add log * try fix mulmat gemv * wip * fix: enhance DMA transfer handling in mul_mat_gemv_impl for quantized source tensors * fix: optimize cache offset calculation and remove redundant swap in mul_mat_gemv_impl * fix: refactor DMA transfer handling in mul_mat_gemv_impl for improved clarity and maintainability * wip * wip * wip * fix: enhance mul_mat_impl for improved cache handling and clarity * fix: refactor tensor unflattening and DMA transfer initialization for improved clarity and type safety * fix: improve cache handling of quant * wip * fix: improve cache handling in mul_mat_impl and mul_mat_gemv_impl for better memory efficiency * rename * add load_hexa_block_generic * wip * extract dequant block into separated function * refactor: enhance dequantization functions with table parameter * fix load_dual_block_generic * refactor: rename dequantization functions for clarity and enhance block handling * refactor: simplify dequantization logic by consolidating block handling and removing unused parameters * wip * wip * feat: add make_qs_load_mask function and update load_dual_block_generic to use qs_indices * fix load_dual_block_generic * refactor: update load functions to use qs_indices for improved block loading * wip * fix: update loop indices and boundary checks to use size_t for better efficiency * wip * update make_scale_load_mask, to make it available for q8 * feat: add vec_dot_product_quant_impl for quantized dot product computation * refactoring: move come quant func to dedicated file * refactor: rename dequantization functions for clarity and consistency * wip * feat: enhance vec_dot_product_quant_impl with dual dequantization and improved assertions * add vec_dot_product_vqf32_q40_f32 * wip * wip * wip * wip * implement vec_mpy_qf32_qf32_qf32 function and update vec_dot_product_vqf32_q40_f32 to use it * wip * add src0_plane_write_cache_offset * wip * enhance mul_mat_f32 to handle NPU_DATA_TYPE_Q4_0 for quantized matrix multiplication * wip * wip * update test func * refactor mul_mat_gemv_quant_impl to use get_nb for row stride and remove unused test function in init_f16_f32_table * wip * Add support for 4-block dequantization in vec_quant and update dot product implementation * Refactor vec_dot_product_quant_impl to improve variable handling and enhance readability * Refactor vec_dot_product_quant_impl to replace template function with inline vector operations * use Q6_Vqf32_vmpy_VsfVsf instead of Q6_Vqf32_vmpy_Vqf32Vqf32 * Revert "use Q6_Vqf32_vmpy_VsfVsf instead of Q6_Vqf32_vmpy_Vqf32Vqf32" This reverts commit 54839166fddbe40a0392adee5863c59070ccdbe4. * wip * improve log print in graph * Refactor batched_row_dot to accept additional arguments and remove batched_row_dot_with_table * Refactor synchronization functions to include previous operation and NE type parameters * Refactor synchronization checks in several operations * Update synchronization checks to include NPU_OP_COUNT in required conditions * Add performance tracking to buffer management functions * add memset * add log * fix: update backend device type from ACCEL to IGPU * fix comment	2025-10-16 23:21:51 +08:00
nullname	295f7f5957	feat: perf opt part3 (#42 ) * add f16 support to etl wise op * wip * Revert "wip" This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b. * qf32 for mul * wip * Revert "wip" This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748. * disable fp16 add/sub * tempate trick * wip * add f16 mulmat * add log * fix view liked op * add log * fix f16 mulmat * add quant type * wip * add l2fetch * add vtcm_mem * wip * fix fetch * use vtcm cache in mulmat * revert vtcm cache * cache plane * small opt for plane cache * cache plane for some element wise op * wip * enable fetch even on vtcm * wip * copy sysMonApp * small opt * init ltu * add compute_params * add op common header * move vtcm_mem allocation to compute_param * fallback to memcache when vtcm allocate failed * pre-calculate quantize type * wip * try fix test failure * try fix mulmat nan * fix inf in mulmat * remove debug logs * wip * small refactoring on the dequant row func * fix typo * improve logging * add q4_0 and q8_0 * wip * wip * build hexagon libs in cmake * wip * fix qnn only build flag * fix typo * fix todo * wip * wip * add to_float * use to)float directly instead of ltu * wip * cache f16_to_f32 table into vtcm * print tensor dims at log * init device in supports_op_impl * revert cache ltu * wip * wip * fix graph calc issues by validate cache manually after each op * add cache invalidate func * enable cache fallback only in quantize tensors * add option to disable quantized tensors * propagate the asan flag to npu build * fix asan option * wip * invalidate tensors after finished * implement backend_buffer_reset * wip * wip * refactoring plane cache mechanism * wip * split row elements across thread * use table for f16 to f32 conversion * sync after each op * small refactoring to invalidate l2 cahce * wip * opt on float fetching * unroll for loop manually * reduce vtcm usage * add perf tracking for npu * print dimensions for profiler log * wip * wip * wip * add sub proc tracker * fix typo * print pcycles * wip * wip * prefetch rows * add l2fetch_row * small tweak based on perf tracer * opt l2 fetching * wip	2025-05-16 19:57:33 +08:00
nullname	beff5c4b78	feat: op perf opt (#38 ) * add op define xml * copy qnn libs in cmake * fix htp skel path * add windows copy file list * wip * add generated package * remove unused params * add cmake list * set qnn sdk and hexagon sdk path * wip * wip * fix tools version * fix compiling error * fix dims calc * wip * add mulmat 2d * wip * reduction * wip * wip * fix compiling error in x64 * wip * fix device description in emulator * wip * add flag * copy necessary libs * wip * load HtpPrepare first for emulator * enable custom op for 2d matrix * verify op config before add to node * Revert "verify op config before add to node" This reverts commit 206dec826e560625e053c4c78e023994f993526e. * wip * wip * wip * revert tool version change * use hexagon sdk version 5.5.0 https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0 * wip * move to sub dir * add hexagon npu device and server lib * fix npu lib build * refactoring: rename QNNBackend enum * fix compiling error * wip * remove qnn/backend.hpp * add hexagon dsp host layer * extract rpc_mem from qnn submodule * fix dsp compiling error * wip * wip * open and lose npu device * split objects into separated files * fix linking error * add npu_tensor * add host graph * map rpc buffer before usage * fix some todos * add shared module * split rpc_interface from rpc_mem * get get_dsp_arch from device * wip * rename host classes * fix hexagon sdk arch getter * fix device open * fix linking error * fix crash * use tensor_data_type * fix npu lib crash * fix debug log print * skip empty graph * wip * add log * fix unmap fail * fix tensor set * remove some logs * flush back memory after finished * fix nb * wip * wip * add helper function * impl add op * fix some add in test-backend-ops * add elt wise sub and mul * fix crash on some inplace op * wip * fix elt wise op calc * wip * split mul_mat into file * add caps array * wip * wip * print support/unsupport op * copy lldb-server for newer android sdk * add tensor_spec * add assert * fix crash when loading model * rename cmake option * fix name * fix device memory and description * fix compiling error on qnn only build * fix some potential UBs * fix comments	2025-04-21 12:06:16 +08:00

Author

SHA1

Message

Date

nullname

36bc6f3213

feat: perf opt gemv phase2 (#58 )

* Add power management utilities to NPU device context and update DCVS settings

* Update DCVS settings in power_utils to use v3 API and enhance power management

* wip

* Enhance dequantization functions by adding load_dequant_table support and updating signatures for improved performance

* use lut

* wip

* fix test failure

* wip

* Refactor load_qual_block_generic to improve block handling and optimize vector operations

* Enhance load_dual_block_generic and load_qual_block_generic to accept a mask parameter for improved block handling

* Refactor flash_attn_impl to optimize mask l2 prefetch

* wip

* wip

* wip

* wip

* add log

* link against shared libraries instead of static ones

* fix swiglu

* wip

* refactor expf_fix to handle overflow for different data types

* enhance is_glu_op_supported to validate shapes for multiple sources

* wip

* refactor logging macros to use hexagon namespace and improve formatting

* fix printf format error

* wip

* refactor: update static_assert messages for block size validation and add HVX_VectorPred_x3 type alias

* rename

* feat: enhance fa with mask

* wip

* wip

* refactor: replace instances of Q6_V_vzero() with kZeroV for consistency

* wip

* wip

* wip

* fix: improve address alignment check in HVX_Vector handling

* refactor: streamline vector dot product implementations for improved readability

* refactor: q4k add hvx intrinsic impl

* refactor: enhance dequantize_row_q4_K for clarity and performance

* refactor: optimize scale mask usage in dequantization functions for improved performance

* refactor: optimize dequantize_row_q4_K for intrinsic usage and performance improvements

* refactor: move GLU operation implementation into separated file

* sync after swiglu

* wip

* wip

* wip

* feat: increase prc main thread stack size

* fix: replace hardcoded stack size with NPU_THREAD_STACK_SIZE constant

* wip

* feat: add optimized vector operations for exponential and division with overflow handling

* wip

* feat: refactor exponential function to handle overflow and underflow with improved logic

* wip

* wip

* feat: add vector loading and scaling functions for improved performance in block processing

* wip

* feat: optimize block loading by refactoring scale index handling for improved performance

* use Q6_Vb_vlut32_VbVbR_nomatch instead

* feat: enhance scale loading by adding static assertion and restructuring block handling

* wip

* feat: refactor vec_dot_product_mixed_impl for improved clarity and performance

* wip

* feat: simplify vector loading functions and improve alignment handling

* wip

* feat: enhance scale loading mask with quantization block size validation

* wip

* feat: implement make_scale_load_mask function and refactor vector handling in vec_ops

* feat: enhance load_dual_block_generic to include scale indices for improved vector loading

* revert q8 dequant

* wip

* feat: optimize dequantization functions by removing unnecessary masking and updating lookup methods

* wip

* wip

* add qurt_mutex

* Add DMA transfer class and integrate into thread pool

* Enhance DMA transfer functionality by adding support for multiple descriptors and initiating transfers in parallel

* fix dma crash

* fix failed unit tests

* wip

* use alignas

* Improve DMA transfer error handling and update descriptor completion check

* Fix VTCM cache size calculation in element-wise operations

* Add cache clean operations before DMA transfers in element-wise operations

* reduce cache clean operations

* Refactor DMA transfer functions to support 1D operations and rename for clarity

* Enhance DMA transfer functionality by adding 2D submission support and improving descriptor initialization

* Update read buffer method to support forced invalidation and remove unnecessary invalidation calls in element-wise operations

* wip

* Improve DMA transfer handling in mul_mat_gemv_impl by replacing memcpy with initiate_dma_row_transfer and adding wait_for_dma logic

* fix 2d dma

* feat: add DMA plane cache

* rename

* wip

* use memcpy for debug

* fix cache plane calc

* refactor: remove debug logging from mul_mat_impl and optimize cache handling

* rename

* fix 2d dma type

* refactor: enhance DMA transfer handling in mul_mat_gemv_impl and wait functions

* refactor: optimize DMA transfer handling in mul_mat_gemv_impl and wait functions

* wip

* wip

* move op impl into sub dir

* add log

* fix: correct pointer usage in mul_mat_gemv_impl for next plane access

* fix: improve DMA transfer error handling in mul_mat_impl and mul_mat_gemv_impl

* fix: fix crash by using the entire row bytes

* wip

* wip

* fix: prevent parallelization for scalar src1 in is_mul_mat_supported

* fix: add dimension checks for 2D DMA transfers and fallback to 1D if necessary

* wip

* fix: enable thread barrier for mul multiplication operations

* feat: add synchronization checks for tensor operations and update related functions

* wip

* fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations

* Revert "fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations"

This reverts commit af3441e67e706b2e5122369dc160353796867dd3.

* wip

* wip

* add comment

* fix: improve DMA transfer handling in mul_mat_gemv_impl for quantized source tensors

* add log

* try fix mulmat gemv

* wip

* fix: enhance DMA transfer handling in mul_mat_gemv_impl for quantized source tensors

* fix: optimize cache offset calculation and remove redundant swap in mul_mat_gemv_impl

* fix: refactor DMA transfer handling in mul_mat_gemv_impl for improved clarity and maintainability

* wip

* wip

* wip

* fix: enhance mul_mat_impl for improved cache handling and clarity

* fix: refactor tensor unflattening and DMA transfer initialization for improved clarity and type safety

* fix: improve cache handling of quant

* wip

* fix: improve cache handling in mul_mat_impl and mul_mat_gemv_impl for better memory efficiency

* rename

* add load_hexa_block_generic

* wip

* extract dequant block into separated function

* refactor: enhance dequantization functions with table parameter

* fix load_dual_block_generic

* refactor: rename dequantization functions for clarity and enhance block handling

* refactor: simplify dequantization logic by consolidating block handling and removing unused parameters

* wip

* wip

* feat: add make_qs_load_mask function and update load_dual_block_generic to use qs_indices

* fix load_dual_block_generic

* refactor: update load functions to use qs_indices for improved block loading

* wip

* fix: update loop indices and boundary checks to use size_t for better efficiency

* wip

* update make_scale_load_mask, to make it available for q8

* feat: add vec_dot_product_quant_impl for quantized dot product computation

* refactoring: move come quant func to dedicated file

* refactor: rename dequantization functions for clarity and consistency

* wip

* feat: enhance vec_dot_product_quant_impl with dual dequantization and improved assertions

* add vec_dot_product_vqf32_q40_f32

* wip

* wip

* wip

* wip

* implement vec_mpy_qf32_qf32_qf32 function and update vec_dot_product_vqf32_q40_f32 to use it

* wip

* add src0_plane_write_cache_offset

* wip

* enhance mul_mat_f32 to handle NPU_DATA_TYPE_Q4_0 for quantized matrix multiplication

* wip

* wip

* update test func

* refactor mul_mat_gemv_quant_impl to use get_nb for row stride and remove unused test function in init_f16_f32_table

* wip

* Add support for 4-block dequantization in vec_quant and update dot product implementation

* Refactor vec_dot_product_quant_impl to improve variable handling and enhance readability

* Refactor vec_dot_product_quant_impl to replace template function  with inline vector operations

* use Q6_Vqf32_vmpy_VsfVsf instead of Q6_Vqf32_vmpy_Vqf32Vqf32

* Revert "use Q6_Vqf32_vmpy_VsfVsf instead of Q6_Vqf32_vmpy_Vqf32Vqf32"

This reverts commit 54839166fddbe40a0392adee5863c59070ccdbe4.

* wip

* improve log print in graph

* Refactor batched_row_dot to accept additional arguments and remove batched_row_dot_with_table

* Refactor synchronization functions to include previous operation and NE type parameters

* Refactor synchronization checks in several operations

* Update synchronization checks to include NPU_OP_COUNT in required conditions

* Add performance tracking to buffer management functions

* add memset

* add log

* fix: update backend device type from ACCEL to IGPU

* fix comment

2025-10-16 23:21:51 +08:00

nullname

295f7f5957

feat: perf opt part3 (#42 )

* add f16 support to etl wise op

* wip

* Revert "wip"

This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.

* qf32 for mul

* wip

* Revert "wip"

This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.

* disable fp16 add/sub

* tempate trick

* wip

* add f16 mulmat

* add log

* fix view liked op

* add log

* fix f16 mulmat

* add quant type

* wip

* add l2fetch

* add vtcm_mem

* wip

* fix fetch

* use vtcm cache in mulmat

* revert vtcm cache

* cache plane

* small opt for plane cache

* cache plane for some element wise op

* wip

* enable fetch even on vtcm

* wip

* copy sysMonApp

* small opt

* init ltu

* add compute_params

* add op common header

* move vtcm_mem allocation to compute_param

* fallback to memcache when vtcm allocate failed

* pre-calculate quantize type

* wip

* try fix test failure

* try fix mulmat nan

* fix inf in mulmat

* remove debug logs

* wip

* small refactoring on the dequant row func

* fix typo

* improve logging

* add q4_0 and q8_0

* wip

* wip

* build hexagon libs in cmake

* wip

* fix qnn only build flag

* fix typo

* fix todo

* wip

* wip

* add to_float

* use to)float directly instead of ltu

* wip

* cache f16_to_f32 table into vtcm

* print tensor dims at log

* init device in supports_op_impl

* revert cache ltu

* wip

* wip

* fix graph calc issues by validate cache manually after each op

* add cache invalidate func

* enable cache fallback only in quantize tensors

* add option to disable quantized tensors

* propagate the asan flag to npu build

* fix asan option

* wip

* invalidate tensors after finished

* implement backend_buffer_reset

* wip

* wip

* refactoring plane cache mechanism

* wip

* split row elements across thread

* use table for f16 to f32 conversion

* sync after each op

* small refactoring to invalidate l2 cahce

* wip

* opt on float fetching

* unroll for loop manually

* reduce vtcm usage

* add perf tracking for npu

* print dimensions for profiler log

* wip

* wip

* wip

* add sub proc tracker

* fix typo

* print pcycles

* wip

* wip

* prefetch rows

* add l2fetch_row

* small tweak based on perf tracer

* opt l2 fetching

* wip

2025-05-16 19:57:33 +08:00

nullname

beff5c4b78

feat: op perf opt (#38 )

* add op define xml

* copy qnn libs in cmake

* fix htp skel path

* add windows copy file list

* wip

* add generated package

* remove unused params

* add cmake list

* set qnn sdk and hexagon sdk path

* wip

* wip

* fix tools version

* fix compiling error

* fix dims calc

* wip

* add mulmat 2d

* wip

* reduction

* wip

* wip

* fix compiling error in x64

* wip

* fix device description in emulator

* wip

* add flag

* copy necessary libs

* wip

* load HtpPrepare first for emulator

* enable custom op for 2d matrix

* verify op config before add to node

* Revert "verify op config before add to node"

This reverts commit 206dec826e560625e053c4c78e023994f993526e.

* wip

* wip

* wip

* revert tool version change

* use hexagon sdk version 5.5.0

https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0

* wip

* move to sub dir

* add hexagon npu device and server lib

* fix npu lib build

* refactoring: rename QNNBackend enum

* fix compiling error

* wip

* remove qnn/backend.hpp

* add hexagon dsp host layer

* extract rpc_mem from qnn submodule

* fix dsp compiling error

* wip

* wip

* open and lose npu device

* split objects into separated files

* fix linking error

* add npu_tensor

* add host graph

* map rpc buffer before usage

* fix some todos

* add shared module

* split rpc_interface from rpc_mem

* get get_dsp_arch from device

* wip

* rename host classes

* fix hexagon sdk arch getter

* fix device open

* fix linking error

* fix crash

* use tensor_data_type

* fix npu lib crash

* fix debug log print

* skip empty graph

* wip

* add log

* fix unmap fail

* fix tensor set

* remove some logs

* flush back memory after finished

* fix nb

* wip

* wip

* add helper function

* impl add op

* fix some add in test-backend-ops

* add elt wise sub and mul

* fix crash on some inplace op

* wip

* fix elt wise op calc

* wip

* split mul_mat into file

* add caps array

* wip

* wip

* print support/unsupport op

* copy lldb-server for newer android sdk

* add tensor_spec

* add assert

* fix crash when loading model

* rename cmake option

* fix name

* fix device memory and description

* fix compiling error on qnn only build

* fix some potential UBs

* fix comments

2025-04-21 12:06:16 +08:00

3 Commits