llama.cpp

Commit Graph

Author	SHA1	Message	Date
hongruichen	332514cd5c	qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility	2025-06-23 16:04:01 +08:00
nullname	af620a12f7	feat: flash attention support for hexagon-npu (#45 ) * add flash attn op * expend src tensor size * add flash attn sources * add quantize row functions * make a separated file for vec_dot * wip * wip * refactor: rename quants.hpp includes and add vec_dot to type traits * add flash_attn impl * split vec_scale_f32 * move vec_reduction_qf32 to vec_ops * add vec_scale_f16 * opt * add vec_mad * implement vec_mad_f16 * opt * add op template * opt * add align version * enable flash attn * wip * log print improve * add profiler log * wip * wip * add multi sub proc perf tracker * increase log buffer * remove sub prov pcycle * wip * wip * add prefetch for vec_dot * wip * wip * opt f16 vec dot * opt f16 vecdot * reuse vec_dot_product_impl in vec dot f32 * small opt to unblock pipeline * opt on aligned address wip * Revert "opt on aligned address" This reverts commit 27be1eb61a7d29d2f5fa6f90383e1b5d7fdf9b6a. * add profiler log at thread_pool * wip * invalidate all... * Reapply "opt on aligned address" This reverts commit f075a4c4586e32b7e5819c1fe7f9b6ed218b1767. * add is_constant for tensor config * disable align tensor opt in mul_mat * wip * wip * vec_scale_impl: unrolling the loop * wip * wip * replace reinterpret_cast with direct pointer access for write/read buffers * add fetch * wip * wip * wip * add log * check tensor shape at flash_attn * wip * wip * fix: update tensor type handling in flash_attn_impl * wip * fix: align cache size * fix: qf16->hf * fix: swap order of elements in vector combine for correct scaling * fix: opt f16 scale and mad * fix leftover fetch * wip * load into vector pair * opt cache size calculation in flash_attn_impl * refactoring: hold vtcm at thread local object * wip * add profiler log * mark tensors as modified * restrict tensor invalidation to the first thread in compute_impl * Revert "restrict tensor invalidation to the first thread in compute_impl" This reverts commit 0a8ff2b1bcf366097c16d7437c091382eacbef8b. * invalidate last tensor in compute_impl * invalidate last tensor in compute function * wip * refactor dequantize_row_q4_0 to simplify vector alignment * wip * refactoring: move VTCM quota calculation to thread pool * wip * fix: correct condition check for HEXAGON_SDK_ROOT existence * wip * wip * wip * wip * fix: update condition checks match the naming * fix: improve tensor handling checks and logging in graph and operation implementations * wip	2025-06-18 10:32:08 +08:00
nullname	c23ab465c0	feat: perf opt part4 (#43 ) * wip * refactor: rewrite dequantize_row_q4_0 by intrinsic * log for debug * fix q4 intrinsic * small opt * wip * wip * add vtcm_quota_size * add perf log for hexagon-npu backend * wip * add log * sync after a specfic op * increase worker thread priority * fix unbalanced thread slice * small slict to fit in vtcm cache * limit the supported row element size * opt 4_0 dequant * fix q4 dequant * add power_utils * add rms_norm * wip * enable rms_norm f32 * fix rms_norm with param * fix compiling flags * use float * fix small row size * vectorized rms norm * wip * read 2 vectors * rename * add perf log on update * set empty tensors handle also * merge some rpc functions * opt param update * wip * print more log * add struct for update param config * add npu_device_graph_set_tensor_with_param * merge tensor and params update * wip * wip * make as template to reuse * vectorize dequantize_row_q8_0 * opt * avoid using union to store q data * wip * wip * wip	2025-05-28 00:00:42 +08:00
nullname	2306f82a58	fix compiling error	2025-05-27 06:35:41 +00:00
nullname	295f7f5957	feat: perf opt part3 (#42 ) * add f16 support to etl wise op * wip * Revert "wip" This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b. * qf32 for mul * wip * Revert "wip" This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748. * disable fp16 add/sub * tempate trick * wip * add f16 mulmat * add log * fix view liked op * add log * fix f16 mulmat * add quant type * wip * add l2fetch * add vtcm_mem * wip * fix fetch * use vtcm cache in mulmat * revert vtcm cache * cache plane * small opt for plane cache * cache plane for some element wise op * wip * enable fetch even on vtcm * wip * copy sysMonApp * small opt * init ltu * add compute_params * add op common header * move vtcm_mem allocation to compute_param * fallback to memcache when vtcm allocate failed * pre-calculate quantize type * wip * try fix test failure * try fix mulmat nan * fix inf in mulmat * remove debug logs * wip * small refactoring on the dequant row func * fix typo * improve logging * add q4_0 and q8_0 * wip * wip * build hexagon libs in cmake * wip * fix qnn only build flag * fix typo * fix todo * wip * wip * add to_float * use to)float directly instead of ltu * wip * cache f16_to_f32 table into vtcm * print tensor dims at log * init device in supports_op_impl * revert cache ltu * wip * wip * fix graph calc issues by validate cache manually after each op * add cache invalidate func * enable cache fallback only in quantize tensors * add option to disable quantized tensors * propagate the asan flag to npu build * fix asan option * wip * invalidate tensors after finished * implement backend_buffer_reset * wip * wip * refactoring plane cache mechanism * wip * split row elements across thread * use table for f16 to f32 conversion * sync after each op * small refactoring to invalidate l2 cahce * wip * opt on float fetching * unroll for loop manually * reduce vtcm usage * add perf tracking for npu * print dimensions for profiler log * wip * wip * wip * add sub proc tracker * fix typo * print pcycles * wip * wip * prefetch rows * add l2fetch_row * small tweak based on perf tracer * opt l2 fetching * wip	2025-05-16 19:57:33 +08:00
hongruichen	db2a125438	fix GGML_QNN_ENABLE_PERFORMANCE_TRACKING option	2025-05-13 20:18:09 +08:00
hongruichen	02af8ff653	fix qnn only build flag	2025-05-08 21:28:11 +08:00
hongruichen	0ce53ce7cd	fix linking error	2025-05-08 12:19:40 +08:00
hongruichen	039f835410	fix compiling error	2025-05-08 10:17:48 +08:00
hongruichen	161c4ee124	fix typo	2025-05-08 01:20:41 +08:00
nullname	c2b6fec63f	feat: perf opt part2 (#39 ) * add qurt_thread * add thread pool * add thread_pool obj at device ctx * wip * small refactoring to fit the thread pool structure * set start/end threads for add * init thread pool * fix thread creation * split complete and pending signals * opt mulmat * wip * 2 threads * back to 4 threads * use barrier * remove some unnecessary package * add multi thread support for mul mat * wip * use qurt_barrier_t instead of qurt_signal_t * wip * wip * add log * split qnn cmake config * create function to calculate the start and end func * wip * fix comment * fix comment * fix comment * wip * fix typo	2025-04-27 17:43:32 +08:00
nullname	beff5c4b78	feat: op perf opt (#38 ) * add op define xml * copy qnn libs in cmake * fix htp skel path * add windows copy file list * wip * add generated package * remove unused params * add cmake list * set qnn sdk and hexagon sdk path * wip * wip * fix tools version * fix compiling error * fix dims calc * wip * add mulmat 2d * wip * reduction * wip * wip * fix compiling error in x64 * wip * fix device description in emulator * wip * add flag * copy necessary libs * wip * load HtpPrepare first for emulator * enable custom op for 2d matrix * verify op config before add to node * Revert "verify op config before add to node" This reverts commit 206dec826e560625e053c4c78e023994f993526e. * wip * wip * wip * revert tool version change * use hexagon sdk version 5.5.0 https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0 * wip * move to sub dir * add hexagon npu device and server lib * fix npu lib build * refactoring: rename QNNBackend enum * fix compiling error * wip * remove qnn/backend.hpp * add hexagon dsp host layer * extract rpc_mem from qnn submodule * fix dsp compiling error * wip * wip * open and lose npu device * split objects into separated files * fix linking error * add npu_tensor * add host graph * map rpc buffer before usage * fix some todos * add shared module * split rpc_interface from rpc_mem * get get_dsp_arch from device * wip * rename host classes * fix hexagon sdk arch getter * fix device open * fix linking error * fix crash * use tensor_data_type * fix npu lib crash * fix debug log print * skip empty graph * wip * add log * fix unmap fail * fix tensor set * remove some logs * flush back memory after finished * fix nb * wip * wip * add helper function * impl add op * fix some add in test-backend-ops * add elt wise sub and mul * fix crash on some inplace op * wip * fix elt wise op calc * wip * split mul_mat into file * add caps array * wip * wip * print support/unsupport op * copy lldb-server for newer android sdk * add tensor_spec * add assert * fix crash when loading model * rename cmake option * fix name * fix device memory and description * fix compiling error on qnn only build * fix some potential UBs * fix comments	2025-04-21 12:06:16 +08:00
hongruichen	9e41f79403	fix compiling error after merge master	2025-04-16 11:16:26 +08:00
hongruichen	1caca627ea	fix compiling error after merge	2025-03-22 12:51:09 +08:00
nullname	a1ab67478f	[feat] add more op (#35 ) * move op key generate function to kOpCaps * fix op desc print * try fix rms_norm * Revert "try fix rms_norm" This reverts commit 33b296098012909cb482fc29b52b28098dc971cd. * add quantization type support by converting them to float * enable quantization tensor for mulmat in gpu/npu * fix asan error * add log and assert * insert output convert operator after mulmat * add log * fix some error in running * disable permute again * add log * add error function * Revert "add error function" This reverts commit f92ff47798ac8053fb776c55efbb1a98469c7af1. * add log * more log * disable convert op in graph * wip * add f16 config for graph * set f16 precision for f16 graph * fix override data type * add comment * add config flag to enable quantize type * add log * more quantized type for cpu and gpu backend * enable all quant types for cpu and gpu backend * rename * wip * add log * remove unused functions * skip permute * remove get_qnn_op_input_param_count * fallback to generic_get_op_desc if no op_desc * revert 'skip permute' * Revert "revert 'skip permute'" This reverts commit 5761e31fd23c69c4cabf6fd9fac1a0d3e5a74968. * wip * add log * print qnn tensor type * add log * limit the max size of tensor * add log * fix tensor size limiter * small improve on tensor info printer * disable sqrt and div to pass test-backend-ops for 8 gen 2 * remove debug log in release build * add log * skip permute in src * wip * disable reshape * skip mul at decoder start * wip * add log * add qnn_scoped_timer * add perf tracker in graph * add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING * fix flag name * use milli-second * wip * fix comment string * add file for profiler * change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu * wip * profiler: refactoring * wip * add implement for print_profile_events * set-up profiler for graph * set profiler to graph execute * pretty print events * unified log print prefix * print event count * enable optrace * print duration at event end * wip * add more detailed soc information * wip * move device caps array into qnn-lib.cpp * remove lib_name in device_context * move get_graph_key_from_cgraph to graph.cpp * add override type for tensor key * use override_type instead of original data type for graph key * append op type to tensor name to fix error in qwen * remove todo * wip	2025-03-22 12:34:31 +08:00
hongruichen	31847c8301	fix compiling error after merge	2025-03-05 22:25:36 +08:00
nullname	8b652dd6ec	bug: fix benchmark debug warning (#31 ) * print build type * wip * print compiling flags * wip * wip	2025-02-28 22:54:57 +08:00
nullname	f289752664	[bugfix]make sure single node op will have the same type (#29 ) * debug * disable reshape * make sure single node op have same type * fix warning at the logger * Revert "disable reshape" This reverts commit 5aeca4ba9bec6db3f047f9da803df20f9f6612b3.	2025-02-28 19:18:16 +08:00
nullname	c867641222	feat: fix some TODO item in upstream PR #26 (#27 ) * fix warning * wip * add todo for graph key generate * rename some file to meet upstream guideline * remove local .clang-format * expend supported/unsupported counter to all ops * append device name to log * port to ggml logger * fix warning after adapt to ggml logger * append \n to all log * use case op instead of convert * Revert "use case op instead of convert" This reverts commit e662fc2dfee41719aaf7bc9d75e03e8d0f7ded0f. * fix op that needs same shape * opt kQnnOpsTable * refresh params name field when getting op config * opt npu log print * remove unused functions	2025-02-27 23:16:08 +08:00
nullname	ff033e1e23	opt mulmat base on official doc (#25 ) https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md	2025-02-25 19:46:48 +08:00
nullname	a822d00753	feat: run on win (#24 ) * move qnn_instance function implementation into cpp * wip * wip * move dl related function into separated file * use cast op for gpu * Revert "use cast op for gpu" This reverts commit 05df7362a15c022d05940d682e84cf480a082c6a. * Reapply "use cast op for gpu" This reverts commit 2520e5922a216faceb6d7efcde23dafe6947a4b3. * fix compiling error in win * fix align_alloc in win * fix compiling error * add get sys free/total mem for win * wip * suppress warning in win * add missing chrono header * set the correct qnn lib name for windows * add flag to control cpu backend * wip * wip * Revert "Reapply "use cast op for gpu"" This reverts commit f56519c374a7d46faac706cf214de48ff5fc5139. * fix compiling error for linux build * fix cdsprpc dynamic library name * wip * skip rpc load fail * fix page_align_alloc * suppress some warning in gcc * wip * reuse align to function * more log * add log and fix warning * wip * fix asan errors and memory leaks * fix the get_io_tensors_from_graph * improve comment * print GGML_QNN_DEFAULT_LIB_SEARCH_PATH * revert some unused changes * move library search path setter into qnn module * fix android library loading * skip qnn_device_get_platform_info for npu emulator	2025-02-24 10:47:47 +08:00
nullname	10bd671c08	[feat]add more op support (#18 ) * disable rpc buffer for npu * append input/output tensor size into unsupported op log * log dimensions for unsupported tensor * wip * split op config classes into separated file * fix reshape * wip * add op_constructor_with_type_param * set parameter for op_constructor_with_type_param func	2025-01-18 22:15:27 +08:00
hongruichen	5f93376f67	fix compiling error after merged	2025-01-10 11:30:03 +08:00
nullname	f2d8d017da	[feat] Port ggml graph to QNN graph (#16 ) * more log * split graph implementation into cpp file * rename: ggml_qnn_graph -> qnn_graph * add imput/output tensor to graph * fix assert * wip * add _ggml_tensor field in qnn tensor * add comments * add set_data_buffer with raw memory buffer * use set_data_buffer * op param buffer use qnn_buffer_ptr * add qnn_mem_buffer_slice * use qnn_buffer_ptr as tensor buffer * use new set_data_buffer to reduce copy * ggml_qnn_op_config: add function to set input/output tensor before init node * remove ggml_qnn_connectable_op_config and use ggml_qnn_single_op_config instead * wip * add initialize_op_nodes without tensor params * wip * add op caps table * merge kGgmlOpToQnnOp and kOpCaps tables * wip * add cache parameter to create_tensors * add init_from_ggml_graph * disable gelu for all backend * wip * move op index calc to op config module * use the ggml_tensor as parameter of build_graph * add log * use create_operation_from_op_tensor in old build_graph function * remove unused constructors * fix parameter count * remove unused member func/var * make init_from_ggml_graph as a class member: build_graph_from_ggml_graph * move graph finalize into member function `finalize()` * get graph key from ggml op tensor directly * append output type * reduce tensor key length * add function to generate key from ggml_cgraph * simplify graph cache insert and delete * remove template param at get_qnn_graph_from_cache * wip * merge kQnnUnaryOpsTable and kQnnBinaryOpsTable * refactor device_supports_op * add log * wip * use framework function to check same shape * wip * extract some logic into separated function * wip * add execution function that runs graph * add function to create qnn graph from ggml_cgraph with cache * execute graph directly * return null graph key for empty graph * add more qualcomm chipset enums * add cap for reshape * disable some ops * try to skip GGML_OP_VIEW * moew log for view tensor * append param tensor into intermedia tensor key * use 'ordered' set * fix warning in release * wip	2025-01-10 11:13:25 +08:00
hongruichen	79f124a699	add missing op	2024-12-14 15:49:44 +08:00
nullname	e36ad89528	bugfix: error pre-allocated tensor (k_cache_view-0) (#12 ) * fix device binding at ggml_backend_qnn_buffer_type * merge ggml_backend_qnn_buffer_context and qnn_mem_buffer * wip * add log * wip * add qnn_buffer_ptr * remove tailing `\n` at log * add log * enable GGML_OP_NONE * wip * wip * disable tensor with view * wip * wip * more log for view tensor * re-enable view * wip * remove link android lib * set dimension at bind function * move graph traversal to backend-ops * wip * add get_view_internal_dimension to obtain the tensor view source dimension * use _view_source_dimensions to allocate qnn tensor * add place holder function ggml_backend_qnn_cpy_tensor_async * add ggml_qnn_aggregate_op_config * make matmul based on ggml_qnn_aggregate_op_config * wip * manually specify the order of op destruct * skip register qnn-cpu backend * disable view op again * remove _view_source_dimensions * add nop for reshape and view ops * add log * add comment	2024-12-11 10:42:00 +08:00
hongruichen	0d02ee09ed	fix int overflow and remove view op to pass unit test	2024-12-03 10:55:11 +08:00
hongruichen	c5e6549331	fix: fix assertion	2024-11-29 23:38:06 +08:00
hongruichen	09efaa389e	define compile flag as module private	2024-11-29 17:24:05 +08:00
hongruichen	6d4feae579	redo conflict changes	2024-11-29 17:14:01 +08:00
hongruichen	5103b166ba	bugfix: block large tensor calc in npu	2024-11-29 14:19:34 +08:00
nullname	a2df09b6af	[WIP] feat: perf opt (#10 ) * reduce log * wip * add function to create concat nodes * opt * insert concat node before mulmat * use resize op * wip * add bind_buffer and remov ggml prefix in tensor types * use gather node instead * fix tensor type, now succeed in gpu and cpu, failed in npu * add comment * wip * add comment * wip * in destructor, clear internal buffer before unbind * disable gather for npu * wip * count swap memory as free memory * wip * fix supported_types ggml_backend_device_i.supports_op will be invoked before ggml_backend_device_i.init_backend * rename create_tensors -> initialize_op_nodes * move ggml_qnn_op_config to deparated file * wip * add create_convert_nodes * add comment * enable different type in/out for npu and cpu backend * fix npu convert op * enlarge max buffer size * add more error code * check tensor type before create convert node * add log * add log * remove transpose0 and use buildin transpose flag * rename transpose1 -> transpose_out * disable convert for npu * add more logs	2024-11-29 00:03:23 +08:00
nullname	e6dbdacc32	feat: fix llama-bench (#7 ) * remove unused functions * wip * init from last devices * move init into constructor * wip * add static assert to device table * make kDeviceCaps as constexpr * get free memory and total memory * add optimize flag for qnn backend	2024-11-13 17:06:46 +08:00
nullname	8ad86dc703	feat: add QNN_OP_TRANSPOSE (#6 ) * redo: add convert nodes This reverts commit 8448acd5ebf8fe86ab9d25313b64a15c811ef96e. * align clang format with cann * rename binary_op -> general_op casue there're some op that will only tak 1 param * Revert "rename binary_op -> general_op" This reverts commit 5be63b1a0dc4614457785367dade62158fe46214. * wip * add GGML_OP_PERMUTE * add GGML_OP_VIEW and GGML_OP_GET_ROWS * wip * Revert "wip" This reverts commit 772462ca6cfa01ea31bde725c2da60076ad9385f.	2024-11-04 23:12:03 +08:00
nullname	fe565cfd9f	fix compiling error in release	2024-10-29 15:47:07 +08:00
hongruichen	5c1e6d4905	disable gelu in NPU	2024-10-29 00:54:08 +08:00
nullname	4abaf7d87e	feat: fix mulmat (#2 ) * ggml_qnn_op_config now manager the construction of ggml_qnn_tensor * wip * add interface ggml_qnn_op_config * add ggml_qnn_list_op_config * add create_tensor and move tensor bind to execute * wip * rename: ggml_qnn_list_op_config -> ggml_qnn_matmul_op_config * add tensortype to allow native tensor * remove ggml_tensor param at ggml_qnn_tensor::create_tensor * postpone the tensor id allocation to add_node * add ggml_qnn_op_config_base * trival change to reduct the param of function * split bind_tensors into bind_input_tensors and bind_output_tensors * implement ggml_qnn_single_op_config::create_tensors next will set the prameter of transpose * tensor: add bind buffer * add parameter tensor type * implement add_tensor_param * set qnn_instance only at constructor * set transpose tensor param * move create_op_constructor into op-config module * create QNN_OP_MAT_MUL from ggml_qnn_matmul_op_config * try fix crash * fix compiling error at older ndk (r23c) * fix crash * fix parameter tensor name * update tensor dimension assignment and add TODO * fix mat_mul graph creating * fix MUL_MAT_256x16x10x1_256x1x10x1_16x1x10x1 * append type to graph cache key * wip * fix supported op * update comment * disable op other than add and mat_mul * add convert op to adapt multi input/output format * disable f16 for cpu backend according to official doc https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/cpu_backend.html#supported-operations * add supported data types flags in each backend * remove unused functions * append output type to graph key * fix gpu backend by disable the different data type op * fix cpu backend support ops * fix duplicated tensor name * append op name * suppress warning * remove unused code	2024-10-28 12:48:16 +08:00
hongruichen	181cf52888	adapt new register backend interface and fix missing ops	2024-10-11 10:17:50 +08:00
hongruichen	1da8a3e678	fix compiling error after merge	2024-09-30 10:37:23 +08:00
Hongrui Chen	a1ceaae4ad	fix compiling error at older ndk (r23c)	2024-09-30 10:18:12 +08:00
hongruichen	481cb3a0c5	fix compiling error	2024-09-07 12:29:26 +08:00
みゃん	dedadf2a20	Fixed a bug where debug code was included in the release, resulting i… (#1 ) * Fixed a bug where debug code was included in the release, resulting in an undefined function error. * Change the path of the QNN library when building in termux environment * Revert "Change the path of the QNN library when building in termux environment" This reverts commit c6e26a3679da2608940e2163e090adf75d667400. * Changed so that GGML_QNN_DEFAULT_LIB_SEARCH_PATH can be set from command line arguments	2024-08-20 10:20:23 +08:00
hongruichen	47f6e02eda	fix: try fix the tensor rank of mul mat	2024-07-31 23:54:07 +08:00
hongruichen	74eb05a13b	feat: add ggml_qnn_op_config for handle different op	2024-07-31 20:22:37 +08:00
hongruichen	9a5f802bb6	refactoring: add convient macro to disable copy and move of class	2024-07-29 22:18:48 +08:00
hongruichen	6da82947df	refactoring: set the default qnn lib search path at CMakeLists.txt by GGML_QNN_DEFAULT_LIB_SEARCH_PATH	2024-07-29 15:53:14 +08:00
hongruichen	1f9d2a7e22	refactoring: improve tensor print	2024-07-28 22:05:51 +08:00
hongruichen	e33b5c9837	refactoring: print the name of unsupport op	2024-07-27 13:49:49 +08:00
hongruichen	8ab1f15fe3	refactoring: remove internal functions, use op table directly	2024-07-27 13:43:07 +08:00
hongruichen	e0c9b34016	feat: check if dims equal for add looks qnn add can only applied to matrix with equal dimensions	2024-07-27 13:38:12 +08:00

1 2 3

109 Commits