llama.cpp

History

nullname e6a5f7baa6 feat: perf opt add set rows (#59 ) * Add power management utilities to NPU device context and update DCVS settings * Update DCVS settings in power_utils to use v3 API and enhance power management * wip * Enhance dequantization functions by adding load_dequant_table support and updating signatures for improved performance * use lut * wip * fix test failure * wip * Refactor load_qual_block_generic to improve block handling and optimize vector operations * Enhance load_dual_block_generic and load_qual_block_generic to accept a mask parameter for improved block handling * Refactor flash_attn_impl to optimize mask l2 prefetch * wip * wip * wip * wip * add log * link against shared libraries instead of static ones * fix swiglu * wip * refactor expf_fix to handle overflow for different data types * enhance is_glu_op_supported to validate shapes for multiple sources * wip * refactor logging macros to use hexagon namespace and improve formatting * fix printf format error * wip * refactor: update static_assert messages for block size validation and add HVX_VectorPred_x3 type alias * rename * feat: enhance fa with mask * wip * wip * refactor: replace instances of Q6_V_vzero() with kZeroV for consistency * wip * wip * wip * fix: improve address alignment check in HVX_Vector handling * refactor: streamline vector dot product implementations for improved readability * refactor: q4k add hvx intrinsic impl * refactor: enhance dequantize_row_q4_K for clarity and performance * refactor: optimize scale mask usage in dequantization functions for improved performance * refactor: optimize dequantize_row_q4_K for intrinsic usage and performance improvements * refactor: move GLU operation implementation into separated file * sync after swiglu * wip * wip * wip * feat: increase prc main thread stack size * fix: replace hardcoded stack size with NPU_THREAD_STACK_SIZE constant * wip * feat: add optimized vector operations for exponential and division with overflow handling * wip * feat: refactor exponential function to handle overflow and underflow with improved logic * wip * wip * feat: add vector loading and scaling functions for improved performance in block processing * wip * feat: optimize block loading by refactoring scale index handling for improved performance * use Q6_Vb_vlut32_VbVbR_nomatch instead * feat: enhance scale loading by adding static assertion and restructuring block handling * wip * feat: refactor vec_dot_product_mixed_impl for improved clarity and performance * wip * feat: simplify vector loading functions and improve alignment handling * wip * feat: enhance scale loading mask with quantization block size validation * wip * feat: implement make_scale_load_mask function and refactor vector handling in vec_ops * feat: enhance load_dual_block_generic to include scale indices for improved vector loading * revert q8 dequant * wip * feat: optimize dequantization functions by removing unnecessary masking and updating lookup methods * wip * wip * add qurt_mutex * Add DMA transfer class and integrate into thread pool * Enhance DMA transfer functionality by adding support for multiple descriptors and initiating transfers in parallel * fix dma crash * fix failed unit tests * wip * use alignas * Improve DMA transfer error handling and update descriptor completion check * Fix VTCM cache size calculation in element-wise operations * Add cache clean operations before DMA transfers in element-wise operations * reduce cache clean operations * Refactor DMA transfer functions to support 1D operations and rename for clarity * Enhance DMA transfer functionality by adding 2D submission support and improving descriptor initialization * Update read buffer method to support forced invalidation and remove unnecessary invalidation calls in element-wise operations * wip * Improve DMA transfer handling in mul_mat_gemv_impl by replacing memcpy with initiate_dma_row_transfer and adding wait_for_dma logic * fix 2d dma * feat: add DMA plane cache * rename * wip * use memcpy for debug * fix cache plane calc * refactor: remove debug logging from mul_mat_impl and optimize cache handling * rename * fix 2d dma type * refactor: enhance DMA transfer handling in mul_mat_gemv_impl and wait functions * refactor: optimize DMA transfer handling in mul_mat_gemv_impl and wait functions * wip * wip * move op impl into sub dir * add log * fix: correct pointer usage in mul_mat_gemv_impl for next plane access * fix: improve DMA transfer error handling in mul_mat_impl and mul_mat_gemv_impl * fix: fix crash by using the entire row bytes * wip * wip * fix: prevent parallelization for scalar src1 in is_mul_mat_supported * fix: add dimension checks for 2D DMA transfers and fallback to 1D if necessary * wip * fix: enable thread barrier for mul multiplication operations * feat: add synchronization checks for tensor operations and update related functions * wip * fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations * Revert "fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations" This reverts commit af3441e67e706b2e5122369dc160353796867dd3. * wip * wip * add comment * fix: improve DMA transfer handling in mul_mat_gemv_impl for quantized source tensors * add log * try fix mulmat gemv * wip * fix: enhance DMA transfer handling in mul_mat_gemv_impl for quantized source tensors * fix: optimize cache offset calculation and remove redundant swap in mul_mat_gemv_impl * fix: refactor DMA transfer handling in mul_mat_gemv_impl for improved clarity and maintainability * wip * wip * wip * fix: enhance mul_mat_impl for improved cache handling and clarity * fix: refactor tensor unflattening and DMA transfer initialization for improved clarity and type safety * fix: improve cache handling of quant * wip * fix: improve cache handling in mul_mat_impl and mul_mat_gemv_impl for better memory efficiency * rename * add load_hexa_block_generic * wip * extract dequant block into separated function * refactor: enhance dequantization functions with table parameter * fix load_dual_block_generic * refactor: rename dequantization functions for clarity and enhance block handling * refactor: simplify dequantization logic by consolidating block handling and removing unused parameters * wip * wip * feat: add make_qs_load_mask function and update load_dual_block_generic to use qs_indices * fix load_dual_block_generic * refactor: update load functions to use qs_indices for improved block loading * wip * fix: update loop indices and boundary checks to use size_t for better efficiency * wip * update make_scale_load_mask, to make it available for q8 * feat: add vec_dot_product_quant_impl for quantized dot product computation * refactoring: move come quant func to dedicated file * refactor: rename dequantization functions for clarity and consistency * wip * feat: enhance vec_dot_product_quant_impl with dual dequantization and improved assertions * add vec_dot_product_vqf32_q40_f32 * wip * wip * wip * wip * implement vec_mpy_qf32_qf32_qf32 function and update vec_dot_product_vqf32_q40_f32 to use it * wip * add src0_plane_write_cache_offset * wip * enhance mul_mat_f32 to handle NPU_DATA_TYPE_Q4_0 for quantized matrix multiplication * wip * wip * update test func * refactor mul_mat_gemv_quant_impl to use get_nb for row stride and remove unused test function in init_f16_f32_table * wip * Add support for 4-block dequantization in vec_quant and update dot product implementation * Refactor vec_dot_product_quant_impl to improve variable handling and enhance readability * Refactor vec_dot_product_quant_impl to replace template function with inline vector operations * use Q6_Vqf32_vmpy_VsfVsf instead of Q6_Vqf32_vmpy_Vqf32Vqf32 * Revert "use Q6_Vqf32_vmpy_VsfVsf instead of Q6_Vqf32_vmpy_Vqf32Vqf32" This reverts commit 54839166fddbe40a0392adee5863c59070ccdbe4. * wip * improve log print in graph * Refactor batched_row_dot to accept additional arguments and remove batched_row_dot_with_table * Refactor synchronization functions to include previous operation and NE type parameters * Refactor synchronization checks in several operations * Update synchronization checks to include NPU_OP_COUNT in required conditions * Add performance tracking to buffer management functions * add memset * add log * fix: update backend device type from ACCEL to IGPU * fix comment * add get/set rows * feat: implement row operation support checks in is_rows_supported * feat: add support for I64 data type in rows operations * feat: implement set_rows functionality for I32 and I64 data types * wip * fix set_rows * feat: extend is_rows_supported to allow F32 data type in destination * wip * feat: rename set_rows function, add generic to its name * disable q4_k * move ops to separated file * rename: op_impl -> op_registry * refactor: update get_data_type struct to include output type for unary operations * refactor: simplify vec_trans_impl by removing parameterized overload and using variadic templates * add vec_trans_with_half_ret_impl * add NPU_OP_CPY * refactor: enhance is_unary_op_supported to handle non-continuous rows and add type support logging * refactor: update vec_trans_with_half_ret_impl to use processed_bytes for clarity and accuracy * wip * refactor: optimize dequantize_vec_q40_qf32_4blocks by improving shuffling logic and reducing redundancy * refactor: improve performance of vec_dot_product and dequantize functions by optimizing shuffling logic * wip * add dequantize_vec_q40_qf32_6blocks * feat: add load_dequant_vec_q40_qf32_6blocks function for 6-block dequantization * feat: enhance vec_dot_product_quant_impl with 6-element processing loop for improved performance * Revert "feat: enhance vec_dot_product_quant_impl with 6-element processing loop for improved performance" This reverts commit a5c8fa3e4d9a2d89c8c0821c936c0466e0af7869. since there's a performance degradation * fix: correct load_hexa_block_generic return type and update dequantization logic * wip * wip * feat: add make_q40_qs_load_mask function and update vec_dot_product_vqf32_q40_f32 * fix dequant load * add debug log * wip * wip * fix shuffle index array * refactor: simplify load mask generation and improve index shuffling for q4 blocks * wip * wip * fix comment * wip * update ops.md * update ops.md by create_ops_docs.py # Conflicts: # docs/ops.md		2025-10-30 21:51:15 +08:00
..
backend	SYCL: Update to oneAPI 2025.2 (#16371 )	2025-10-02 10:16:25 +03:00
development	docs : update HOWTO‑add‑model.md for ModelBase and new model classes (#14874 )	2025-07-25 16:25:05 +02:00
multimodal	model : support MiniCPM-V 4.5 (#15575 )	2025-08-26 10:05:55 +02:00
ops	feat: perf opt add set rows (#59 )	2025-10-30 21:51:15 +08:00
android.md	repo : update links to new url (#11886 )	2025-02-15 16:40:57 +02:00
build-riscv64-spacemit.md	ggml: riscv: add riscv spacemit backend (#15288 )	2025-09-29 17:50:44 +03:00
build-s390x.md	ggml-zdnn: fix #15414 , activate FP16 and BF16 acceleration and incorrect zTensor free (#15839 )	2025-09-13 02:39:52 +08:00
build.md	Update build.md to remove MSVC arm64 notes (#15684 )	2025-08-30 23:51:28 +08:00
docker.md	musa: upgrade musa sdk to 4.3.0 (#16240 )	2025-09-26 02:56:38 +02:00
function-calling.md	server : add documentation for `parallel_tool_calls` param (#15647 )	2025-08-29 20:25:40 +03:00
install.md	docs : add "Quick start" section for new users (#13862 )	2025-06-03 13:09:36 +02:00
llguidance.md	llguidance build fixes for Windows (#11664 )	2025-02-14 12:46:08 -08:00
multimodal.md	mtmd : add support for Voxtral (#14862 )	2025-07-28 15:01:48 +02:00
ops.md	feat: perf opt add set rows (#59 )	2025-10-30 21:51:15 +08:00