* move qnn_instance function implementation into cpp
* wip
* wip
* move dl related function into separated file
* use cast op for gpu
* Revert "use cast op for gpu"
This reverts commit 05df7362a15c022d05940d682e84cf480a082c6a.
* Reapply "use cast op for gpu"
This reverts commit 2520e5922a216faceb6d7efcde23dafe6947a4b3.
* fix compiling error in win
* fix align_alloc in win
* fix compiling error
* add get sys free/total mem for win
* wip
* suppress warning in win
* add missing chrono header
* set the correct qnn lib name for windows
* add flag to control cpu backend
* wip
* wip
* Revert "Reapply "use cast op for gpu""
This reverts commit f56519c374a7d46faac706cf214de48ff5fc5139.
* fix compiling error for linux build
* fix cdsprpc dynamic library name
* wip
* skip rpc load fail
* fix page_align_alloc
* suppress some warning in gcc
* wip
* reuse align to function
* more log
* add log and fix warning
* wip
* fix asan errors and memory leaks
* fix the get_io_tensors_from_graph
* improve comment
* print GGML_QNN_DEFAULT_LIB_SEARCH_PATH
* revert some unused changes
* move library search path setter into qnn module
* fix android library loading
* skip qnn_device_get_platform_info for npu emulator
* ggml-cpu : add chunking support to mul_mat_id
* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row
* disable for arm
* cleanup
* better way to disable for arm
* fix uninitialized counter when using 1 thread only
* revert test-backend-ops changes
* Bug fix for clamp_f32
When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.
* Bug fix for clamp_f32
* Bug fix for clamp_f32
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
* ggml : optimize convert f32<->f16 for loongarch_asx
* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16
* ggml : Fix warnings when run cpu CI locally on LoongArch
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.
This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* vulkan: initial support for IQ3_S
* vulkan: initial support for IQ3_XXS
* vulkan: initial support for IQ2_XXS
* vulkan: initial support for IQ2_XS
* vulkan: optimize Q3_K by removing branches
* vulkan: implement dequantize variants for coopmat2
* vulkan: initial support for IQ2_S
* vulkan: vertically realign code
* port failing dequant callbacks from mul_mm
* Fix array length mismatches
* vulkan: avoid using workgroup size before it is referenced
* tests: increase timeout for Vulkan llvmpipe backend
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* Add option to not print stack on abort
Add option/envvar to disable stack printing on abort.
Also link some unittests with Threads to fix link errors on
ubuntu/g++11.
* Update ggml/src/ggml.c
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.