* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci
* rpc : send hash when tensor data is above some fixed threshold
ref #10095
* rpc : put cache under $HOME/.cache/llama.cpp
* try to fix win32 build
* another try to fix win32 build
* remove llama as dependency
* move op key generate function to kOpCaps
* fix op desc print
* try fix rms_norm
* Revert "try fix rms_norm"
This reverts commit 33b296098012909cb482fc29b52b28098dc971cd.
* add quantization type support by converting them to float
* enable quantization tensor for mulmat in gpu/npu
* fix asan error
* add log and assert
* insert output convert operator after mulmat
* add log
* fix some error in running
* disable permute again
* add log
* add error function
* Revert "add error function"
This reverts commit f92ff47798ac8053fb776c55efbb1a98469c7af1.
* add log
* more log
* disable convert op in graph
* wip
* add f16 config for graph
* set f16 precision for f16 graph
* fix override data type
* add comment
* add config flag to enable quantize type
* add log
* more quantized type for cpu and gpu backend
* enable all quant types for cpu and gpu backend
* rename
* wip
* add log
* remove unused functions
* skip permute
* remove get_qnn_op_input_param_count
* fallback to generic_get_op_desc if no op_desc
* revert 'skip permute'
* Revert "revert 'skip permute'"
This reverts commit 5761e31fd23c69c4cabf6fd9fac1a0d3e5a74968.
* wip
* add log
* print qnn tensor type
* add log
* limit the max size of tensor
* add log
* fix tensor size limiter
* small improve on tensor info printer
* disable sqrt and div to pass test-backend-ops for 8 gen 2
* remove debug log in release build
* add log
* skip permute in src
* wip
* disable reshape
* skip mul at decoder start
* wip
* add log
* add qnn_scoped_timer
* add perf tracker in graph
* add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING
* fix flag name
* use milli-second
* wip
* fix comment string
* add file for profiler
* change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu
* wip
* profiler: refactoring
* wip
* add implement for print_profile_events
* set-up profiler for graph
* set profiler to graph execute
* pretty print events
* unified log print prefix
* print event count
* enable optrace
* print duration at event end
* wip
* add more detailed soc information
* wip
* move device caps array into qnn-lib.cpp
* remove lib_name in device_context
* move get_graph_key_from_cgraph to graph.cpp
* add override type for tensor key
* use override_type instead of original data type for graph key
* append op type to tensor name to fix error in qwen
* remove todo
* wip
* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions
* cmake: Add GGML_BMI2 build option
* ggml: enable BMI2 on relevant CPU variants
* ggml-cpu: include BMI2 in backend score
* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features
* ggml-cpu: add __BMI2__ define when using MSVC
* Add include files for std::min/max and std::toupper/tolower
* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined
* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode
* win32: only use __restrict in MSVC if C11/C17 support is not enabled
---------
Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com>
* ggml-cpu: Add CPU backend support for KleidiAI library
* Add environmental variable GGML_KLEIDIAI_SME
* Add support for multithread LHS conversion
* Switch kernel selection order to dotprod and i8mm
* updates for review comments
* More updates for review comments
* Reorganize and rename KleidiAI files
* Move ggml-cpu-traits.h to source file
* Update cmake for SME build and add alignment for SME
* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
* [cl][adreno] Add Adreno GPU support
Add new OpenCL backend to support Adreno GPUs
---------
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Co-authored-by: Alexander Angus <quic_aangus@quicinc.com>
Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com>
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
* [cl][ci] Add workflow for CL
* [cl][adreno] Fix memory leak for non SMALL_ALLOC path
* opencl: integrate backend dyn.load interface and fix compiler and format warnings
* opencl: remove small-alloc support and fix build errors for non-opencl platforms
* opencl: fixed merge conflict (MUSA added twice in cmake)
* opencl-ci: use RUNNER_TEMP instead of github.workspace
* opencl: fix embed tool invocation with python3
* opencl: CI workflow fixes
* opencl: Clean up small-alloc in CMake files
* opencl: cleanup ggml-opencl2 header file
* opencl: use ulong for offsets and strides in ADD kernel
* opencl: use cl_ulong for all offsets
* opencl: use cl_ulong for sizes and strides
* opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)`
* opencl: rename backend `opencl2` -> `opencl`
* opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl`
* opencl: make OpenCL required, remove redundant lib and inc directories
* `ggml-base`, `..` and `.` are added by `ggml_add_backend_library`
* opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl`
* opencl: remove copyright marker since main license already covers
* opencl: replace some more OPENCL2 leftovers
* opencl: remove limits on `tensor_extra`
* opencl: use pools for `tensor_extra`
* opencl: fix compiler warnings with GCC and Clang
Still getting the warning about clCreateCmdQueue being obsolete.
Will fix that separately.
* opencl: fail gracefully if opencl devices are not available
Also for unsupported GPUs.
* opencl: fix MSVC builds (string length error)
* opencl: check for various requirements, allow deprecated API
* opencl: update log message for unsupported GPUs
---------
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Co-authored-by: Alexander Angus <quic_aangus@quicinc.com>
Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com>
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
* feat: load all backends from a user-provided search path
* fix: Windows search path
* refactor: rename `ggml_backend_load_all_in_search_path` to `ggml_backend_load_all_from_path`
* refactor: rename `search_path` to `dir_path`
* fix: change `NULL` to `nullptr`
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* fix: change `NULL` to `nullptr`
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
* remove unused functions
* wip
* init from last devices
* move init into constructor
* wip
* add static assert to device table
* make kDeviceCaps as constexpr
* get free memory and total memory
* add optimize flag for qnn backend
* ggml : add ggml_flash_attn_ext_get_prec
* metal : use F16 precision in FA kernels
ggml-ci
* metal : minor clean-up
* metal : compile-guard bf16 FA kernels
ggml-ci
* build : remove obsolete compile flag [no ci]
* metal : prevent int overflows [no ci]
* cuda : disable BF16 FA
ggml-ci
* metal : fix BF16 requirement for FA kernels
ggml-ci
* make : clean-up [no ci]
* rwkv6: rename to wkv6
* rwkv6: support avx2 avx512 armv8 armv9
* rwkv6: update cuda file name
* rwkv6: rename params
* wkv on sycl
* sycl: add some ops
* sycl: Enhance OP support judgment
* wkv6: drop armv9 and tranfer to GGML style
ggml-ci
* sync : ggml
* update the function to use appropriate types
* fix define error
* Update ggml/src/ggml-cpu.c
* add appropriate asserts
* move element-wise functions outside
* put the declaration outside the loop
* rewrite to be more inline with the common pattern for distributing threads
* use recommended way GGML_TENSOR_LOCALS
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>