Commit Graph

885 Commits

Author SHA1 Message Date
R0CKSTAR fac63a3d78
musa: refine compute capability (#12493)
* musa: refine compute capability

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-22 10:11:37 +01:00
Jeff Bolz eddfb43850
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-22 09:40:11 +01:00
hongruichen 1caca627ea fix compiling error after merge 2025-03-22 12:51:09 +08:00
hongruichen c2887f0377 Merge branch 'master' into dev-refactoring 2025-03-22 12:36:55 +08:00
nullname a1ab67478f
[feat] add more op (#35)
* move op key generate function to kOpCaps

* fix op desc print

* try fix rms_norm

* Revert "try fix rms_norm"

This reverts commit 33b296098012909cb482fc29b52b28098dc971cd.

* add quantization type support by converting them to float

* enable quantization tensor for mulmat in gpu/npu

* fix asan error

* add log and assert

* insert output convert operator after mulmat

* add log

* fix some error in running

* disable permute again

* add log

* add error function

* Revert "add error function"

This reverts commit f92ff47798ac8053fb776c55efbb1a98469c7af1.

* add log

* more log

* disable convert op in graph

* wip

* add f16 config for graph

* set f16 precision for f16 graph

* fix override data type

* add comment

* add config flag to enable quantize type

* add log

* more quantized type for cpu and gpu backend

* enable all quant types for cpu and gpu backend

* rename

* wip

* add log

* remove unused functions

* skip permute

* remove get_qnn_op_input_param_count

* fallback to generic_get_op_desc if no op_desc

* revert 'skip permute'

* Revert "revert 'skip permute'"

This reverts commit 5761e31fd23c69c4cabf6fd9fac1a0d3e5a74968.

* wip

* add log

* print qnn tensor type

* add log

* limit the max size of tensor

* add log

* fix tensor size limiter

* small improve on tensor info printer

* disable sqrt and div to pass test-backend-ops for 8 gen 2

* remove debug log in release build

* add log

* skip permute in src

* wip

* disable reshape

* skip mul at decoder start

* wip

* add log

* add qnn_scoped_timer

* add perf tracker in graph

* add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING

* fix flag name

* use milli-second

* wip

* fix comment string

* add file for profiler

* change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu

* wip

* profiler: refactoring

* wip

* add implement for print_profile_events

* set-up profiler for graph

* set profiler to graph execute

* pretty print events

* unified log print prefix

* print event count

* enable optrace

* print duration at event end

* wip

* add more detailed soc information

* wip

* move device caps array into qnn-lib.cpp

* remove lib_name in device_context

* move get_graph_key_from_cgraph to graph.cpp

* add override type for tensor key

* use override_type instead of original data type for graph key

* append op type to tensor name to fix error in qwen

* remove todo

* wip
2025-03-22 12:34:31 +08:00
stduhpf 4375415b4a
Vulkan: RTE rounding for cpy to quant (#12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-21 20:34:50 +01:00
Eve 30c42ef5cb
vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472) 2025-03-21 20:27:47 +01:00
蕭澧邦 1aa87ee53d
[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)
* [SYCL] Fix build on Windows when ccache enabled (#9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-21 14:58:47 +08:00
Svetlozar Georgiev 9ffcc9e374
sycl: cleanup oneDNN related code (#12097) 2025-03-21 10:15:56 +08:00
Srihari-mcw 3d82dbcbce
ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332)
* Add block interleaving support for Q4_K quantization

* Remove whitespaces and fix CI/CD issues

* Update pointer of bsums from int16_t to const int16_t

* Add vector version of quantize_q8_K_4x8 function

* Update code formatting based on review comments
2025-03-20 13:35:34 +02:00
Gaurav Garg 517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-19 20:52:06 +01:00
Jeff Bolz a9b59288e2
vulkan: optimize iq1 coopmat2 dequant functions (#12427) 2025-03-19 19:56:23 +01:00
Guus Waals 0fd8487b14
Fix visionOS build and add CI (#12415)
* ci: add visionOS build workflow

Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.

* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs

* ci: remove define hacks for u_xxx system types

---------

Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
2025-03-19 11:15:23 +01:00
Jeff Bolz c446b2edd2
vulkan: Submit once enough matmul work has been recorded (#12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-19 08:26:26 +01:00
lhez d84635b1b0
opencl: improve profiling (#12442)
* opencl: more profiling timing

* opencl: generate trace for profiling

* opencl: reduce profiling overhead

* Populate profiling timing info at the end rather than after each
  kernel run

* opencl: fix for chrome tracing
2025-03-18 12:54:55 -07:00
R0CKSTAR bb115d2bf7
musa: override warp_size of musa device to 32 (#12445)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-18 19:28:26 +01:00
Łukasz Ślusarczyk 35cae5ba05
SYCL: using graphs is configurable by environment variable and compile option (#12371)
* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>

* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-18 11:16:31 +01:00
Prajwal B Mehendarkar eba92d64c3
cmake : fix PowerPC build (#12241)
Closes #12240
2025-03-18 11:37:33 +02:00
fj-y-saito d9a14523bb
ggml : add SVE support for q6_K_q8_K (#12361) 2025-03-18 10:14:39 +02:00
0cc4m fd123cfead
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434) 2025-03-18 07:21:40 +01:00
Łukasz Ślusarczyk a53f7f7b88
fixed compilation warnings in ggml-sycl (#12424) 2025-03-18 08:51:25 +08:00
Molly Sophia 7dfad387e3
llama: Add support for RWKV v7 architecture (#12412)
* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code-format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* fix MUSA build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
Gaurav Garg b1b132efcb
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394)
* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA
2025-03-17 20:25:13 +02:00
Guus Waals 01e8f2138b
ggml-vulkan: remove unused find_program(glslc) (#12416)
It's already found by FindVulkan.cmake in the parent CMakeLists
2025-03-17 13:35:43 -03:00
Jeff Bolz 484a8ab513
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312) 2025-03-17 09:26:18 -05:00
Daniele cf2270e4d3
vulkan: subgroup size tuning (#12087)
* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-03-17 12:42:33 +01:00
Jeff Bolz f07690c930
vulkan: use fp32 in coopmat2 q4_k dequant function (#12309) 2025-03-17 10:43:35 +01:00
Jeff Bolz 891c63956d
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273)
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
2025-03-17 10:41:59 +01:00
Jeff Bolz 2f21123c1d
vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258) 2025-03-17 10:35:00 +01:00
Christian Kastner 374101fd74
cmake : enable building llama.cpp using system libggml (#12321)
* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
2025-03-17 11:05:23 +02:00
Akarshan Biswas b3c9a65673
SYCL: set extras only on GGML_TYPE_Q4_0 (#12366)
* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface
2025-03-17 09:45:12 +08:00
aubreyli 3d35d87b41
SYCL: Delete redundant plus sign and space (#12391) 2025-03-15 15:49:03 +01:00
fairydreaming b19bd064c0
SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)
* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-15 22:19:30 +08:00
Chenguang Li 92a391327e
[CANN]MUL_MAT optimization (#12382) 2025-03-15 09:31:08 +08:00
hongruichen 525cd2d641 Merge branch 'master' into dev-refactoring 2025-03-14 10:04:11 +08:00
Alberto Cabrera Pérez 363f8c5d67
sycl : variable sg_size support for mmvq kernels (#12336) 2025-03-12 09:57:32 +00:00
uvos 34c961b181
CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64
2025-03-12 10:14:11 +01:00
Jeff Bolz bf69cfe62f
vulkan: fix bug in coopmat1 mul_mat_id (#12316)
* tests: run mul_mat_id with a larger N

* vulkan: fix bug in coopmat1 mul_mat_id
2025-03-12 06:59:19 +01:00
uvos 10f2e81809
CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177)
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-11 20:16:03 +01:00
jklincn ba7654380a
ggml-backend : fix backend search path (#12330)
* Fix backend search path

* replace .native() with '/'

* reverted .native()
2025-03-11 14:25:17 +01:00
BB-fat 6ab2e4765a
metal : Cache the Metal library at the device context level (#12265) 2025-03-11 13:45:02 +02:00
Eve 2c9f833d17
mat vec double buffer (#12188) 2025-03-10 19:28:11 +00:00
R0CKSTAR 251364549f
musa: support new arch mp_31 and update doc (#12296)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-10 18:18:25 +01:00
Henry Linjamäki 8acdacb3ea
opencl: use OpenCL C standard supported by the device (#12221)
This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.
2025-03-10 09:57:00 -07:00
Jason C.H 6fefc05a7a
ggml-backend : make path_str compatible with C++20 (#12269) 2025-03-08 17:02:39 +01:00
Daniel Bevenius 7c7f3b7f43
ggml : skip intermediate .air file when compiling .metallib (#12247)
This commit updates the compilation of default.metallib to skip the
intermediate .air (Apple Intermediate Representation) file.

The motivation for this change is to simplify the custom command a
little and avoid generating and then removing the .air file.
2025-03-07 14:15:27 +01:00
vmobilis d6ae2fa061 ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118)
* ggml_compute_forward_concat() for arbitrary tensor type

* Check that tensors' type match

* ggml-cpu.c: check type of source tensors

* ggml-cpu.c: move tensor type check to ggml_compute_forward_concat()

* ggml.c: check concatenated tensor type

* Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c

..., as it was moved to ggml.c.
2025-03-07 14:49:44 +02:00
Rémy O 68d0027f3d
ggml-cpu: faster AVX2 variant for IQ1_M (#12216) 2025-03-07 13:54:22 +02:00
BB-fat 5e2d57b2b2
metal : simplify kernel arguments using a struct (#3229) (#12194)
* metal : refactor im2col parameters into a struct

* metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets

* metal : refactor sum_rows parameters into a struct

* metal : refactor soft_max parameters into a struct

* metal : refactor diag_mask_inf parameters into a struct

* metal : refactor ssm_conv parameters into a struct

* metal : refactor ssm_scan parameters into a struct

* metal : refactor get_rows parameters into a struct

* metal : refactor group_norm parameters into a struct

* metal : refactor conv_transpose_1d parameters into a struct

* metal : refactor upscale parameters into a struct

* metal : refactor pad parameters into a struct

* metal : refactor pad_reflect_1d parameters into a struct

* metal : refactor arange parameters into a struct

* metal : refactor timestep_embedding parameters into a struct

* metal : refactor argsort parameters into a struct

* metal : refactor leaky_relu parameters into a struct

* metal : refactor pool_2d parameters into a struct

* metal : fix trailing whitespace

---------

Co-authored-by: alexju <alexju@tencent.com>
2025-03-07 08:35:57 +01:00
Daniel Bevenius d6c95b0740
metal : fix default.metallib build (#12224)
This commit updates the custom command to build the default.metallib
file to use the correct path to ../ggml-common.h by using the variable
METALLIB_COMMON.

The motivation for this change is that currently when building and
specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is
generated:
```console
[ 11%] Linking CXX shared library ../../bin/libggml.dylib
[ 11%] Built target ggml
make[2]: *** No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'.  Stop.
make[1]: *** [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2
```

With the above change the build could progress but there was a follow
on error about not being able to find the ggml-common.h file in
ggml-metal.metal where is was included as a relative path:
```console
[ 11%] Compiling Metal kernels
/Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'?
         ^~~~~~~~~~~~~~~~~~
         "ggml-common.h"
1 error generated.
```
Removing the relative path then allowed the build to complete
successfully.
2025-03-07 06:23:16 +01:00
lhez d76a86d967
opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (#12217)
* opencl: support noncontiguous `norm`

* opencl: support noncontiguous `rms_norm`

* opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`
2025-03-07 00:20:35 +00:00
xiaofei 776f9e59cc
cmake : fix undefined reference errors for std::filesystem in ggml (#12092) (#12094)
Signed-off-by: Ray Lee <hburaylee@gmail.com>
Co-authored-by: Ray Lee <hburaylee@gmail.com>
2025-03-06 22:58:25 +00:00
Johannes Gäßler 5220a16d18
CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222) 2025-03-06 18:45:09 +01:00
uvos e721c05c93
HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (#12209)
This avoids conflict with internal cuda/hip runtimes memory managment behavior.
2025-03-06 08:20:52 +01:00
Henry Linjamäki 94bb63e4f0
opencl : fix buffer alignment (#12197)
Fix the following error:

```
ggml-alloc.c:99: not enough space in the buffer
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024)
```

which occurs when `ggml_backend_opencl_context::alignment` is larger
than `cl_ptr_base` (hard-coded to `0x1000`).

Also, fix `ggml_backend_opencl_context::alignment` was set to
`CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the
value is reported in bits.
2025-03-06 02:33:40 +01:00
Henry Linjamäki f79243992c
opencl : fix `ulong` kernel args were set from `int` variables (#12174)
... which left garbage bits in the upper half of the kernel args. This
caused segmentation faults when running PoCL.
2025-03-06 02:31:14 +01:00
simon886212 ed4ce0dda2
opencl : fix profile-related errors (#12095)
Co-authored-by: ubuntu <ubuntu@localhost.localdomain>
2025-03-06 02:30:05 +01:00
Rémy O 07d1572347
ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (#12154)
* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC
2025-03-06 02:26:10 +01:00
Akarshan Biswas 5e43f104cc
SYCL: Disable f16 Unary OPs as not supported by the kernels (#12201) 2025-03-05 16:58:23 +01:00
Plamen Minev 16e4b22c5e
ggml : fix GGMLMetalClass ODR (#12200)
-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.
2025-03-05 17:16:01 +02:00
hongruichen 31847c8301 fix compiling error after merge 2025-03-05 22:25:36 +08:00
hongruichen 27cec63802 Merge branch 'master' into dev-refactoring 2025-03-05 22:22:21 +08:00
mgroeber9110 5bbe6a9fe9
ggml : portability fixes for VS 2017 (#12150)
* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
David Huang becade5de7
HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032)
Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16

---

Signed-off-by: Carl Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Ben Jackson <ben@ben.com>
2025-03-03 22:10:54 +01:00
cmdr2 b64d7cc272 cuda: unary ops as float + de-duplicate (ggml/1130) 2025-03-03 18:18:11 +02:00
cmdr2 0cbee131ad cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
ggml-ci
2025-03-03 18:18:11 +02:00
cmdr2 87abb7e903 cuda/cpu: Increase support for fp16 unary operations (ggml/1125)
* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests
2025-03-03 18:18:11 +02:00
Diego Devesa 6d4c23b81b whisper : support GGML_BACKEND_DL (whisper/2843)
* whisper : support GGML_BACKEND_DL

* fix DTW crash

* whisper.objc : fix build - add ggml-cpp.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-03 18:18:11 +02:00
midnight 6512a90037 cmake : fix compile assumptions for power9/etc (whisper/2777)
* Add small comment re: VSX to readme

Co-authored-by: midnight <midnight@example.com>
2025-03-03 18:18:11 +02:00
petterreinholdtsen 4512055792 Told cmake to install ggml-cpp.h as a public header file. (ggml/1126)
It is used by Whisper talk-llama example.

Co-authored-by: Petter Reinholdtsen <pere@debian.org>
2025-03-03 18:18:11 +02:00
cmdr2 f54a4ba11e Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)
* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div
2025-03-03 18:18:11 +02:00
ag2s20150909 9660ffef58
ggml : fix kleidiai build (#12159)
The libggml API has changed, but this has not been updated.
2025-03-03 13:54:08 +01:00
Akarshan Biswas ece9745bb8
SYCL: Move CPY kernels to a separate file and add few missing kernels (#12133)
* SYCL: refactor and move cpy kernels to a separate file

* Add few missing cpy kernels

* refactor and add debug logs
2025-03-03 11:07:22 +01:00
Diego Devesa cc473cac7c
ggml-backend : keep paths in native string type when possible (#12144) 2025-03-02 22:11:00 +01:00
Erik Scholz 80c41ddd8f
CUDA: compress mode option and default to size (#12029)
cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
2025-03-01 12:57:22 +01:00
nullname 8b652dd6ec
bug: fix benchmark debug warning (#31)
* print build type

* wip

* print compiling flags

* wip

* wip
2025-02-28 22:54:57 +08:00
William Tambellini 70680c48e5
ggml : upgrade init_tensor API to return a ggml_status (#11854)
* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-02-28 14:41:47 +01:00
nullname f289752664
[bugfix]make sure single node op will have the same type (#29)
* debug

* disable reshape

* make sure single node op have same type

* fix warning at the logger

* Revert "disable reshape"

This reverts commit 5aeca4ba9bec6db3f047f9da803df20f9f6612b3.
2025-02-28 19:18:16 +08:00
Rémy O 438a83926a
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595)
* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants
2025-02-28 09:42:52 +01:00
Johannes Gäßler 9c42b1718c
CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098) 2025-02-28 09:26:43 +01:00
Prashant Vithule 05e6f5aad0
ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064)
* Added SVE Support for Q2_K Quantized Models

* Use 4-space indentation in the switch cases

* removed comments lines

* Remove the loop Retain the curly bracess for better understanding of code

* Remove the comment like added for q3_k_q8_k kernel

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
2025-02-28 09:36:12 +02:00
hipudding 673cfef9aa
CANN: Fix build error with GCC 13 (#11990)
Remove unused header file that causes compilation failure on ARM
platform with GCC 13.
2025-02-28 15:23:47 +08:00
Eve fbeda9002d
vulkan: matmul dequantization improvements (#12015)
* faster dequant for old quants

* dont use unpack for iq4_nl

* vec2 unpack for q8
2025-02-28 08:20:08 +01:00
Daniele 581650b7ca
vulkan: improve im2col (#11826)
* vulkan: improve im2col performance
2025-02-28 07:52:51 +01:00
nullname c867641222
feat: fix some TODO item in upstream PR #26 (#27)
* fix warning

* wip

* add todo for graph key generate

* rename some file to meet upstream guideline

* remove local .clang-format

* expend supported/unsupported counter to all ops

* append device name to log

* port to ggml logger

* fix warning after adapt to ggml logger

* append \n to all log

* use case op instead of convert

* Revert "use case op instead of convert"

This reverts commit e662fc2dfee41719aaf7bc9d75e03e8d0f7ded0f.

* fix op that needs same shape

* opt kQnnOpsTable

* refresh params name field when getting op config

* opt npu log print

* remove unused functions
2025-02-27 23:16:08 +08:00
Vladimir Vuksanovic b95c8af37c
cmake: Fix ggml backend dependencies and installation (#11818)
* Fix dependencies between ggml and backends

ggml backends link only to ggml-base and ggml links to all backends.

* Fix installation of ggml backends

Set up GNUInstallDirs before setting the installation directory of ggml backends
2025-02-27 09:42:48 +02:00
Jeff Bolz a82c9e7c23
vulkan: fix assertion when qy_needs_dequant (#12068)
Looks like a copy/paste bug from qx_needs_dequant.
2025-02-25 16:30:21 +01:00
nullname ff033e1e23
opt mulmat base on official doc (#25)
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
2025-02-25 19:46:48 +08:00
Judd c132239bfb
add OP sigmoid (#12056)
Co-authored-by: Judd <foldl@boxvest.com>
2025-02-25 12:32:20 +01:00
Molly Sophia 393fca629e
ggml-cpu: Fix build with sve (#12059)
* ggml-cpu: Fix build with sve

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml-cpu: Remove unused variable in sve q3_k vec dot

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-02-25 19:28:22 +08:00
Rémy O 61d4f39dfe
vulkan: implement more backpropagation operators (#11914)
* vulkan: implement GGML_OP_ROPE_BACK

* vulkan: implement GGML_OP_RMS_NORM_BACK

* vulkan: implement GGML_OP_SILU_BACK

* vulkan: implement GGML_OP_SOFTMAX_BACK
2025-02-25 12:04:45 +01:00
Gian-Carlo Pascutto 58d07a8043
metal : copy kernels for quant to F32/F16 conversions (#12017)
metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-25 11:27:58 +02:00
lhez 34a846b584
opencl: fix for small models (#11950)
* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <quic_shawngu@quicinc.com>
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
2025-02-24 14:47:07 -07:00
Neo Zhang Jianyu 08d5986290
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035)
* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2025-02-24 22:33:23 +08:00
Akarshan Biswas 8303e8b0fb
SYCL: Fix GGML_SYCL_DEBUG macro (#11995) 2025-02-24 10:18:25 +00:00
hongruichen 84328ffbda Merge branch 'master' into dev-refactoring 2025-02-24 14:39:12 +08:00
nullname a822d00753
feat: run on win (#24)
* move qnn_instance function implementation into cpp

* wip

* wip

* move dl related function into separated file

* use cast op for gpu

* Revert "use cast op for gpu"

This reverts commit 05df7362a15c022d05940d682e84cf480a082c6a.

* Reapply "use cast op for gpu"

This reverts commit 2520e5922a216faceb6d7efcde23dafe6947a4b3.

* fix compiling error in win

* fix align_alloc in win

* fix compiling error

* add get sys free/total mem for win

* wip

* suppress warning in win

* add missing chrono header

* set the correct qnn lib name for windows

* add flag to control cpu backend

* wip

* wip

* Revert "Reapply "use cast op for gpu""

This reverts commit f56519c374a7d46faac706cf214de48ff5fc5139.

* fix compiling error for linux build

* fix cdsprpc dynamic library name

* wip

* skip rpc load fail

* fix page_align_alloc

* suppress some warning in gcc

* wip

* reuse align to function

* more log

* add log and fix warning

* wip

* fix asan errors and memory leaks

* fix the get_io_tensors_from_graph

* improve comment

* print GGML_QNN_DEFAULT_LIB_SEARCH_PATH

* revert some unused changes

* move library search path setter into qnn module

* fix android library loading

* skip qnn_device_get_platform_info for npu emulator
2025-02-24 10:47:47 +08:00
Aaron Teo af7747c95a
ggml-cpu: Support s390x SIMD Instruction Set (#12019)
* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: remove test.py

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml : fix LoongArch compile error with 128-bit SIMD (#11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
2025-02-22 21:39:24 +00:00
Johannes Gäßler a28e0d5eb1
CUDA: app option to compile without FlashAttention (#12025) 2025-02-22 20:44:34 +01:00
Johannes Gäßler 5fa07c2f93
CUDA: optimize FA for GQA + large batches (#12014) 2025-02-22 12:20:17 +01:00
Gian-Carlo Pascutto d70908421f
cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000) 2025-02-22 09:43:24 +01:00
PureJourney ecc8e3aeff
CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984)
* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-02-21 12:21:05 +01:00
Bodhi 0b3863ff95
MUSA: support ARM64 and enable dp4a .etc (#11843)
* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>
2025-02-21 09:46:23 +02:00
Charles Xu c5d91a7400
ggml-cpu: Add CPU backend support for KleidiAI library (#11390)
* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
2025-02-20 15:06:51 +02:00
Prashant Vithule 4806498bf1
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917)
* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-20 12:08:32 +02:00
Johannes Gäßler 73e2ed3ce3
CUDA: use async data loading for FlashAttention (#11894)
* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-17 14:03:24 +01:00
Rémy O 2eea03d86a
vulkan: implement several ops relevant for ggml_opt (#11769)
* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command
2025-02-17 07:55:57 +01:00
Jeff Bolz bf42a23d0a
vulkan: support multi/vision rope, and noncontiguous rope (#11902) 2025-02-16 08:52:23 +01:00
Hale Chan c2ea16f260
metal : fix the crash caused by the lack of residency set support on Intel Macs. (#11904) 2025-02-16 08:50:26 +02:00
Adrian Kretz 22885105a6
metal : optimize dequant q6_K kernel (#11892) 2025-02-15 20:39:20 +02:00
Georgi Gerganov 68ff663a04
repo : update links to new url (#11886)
* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
2025-02-15 16:40:57 +02:00
Rémy O fc1b0d0936
vulkan: initial support for IQ1_S and IQ1_M quantizations (#11528)
* vulkan: initial support for IQ1_S and IQ1_M quantizations

* vulkan: define MMV kernels for IQ1 quantizations

* devops: increase timeout of Vulkan tests again

* vulkan: simplify ifdef for init_iq_shmem
2025-02-15 09:01:40 +01:00
lhez 300907b211
opencl: Fix rope and softmax (#11833)
* opencl: fix `ROPE`

* opencl: fix `SOFT_MAX`

* Add fp16 variant

* opencl: enforce subgroup size for `soft_max`
2025-02-14 12:12:23 -07:00
Diego Devesa 94b87f87b5
cuda : add ampere to the list of default architectures (#11870) 2025-02-14 15:33:52 +01:00
Jinyang He 38e32eb6a0
ggml: optimize some vec dot functions for LoongArch ASX (#11842)
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
2025-02-14 10:54:27 +02:00
Eve a4f011e8d0
vulkan: linux builds + small subgroup size fixes (#11767)
* mm subgroup size

* upload vulkan x86 builds
2025-02-14 02:59:40 +00:00
Jeffrey Morgan 8a8c4ceb60
llamafile: use member variable instead of constant for iq4nlt (#11780) 2025-02-13 18:05:04 +01:00
R0CKSTAR bd6e55bfd3
musa: bump MUSA SDK version to rc3.1.1 (#11822)
* musa: Update MUSA SDK version to rc3.1.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: Remove workaround in PR #10042

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-02-13 13:28:18 +01:00
hongruichen 12c75f17c4 Merge branch 'master' into dev-refactoring 2025-02-13 10:21:35 +08:00
Diego Devesa a394039db0
ggml-cpu : add chunking support to mul_mat_id (#11666)
* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes
2025-02-13 01:02:38 +01:00
Xuan-Son Nguyen be3bbd6215
ggml : x2 speed for WASM by optimizing SIMD (#11453)
* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <camel-cdr@protonmail.com>

* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>

---------

Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
2025-02-13 00:33:45 +01:00
uvos 5c4284d57b
HIP: Remove GCN from list of devices that avoid MMQ (#11831) 2025-02-12 22:25:28 +01:00
uvos e598697d63
HIP: Switch to std::vector in rocblas version check (#11820) 2025-02-12 17:25:03 +01:00
hongruichen ba324b0252 Merge branch 'master' into dev-refactoring 2025-02-12 23:51:03 +08:00
bandoti fef0cbeadf
cleanup: fix compile warnings associated with gnu_printf (#11811) 2025-02-12 10:06:53 -04:00
Richard 748ee9fe93
ggml : fix multi-threaded clamp_f32 (#11824)
* Bug fix for clamp_f32

When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.

* Bug fix for clamp_f32

* Bug fix for clamp_f32
2025-02-12 15:57:33 +02:00
Weizhao Ouyang 198b1ec611
ggml-cpu: Fix duplicate MATMUL_INT8 (#11817)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-02-12 13:22:58 +01:00
Johannes Gäßler c3d6af7cd2
CUDA: fix CUDART_VERSION checks (#11821) 2025-02-12 13:16:39 +01:00
Sheldon Robinson 90e4dba461
Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx (#11803)
* Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx

* Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Johannes Gäßler b9ab0a4d0b
CUDA: use arch list for compatibility check (#11775)
* CUDA: use arch list for feature availability check

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-11 00:17:22 +01:00
Maxim Evtush 7b891bdc86
fix: typos in documentation files (#11791)
* Update ggml.c

* Update arg.cpp

* Update speculative.h
2025-02-10 23:21:31 +01:00
Danny Milosavljevic c2a67efe38
vulkan: Make Vulkan optional at runtime (#11493). (#11494)
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Wagner Bruna b044a0fe3c
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (#11592) 2025-02-10 07:08:22 +01:00
Jeff Bolz 98f6b0fd1e
vulkan: account for lookup tables when checking shared memory size (#11502) 2025-02-09 08:43:51 +01:00
Karol Kontny 4d3465c5ae
ggml: Fix data race in ggml threadpool (#11736)
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.

Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-08 15:30:53 +01:00
Johannes Gäßler d80be897ac
CUDA: fix min. version for movmatrix (#11751) 2025-02-08 10:46:07 +01:00
Jeff Bolz c026ba3c23
vulkan: print shared memory size (#11719) 2025-02-07 11:26:03 +01:00
Akarshan Biswas ec3bc8270b
SYCL: remove XMX info from print devices (#11712) 2025-02-07 09:27:53 +00:00
Jinyang He 225bbbfa39
ggml : optimize and build warning fix for LoongArch (#11709)
* ggml : optimize convert f32<->f16 for loongarch_asx

* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16

* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-07 09:38:31 +02:00
Patrick Peng 1d20e53c40
rpc: fix known RCE in rpc-server (ggml/1103)
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if  `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
2025-02-06 21:22:54 +02:00
Akarshan Biswas 194b2e69f8
SYCL: Adjust support condition for norm operators (#11674)
SYCL does not support non contiguous tensors for norm operations
2025-02-06 11:42:35 +00:00
junchao-zhao 8d4d2be143
ggml : fix LoongArch compile error with 128-bit SIMD (#11701) 2025-02-06 11:20:00 +02:00
Jeff Bolz 2c6c8df56d
vulkan: optimize coopmat2 iq2/iq3 callbacks (#11521)
* vulkan: optimize coopmat2 iq2/iq3 callbacks

* build: trigger CI on GLSL compute shader changes
2025-02-06 07:15:30 +01:00
Rémy O 8a7e3bf17a
vulkan: initial support for IQ4_XS quantization (#11501) 2025-02-06 07:09:59 +01:00
Jeff Bolz 1b598b3058
vulkan: use smaller combined allocations to avoid fragmentation (#11551) 2025-02-06 07:02:18 +01:00
Charles Duffy 902368a06b
metal : avoid breaking build when metal API predates TARGET_OS_VISION (#11690)
Avoids breakage in nix flake build introduced by b0569130c5
2025-02-06 09:52:31 +08:00
Georgi Gerganov d774ab3acc
metal : adjust support conditions for norm operators (#11671)
cont #11659

ggml-ci
2025-02-05 10:57:42 +02:00
Johannes Gäßler fa62da9b2d
CUDA: support for mat. mul. with ne03 != ne13 (#11656) 2025-02-05 08:58:31 +01:00
Johannes Gäßler fd08255d0d
CUDA: non-contiguous (RMS) norm support (#11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-04 22:21:42 +01:00
fxzjshm 3ec9fd4b77
HIP: force max threads per block to be 1024 (#11621)
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <fxzjshm@163.com>
2025-02-04 19:18:38 +01:00