lhez
d76a86d967
opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops ( #12217 )
...
* opencl: support noncontiguous `norm`
* opencl: support noncontiguous `rms_norm`
* opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`
2025-03-07 00:20:35 +00:00
xiaofei
776f9e59cc
cmake : fix undefined reference errors for std::filesystem in ggml ( #12092 ) ( #12094 )
...
Signed-off-by: Ray Lee <hburaylee@gmail.com>
Co-authored-by: Ray Lee <hburaylee@gmail.com>
2025-03-06 22:58:25 +00:00
Johannes Gäßler
5220a16d18
CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 ( #12222 )
2025-03-06 18:45:09 +01:00
uvos
e721c05c93
HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. ( #12209 )
...
This avoids conflict with internal cuda/hip runtimes memory managment behavior.
2025-03-06 08:20:52 +01:00
Henry Linjamäki
94bb63e4f0
opencl : fix buffer alignment ( #12197 )
...
Fix the following error:
```
ggml-alloc.c:99: not enough space in the buffer
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024)
```
which occurs when `ggml_backend_opencl_context::alignment` is larger
than `cl_ptr_base` (hard-coded to `0x1000`).
Also, fix `ggml_backend_opencl_context::alignment` was set to
`CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the
value is reported in bits.
2025-03-06 02:33:40 +01:00
Henry Linjamäki
f79243992c
opencl : fix `ulong` kernel args were set from `int` variables ( #12174 )
...
... which left garbage bits in the upper half of the kernel args. This
caused segmentation faults when running PoCL.
2025-03-06 02:31:14 +01:00
simon886212
ed4ce0dda2
opencl : fix profile-related errors ( #12095 )
...
Co-authored-by: ubuntu <ubuntu@localhost.localdomain>
2025-03-06 02:30:05 +01:00
Rémy O
07d1572347
ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions ( #12154 )
...
* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions
* cmake: Add GGML_BMI2 build option
* ggml: enable BMI2 on relevant CPU variants
* ggml-cpu: include BMI2 in backend score
* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features
* ggml-cpu: add __BMI2__ define when using MSVC
2025-03-06 02:26:10 +01:00
Akarshan Biswas
5e43f104cc
SYCL: Disable f16 Unary OPs as not supported by the kernels ( #12201 )
2025-03-05 16:58:23 +01:00
Plamen Minev
16e4b22c5e
ggml : fix GGMLMetalClass ODR ( #12200 )
...
-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.
2025-03-05 17:16:01 +02:00
mgroeber9110
5bbe6a9fe9
ggml : portability fixes for VS 2017 ( #12150 )
...
* Add include files for std::min/max and std::toupper/tolower
* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined
* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode
* win32: only use __restrict in MSVC if C11/C17 support is not enabled
---------
Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
David Huang
becade5de7
HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ ( #12032 )
...
Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16
---
Signed-off-by: Carl Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Ben Jackson <ben@ben.com>
2025-03-03 22:10:54 +01:00
cmdr2
b64d7cc272
cuda: unary ops as float + de-duplicate (ggml/1130)
2025-03-03 18:18:11 +02:00
cmdr2
0cbee131ad
cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
...
ggml-ci
2025-03-03 18:18:11 +02:00
cmdr2
87abb7e903
cuda/cpu: Increase support for fp16 unary operations (ggml/1125)
...
* Support fp16 unary operations in the CUDA backend
* cpu: increase fp16 support for unary operators in the CPU backend
* cuda: increase fp16 support for unary operators in the CUDA backend
* Add test cases for fp16 unary operators
* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing
* metal: fix PR comments for unary op support after fp16 unary tests
2025-03-03 18:18:11 +02:00
Diego Devesa
6d4c23b81b
whisper : support GGML_BACKEND_DL (whisper/2843)
...
* whisper : support GGML_BACKEND_DL
* fix DTW crash
* whisper.objc : fix build - add ggml-cpp.h
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-03 18:18:11 +02:00
midnight
6512a90037
cmake : fix compile assumptions for power9/etc (whisper/2777)
...
* Add small comment re: VSX to readme
Co-authored-by: midnight <midnight@example.com>
2025-03-03 18:18:11 +02:00
petterreinholdtsen
4512055792
Told cmake to install ggml-cpp.h as a public header file. (ggml/1126)
...
It is used by Whisper talk-llama example.
Co-authored-by: Petter Reinholdtsen <pere@debian.org>
2025-03-03 18:18:11 +02:00
cmdr2
f54a4ba11e
Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)
...
* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend
* Add fp16 support for add/sub/mul/div on the CPU backend
* Add test cases for fp16 add/sub/mul/div
2025-03-03 18:18:11 +02:00
ag2s20150909
9660ffef58
ggml : fix kleidiai build ( #12159 )
...
The libggml API has changed, but this has not been updated.
2025-03-03 13:54:08 +01:00
Akarshan Biswas
ece9745bb8
SYCL: Move CPY kernels to a separate file and add few missing kernels ( #12133 )
...
* SYCL: refactor and move cpy kernels to a separate file
* Add few missing cpy kernels
* refactor and add debug logs
2025-03-03 11:07:22 +01:00
Diego Devesa
cc473cac7c
ggml-backend : keep paths in native string type when possible ( #12144 )
2025-03-02 22:11:00 +01:00
Erik Scholz
80c41ddd8f
CUDA: compress mode option and default to size ( #12029 )
...
cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
2025-03-01 12:57:22 +01:00
William Tambellini
70680c48e5
ggml : upgrade init_tensor API to return a ggml_status ( #11854 )
...
* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com>
2025-02-28 14:41:47 +01:00
Rémy O
438a83926a
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations ( #11595 )
...
* vulkan: implement specialized MMV kernels for IQ2 quantizations
* vulkan: add MMV kernels for IQ3 quants
* vulkan: Increase MMV batch size and unroll IQ LUT setup
* vulkan: fix init_iq_shmem for WG sizes larger than tables
* vulkan: common batch size for all I-quants
2025-02-28 09:42:52 +01:00
Johannes Gäßler
9c42b1718c
CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ ( #12098 )
2025-02-28 09:26:43 +01:00
Prashant Vithule
05e6f5aad0
ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot ( #12064 )
...
* Added SVE Support for Q2_K Quantized Models
* Use 4-space indentation in the switch cases
* removed comments lines
* Remove the loop Retain the curly bracess for better understanding of code
* Remove the comment like added for q3_k_q8_k kernel
---------
Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
2025-02-28 09:36:12 +02:00
hipudding
673cfef9aa
CANN: Fix build error with GCC 13 ( #11990 )
...
Remove unused header file that causes compilation failure on ARM
platform with GCC 13.
2025-02-28 15:23:47 +08:00
Eve
fbeda9002d
vulkan: matmul dequantization improvements ( #12015 )
...
* faster dequant for old quants
* dont use unpack for iq4_nl
* vec2 unpack for q8
2025-02-28 08:20:08 +01:00
Daniele
581650b7ca
vulkan: improve im2col ( #11826 )
...
* vulkan: improve im2col performance
2025-02-28 07:52:51 +01:00
Vladimir Vuksanovic
b95c8af37c
cmake: Fix ggml backend dependencies and installation ( #11818 )
...
* Fix dependencies between ggml and backends
ggml backends link only to ggml-base and ggml links to all backends.
* Fix installation of ggml backends
Set up GNUInstallDirs before setting the installation directory of ggml backends
2025-02-27 09:42:48 +02:00
Jeff Bolz
a82c9e7c23
vulkan: fix assertion when qy_needs_dequant ( #12068 )
...
Looks like a copy/paste bug from qx_needs_dequant.
2025-02-25 16:30:21 +01:00
Judd
c132239bfb
add OP sigmoid ( #12056 )
...
Co-authored-by: Judd <foldl@boxvest.com>
2025-02-25 12:32:20 +01:00
Molly Sophia
393fca629e
ggml-cpu: Fix build with sve ( #12059 )
...
* ggml-cpu: Fix build with sve
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml-cpu: Remove unused variable in sve q3_k vec dot
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-02-25 19:28:22 +08:00
Rémy O
61d4f39dfe
vulkan: implement more backpropagation operators ( #11914 )
...
* vulkan: implement GGML_OP_ROPE_BACK
* vulkan: implement GGML_OP_RMS_NORM_BACK
* vulkan: implement GGML_OP_SILU_BACK
* vulkan: implement GGML_OP_SOFTMAX_BACK
2025-02-25 12:04:45 +01:00
Gian-Carlo Pascutto
58d07a8043
metal : copy kernels for quant to F32/F16 conversions ( #12017 )
...
metal: use dequantize_q templates
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-25 11:27:58 +02:00
lhez
34a846b584
opencl: fix for small models ( #11950 )
...
* opencl: fix small shape gemv, remove unused extensions
* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size
* opencl: fix for token length < 4
* opencl: use wave size of 64 for all Adreno GPUs
---------
Co-authored-by: Shawn Gu <quic_shawngu@quicinc.com>
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
2025-02-24 14:47:07 -07:00
Neo Zhang Jianyu
08d5986290
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU ( #12035 )
...
* opt performance by reorder for Intel GPU
* detect hw type and save opt feature, and print opt feature
* correct name
* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed
* add env variable GGML_SYCL_DISABLE_OPT for debug
* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT
* add performance data
* mv getrows functions to separeted files
* fix global variables
---------
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2025-02-24 22:33:23 +08:00
Akarshan Biswas
8303e8b0fb
SYCL: Fix GGML_SYCL_DEBUG macro ( #11995 )
2025-02-24 10:18:25 +00:00
Aaron Teo
af7747c95a
ggml-cpu: Support s390x SIMD Instruction Set ( #12019 )
...
* ggml: add s390x ARCH_FLAGS for compilation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add SIMD for s390x using vector intrinsics
SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16
SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix missing escape character in GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix s390x GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: full SIMD activation for F32,F16 s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add option to disable s390x VXE/VXE2
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: change vecintrin.h include to ggml-cpu-impl
* add __VXE__ and __VXE2__ macros
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* cmake: add s390x target detection for VX/VXE/VXE2
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x Q8_0 SIMD
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: correct documentation for Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x reduce code complexity Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x bugfix typo Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activated for Q4_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x inline vec_reve
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for Q4_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add VXE backend feature
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: remove test.py
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for quantize_row_q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for quantize_row_q8_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for iq4_xs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: bugfix iq4_xs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for iq4_nl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add float, double, and long vector data type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: clean up iq4_xs SIMD
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix improper use of restrict keyword
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: update warning message for ggml_vec_tbl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: switch to restrict for iq4_nl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: slight dot product speed improvement for q4_1_q8_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for q6_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add missing `_t` to ggml_int8x16x4_t
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix more missing `_t`
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add unroll and prefetch to Q8_0
increase of 3.86% for prompt processing and 32.22% for token generation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: patch Q8_0 to use proper vector sizes
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: optimise Q8_0 dot prod compute kernel further
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add unroll and prefetch to Q4_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: refactor Q6_K variable naming for readability
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q6_K typos
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for Q5_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix wrong char*x16_t naming
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: Q5_K y0 wrong signness
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for Q4_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q4_K invalid vector intrinsics
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: simplify ggml_padd_s16 compute kernel
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: correct ggml-cpu vxe wording
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: change ggml_aligned_malloc alignment to 256
256 is the cache line size for s390x platforms
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: resolve pr merge via cherry-pick 225bbbf
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml : fix LoongArch compile error with 128-bit SIMD (#11701 )
* ggml: resolve pr merge via cherry-pick 4571953
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: cmake remove fork when determining s390x machine type
thank you @ericcurtin
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
2025-02-22 21:39:24 +00:00
Johannes Gäßler
a28e0d5eb1
CUDA: app option to compile without FlashAttention ( #12025 )
2025-02-22 20:44:34 +01:00
Johannes Gäßler
5fa07c2f93
CUDA: optimize FA for GQA + large batches ( #12014 )
2025-02-22 12:20:17 +01:00
Gian-Carlo Pascutto
d70908421f
cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. ( #12000 )
2025-02-22 09:43:24 +01:00
PureJourney
ecc8e3aeff
CUDA: correct the lowest Maxwell supported by CUDA 12 ( #11984 )
...
* CUDA: correct the lowest Maxwell supported by CUDA 12
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-02-21 12:21:05 +01:00
Bodhi
0b3863ff95
MUSA: support ARM64 and enable dp4a .etc ( #11843 )
...
* MUSA: support ARM64 and enable __dp4a .etc
* fix cross entropy loss op for musa
* update
* add cc info log for musa
* add comment for the MUSA .cc calculation block
---------
Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>
2025-02-21 09:46:23 +02:00
Charles Xu
c5d91a7400
ggml-cpu: Add CPU backend support for KleidiAI library ( #11390 )
...
* ggml-cpu: Add CPU backend support for KleidiAI library
* Add environmental variable GGML_KLEIDIAI_SME
* Add support for multithread LHS conversion
* Switch kernel selection order to dotprod and i8mm
* updates for review comments
* More updates for review comments
* Reorganize and rename KleidiAI files
* Move ggml-cpu-traits.h to source file
* Update cmake for SME build and add alignment for SME
* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
2025-02-20 15:06:51 +02:00
Prashant Vithule
4806498bf1
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot ( #11917 )
...
* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file
* Improved Formating of code in ggml-cpu-quants.c file
* style : minor fixes
* style : less whitespaces
* style : ptr spaceing
---------
Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-20 12:08:32 +02:00
Johannes Gäßler
73e2ed3ce3
CUDA: use async data loading for FlashAttention ( #11894 )
...
* CUDA: use async data loading for FlashAttention
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-17 14:03:24 +01:00
Rémy O
2eea03d86a
vulkan: implement several ops relevant for ggml_opt ( #11769 )
...
* vulkan: support memset_tensor
* vulkan: support GGML_OP_SUM
* vulkan: implement GGML_OP_ARGMAX
* vulkan: implement GGML_OP_SUB
* vulkan: implement GGML_OP_COUNT_EQUAL
* vulkan: implement GGML_OP_OPT_STEP_ADAMW
* vulkan: fix check_results RWKV_WKV6 crash and memory leaks
* vulkan: implement GGML_OP_REPEAT_BACK
* tests: remove invalid test-backend-ops REPEAT_BACK tests
* vulkan: fix COUNT_EQUAL memset using a fillBuffer command
2025-02-17 07:55:57 +01:00
Jeff Bolz
bf42a23d0a
vulkan: support multi/vision rope, and noncontiguous rope ( #11902 )
2025-02-16 08:52:23 +01:00
Hale Chan
c2ea16f260
metal : fix the crash caused by the lack of residency set support on Intel Macs. ( #11904 )
2025-02-16 08:50:26 +02:00
Adrian Kretz
22885105a6
metal : optimize dequant q6_K kernel ( #11892 )
2025-02-15 20:39:20 +02:00
Georgi Gerganov
68ff663a04
repo : update links to new url ( #11886 )
...
* repo : update links to new url
ggml-ci
* cont : more urls
ggml-ci
2025-02-15 16:40:57 +02:00
Rémy O
fc1b0d0936
vulkan: initial support for IQ1_S and IQ1_M quantizations ( #11528 )
...
* vulkan: initial support for IQ1_S and IQ1_M quantizations
* vulkan: define MMV kernels for IQ1 quantizations
* devops: increase timeout of Vulkan tests again
* vulkan: simplify ifdef for init_iq_shmem
2025-02-15 09:01:40 +01:00
lhez
300907b211
opencl: Fix rope and softmax ( #11833 )
...
* opencl: fix `ROPE`
* opencl: fix `SOFT_MAX`
* Add fp16 variant
* opencl: enforce subgroup size for `soft_max`
2025-02-14 12:12:23 -07:00
Diego Devesa
94b87f87b5
cuda : add ampere to the list of default architectures ( #11870 )
2025-02-14 15:33:52 +01:00
Jinyang He
38e32eb6a0
ggml: optimize some vec dot functions for LoongArch ASX ( #11842 )
...
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX
* Optimize mul_sum_i8_pairs_float for LoongArch ASX
* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
2025-02-14 10:54:27 +02:00
Eve
a4f011e8d0
vulkan: linux builds + small subgroup size fixes ( #11767 )
...
* mm subgroup size
* upload vulkan x86 builds
2025-02-14 02:59:40 +00:00
Jeffrey Morgan
8a8c4ceb60
llamafile: use member variable instead of constant for iq4nlt ( #11780 )
2025-02-13 18:05:04 +01:00
R0CKSTAR
bd6e55bfd3
musa: bump MUSA SDK version to rc3.1.1 ( #11822 )
...
* musa: Update MUSA SDK version to rc3.1.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: Remove workaround in PR #10042
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-02-13 13:28:18 +01:00
Diego Devesa
a394039db0
ggml-cpu : add chunking support to mul_mat_id ( #11666 )
...
* ggml-cpu : add chunking support to mul_mat_id
* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row
* disable for arm
* cleanup
* better way to disable for arm
* fix uninitialized counter when using 1 thread only
* revert test-backend-ops changes
2025-02-13 01:02:38 +01:00
Xuan-Son Nguyen
be3bbd6215
ggml : x2 speed for WASM by optimizing SIMD ( #11453 )
...
* ggml : x2 speed for WASM by optimizing SIMD
* fix bad merging
* rm trailing spaces
* rm redundant clamp
* better quantize_row_q8_K
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
---------
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
2025-02-13 00:33:45 +01:00
uvos
5c4284d57b
HIP: Remove GCN from list of devices that avoid MMQ ( #11831 )
2025-02-12 22:25:28 +01:00
uvos
e598697d63
HIP: Switch to std::vector in rocblas version check ( #11820 )
2025-02-12 17:25:03 +01:00
bandoti
fef0cbeadf
cleanup: fix compile warnings associated with gnu_printf ( #11811 )
2025-02-12 10:06:53 -04:00
Richard
748ee9fe93
ggml : fix multi-threaded clamp_f32 ( #11824 )
...
* Bug fix for clamp_f32
When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.
* Bug fix for clamp_f32
* Bug fix for clamp_f32
2025-02-12 15:57:33 +02:00
Weizhao Ouyang
198b1ec611
ggml-cpu: Fix duplicate MATMUL_INT8 ( #11817 )
...
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-02-12 13:22:58 +01:00
Johannes Gäßler
c3d6af7cd2
CUDA: fix CUDART_VERSION checks ( #11821 )
2025-02-12 13:16:39 +01:00
Sheldon Robinson
90e4dba461
Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx ( #11803 )
...
* Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx
* Fix #11802 : PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Johannes Gäßler
b9ab0a4d0b
CUDA: use arch list for compatibility check ( #11775 )
...
* CUDA: use arch list for feature availability check
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-11 00:17:22 +01:00
Maxim Evtush
7b891bdc86
fix: typos in documentation files ( #11791 )
...
* Update ggml.c
* Update arg.cpp
* Update speculative.h
2025-02-10 23:21:31 +01:00
Danny Milosavljevic
c2a67efe38
vulkan: Make Vulkan optional at runtime ( #11493 ). ( #11494 )
...
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Wagner Bruna
b044a0fe3c
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation ( #11592 )
2025-02-10 07:08:22 +01:00
Jeff Bolz
98f6b0fd1e
vulkan: account for lookup tables when checking shared memory size ( #11502 )
2025-02-09 08:43:51 +01:00
Karol Kontny
4d3465c5ae
ggml: Fix data race in ggml threadpool ( #11736 )
...
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-08 15:30:53 +01:00
Johannes Gäßler
d80be897ac
CUDA: fix min. version for movmatrix ( #11751 )
2025-02-08 10:46:07 +01:00
Jeff Bolz
c026ba3c23
vulkan: print shared memory size ( #11719 )
2025-02-07 11:26:03 +01:00
Akarshan Biswas
ec3bc8270b
SYCL: remove XMX info from print devices ( #11712 )
2025-02-07 09:27:53 +00:00
Jinyang He
225bbbfa39
ggml : optimize and build warning fix for LoongArch ( #11709 )
...
* ggml : optimize convert f32<->f16 for loongarch_asx
* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16
* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-07 09:38:31 +02:00
Patrick Peng
1d20e53c40
rpc: fix known RCE in rpc-server (ggml/1103)
...
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
2025-02-06 21:22:54 +02:00
Akarshan Biswas
194b2e69f8
SYCL: Adjust support condition for norm operators ( #11674 )
...
SYCL does not support non contiguous tensors for norm operations
2025-02-06 11:42:35 +00:00
junchao-zhao
8d4d2be143
ggml : fix LoongArch compile error with 128-bit SIMD ( #11701 )
2025-02-06 11:20:00 +02:00
Jeff Bolz
2c6c8df56d
vulkan: optimize coopmat2 iq2/iq3 callbacks ( #11521 )
...
* vulkan: optimize coopmat2 iq2/iq3 callbacks
* build: trigger CI on GLSL compute shader changes
2025-02-06 07:15:30 +01:00
Rémy O
8a7e3bf17a
vulkan: initial support for IQ4_XS quantization ( #11501 )
2025-02-06 07:09:59 +01:00
Jeff Bolz
1b598b3058
vulkan: use smaller combined allocations to avoid fragmentation ( #11551 )
2025-02-06 07:02:18 +01:00
Charles Duffy
902368a06b
metal : avoid breaking build when metal API predates TARGET_OS_VISION ( #11690 )
...
Avoids breakage in nix flake build introduced by b0569130c5
2025-02-06 09:52:31 +08:00
Georgi Gerganov
d774ab3acc
metal : adjust support conditions for norm operators ( #11671 )
...
cont #11659
ggml-ci
2025-02-05 10:57:42 +02:00
Johannes Gäßler
fa62da9b2d
CUDA: support for mat. mul. with ne03 != ne13 ( #11656 )
2025-02-05 08:58:31 +01:00
Johannes Gäßler
fd08255d0d
CUDA: non-contiguous (RMS) norm support ( #11659 )
...
* CUDA: non-contiguous (RMS) norm support
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-04 22:21:42 +01:00
fxzjshm
3ec9fd4b77
HIP: force max threads per block to be 1024 ( #11621 )
...
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.
Signed-off-by: fxzjshm <fxzjshm@163.com>
2025-02-04 19:18:38 +01:00
Jhen-Jie Hong
534c46b53c
metal : use residency set for other platforms ( #11648 )
2025-02-04 13:07:18 +02:00
Christian Kastner
8f8290ada9
cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
...
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.
This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
2025-02-04 12:59:15 +02:00
Johannes Gäßler
21c84b5d2d
CUDA: fix Volta FlashAttention logic ( #11615 )
2025-02-03 14:25:56 +02:00
Johannes Gäßler
6eecde3cc8
HIP: fix flash_attn_stream_k_fixup warning ( #11604 )
2025-02-02 23:48:29 +01:00
uvos
396856b400
CUDA/HIP: add support for selectable warp size to mmv ( #11519 )
...
CUDA/HIP: add support for selectable warp size to mmv
2025-02-02 22:40:09 +01:00
uvos
4d0598e144
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other ( #11601 )
...
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-02 22:08:05 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-02 19:31:09 +01:00
Olivier Chafik
aa6fb13213
`ci`: use sccache on windows instead of ccache ( #11545 )
...
* Use sccache on ci for windows
* Detect sccache in cmake
2025-01-31 17:12:40 +00:00
uvos
27d135c970
HIP: require at least HIP 5.5
2025-01-30 16:25:44 +01:00
uvos
6af1ca48cb
HIP: Prepare reduction operators for wave 64
2025-01-30 16:25:44 +01:00