Commit Graph

87 Commits

Author SHA1 Message Date
Francis Couture-Harpin 96b3d411e0 ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
Not yet tested on harware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.

* ggml-quants : remove comment about possible format change of TQ2_0

Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.
2024-08-07 15:08:41 -04:00
Francis Couture-Harpin f034aa1bb1 ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency 2024-08-03 16:22:04 -04:00
Francis Couture-Harpin 04eec58112 ggml : remove q1_3 and q2_2
* llama : remove the separate scale tensors of BitNet b1.58

They won't be needed, since the remaining ternary quant types have
built-in scales.
2024-08-02 20:16:26 -04:00
Francis Couture-Harpin 45719a2472 ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.
2024-08-01 01:11:30 -04:00
Francis Couture-Harpin 5417089aeb ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0 2024-07-31 23:35:04 -04:00
Francis Couture-Harpin a6dd6994a5 ggml : fix build issues in certain environments 2024-07-31 23:14:36 -04:00
Francis Couture-Harpin e9719576c4 ggml : also faster TQ1_0
Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.
2024-07-31 00:08:48 -04:00
Francis Couture-Harpin 560873f337 ggml : even faster TQ2_0 2024-07-30 23:36:52 -04:00
Francis Couture-Harpin 77b8f84ae7 ggml : add TQ1_0 and TQ2_0 ternary quantization types 2024-07-30 18:33:15 -04:00
Francis Couture-Harpin 79a278e922 Merge branch 'master' into compilade/bitnet-ternary 2024-07-28 21:27:33 -04:00
Francis Couture-Harpin dd3e62a703 ggml : add some informative comments in q1_3 vec_dot 2024-07-28 21:17:16 -04:00
Austin 4730faca61
chore : Fix vulkan related compiler warnings, add help text, improve CLI options (#8477)
* chore: Fix compiler warnings, add help text, improve CLI options

* Add prototypes for function definitions
* Invert logic of --no-clean option to be more intuitive
* Provide a new help prompt with clear instructions

* chore : Add ignore rule for vulkan shader generator

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* chore : Remove void and apply C++ style empty parameters

* chore : Remove void and apply C++ style empty parameters

---------

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: 0cc4m <picard12@live.de>
2024-07-28 09:52:42 +02:00
R0CKSTAR e54c35e4fb
feat: Support Moore Threads GPU (#8383)
* Update doc for MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA => MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix CI build failure

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-07-28 01:41:25 +02:00
Georgi Gerganov 345c8c0c87 ggml : add missing semicolon (#0)
ggml-ci
2024-07-27 17:43:44 +03:00
Mahesh Madhav a05ca93697 ggml : loop tiling optimizations for scalar path (ggml/898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
2024-07-27 17:43:44 +03:00
Ivan Filipov 9f77d899b7 ggml: add support for float16 input tensors in pooling operations (ggml/895)
* Add support for float16 tensors in 1d pooling operations

* Add support for float16 input tensors in 2d pooling operations

* code cleanup

remove unnecessary casting during srow ptr initialization

---------

Co-authored-by: vanaka11 <vanaka1189@gmail.com>
2024-07-27 17:43:44 +03:00
Tony Wasserka 203b7f1531 vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.

Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>
2024-07-27 17:43:44 +03:00
Borislav Stanimirov d2b851bfa1 cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885) 2024-07-27 17:43:44 +03:00
Daniel Bevenius c12b6e8ee7 ggml : remove unnecessary UNUSED macro call (ggml/880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-27 17:43:44 +03:00
wangshuai09 bfb4c74981
cann: Fix Multi-NPU execution error (#8710)
* cann: fix multi-npu exec error

* cann: update comment  for ggml_backend_cann_supports_buft
2024-07-27 16:36:44 +08:00
slaren 2b1f616b20
ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
DavidKorczynski 49ce0ab6d4
ggml: handle ggml_init failure to fix NULL pointer deref (#8692)
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.

This fixes it by bailing out if no context is found.
2024-07-25 23:23:05 +02:00
Andreas (Andi) Kunar bf5a81df37
ggml : fix build on Windows with Snapdragon X (#8531)
* Improvements for Windows with Snapdragon X

* Revert "Improvements for Windows with Snapdragon X"

This reverts commit bf21397ae5.

* Improvements for Windows with Snapdragon X

* WOA build clarifications

* WIndows on ARM build clarifications

* cmake build for Windows clarifications

* Update docs/build.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: AndreasKunar <andreaskmsn.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-25 19:01:00 +03:00
Chen Xi ed67bcb24f
[SYCL] fix multi-gpu issue on sycl (#8554)
---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-07-25 19:45:18 +08:00
Georgi Gerganov eddcb5238b
ggml : add and use ggml_cpu_has_llamafile() (#8664) 2024-07-25 12:37:42 +03:00
Joe Todd 79167d9e49
Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (#8667) 2024-07-24 11:55:26 +01:00
Joe Todd 64cf50a0ed
sycl : Add support for non-release DPC++ & oneMKL (#8644)
* Update cmake to support nvidia hardware & open-source compiler
---------
Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-07-23 14:58:37 +01:00
0cc4m 751fcfc6c3
Vulkan IQ4_NL Support (#8613)
* Fix Vulkan matmul tests compile errors

* Add Vulkan IQ4_NL support

* Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
2024-07-23 10:56:49 +02:00
Jeroen Mostert 46e47417aa
Allow all RDNA2 archs to use sdot4 intrinsic (#8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-07-23 10:50:40 +02:00
luoyu-intel 063d99ad11
[SYCL] fix scratch size of softmax (#8642) 2024-07-23 15:43:28 +08:00
Mark Zhuang 04bab6b7da
ggml: fix compile error for RISC-V (#8623) 2024-07-22 10:56:45 +03:00
Johannes Gäßler 69c487f4ed
CUDA: MMQ code deduplication + iquant support (#8495)
* CUDA: MMQ code deduplication + iquant support

* 1 less parallel job for CI build
2024-07-20 22:25:26 +02:00
Georgi Gerganov 07283b1a90
gguf : handle null name during init (#8587) 2024-07-20 17:15:42 +03:00
slaren 87e397d00b
ggml : fix quant dot product with odd number of blocks (#8549)
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix odd blocks for ARM_NEON (#8556)

* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-19 17:17:27 +02:00
Clint Herron b57eb9ca4f
ggml : add friendlier error message to fopen errors (#8575)
* Add additional error information when model files fail to load.

* Adding additional error information to most instances of fopen.
2024-07-19 14:05:45 +03:00
Johannes Gäßler a15ef8f8a0
CUDA: fix partial offloading for ne0 % 256 != 0 (#8572) 2024-07-18 23:48:47 +02:00
65a 705b7ecf60
cmake : install all ggml public headers (#8480)
Co-authored-by: 65a <65a@65a.invalid>
2024-07-18 17:47:12 +03:00
hipudding 1bdd8ae19f
[CANN] Add Ascend NPU backend (#6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-07-17 14:23:50 +03:00
Johannes Gäßler 5e116e8dd5
make/cmake: add missing force MMQ/cuBLAS for HIP (#8515) 2024-07-16 21:20:59 +02:00
Xuan Son Nguyen 97bdd26eee
Refactor lora adapter support (#8332)
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4874.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-15 20:50:47 +02:00
Daniel Bevenius 8fac431b06
ggml : suppress unknown pragma 'GCC' on windows (#8460)
This commit adds a macro guard to pragma GCC to avoid the following
warning on windows:

```console
C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068:
unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj]
```
2024-07-15 15:48:17 +03:00
Meng, Hengyu 16bdfa42ac
[SYCL] add concat through dim 1/2 (#8483)
* add concat through dim 1/2
2024-07-15 19:32:15 +08:00
0cc4m bda62d7999
Vulkan MMQ Fix (#8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter

* Fix Vulkan op result checker build error
2024-07-15 09:38:52 +02:00
bandoti 17eb6aa8a9
vulkan : cmake integration (#8119)
* Add Vulkan to CMake pkg

* Add Sycl to CMake pkg

* Add OpenMP to CMake pkg

* Split generated shader file into separate translation unit

* Add CMake target for Vulkan shaders

* Update README.md

* Add make target for Vulkan shaders

* Use pkg-config to locate vulkan library

* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow

* Clean up tabs

* Move sudo to apt-key invocation

* Forward GGML_EXTRA_LIBS to CMake config pkg

* Update vulkan obj file paths

* Add shaderc to nix pkg

* Add python3 to Vulkan nix build

* Link against ggml in cmake pkg

* Remove Python dependency from Vulkan build

* code review changes

* Remove trailing newline

* Add cflags from pkg-config to fix w64devkit build

* Update README.md

* Remove trailing whitespace

* Update README.md

* Remove trailing whitespace

* Fix doc heading

* Make glslc required Vulkan component

* remove clblast from nix pkg
2024-07-13 18:12:39 +02:00
Georgi Gerganov c917b67f06
metal : template-ify some of the kernels (#8447)
ggml-ci
2024-07-13 18:32:33 +03:00
Georgi Gerganov 370b1f7e7a
ggml : minor naming changes (#8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-12 10:46:02 +03:00
Chen Xi b549a1bbef
[SYCL] fix the mul_mat_id ut issues (#8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-07-12 08:52:04 +08:00
Nicholai Tukanov 368645698a
ggml : add NVPL BLAS support (#8329) (#8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-07-11 18:49:15 +02:00
Daniel Bevenius b078c619aa
cuda : suppress 'noreturn' warn in no_device_code (#8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-11 17:53:42 +02:00
Johannes Gäßler 808aba3916
CUDA: optimize and refactor MMQ (#8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-07-11 16:47:47 +02:00