Commit Graph

4396 Commits

Author SHA1 Message Date
nullname e36ad89528
bugfix: error pre-allocated tensor (k_cache_view-0) (#12)
* fix device binding at ggml_backend_qnn_buffer_type

* merge ggml_backend_qnn_buffer_context and qnn_mem_buffer

* wip

* add log

* wip

* add qnn_buffer_ptr

* remove tailing `\n` at log

* add log

* enable GGML_OP_NONE

* wip

* wip

* disable tensor with view

* wip

* wip

* more log for view tensor

* re-enable view

* wip

* remove link android lib

* set dimension at bind function

* move graph traversal to backend-ops

* wip

* add get_view_internal_dimension to obtain the tensor view source dimension

* use _view_source_dimensions to allocate qnn tensor

* add place holder function ggml_backend_qnn_cpy_tensor_async

* add ggml_qnn_aggregate_op_config

* make matmul based on ggml_qnn_aggregate_op_config

* wip

* manually specify the order of op destruct

* skip register qnn-cpu backend

* disable view op again

* remove _view_source_dimensions

* add nop for reshape and view ops

* add log

* add comment
2024-12-11 10:42:00 +08:00
hongruichen 0d02ee09ed fix int overflow and remove view op to pass unit test 2024-12-03 10:55:11 +08:00
hongruichen cf91253c41 Merge branch 'master' into dev-refactoring
# Conflicts:
#	ggml/src/ggml-backend-reg.cpp
2024-12-03 10:51:15 +08:00
Xuan Son Nguyen 642330ac7c
llama : add enum for built-in chat templates (#10623)
* llama : add enum for supported chat templates

* use "built-in" instead of "supported"

* arg: print list of built-in templates

* fix test

* update server README
2024-12-02 22:10:19 +01:00
Georgi Gerganov 8648c52101
make : deprecate (#10514)
* make : deprecate

ggml-ci

* ci : disable Makefile builds

ggml-ci

* docs : remove make references [no ci]

* ci : disable swift build

ggml-ci

* docs : remove obsolete make references, scripts, examples

ggml-ci

* basic fix for compare-commits.sh

* update build.md

* more build.md updates

* more build.md updates

* more build.md updates

* Update Makefile

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-12-02 21:22:53 +02:00
haopeng 64ed2091b2
server: Add "tokens per second" information in the backend (#10548)
* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-02 14:45:54 +01:00
Akarshan Biswas 991f8aabee
SYCL: Fix and switch to GGML_LOG system instead of fprintf (#10579)
* Switched to GGML_LOG

* Fix missing semicolon
2024-12-02 15:04:11 +08:00
Georgi Gerganov 4cb003dd8d
contrib : refresh (#10593)
* contrib : refresh

* contrib : expand [no ci]

* contrib : expand test-backend-ops instructions

* contrib : add CODEOWNERS

* prs : update template to not have checkbox [no ci]
2024-12-02 08:53:27 +02:00
Juk Armstrong 917786f43d
Add `mistral-v1`, `mistral-v3`, `mistral-v3-tekken` and `mistral-v7` chat template types (#10572)
* Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken`

* Changed system message logic and added tests for all 4

* Invalid `system_message` instead of `content` fixed

* Removed tab-indented lines

* Added template code and test for `mistral-v7`

* Added all tests. Fixed bug with `tmpl == "llama2"` test.

* Replaced tabs with spaces.

* Removed `'mistral-v2'` option as no (open) models ever used it

* Removed all references to 'v2' template from comments

* Update llama.cpp

Fixed `trim_assistant_message` bug
2024-12-01 23:09:49 +01:00
Georgi Gerganov 5e1ed95583
grammars : add English-only grammar (#10612) 2024-12-01 21:37:54 +02:00
Wang Qin 5c7a5aa0c3
ci: add error handling for Python venv creation in run.sh (#10608) 2024-12-01 20:11:42 +02:00
Diego Devesa 3420909dff
ggml : automatic selection of best CPU backend (#10606)
* ggml : automatic selection of best CPU backend

* amx : minor opt

* add GGML_AVX_VNNI to enable avx-vnni, fix checks
2024-12-01 16:12:41 +01:00
alek3y 86dc11c5bc
server : bind to any port when specified (#10590) 2024-12-01 13:33:12 +02:00
Georgi Gerganov 6acce39710
readme : update the usage section with examples (#10596)
* readme : update the usage section with examples

* readme : more examples
2024-12-01 11:25:17 +02:00
Wang Qin 43957ef203
build: update Makefile comments for C++ version change (#10598) 2024-12-01 04:19:44 +01:00
Adrien Gallouët 0c39f44d70
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (#10567)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-11-30 09:13:18 -08:00
Georgi Gerganov 3e0ba0e604
readme : remove old badge 2024-11-30 10:09:21 +02:00
Georgi Gerganov abadba05be
readme : refresh (#10587)
* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
2024-11-30 09:47:07 +02:00
Eve 0533e7fb38
vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536)
* subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

* check for subgroup multiple of 16 and greater than 16

* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)

* force 16 sequential threads per block

* make 16 subgroup size a constant
2024-11-30 08:00:02 +01:00
Diego Devesa 7cc2d2c889
ggml : move AMX to the CPU backend (#10570)
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Xuan Son Nguyen b782e5c7d4
server : add more test cases (#10569)
* server : add split model test

* add test speculative

* add invalid cases
2024-11-29 21:48:56 +01:00
Robert Collins 3a8e9af402
imatrix : support combine-only (#10492)
* imatrix-combine-only idea

* ensured that behavior consistent with log
2024-11-29 19:21:37 +02:00
Diego Devesa a3a3048e7a
cleanup UI link list (#10577)
* cleanup UI link list

* sort list alphabetically

* add missing licenses
2024-11-29 17:45:08 +01:00
hongruichen c5e6549331 fix: fix assertion 2024-11-29 23:38:06 +08:00
Georgi Gerganov f0678c5ff4
ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggml-ci
2024-11-29 16:25:39 +02:00
Shupei Fan 4b3242bbea
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580) 2024-11-29 14:49:02 +01:00
Alberto Cabrera Pérez 0f77aae560
sycl : offload of get_rows set to 0 (#10432) 2024-11-29 20:38:45 +08:00
Alberto Cabrera Pérez 266b8519ee
sycl : Reroute permuted mul_mats through oneMKL (#10408)
This PR fixes the failing MUL_MAT tests for the sycl backend.
2024-11-29 09:49:43 +00:00
hongruichen 09efaa389e define compile flag as module private 2024-11-29 17:24:05 +08:00
hongruichen 6d4feae579 redo conflict changes 2024-11-29 17:14:01 +08:00
hongruichen 67b183ceb1 Merge branch 'master' into dev-refactoring
# Conflicts:
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
2024-11-29 16:31:41 +08:00
Chenguang Li 938f608742
CANN: RoPE operator optimization (#10563)
* [cann] RoPE operator optimization

* [CANN]Code Formatting

---------

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-29 14:46:55 +08:00
hongruichen 5103b166ba bugfix: block large tensor calc in npu 2024-11-29 14:19:34 +08:00
Jeff Bolz f095a649ec
vulkan: get the first command buffer submitted sooner (#10499)
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
2024-11-29 07:18:02 +01:00
Ting Lou 678d7994f4
llava: return false instead of exit (#10546) 2024-11-29 01:09:46 +01:00
Georgi Gerganov dc22344088
ggml : remove redundant copyright notice + update authors 2024-11-28 20:46:40 +02:00
Georgi Gerganov 4c0a95b107
llama : add missing model types 2024-11-28 20:45:07 +02:00
Xuan Son Nguyen 6c59567689
server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568)
* server : (tests) don't use thread for capturing stdout/stderr

* test: bump openai to 1.55.2

* bump openai to 1.55.3
2024-11-28 19:17:49 +01:00
Johannes Gäßler 890719311b
common: fix warning message when no GPU found (#10564) 2024-11-28 18:15:25 +01:00
nullname a2df09b6af
[WIP] feat: perf opt (#10)
* reduce log

* wip

* add function to create concat nodes

* opt

* insert concat node before mulmat

* use resize op

* wip

* add bind_buffer and remov ggml prefix in tensor types

* use gather node instead

* fix tensor type, now succeed in gpu and cpu, failed in npu

* add comment

* wip

* add comment

* wip

* in destructor, clear internal buffer before unbind

* disable gather for npu

* wip

* count swap memory as free memory

* wip

* fix supported_types

ggml_backend_device_i.supports_op will be invoked before ggml_backend_device_i.init_backend

* rename create_tensors -> initialize_op_nodes

* move ggml_qnn_op_config to deparated file

* wip

* add create_convert_nodes

* add comment

* enable different type in/out for npu and cpu backend

* fix npu convert op

* enlarge max buffer size

* add more error code

* check tensor type before create convert node

* add log

* add log

* remove transpose0 and use buildin transpose flag

* rename transpose1 -> transpose_out

* disable convert for npu

* add more logs
2024-11-29 00:03:23 +08:00
Random Fly 7281cf13ad
docs: fix outdated usage of llama-simple (#10565) 2024-11-28 16:03:11 +01:00
Diego Devesa e90688edd0
ci : fix tag name in cuda and hip releases (#10566) 2024-11-28 15:58:54 +01:00
Georgi Gerganov 76b27d29c2
ggml : fix row condition for i8mm kernels (#10561)
ggml-ci
2024-11-28 14:56:37 +02:00
Georgi Gerganov eea986f215
cmake : fix ARM feature detection (#10543)
ggml-ci
2024-11-28 14:56:23 +02:00
Shupei Fan c202cef168
ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
* ggml-cpu: support IQ4_NL_4_4 by runtime repack

* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-11-28 13:52:03 +01:00
Sergio López 2025fa67e9
kompute : improve backend to pass test_backend_ops (#10542)
* kompute: op_unary: reject unsupported parameters

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: softmax: implement ALiBi support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: rope: implement neox and phi3 support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_q4_k permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_[q4_0|q4_1|q8_0] permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_f16 permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_q6_k permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

---------

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-28 12:51:38 +01:00
Ruixin Huang c6bc73951e
CANN: Update cann.md to display correctly in CLion (#10538) 2024-11-28 15:27:11 +08:00
leo-pony 605fa66c50
CANN: Fix SOC_TYPE compile bug (#10519)
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment

* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.

* fix CANN  compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
2024-11-28 15:25:24 +08:00
Chenguang Li b7420131bf
CANN: ROPE operator optimization (#10540)
* [cann] ROPE operator optimization

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-28 14:24:46 +08:00
Xuan Son Nguyen 9f912511bc
common : fix duplicated file name with hf_repo and hf_file (#10550) 2024-11-27 22:30:52 +01:00