Commit Graph

5496 Commits

Author SHA1 Message Date
nullname 295f7f5957
feat: perf opt part3 (#42)
* add f16 support to etl wise op

* wip

* Revert "wip"

This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.

* qf32 for mul

* wip

* Revert "wip"

This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.

* disable fp16 add/sub

* tempate trick

* wip

* add f16 mulmat

* add log

* fix view liked op

* add log

* fix f16 mulmat

* add quant type

* wip

* add l2fetch

* add vtcm_mem

* wip

* fix fetch

* use vtcm cache in mulmat

* revert vtcm cache

* cache plane

* small opt for plane cache

* cache plane for some element wise op

* wip

* enable fetch even on vtcm

* wip

* copy sysMonApp

* small opt

* init ltu

* add compute_params

* add op common header

* move vtcm_mem allocation to compute_param

* fallback to memcache when vtcm allocate failed

* pre-calculate quantize type

* wip

* try fix test failure

* try fix mulmat nan

* fix inf in mulmat

* remove debug logs

* wip

* small refactoring on the dequant row func

* fix typo

* improve logging

* add q4_0 and q8_0

* wip

* wip

* build hexagon libs in cmake

* wip

* fix qnn only build flag

* fix typo

* fix todo

* wip

* wip

* add to_float

* use to)float directly instead of ltu

* wip

* cache f16_to_f32 table into vtcm

* print tensor dims at log

* init device in supports_op_impl

* revert cache ltu

* wip

* wip

* fix graph calc issues by validate cache manually after each op

* add cache invalidate func

* enable cache fallback only in quantize tensors

* add option to disable quantized tensors

* propagate the asan flag to npu build

* fix asan option

* wip

* invalidate tensors after finished

* implement backend_buffer_reset

* wip

* wip

* refactoring plane cache mechanism

* wip

* split row elements across thread

* use table for f16 to f32 conversion

* sync after each op

* small refactoring to invalidate l2 cahce

* wip

* opt on float fetching

* unroll for loop manually

* reduce vtcm usage

* add perf tracking for npu

* print dimensions for profiler log

* wip

* wip

* wip

* add sub proc tracker

* fix typo

* print pcycles

* wip

* wip

* prefetch rows

* add l2fetch_row

* small tweak based on perf tracer

* opt l2 fetching

* wip
2025-05-16 19:57:33 +08:00
hongruichen db2a125438 fix GGML_QNN_ENABLE_PERFORMANCE_TRACKING option 2025-05-13 20:18:09 +08:00
hongruichen 02af8ff653 fix qnn only build flag 2025-05-08 21:28:11 +08:00
hongruichen 0ce53ce7cd fix linking error 2025-05-08 12:19:40 +08:00
hongruichen 039f835410 fix compiling error 2025-05-08 10:17:48 +08:00
hongruichen aca70692d3 Merge branch 'master' into dev-refactoring 2025-05-08 10:11:26 +08:00
hongruichen 161c4ee124 fix typo 2025-05-08 01:20:41 +08:00
Georgi Gerganov d879433824 sync : ggml
ggml-ci
2025-05-07 17:28:36 +03:00
Daniel Bevenius 13b0a04597 whisper: remove MSVC warnings pragmas (whisper/3090)
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
Jared Tweed bba9d945c1 cmake : removed stdc++fs (whisper/3097)
* removed stdc++fs

* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
Sigbjørn Skjæret bc4e1128f7
llama : deci : support ffn-free with attention (#13296) 2025-05-07 12:49:27 +02:00
Ycros 39e73ae0d6
common : Add a warning when we can't match samplers from a string or char. (#13330) 2025-05-07 11:23:28 +03:00
R0CKSTAR 1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile (#13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 09:48:23 +02:00
Georgi Gerganov 4773d7a02f
examples : remove infill (#13283)
ggml-ci
2025-05-07 10:28:02 +03:00
piDack 6c7fd67b64
llama : support tie embedding for chatglm models (#13328) 2025-05-07 09:23:11 +02:00
Johannes Gäßler 141a908a59
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135) 2025-05-06 23:35:51 +02:00
Xuan-Son Nguyen 32916a4907
clip : refactor graph builder (#13321)
* mtmd : refactor graph builder

* fix qwen2vl

* clean up siglip cgraph

* pixtral migrated

* move minicpmv to a dedicated build function

* move max_feature_layer to build_llava

* use build_attn for minicpm resampler

* fix windows build

* add comment for batch_size

* also support tinygemma3 test model

* qwen2vl does not use RMS norm

* fix qwen2vl norm (2)
2025-05-06 22:40:24 +02:00
DocShotgun ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345) 2025-05-06 22:36:24 +02:00
oobabooga 91a86a6f35
sampling : don't consider -infinity values in top_n_sigma (#13344) 2025-05-06 20:24:15 +02:00
Diego Devesa f4ed10b69c
cmake : remove arm64 msvc presets (#13342) 2025-05-06 20:15:31 +02:00
Akarshan Biswas 1e333d5bba
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254)
* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default
2025-05-06 20:27:06 +05:30
Xuan-Son Nguyen 2f54e348ad
llama : fix build_ffn without gate (#13336)
* llama : fix build_ffn without gate

* fix build on windows

* Revert "fix build on windows"

This reverts commit fc420d3c7e.
2025-05-06 14:25:40 +02:00
Johannes Gäßler 2356fb1d53
CUDA: fix bad asserts for partial offload (#13337) 2025-05-06 13:58:51 +02:00
Sigbjørn Skjæret 764b85627b
convert : qwen2/3moe : set yarn metadata if present (#13331)
* set yarn metadata if present

* add comment about enabling YaRN

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-05-06 11:12:06 +02:00
Johannes Gäßler 15a28ec8c7
CUDA: fix --split-mode row for MMQ (#13323) 2025-05-06 08:36:46 +02:00
compilade a7366faa5b
gguf-py : avoid requiring pyside6 for other scripts (#13036)
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed

Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/),
and the entrypoints in pyproject.toml can directly refer to the main functions.
2025-05-05 22:27:31 -04:00
Johannes Gäßler 9070365020
CUDA: fix logic for clearing padding with -ngl 0 (#13320) 2025-05-05 22:32:13 +02:00
oobabooga 233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264)
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering

* revert: sampler ordering

* revert: VS' crappy auto-formatting

* revert: VS' crappy auto-formatting pt.2

* revert: my crappy eye sight...

* sampling: add XTC to Top-nσ sampler chain

* sampling: add Dyna. Temp. to Top-nσ sampler chain

* sampling: actually remove Top-nσ from sampler(oops)

* Integrate top_n_sigma into main sampler chain

* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA

* Formatting

* Lint

* Exit early in the sampler if nsigma < 0

---------

Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>
2025-05-05 22:12:19 +02:00
igardev b34c859146
server : Webui - change setText command from parent window to also send the message. (#13309)
* setText command from parent window for llama-vscode now sends the message automatically.

* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.

* Fix code formatting.

* Add index.html.gz changes.

* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."

This reverts commit 67687b7fda.

* easier approach

* add setTimeout

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-05 16:03:31 +02:00
Xuan-Son Nguyen 9b61acf060
mtmd : rename llava directory to mtmd (#13311)
* mv llava to mtmd

* change ref everywhere
2025-05-05 16:02:55 +02:00
Xuan-Son Nguyen 5215b91e93
clip : fix confused naming ffn_up and ffn_down (#13290)
* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff
2025-05-05 12:54:44 +02:00
Sigbjørn Skjæret ae803bfc3d
convert : bailingmoe : set yarn metadata if present (#13312) 2025-05-05 12:34:26 +02:00
Akarshan Biswas 66645a5285
SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308)
ggml-ci
2025-05-05 13:39:10 +05:30
Xuan-Son Nguyen 27aa259532
mtmd : add C public API (#13184)
* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* add const to various places

* add warning about breaking changes

* helper: use mtmd_image_tokens_get_n_pos
2025-05-04 23:43:42 +02:00
Diego Devesa 9fdfcdaedd
rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
Aaron Teo 6eb7d25c70
ggml : activate s390x simd for Q3_K (#13301)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-04 19:49:12 +02:00
Diego Devesa 86bd60d3fe
llava/mtmd : fixes to fully support dl backends (#13303) 2025-05-04 17:05:20 +02:00
Diego Devesa 9f2da5871f
llama : build windows releases with dl backends (#13220) 2025-05-04 14:20:49 +02:00
Johannes Gäßler 93c4e23905
CUDA: fix race condition in MMQ stream-k fixup (#13299) 2025-05-04 14:16:39 +02:00
Johannes Gäßler 8afbd96818
CUDA: fix race condition in MMQ ids_dst (#13294) 2025-05-04 13:58:38 +02:00
Jeff Bolz 8ae5ebcf85
vulkan: Additional type support for unary, binary, and copy (#13266)
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00
Johannes Gäßler 3e959f0976
imatrix: fix oob writes if src1 is not contiguous (#13286) 2025-05-04 00:50:37 +02:00
Xuan-Son Nguyen 36667c8edc
clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking change) (#13259) 2025-05-03 20:07:54 +02:00
ymcki 3bf785f3ef
llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843) 2025-05-03 17:39:51 +02:00
Diego Devesa 1d36b3670b
llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00
Georgi Gerganov b34443923c
sync : ggml (#13268)
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests

* sync : ggml

ggml-ci

* vulkan : fix lint (#0)

---------

Co-authored-by: Acly <aclysia@gmail.com>
2025-05-02 20:54:30 +03:00
Georgi Gerganov a75cb30dc9
context : fix reorder logic (#13267)
ggml-ci
2025-05-02 20:54:13 +03:00
shalinib-ibm 3f3769ba76
ggml : Enable MMA for BF16 in llamafile_sgemm (#13148)
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-05-02 19:53:12 +03:00
Jared Van Bortel 2f567611c0
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245) 2025-05-02 11:42:30 -04:00
Jared Van Bortel 7d2123484e
convert : use correct context length for nomic-embed-text-v2 (#13216) 2025-05-02 11:41:54 -04:00