Commit Graph

6014 Commits

Author SHA1 Message Date
Ed Addario 90cb1be99d
Minor cosmetic changes 2025-08-03 16:57:27 +01:00
Ed Addario 2117c4e54b
Update aggregated statistic report layout 2025-08-03 16:38:02 +01:00
Ed Addario a6155a8125
Add compute_layer_statistics() function 2025-08-03 16:35:03 +01:00
Ed Addario be60469f25
Refactor function names 2025-08-03 15:10:17 +01:00
Ed Addario fce05aac9e
Refactor lambda into compute_tensor_averages() function 2025-08-03 13:03:21 +01:00
Ed Addario 5324558132
Update table layout 2025-08-03 10:28:47 +01:00
Ed Addario 4d1325e1eb
Refactor variables 2025-08-03 10:28:23 +01:00
Ed Addario a32a2ecbed
Reformat report layout 2025-08-03 00:51:33 +01:00
Ed Addario 4c01f51ae1
Remove inactive 2025-08-03 00:51:12 +01:00
Ed Addario fc8f92596f
Update table display 2025-08-02 16:46:27 +01:00
Ed Addario ee2509f563
Adjust threshold 2025-08-02 16:45:56 +01:00
Ed Addario 9b841eb696
Compute l2 norm 2025-08-02 16:45:09 +01:00
Ed Addario b7fb362d8e
Compute cosine similarity based on activations 2025-08-02 16:43:49 +01:00
Ed Addario cce514a392
Compute entropy for activations 2025-08-02 16:40:40 +01:00
Ed Addario 9744a4a1c6
Determine calculation mode 2025-08-02 16:36:12 +01:00
Ed Addario 78ddb475de
Fix problem up when GGUF does not have in_sum 2025-08-02 16:31:21 +01:00
Ed Addario 2097f038b0
Refactor variable names 2025-07-31 20:46:40 +01:00
Ed Addario 09bc7c24e7
Use activations to calculate the stats 2025-07-26 17:06:41 +01:00
hipudding 11dd5a44eb
CANN: Implement GLU ops (#14884)
Implement REGLU, GEGLU, SWIGLU ops according to #14158
2025-07-26 17:56:18 +08:00
R0CKSTAR 9b8f3c6c77
musa: fix build warnings (unused variable) (#14869)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-26 10:36:02 +08:00
Aaron Teo c7f3169cd5
ggml-cpu : disable GGML_NNPA by default due to instability (#14880)
* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-25 19:09:03 +02:00
Gabe Goodhart 793c0d7f46
metal: SSM_SCAN performance (#14743)
* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-25 10:47:39 -06:00
lhez ce111d39d6
opencl: add fused `rms_norm_mul` (#14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
2025-07-25 17:12:13 +02:00
wooksong e7fecba934
docs : update HOWTO‑add‑model.md for ModelBase and new model classes (#14874)
This patch updates the example in docs/development/HOWTO-add-model.md to
reflect recent changes after `TextModel` and `MmprojModel` were introduced.

It replaces the outdated `Model` base class with `TextModel` or `MmprojModel`
and updates the registration example accordingly.

Signed-off-by: Wook Song <wook16.song@samsung.com>
2025-07-25 16:25:05 +02:00
Oliver Simons e2b7621e7c
ggml : remove invalid portPos specifiers from dot files (#14838)
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
2025-07-25 14:29:57 +03:00
Georgi Gerganov c1dbea752a
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)
ggml-ci
2025-07-25 14:28:06 +03:00
kiwi 749e0d27f0
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503)
* [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip

* Update export-lora.cpp

* Update clip.cpp

* Update export-lora.cpp

* format: use space to replace tab
2025-07-25 13:08:04 +02:00
Chris Rohlf 64bf1c3744
rpc : check for null buffers in get/set/copy tensor endpoints (#14868) 2025-07-25 12:17:02 +02:00
Diego Devesa c12bbde372
sched : fix multiple evaluations of the same graph with pipeline parallelism (#14855)
ggml-ci
2025-07-25 11:07:26 +03:00
R0CKSTAR 3f4fc97f1d
musa: upgrade musa sdk to rc4.2.0 (#14498)
* musa: apply mublas API changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update musa version to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore MUSA graph settings in CMakeLists.txt

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable mudnnMemcpyAsync by default

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: switch back to non-mudnn images

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* minor changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-24 20:05:37 +01:00
Georgi Gerganov 2df255da3c sync : ggml
ggml-ci
2025-07-24 20:27:23 +03:00
Kai Pastor 60f816a79d cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-24 20:27:23 +03:00
Daniel Bevenius 5592f278b6 ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-24 20:27:23 +03:00
Georgi Gerganov e4868d16d2
context : perform output reorder lazily upon access after sync (#14853)
* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
2025-07-24 16:31:48 +03:00
Xuan-Son Nguyen 820de57d4f
chat : fix kimi-k2 chat template (#14852) 2025-07-24 13:59:56 +02:00
Alberto Cabrera Pérez cb4a63aad6
sycl: fixed semantics of block offset calculation (#14814) 2025-07-24 11:09:57 +01:00
yummy 86f5623d90
llama : fix MiniCPM inference after Granite Four changes (#14850)
MiniCPM models use the llm_build_granite constructor which was changed
in the Granite Four PR to use hparams.rope_finetuned instead of a
use_rope parameter. MiniCPM models need rope enabled by default.

Fixes inference from gibberish to correct responses.
2025-07-24 11:50:51 +02:00
Pouya 39cffdf188
docs: add libcurl-dev install hint for Linux distros (#14801)
* docs: add libcurl-dev install hint for Linux distros

Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com>

* Update docs/build.md

---------

Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-07-24 11:26:44 +02:00
Georgi Gerganov 065908cb09
metal : fix fusion across different encoders (#14849)
* metal : fix fusion across different encoders

ggml-ci

* cont : add assertion

ggml-ci
2025-07-24 10:24:05 +03:00
Donghyeon Jeong 4ec6291a24
sycl: fix undefined variable in work group size check (#14843) 2025-07-24 12:50:41 +08:00
jacekpoplawski a12363bbf0
convert : text-only support for GLM-4.1V-9B-Thinking (#14823)
* use language_model part only, ignore visual layers

* fix rope_dim calculation
2025-07-23 23:23:57 +02:00
Johannes Gäßler a86f52b285
CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
Johannes Gäßler b284197df4
CUDA: fix compilation with GGML_CUDA_F16 (#14837) 2025-07-23 18:22:30 +02:00
Sigbjørn Skjæret 221c0e0c58
ci : correct label refactor->refactoring (#14832) 2025-07-23 14:27:54 +02:00
Johannes Gäßler 07a19e27a2 CUDA: fix quantized KV cache + multiple sequences (#14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-23 14:08:09 +03:00
Georgi Gerganov 18f3b5ff9e tests : add non-cont K,V FA tests
ggml-ci
2025-07-23 14:08:09 +03:00
l3utterfly 7233358d29
memory : handle saving/loading null layers in recurrent memory (#14675)
* Update llama-memory-recurrent.cpp

handle saving/loading null layers in recurrent memory

* fixed styling issues and updated comments

* fix styling issue

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-23 11:16:41 +03:00
lixing-star 6c88b3bb25
ggml: fix loongarch quantize_row_q8_1 error (#14827) 2025-07-23 09:39:51 +03:00
chen fan 14c28dfc50
CANN: weight format to NZ for Ascend310P3 (#14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-23 11:58:00 +08:00
Aman Gupta 8c988fa41d
CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00