Commit Graph

279 Commits

Author SHA1 Message Date
Georgi Gerganov cdabeb2c27 sync : ggml 2025-11-05 10:41:51 +02:00
Georgi Gerganov 7fd205a8e8
scripts : add script to bench models (#16894) 2025-11-02 00:15:31 +02:00
Georgi Gerganov 6d39015a74 sync : ggml 2025-10-31 16:26:28 +02:00
Max Krasnyansky 3eb2be1ca5
Hexagon Op queue & dispatch optimizations (#16820)
* hexagon: remove dspqueue callbacks and do all read processing inplace

* hexagon: there is no need to ref/deref the buffers at this point

We're not going to release the buffers without flushing the session queue.
So there is no need to inc/dec the refcounts for every request.
We also don't need to include those bufs in the response.

* hexagon: bump the thread count in the adb wrapper scripts

We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention).
Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to
the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs.

* hexagon: add lhez as the second code owner
2025-10-29 06:29:12 -07:00
Max Krasnyansky 63d2fc46e1
Add experimental ggml-hexagon backend for the Hexagon NPU (#16547)
* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2025-10-22 13:47:09 -07:00
Georgi Gerganov 075c01567b ggml : bump version to 0.9.4 (ggml/1363) 2025-09-30 13:53:55 +03:00
Georgi Gerganov 2ddd3f2356 sync : ggml 2025-09-29 17:43:58 +03:00
Georgi Gerganov 432cf4304c
codeowners : update + cleanup (#16174)
---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-09-22 18:20:21 +03:00
Georgi Gerganov 7f766929ca sync : ggml 2025-09-20 13:02:14 +03:00
Xuan-Son Nguyen 3c3635d2f2
server : speed up tests (#15836)
* server : speed up tests

* clean up

* restore timeout_seconds in some places

* flake8

* explicit offline
2025-09-06 14:45:24 +02:00
Piotr Wilkin (ilintar) 9e2b1e83c6
scripts : add Jinja tester PySide6 simple app (#15756)
* feat: add Jinja tester PySide6 simple app

* Linter fixes

* Pylint fixes

* Whitespace

* Add commandline support; add formatter; add extensions

* Remove testing actions

* Silence flake8 warnings for commandline mode

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix trailing whitespace/newline logic

* Update scripts/jinja/jinja-tester.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/jinja/jinja-tester.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-05 01:05:12 +02:00
Johannes Gäßler e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Johannes Gäßler 3d16b29c3b
scripts: strip "AMD Instinct" from GPU name (#15668) 2025-08-29 22:04:08 +02:00
Aman Gupta 55042b3692
scripts: add sqlite3 check for compare-commits.sh (#15633) 2025-08-28 19:23:22 +08:00
Johannes Gäßler 9ef536907d
scripts: fix compare-llama-bench.py (#15521) 2025-08-23 13:58:58 +03:00
Georgi Gerganov 9ebebef62f
llama : remove KV cache defragmentation logic (#15473)
ggml-ci
2025-08-22 12:22:13 +03:00
Georgi Gerganov 60212f1ead sync : ggml 2025-08-18 22:06:44 +03:00
Georgi Gerganov f0c541d315 scripts : update sync scripts 2025-08-18 22:06:44 +03:00
Georgi Gerganov 3973163bff sync : ggml
ggml-ci
2025-08-14 14:59:27 +03:00
Johannes Gäßler 4850b52aed
server-bench: external OAI servers, sqlite (#15179)
* server-bench: external OAI servers, sqlite

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* raise_for_status

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-08 23:04:36 +02:00
Johannes Gäßler 20638e4f16
scripts: fix crash when --tool is not set (#15133) 2025-08-07 08:50:30 +02:00
R0CKSTAR 3025b621d1
llama-bench: rename DB table name from test to llama_bench (#15003)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-02 17:20:40 +08:00
R0CKSTAR 484b2091ce
compare-commits.sh: support both llama-bench and test-backend-ops (#14392)
* compare-commits.sh: support both llama-bench and test-backend-ops

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Speed up the build by specifying -j 12

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Remove build_number from test-backend-ops db

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Apply suggestion from @JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Refine tool selection logic

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-01 08:47:27 +08:00
Georgi Gerganov e32a4ec60e sync : ggml
ggml-ci
2025-07-30 17:33:11 +03:00
Johannes Gäßler bbd0f91779
server-bench: make seed choice configurable (#14929)
* server-bench: make seed choice configurable

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix error formatting

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-29 10:40:50 +02:00
Georgi Gerganov 1f45f2890e sync : ggml 2025-07-28 08:15:01 +03:00
Aman Gupta 446595b9b3
Docs: add instructions for adding backends (#14889) 2025-07-27 09:36:43 +08:00
Georgi Gerganov 2df255da3c sync : ggml
ggml-ci
2025-07-24 20:27:23 +03:00
Georgi Gerganov b17230917c sync : ggml 2025-07-19 11:46:50 +03:00
Johannes Gäßler 5cae766541
scripts: synthetic prompt mode for server-bench.py (#14695) 2025-07-16 09:33:28 +02:00
Johannes Gäßler 494c5899cb
scripts: benchmark for HTTP server throughput (#14668)
* scripts: benchmark for HTTP server throughput

* fix server connection reset
2025-07-14 13:14:30 +02:00
Georgi Gerganov 8eff95544e sync : ggml 2025-07-12 16:13:27 +03:00
Georgi Gerganov 215535701d sync : ggml
ggml-ci
2025-07-12 14:25:44 +03:00
Aman Gupta 11ee0fea2a
Docs: script to auto-generate ggml operations docs (#14598)
* Docs: script to auto-generate ggml operations docs

* Review: formatting changes + change github action

* Use built-in types instead of typing

* docs : add BLAS and Metal ops

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-10 23:29:01 +08:00
Georgi Gerganov d4cdd9c1c3
ggml : remove kompute backend (#14501)
ggml-ci
2025-07-03 07:48:32 +03:00
Georgi Gerganov e17991c466 sync : ggml
ggml-ci
2025-07-02 20:08:45 +03:00
Georgi Gerganov f61c05d4b1 sync : ggml
ggml-ci
2025-07-01 11:06:39 +03:00
Vedran Miletić e9b6350e61
scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
Georgi Gerganov 06cbedfca1 sync : ggml
ggml-ci
2025-06-20 21:02:47 +03:00
Georgi Gerganov d03172cc79 sync : ggml
ggml-ci
2025-06-18 09:59:21 +03:00
Aman Gupta 2e42be42bd
compare-llama-bench: add option to plot (#14169)
* compare llama-bench: add option to plot

* Address review comments: convert case + add type hints

* Add matplotlib to requirements

* fix tests

* Improve comment and fix assert condition for test

* Add back default test_name, add --plot_log_scale

* use log_scale regardless of x_values
2025-06-14 10:34:20 +02:00
Georgi Gerganov ae92c1855b sync : ggml
ggml-ci
2025-06-10 18:39:33 +03:00
Georgi Gerganov b8e2194efc sync : ggml
ggml-ci
2025-06-10 09:21:56 +03:00
Georgi Gerganov f3a4b1659c sync : ggml
ggml-ci
2025-06-01 13:43:57 +03:00
Georgi Gerganov 53f925074d
sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Georgi Gerganov 1c49c70d07 sync : ggml 2025-05-27 18:05:33 +03:00
Georgi Gerganov a26c4cc11e
scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00
Olivier Chafik f5cd27b71d
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Georgi Gerganov d30cb5a7fa sync : ggml
ggml-ci
2025-05-19 13:29:56 +03:00
Sigbjørn Skjæret be1d4a13db
scripts : fix compare-llama-bench.py show parameter (#13514) 2025-05-14 08:41:01 +02:00