Ed Addario
f5d8811ddd
Prioritise important tensors
2025-10-01 19:04:43 +01:00
Ed Addario
b3b8a111a5
Compute rows based on tensor shape and slice count
2025-09-28 18:45:25 +01:00
Ed Addario
e49e241d37
Calculate bpw over all tensors
2025-09-27 17:28:39 +01:00
Ed Addario
3d75b14c0f
Simplify dequantisation
2025-09-27 17:27:58 +01:00
Ed Addario
8a2c71f471
Check for direction reversal
2025-09-27 17:27:29 +01:00
Ed Addario
87cba65908
Tighten worker allocator
2025-09-27 17:26:30 +01:00
Ed Addario
d16945730e
Refactor outlier trimming
2025-09-27 17:25:29 +01:00
Ed Addario
dd4f4bd0b8
Reduce bpw range
2025-09-27 17:23:48 +01:00
Ed Addario
29bb30c4ed
Merge branch 'master' into quantize
2025-09-25 19:55:31 +01:00
Ed Addario
dbdd179a92
Combine quant types
2025-09-25 19:50:20 +01:00
Ed Addario
a74b410f5f
Move is_iq() into a lambda and remove unused variables
2025-09-25 19:49:47 +01:00
Sigbjørn Skjæret
835b2b915c
model : add GroveMoE support ( #15510 )
...
* add GroveMoE support
* remove constexpr that fails on certain compilers
* revert crude scalar div implementation, use cast
* build_attn_inp_kv_unified -> build_attn_inp_kv
* fix build_attn
* re-apply ffn_exps regex changes
2025-09-25 19:50:28 +02:00
Aaron Teo
b05a9d650f
vendors: update miniaudio version ( #16212 )
...
* vendor: update miniaudio.h
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* vendor: update miniaudio.h
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 23:38:10 +08:00
rtaluyev
27052978e4
readme : update bindings ( #16144 )
...
Link to Java JNA bindings to llama.cpp native libraries
2025-09-25 18:20:34 +03:00
Aman Gupta
077c94d0ca
CUDA: add a fused top-K MoE kernel ( #16130 )
...
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
* Review: format + micro-optimizations
* Fix bug: fix tie breakers
* Add optional norm + clean-up code
* Use smem for final write
* Add bounds check
* Use better memory pattern for writeback
2025-09-25 16:35:05 +02:00
Daniel Bevenius
aa3ee0eb0b
model-conversion : add embedding prompt file support ( #15871 )
...
This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.
The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.
This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.
2025-09-25 12:02:36 +02:00
Daniel Bevenius
d0991da39d
server : add support for external server for tests ( #16243 )
...
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.
The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
2025-09-25 11:36:47 +02:00
junchao-zhao
aa719c2f88
ggml : fix loongarch lsx compilation error ( #15864 )
2025-09-25 12:22:55 +03:00
Johannes Gäßler
4cdd0bb453
docs: fix typo [no ci] ( #16244 )
2025-09-25 12:12:27 +03:00
Douglas Hanley
b5bd037832
llama : add support for qwen3 reranker ( #15824 )
2025-09-25 11:53:09 +03:00
Georgi Gerganov
dfcd53f7ec
metal : fuse NORM + MUL + ADD, support non-multiples of 4 ( #16220 )
...
* metal : fuse NORM + MUL + ADD
* metal : support norms of non-multiple of 4
* cont : fix comment [no ci]
2025-09-25 11:30:16 +03:00
Georgi Gerganov
4ea00794b8
metal : relax reorder conditions ( #16216 )
2025-09-25 11:29:42 +03:00
Georgi Gerganov
02a6a82ae7
metal : restore im2col perf ( #16219 )
2025-09-25 11:29:08 +03:00
Radoslav Gerganov
c498fc82fe
rpc : use ggml logging facilities
...
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
2025-09-25 07:20:02 +00:00
Aaron Teo
e7a5130a20
codeowners: add ownership of zdnn backend [no ci] ( #16232 )
...
add @Andreas-Krebbel to owners of zDNN backend
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 08:06:30 +03:00
Eve
bee378e098
ci: run the x64 and arm ci on the github machines instead ( #16183 )
...
* run the x64 ci on regular machines
* set up the same thing for arm
fix test-quantize-perf just like #12306
* try to disable sve
* add another sve run
2025-09-25 08:06:06 +03:00
Aaron Teo
5fb557653b
devops: fix s390x docker release failure ( #16231 )
2025-09-25 11:36:30 +08:00
Aaron Teo
4ae88d07d0
codeowners: add ownership of zdnn backend [no ci] ( #16229 )
...
add @AlekseiNikiforovIBM to owners of zDNN backend
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 00:25:04 +08:00
Johannes Gäßler
e789095502
llama: print memory breakdown on exit ( #15860 )
...
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Acly
f2a789e334
ggml : split graph allocations according to backend max buffer size ( #15815 )
...
* ggml : make gallocr respect the backend's max buffer size
* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max allocation size in buffer type interface
* fix missing newline, apple-clang warning
* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.
* track (chunk, offset) pairs instead of "global" offsets through gallocr.
* simpler, don't need loops to map between local/global offsets
* touches more code
* fix dyn_tallocr_max_size and initialization
* fix memory leak when buffers are reused due to same buffer type appearing multiple times
* make vbuffer allocation follow the same logic as backend_buffer did before
* continue to use leftover unallocated space of previous chunks after a new one has been created
* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size
* refactor: move adding new free block and new chunk into separate functions
* allocate chunks individually with a separate free-blocks list for each one
* needs a bit more memory/allocations/indirections, but code is simpler
* fix warnings (missing static) & debug checks
2025-09-24 16:17:49 +02:00
Tarek Dakhran
3a59971967
model : add label for LiquidAI LFM2-2.6B model ( #16204 )
...
* model : add label for LiquidAI LFM2-2.6B model
HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B ).
Support for GGUF conversion and inference is added in #14620 .
However, due to similar `n_embd`, it identifies as a 1.2B model.
Fix the label by using `n_ff` to identify the model instead.
Output of `llama-bench`:
```
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2 1.2B F16 | 2.18 GiB | 1.17 B | CPU | 10 | pp512 | 223.97 ± 5.32 |
| lfm2 2.6B F16 | 4.79 GiB | 2.57 B | CPU | 10 | pp512 | 92.53 ± 4.14 |
| lfm2 350M F16 | 676.25 MiB | 354.48 M | CPU | 10 | pp512 | 725.52 ± 11.70 |
| lfm2 700M F16 | 1.38 GiB | 742.49 M | CPU | 10 | pp512 | 336.22 ± 12.93 |
```
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-24 13:42:26 +02:00
Jie Fu (傅杰)
63b54c81a6
model-conversion : make causal-verify-logits fails with model names containing "." ( #16215 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 10:25:26 +02:00
Uilian Ries
152729f884
common : add missing chrono header for common.cpp ( #16211 )
...
Signed-off-by: Uilian Ries <uilianries@gmail.com>
2025-09-24 09:53:47 +03:00
Sigbjørn Skjæret
c0c59c1157
codeowners : match all requirements files ( #16214 )
2025-09-24 08:53:20 +02:00
Jie Fu (傅杰)
7735706b93
model-conversion : run-org-model.py fails to run on mac m1 ( #16213 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 08:46:52 +02:00
Daniel Bevenius
4d9ea03d17
codeowners : use slash prefix for root files [no ci] ( #16210 )
...
This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.
The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.
Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274
2025-09-24 08:10:09 +02:00
Jie Fu (傅杰)
8ba548dae2
model-conversion : fix the make targets in the README.md ( #16209 )
...
Fix two incorrect make targets in the readme.
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 06:19:23 +02:00
Georgi Gerganov
f505bd83ca
ci : disable AMD workflows + update NVIDIA workflows ( #16200 )
...
* ci : disable AMD workflows + update NVIDIA workflows
* cont : fixes
* cont : update nvidia vulkan workflows
2025-09-23 20:41:40 +03:00
Georgi Gerganov
0889589dbe
ci : enable Vulkan workflow on Mac ( #16194 )
2025-09-23 13:44:25 +03:00
Xiangyan Sun
4e29084ba4
ggml-cpu: Respect cpumask settings ( #16164 )
2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret
f6b4af3d04
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl ( #15928 )
...
* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl
* change initialization to true
2025-09-23 10:25:20 +02:00
Aaron Teo
264f1b5187
zdnn: refactor codebase + add docs ( #16178 )
...
* zdnn: initial matmul refactor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: rm static from funcs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: update ggml-zdnn.h
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: change header files to hpp
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: switch to common.hpp
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: move mulmat forward around
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: rm inline from utils
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: code cleanup
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* docs: add zDNN docs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-23 14:53:05 +08:00
Daniel Bevenius
0bc7cc7154
codeowners : add @danbev to model-conversion example [no ci] ( #16190 )
...
This commit adds examples/model-conversion/ to the CODEOWNERS file and
assigns myself (@danbev) as the code owner for this directory.
2025-09-23 09:13:22 +03:00
Aaron Teo
4b9f4cb0f8
devops: add s390x containers ( #15915 )
...
* devops: add s390x dockerfile
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add missing ninja
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: move s390x docker into cpu docker
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: rework s390x docker
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: copy more tools
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add server build step
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove apt clean steps as distroless misses it
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove apt commands from distroless
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix shared libs in distroless
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: use correct libs path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix shared libs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add collector stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix missing stage ref
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix permission issue
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix unknown model loading failures
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: attempt at fixing model loading failure
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix missing ggml shared object
failure to load model
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove move shared objects
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: move libggml-cpu and blas into bin
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: finalise hardened server stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add cli target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix typos
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix missing shared libraries in base
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: update debian target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "devops: formalise llama.cpp loc"
This reverts commit 0a7664af84 .
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0a7664af84 )
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: attempt at fixing missing dir
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: attempt at making it cache the build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix copying process
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: make build dir an argument
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "devops: make build dir an argument"
This reverts commit 438698976b .
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add build stage for gguf-py
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: move gguf-py installation into build stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: break system packages?
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add rust compiler installer
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix rustc not found
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove cache mount to allow rustc to persist
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: move rustc installation to another layer
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: move gguf-py installation to full stage, fix copying
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove rustc installation in build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable full target for now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: attempting static build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: merge s390x dockerfile into cpu for now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: switch to gcc image for build step
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove build essentials
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: install openblas into base target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: go back to s390x dockerfile
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove libggml and libblas
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add full target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add break system packages
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add libjpeg
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add missing cmake dep
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: finalise docker images for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add custom openblas patch
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: use libopenblas-dev instead of libopenblas-openmp-dev
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add s390x docker build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-23 13:59:34 +08:00
Daniel Bevenius
85e72271ba
ggml-cpu : fix typo in gemm comments [no ci] ( #16189 )
2025-09-23 05:59:03 +02:00
Ed Addario
8eedcf74bc
Increase scale multiplier
2025-09-22 20:42:37 +01:00
Ed Addario
d36ee0a0a8
Add comments to explain magic numbers
2025-09-22 20:41:56 +01:00
Ed Addario
7ba6001ec8
Simplify candidates sorting
2025-09-22 20:11:54 +01:00
Ed Addario
d79ade2e8e
Adjust for small vector size
2025-09-22 20:11:26 +01:00
Ed Addario
f184450806
Fix minor logic flaw
2025-09-22 20:10:42 +01:00