llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ed Addario	f5d8811ddd	Prioritise important tensors	2025-10-01 19:04:43 +01:00
Ed Addario	b3b8a111a5	Compute rows based on tensor shape and slice count	2025-09-28 18:45:25 +01:00
Ed Addario	e49e241d37	Calculate bpw over all tensors	2025-09-27 17:28:39 +01:00
Ed Addario	3d75b14c0f	Simplify dequantisation	2025-09-27 17:27:58 +01:00
Ed Addario	8a2c71f471	Check for direction reversal	2025-09-27 17:27:29 +01:00
Ed Addario	87cba65908	Tighten worker allocator	2025-09-27 17:26:30 +01:00
Ed Addario	d16945730e	Refactor outlier trimming	2025-09-27 17:25:29 +01:00
Ed Addario	dd4f4bd0b8	Reduce bpw range	2025-09-27 17:23:48 +01:00
Ed Addario	29bb30c4ed	Merge branch 'master' into quantize	2025-09-25 19:55:31 +01:00
Ed Addario	dbdd179a92	Combine quant types	2025-09-25 19:50:20 +01:00
Ed Addario	a74b410f5f	Move is_iq() into a lambda and remove unused variables	2025-09-25 19:49:47 +01:00
Sigbjørn Skjæret	835b2b915c	model : add GroveMoE support (#15510 ) * add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes	2025-09-25 19:50:28 +02:00
Aaron Teo	b05a9d650f	vendors: update miniaudio version (#16212 ) * vendor: update miniaudio.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * vendor: update miniaudio.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 23:38:10 +08:00
rtaluyev	27052978e4	readme : update bindings (#16144 ) Link to Java JNA bindings to llama.cpp native libraries	2025-09-25 18:20:34 +03:00
Aman Gupta	077c94d0ca	CUDA: add a fused top-K MoE kernel (#16130 ) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback	2025-09-25 16:35:05 +02:00
Daniel Bevenius	aa3ee0eb0b	model-conversion : add embedding prompt file support (#15871 ) This commit adds support for passing a prompt file to the model conversion targets/scripts. It also updates the logits.cpp to print out embedding information in the same format as when running the original embedding model. The motivation for this is that it allows us to pass files of different sizes when running the converted models and validating the logits. This can be particularly important when testing the sliding window functionality of models where the sequence length needs to exceed a certain number of tokens to trigger the sliding window logic.	2025-09-25 12:02:36 +02:00
Daniel Bevenius	d0991da39d	server : add support for external server for tests (#16243 ) This commit adds support for using an externally started llama-server instance for the server tests. This can be enabled by setting the DEBUG_EXTERNAL environment variable. The motivation for this is to allow debugging of the server itself when investigating a test failure. Instructions for how to do this are added to the README.md file in the tests directory.	2025-09-25 11:36:47 +02:00
junchao-zhao	aa719c2f88	ggml : fix loongarch lsx compilation error (#15864 )	2025-09-25 12:22:55 +03:00
Johannes Gäßler	4cdd0bb453	docs: fix typo [no ci] (#16244 )	2025-09-25 12:12:27 +03:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 )	2025-09-25 11:53:09 +03:00
Georgi Gerganov	dfcd53f7ec	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220 ) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-25 11:30:16 +03:00
Georgi Gerganov	4ea00794b8	metal : relax reorder conditions (#16216 )	2025-09-25 11:29:42 +03:00
Georgi Gerganov	02a6a82ae7	metal : restore im2col perf (#16219 )	2025-09-25 11:29:08 +03:00
Radoslav Gerganov	c498fc82fe	rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.	2025-09-25 07:20:02 +00:00
Aaron Teo	e7a5130a20	codeowners: add ownership of zdnn backend [no ci] (#16232 ) add @Andreas-Krebbel to owners of zDNN backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 08:06:30 +03:00
Eve	bee378e098	ci: run the x64 and arm ci on the github machines instead (#16183 ) * run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run	2025-09-25 08:06:06 +03:00
Aaron Teo	5fb557653b	devops: fix s390x docker release failure (#16231 )	2025-09-25 11:36:30 +08:00
Aaron Teo	4ae88d07d0	codeowners: add ownership of zdnn backend [no ci] (#16229 ) add @AlekseiNikiforovIBM to owners of zDNN backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-25 00:25:04 +08:00
Johannes Gäßler	e789095502	llama: print memory breakdown on exit (#15860 ) * llama: print memory breakdown on exit	2025-09-24 16:53:48 +02:00
Acly	f2a789e334	ggml : split graph allocations according to backend max buffer size (#15815 ) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks	2025-09-24 16:17:49 +02:00
Tarek Dakhran	3a59971967	model : add label for LiquidAI LFM2-2.6B model (#16204 ) * model : add label for LiquidAI LFM2-2.6B model HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B). Support for GGUF conversion and inference is added in #14620. However, due to similar `n_embd`, it identifies as a 1.2B model. Fix the label by using `n_ff` to identify the model instead. Output of `llama-bench`: ``` \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| ------: \| --------------: \| -------------------: \| \| lfm2 1.2B F16 \| 2.18 GiB \| 1.17 B \| CPU \| 10 \| pp512 \| 223.97 ± 5.32 \| \| lfm2 2.6B F16 \| 4.79 GiB \| 2.57 B \| CPU \| 10 \| pp512 \| 92.53 ± 4.14 \| \| lfm2 350M F16 \| 676.25 MiB \| 354.48 M \| CPU \| 10 \| pp512 \| 725.52 ± 11.70 \| \| lfm2 700M F16 \| 1.38 GiB \| 742.49 M \| CPU \| 10 \| pp512 \| 336.22 ± 12.93 \| ``` * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-24 13:42:26 +02:00
Jie Fu (傅杰)	63b54c81a6	model-conversion : make causal-verify-logits fails with model names containing "." (#16215 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 10:25:26 +02:00
Uilian Ries	152729f884	common : add missing chrono header for common.cpp (#16211 ) Signed-off-by: Uilian Ries <uilianries@gmail.com>	2025-09-24 09:53:47 +03:00
Sigbjørn Skjæret	c0c59c1157	codeowners : match all requirements files (#16214 )	2025-09-24 08:53:20 +02:00
Jie Fu (傅杰)	7735706b93	model-conversion : run-org-model.py fails to run on mac m1 (#16213 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 08:46:52 +02:00
Daniel Bevenius	4d9ea03d17	codeowners : use slash prefix for root files [no ci] (#16210 ) This commit adds a leading slash to the paths of root-level files in the CODEOWNERS file. The motivation for this is that these might otherwise match files in subdirectories that have other/additional owners will override them. Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274	2025-09-24 08:10:09 +02:00
Jie Fu (傅杰)	8ba548dae2	model-conversion : fix the make targets in the README.md (#16209 ) Fix two incorrect make targets in the readme. Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 06:19:23 +02:00
Georgi Gerganov	f505bd83ca	ci : disable AMD workflows + update NVIDIA workflows (#16200 ) * ci : disable AMD workflows + update NVIDIA workflows * cont : fixes * cont : update nvidia vulkan workflows	2025-09-23 20:41:40 +03:00
Georgi Gerganov	0889589dbe	ci : enable Vulkan workflow on Mac (#16194 )	2025-09-23 13:44:25 +03:00
Xiangyan Sun	4e29084ba4	ggml-cpu: Respect cpumask settings (#16164 )	2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret	f6b4af3d04	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928 ) * fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl * change initialization to true	2025-09-23 10:25:20 +02:00
Aaron Teo	264f1b5187	zdnn: refactor codebase + add docs (#16178 ) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-23 14:53:05 +08:00
Daniel Bevenius	0bc7cc7154	codeowners : add @danbev to model-conversion example [no ci] (#16190 ) This commit adds examples/model-conversion/ to the CODEOWNERS file and assigns myself (@danbev) as the code owner for this directory.	2025-09-23 09:13:22 +03:00
Aaron Teo	4b9f4cb0f8	devops: add s390x containers (#15915 ) * devops: add s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing ninja Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move s390x docker into cpu docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: rework s390x docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: copy more tools Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add server build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove apt clean steps as distroless misses it Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove apt commands from distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix shared libs in distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: use correct libs path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix shared libs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add collector stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing stage ref Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix permission issue Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix unknown model loading failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at fixing model loading failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing ggml shared object failure to load model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove move shared objects Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move libggml-cpu and blas into bin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: finalise hardened server stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add cli target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix missing shared libraries in base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: update debian target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: formalise llama.cpp loc" This reverts commit `0a7664af84`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0a7664af84`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at fixing missing dir Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at making it cache the build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix copying process Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: make build dir an argument Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: make build dir an argument" This reverts commit `438698976b`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add build stage for gguf-py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move gguf-py installation into build stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: break system packages? Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add rust compiler installer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix rustc not found Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove cache mount to allow rustc to persist Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move rustc installation to another layer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: move gguf-py installation to full stage, fix copying Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove rustc installation in build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable full target for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempting static build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: merge s390x dockerfile into cpu for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: switch to gcc image for build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove build essentials Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: install openblas into base target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: go back to s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove libggml and libblas Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add full target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add break system packages Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add libjpeg Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing cmake dep Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: finalise docker images for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add custom openblas patch Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: use libopenblas-dev instead of libopenblas-openmp-dev Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add s390x docker build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-23 13:59:34 +08:00
Daniel Bevenius	85e72271ba	ggml-cpu : fix typo in gemm comments [no ci] (#16189 )	2025-09-23 05:59:03 +02:00
Ed Addario	8eedcf74bc	Increase scale multiplier	2025-09-22 20:42:37 +01:00
Ed Addario	d36ee0a0a8	Add comments to explain magic numbers	2025-09-22 20:41:56 +01:00
Ed Addario	7ba6001ec8	Simplify candidates sorting	2025-09-22 20:11:54 +01:00
Ed Addario	d79ade2e8e	Adjust for small vector size	2025-09-22 20:11:26 +01:00
Ed Addario	f184450806	Fix minor logic flaw	2025-09-22 20:10:42 +01:00

1 2 3 4 5 ...

6701 Commits All Branches Search

6701 Commits

All Branches