llama.cpp

History

Gabe Goodhart fd621880f3 aLoRA Support (#15327 ) * feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use \|\| instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>		2025-09-05 17:32:39 -06:00
..
CMakeLists.txt	kv-cache : drop the "unified" prefix (#15467 )	2025-08-21 17:00:33 +03:00
llama-adapter.cpp	aLoRA Support (#15327 )	2025-09-05 17:32:39 -06:00
llama-adapter.h	aLoRA Support (#15327 )	2025-09-05 17:32:39 -06:00
llama-arch.cpp	aLoRA Support (#15327 )	2025-09-05 17:32:39 -06:00
llama-arch.h	aLoRA Support (#15327 )	2025-09-05 17:32:39 -06:00
llama-batch.cpp	perplexity : provide a helpful hint for has_cpl case in split_equal error. (#15304 )	2025-08-14 14:03:30 +03:00
llama-batch.h	llama : reuse compute graphs (#14482 )	2025-07-17 19:08:33 +03:00
llama-chat.cpp	model : add support for Seed-OSS (#15490 )	2025-08-23 15:21:52 +02:00
llama-chat.h	model : add support for Seed-OSS (#15490 )	2025-08-23 15:21:52 +02:00
llama-context.cpp	llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791 )	2025-09-04 15:40:44 +02:00
llama-context.h	llama : separate compute buffer reserve from fattn check (#15696 )	2025-08-31 15:49:03 +02:00
llama-cparams.cpp	cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188 )	2025-06-15 10:08:58 +03:00
llama-cparams.h	llama : remove KV cache defragmentation logic (#15473 )	2025-08-22 12:22:13 +03:00
llama-grammar.cpp	`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379 )	2025-05-25 01:48:08 +01:00
llama-grammar.h	`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )	2025-03-05 13:05:13 +00:00
llama-graph.cpp	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-graph.h	llama : add support for EmbeddingGemma 300m (#15798 )	2025-09-04 18:10:29 +02:00
llama-hparams.cpp	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-hparams.h	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-impl.cpp	GGUF: C++ refactor, backend support, misc fixes (#11030 )	2025-01-07 18:01:58 +01:00
llama-impl.h	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
llama-io.cpp	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-io.h	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-kv-cache-iswa.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-kv-cache.cpp	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-kv-cache.h	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-kv-cells.h	llama : remove KV cache defragmentation logic (#15473 )	2025-08-22 12:22:13 +03:00
llama-memory-hybrid.cpp	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-memory-hybrid.h	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-memory-recurrent.cpp	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-memory-recurrent.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-memory.cpp	memory : correctly handle failure in apply() (#14438 )	2025-06-30 18:03:03 +03:00
llama-memory.h	kv-cache : support layer reuse (#15504 )	2025-08-24 13:07:07 +03:00
llama-mmap.cpp	llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013 )	2025-06-05 11:57:42 +02:00
llama-mmap.h	llama-mmap: fix missing include (#11796 )	2025-02-10 20:58:18 +02:00
llama-model-loader.cpp	nvidia nemotron nano v2 (nemotronh) (#15507 )	2025-08-28 18:39:31 -06:00
llama-model-loader.h	model: support GLM 4.5 family of models (#14939 )	2025-08-04 20:29:25 +02:00
llama-model-saver.cpp	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
llama-model-saver.h	llama/ggml: add LLM training support (#10544 )	2025-05-12 14:44:49 +02:00
llama-model.cpp	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 )	2025-09-05 10:39:22 +03:00
llama-model.h	llama : fix incorrect model type for Gemma 270M (#15764 )	2025-09-03 13:35:49 +02:00
llama-quant.cpp	convert : support non-mxfp4 HF model (#15153 )	2025-08-07 23:26:03 +02:00
llama-quant.h	llama : refactor `src/llama.cpp` (#10902 )	2025-01-03 10:18:53 +02:00
llama-sampling.cpp	sampling : optimize dist sampler (#15704 )	2025-09-03 18:16:26 +03:00
llama-sampling.h	llama : add `llama_vocab`, functions -> methods, naming (#11110 )	2025-01-12 11:32:42 +02:00
llama-vocab.cpp	model : jina-embeddings-v3 support (#13693 )	2025-08-28 15:49:50 +02:00
llama-vocab.h	model : add hunyuan dense (#14878 )	2025-08-01 15:31:12 +02:00
llama.cpp	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
unicode-data.cpp	server : better security control for public deployments (#9776 )	2024-10-08 13:27:04 +02:00
unicode-data.h	llama : reduce compile time and binary size (#9712 )	2024-10-02 15:49:55 +02:00
unicode.cpp	model : add Kimi-K2 support (#14654 )	2025-07-15 21:54:22 +02:00
unicode.h	model : add Kimi-K2 support (#14654 )	2025-07-15 21:54:22 +02:00