llama.cpp/src
Daniel Bevenius 444f00b0ec
llama : remove quantization sanity check (#17788)
* llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

* llama : remove unused pruned_attention_w and is_clip_model vars
2025-12-06 12:26:20 +01:00
..
models model: support Ministral3 (#17644) 2025-12-01 12:26:52 +01:00
CMakeLists.txt model: support Ministral3 (#17644) 2025-12-01 12:26:52 +01:00
llama-adapter.cpp aLoRA Support (#15327) 2025-09-05 17:32:39 -06:00
llama-adapter.h aLoRA Support (#15327) 2025-09-05 17:32:39 -06:00
llama-arch.cpp Override SSM_A op for Qwen3 Next to reduce splits (#17587) 2025-12-02 00:43:13 +01:00
llama-arch.h Override SSM_A op for Qwen3 Next to reduce splits (#17587) 2025-12-02 00:43:13 +01:00
llama-batch.cpp batch : fix consistency checks for the input positions (#16890) 2025-10-31 13:50:33 +02:00
llama-batch.h llama: store mrope data in KV cell (#16825) 2025-10-29 18:09:18 +01:00
llama-chat.cpp model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-chat.h model : add openPangu-Embedded (#16941) 2025-11-05 10:28:58 +01:00
llama-context.cpp ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276) 2025-11-28 17:33:23 +02:00
llama-context.h server : support unified cache across slots (#16736) 2025-11-02 18:14:04 +02:00
llama-cparams.cpp cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188) 2025-06-15 10:08:58 +03:00
llama-cparams.h server : support unified cache across slots (#16736) 2025-11-02 18:14:04 +02:00
llama-grammar.cpp grammar: fix regression caused by #17381 (#17412) 2025-11-20 18:35:10 +01:00
llama-grammar.h `tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034) 2025-03-05 13:05:13 +00:00
llama-graph.cpp model: support Ministral3 (#17644) 2025-12-01 12:26:52 +01:00
llama-graph.h graph : support cacheless embeddings with FA and iSWA (#16528) 2025-10-13 22:42:37 +03:00
llama-hparams.cpp hparams : add n_embd_inp() to support extended embed (#16928) 2025-11-07 19:27:58 +01:00
llama-hparams.h model: support Ministral3 (#17644) 2025-12-01 12:26:52 +01:00
llama-impl.cpp common : more accurate sampling timing (#17382) 2025-11-20 13:40:10 +02:00
llama-impl.h ggml, llama : use defaulted constructors/destructors (#17649) 2025-12-03 07:12:18 +01:00
llama-io.cpp llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) 2025-03-13 12:35:44 +02:00
llama-io.h llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) 2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp kv-cache : pad the cache size to 256 for performance (#17046) 2025-11-07 20:03:25 +02:00
llama-kv-cache-iswa.h llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
llama-kv-cache.cpp model: add support for qwen3vl series (#16780) 2025-10-30 16:19:14 +01:00
llama-kv-cache.h memory : remove KV cache size padding (#16812) 2025-10-28 20:19:44 +02:00
llama-kv-cells.h llama: store mrope data in KV cell (#16825) 2025-10-29 18:09:18 +01:00
llama-memory-hybrid.cpp memory : use sequential equal splits for recurrent modules (#16442) 2025-10-07 08:24:17 +03:00
llama-memory-hybrid.h llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
llama-memory-recurrent.cpp memory: Hybrid context shift (#17009) 2025-11-10 17:14:23 +02:00
llama-memory-recurrent.h llama: consistent ctx <-> buf order for KV cache (#16746) 2025-10-28 11:23:54 +01:00
llama-memory.cpp memory : correctly handle failure in apply() (#14438) 2025-06-30 18:03:03 +03:00
llama-memory.h llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
llama-mmap.cpp llama : fix signed comparison warning on FreeBSD (#17497) 2025-12-02 12:05:38 +01:00
llama-mmap.h llama-mmap: fix missing include (#11796) 2025-02-10 20:58:18 +02:00
llama-model-loader.cpp model : Apertus model implementation (#15852) 2025-10-02 20:43:22 +03:00
llama-model-loader.h model: support GLM 4.5 family of models (#14939) 2025-08-04 20:29:25 +02:00
llama-model-saver.cpp llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
llama-model-saver.h llama/ggml: add LLM training support (#10544) 2025-05-12 14:44:49 +02:00
llama-model.cpp ggml, llama : use defaulted constructors/destructors (#17649) 2025-12-03 07:12:18 +01:00
llama-model.h model : Qwen3 Next (#16095) 2025-11-28 12:02:56 +01:00
llama-quant.cpp llama : remove quantization sanity check (#17788) 2025-12-06 12:26:20 +01:00
llama-quant.h llama : refactor `src/llama.cpp` (#10902) 2025-01-03 10:18:53 +02:00
llama-sampling.cpp common : more accurate sampling timing (#17382) 2025-11-20 13:40:10 +02:00
llama-sampling.h llama : add `llama_vocab`, functions -> methods, naming (#11110) 2025-01-12 11:32:42 +02:00
llama-vocab.cpp ggml, llama : use defaulted constructors/destructors (#17649) 2025-12-03 07:12:18 +01:00
llama-vocab.h model : add AfmoeForCausalLM support (#16477) 2025-11-14 13:54:10 +01:00
llama.cpp llama-quant: add support for mmproj (#16592) 2025-10-15 14:48:08 +02:00
unicode-data.cpp server : better security control for public deployments (#9776) 2024-10-08 13:27:04 +02:00
unicode-data.h llama : reduce compile time and binary size (#9712) 2024-10-02 15:49:55 +02:00
unicode.cpp fix: prevent segfault in tokenizer on highly repetitive input (#17786) 2025-12-05 13:52:23 +02:00
unicode.h devops: add s390x & ppc64le CI (#15925) 2025-09-27 02:03:33 +08:00