llama.cpp/include
samuel a3e29da02a glm-moe: allow skipping MTP tensor loading to save VRAM
Adds a new `mtp` boolean to `llama_model_params`. When set to false (default):
1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`.
2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`).

This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.
2025-12-21 17:29:55 -05:00
..
llama-cpp.h llama : add `llama_vocab`, functions -> methods, naming (#11110) 2025-01-12 11:32:42 +02:00
llama.h glm-moe: allow skipping MTP tensor loading to save VRAM 2025-12-21 17:29:55 -05:00