llama.cpp

History

samuel a3e29da02a glm-moe: allow skipping MTP tensor loading to save VRAM Adds a new `mtp` boolean to `llama_model_params`. When set to false (default): 1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`. 2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`). This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.	2025-12-21 17:29:55 -05:00
..
llama-cpp.h	llama : add `llama_vocab`, functions -> methods, naming (#11110 )	2025-01-12 11:32:42 +02:00
llama.h	glm-moe: allow skipping MTP tensor loading to save VRAM	2025-12-21 17:29:55 -05:00

samuel a3e29da02a glm-moe: allow skipping MTP tensor loading to save VRAM

Adds a new `mtp` boolean to `llama_model_params`. When set to false (default):
1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`.
2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`).

This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.

2025-12-21 17:29:55 -05:00

llama-cpp.h

llama : add `llama_vocab`, functions -> methods, naming (#11110 )

2025-01-12 11:32:42 +02:00

llama.h

glm-moe: allow skipping MTP tensor loading to save VRAM

2025-12-21 17:29:55 -05:00