- Add requested_n_ctx parameter to llama_model_params - Implement 3-pass allocation algorithm accounting for compute buffers - Add device exclusion for insufficient memory (GPUs too small to allocate 1 layer + KV_cache + compute buffer excluded) - Add layer redistribution to make equitable use of included GPUs (may not be truly optimal) |
||
|---|---|---|
| .. | ||
| llama-cpp.h | ||
| llama.h | ||