Introducing a new split-mode, selects the minimum subset of GPUs needed to fit a model based on estimated memory requirements. GPUs are sorted by free memory and selected until cumulative memory meets the model's estimated needs. This helps reducing inter-communication between the GPUs and increase performance and overall power consumption on multi-gpu systems. |
||
|---|---|---|
| .. | ||
| llama-cpp.h | ||
| llama.h | ||