llama.cpp/include
Rob Kim 49f9271148 Implement context-length dependent KV-cache and Compute Buffer aware layer distribution for heterogeneous multi-GPU inference. Solves the problem of attemtping to run setups with different VRAM (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer.
- Add requested_n_ctx parameter to llama_model_params
- Implement 3-pass allocation algorithm accounting for compute buffers
- Add device exclusion for insufficient memory (GPUs too small to allocate 1 layer + KV_cache + compute buffer excluded)
- Add layer redistribution to make equitable use of included GPUs (may not be truly optimal)
2025-07-01 12:15:45 -04:00
..
llama-cpp.h llama : add `llama_vocab`, functions -> methods, naming (#11110) 2025-01-12 11:32:42 +02:00
llama.h Implement context-length dependent KV-cache and Compute Buffer aware layer distribution for heterogeneous multi-GPU inference. Solves the problem of attemtping to run setups with different VRAM (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer. 2025-07-01 12:15:45 -04:00