llama.cpp

History

Rob Kim 49f9271148 Implement context-length dependent KV-cache and Compute Buffer aware layer distribution for heterogeneous multi-GPU inference. Solves the problem of attemtping to run setups with different VRAM (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer. - Add requested_n_ctx parameter to llama_model_params - Implement 3-pass allocation algorithm accounting for compute buffers - Add device exclusion for insufficient memory (GPUs too small to allocate 1 layer + KV_cache + compute buffer excluded) - Add layer redistribution to make equitable use of included GPUs (may not be truly optimal)		2025-07-01 12:15:45 -04:00
..
llama-cpp.h	llama : add `llama_vocab`, functions -> methods, naming (#11110 )	2025-01-12 11:32:42 +02:00
llama.h	Implement context-length dependent KV-cache and Compute Buffer aware layer distribution for heterogeneous multi-GPU inference. Solves the problem of attemtping to run setups with different VRAM (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer.	2025-07-01 12:15:45 -04:00