3.7 KiB
Dynamic VRAM Allocation for Vulkan Backend
This document describes the dynamic VRAM allocation heuristic for llama.cpp's Vulkan backend, which automatically optimizes GPU layer offloading based on available VRAM.
Overview
The Vulkan backend now includes a dynamic heuristic that automatically calculates the optimal number of GPU layers to offload based on:
- Available VRAM on your GPU
- Model size and layer count (from GGUF metadata)
- Reserved overhead for KV cache and compute buffers
This enables optimal performance on low-VRAM devices (like AMD RX 6500 XT with 4GB) without manual configuration or OOM errors.
How It Works
When you run llama-cli or llama-server without specifying -ngl (or with -ngl -1), the heuristic:
- Queries available VRAM from your Vulkan device
- Parses model metadata to determine model size and layer count
- Reserves overhead (800MB) for KV cache, compute buffers, and system
- Calculates optimal layers:
(available_vram - overhead) / bytes_per_layer - Offloads automatically without risking OOM
Example Results
AMD RX 6500 XT (4GB VRAM):
- Gemma 2B (1.6GB): 26/27 layers offloaded → 2.5-3.1x faster
- Llama 3.2 3B (1.9GB): 28/29 layers offloaded → ~2x faster
- Llama 2 7B (3.9GB): 21/33 layers offloaded → 1.6x faster
- Llama 2 13B (7.5GB): 14/41 layers offloaded → No OOM ✅
Usage
Automatic (Recommended)
Simply run without -ngl to enable the dynamic heuristic:
# Heuristic calculates optimal layers automatically
llama-cli -m models/gemma-2b-q4.gguf -p "Hello"
The heuristic will print debug info showing the calculation:
Vulkan dynamic heuristic: available_vram=3434 MB, model_size=1623 MB,
n_layers=27, overhead=800 MB, calculated_layers=26
Manual Override
You can still manually specify layers to override the heuristic:
# Force specific number of layers
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 20
# Force CPU-only
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 0
Performance
Compared to CPU-only (-ngl 0), the dynamic heuristic provides:
Gemma 2B Q4_K_M on AMD RX 6500 XT:
- Prompt processing: 2.5x faster (497 → 1231 t/s)
- Token generation: 3.1x faster (19.4 → 60.4 t/s)
Troubleshooting
Still Getting OOM Errors?
If you encounter "Out of Device Memory" errors despite the heuristic:
- Reduce context size: Use
-c 2048or lower - Force fewer layers: Use
-ngl 10or lower - Check available VRAM: Close other GPU applications
- Use smaller model: Try a smaller quantization (Q4_K_M → Q3_K_S)
Heuristic Not Triggering?
The heuristic only activates when:
- ✅ Vulkan backend is enabled (
GGML_USE_VULKAN=1during build) - ✅
-nglis not specified (or set to-1) - ✅ GGUF file can be parsed for metadata
If you explicitly set -ngl, the heuristic is bypassed.
Technical Details
Overhead Calculation
The heuristic reserves 800MB for:
- KV cache (dynamically allocated by llama.cpp)
- Compute buffers (temporary tensors during inference)
- System overhead (driver, fragmentation)
This value is conservative and works well across different model sizes.
Model Compatibility
The heuristic generalizes across model architectures by searching for:
*.block_count(layer count)*.embedding_length(model dimensions)
Tested architectures:
- ✅ Gemma / Gemma 2
- ✅ Llama / Llama 2 / Llama 3
- ✅ Qwen / Qwen 2.5
Benchmark Script
The tests/6500xt_benchmark.ps1 script automates testing across different configurations:
cd tests
.\6500xt_benchmark.ps1
This will test CPU-only vs GPU heuristic and report performance improvements.