llama.cpp/docs/vulkan_low_vram.md

121 lines
3.7 KiB
Markdown

# Dynamic VRAM Allocation for Vulkan Backend
This document describes the dynamic VRAM allocation heuristic for `llama.cpp`'s Vulkan backend, which automatically optimizes GPU layer offloading based on available VRAM.
## Overview
The Vulkan backend now includes a **dynamic heuristic** that automatically calculates the optimal number of GPU layers to offload based on:
- Available VRAM on your GPU
- Model size and layer count (from GGUF metadata)
- Reserved overhead for KV cache and compute buffers
This enables **optimal performance** on low-VRAM devices (like AMD RX 6500 XT with 4GB) without manual configuration or OOM errors.
## How It Works
When you run `llama-cli` or `llama-server` **without** specifying `-ngl` (or with `-ngl -1`), the heuristic:
1. **Queries available VRAM** from your Vulkan device
2. **Parses model metadata** to determine model size and layer count
3. **Reserves overhead** (800MB) for KV cache, compute buffers, and system
4. **Calculates optimal layers**: `(available_vram - overhead) / bytes_per_layer`
5. **Offloads automatically** without risking OOM
### Example Results
**AMD RX 6500 XT (4GB VRAM)**:
- Gemma 2B (1.6GB): **26/27 layers** offloaded → **2.5-3.1x faster**
- Llama 3.2 3B (1.9GB): **28/29 layers** offloaded → **~2x faster**
- Llama 2 7B (3.9GB): **21/33 layers** offloaded → **1.6x faster**
- Llama 2 13B (7.5GB): **14/41 layers** offloaded → **No OOM**
## Usage
### Automatic (Recommended)
Simply run without `-ngl` to enable the dynamic heuristic:
```bash
# Heuristic calculates optimal layers automatically
llama-cli -m models/gemma-2b-q4.gguf -p "Hello"
```
The heuristic will print debug info showing the calculation:
```
Vulkan dynamic heuristic: available_vram=3434 MB, model_size=1623 MB,
n_layers=27, overhead=800 MB, calculated_layers=26
```
### Manual Override
You can still manually specify layers to override the heuristic:
```bash
# Force specific number of layers
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 20
# Force CPU-only
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 0
```
## Performance
Compared to CPU-only (`-ngl 0`), the dynamic heuristic provides:
**Gemma 2B Q4_K_M on AMD RX 6500 XT**:
- Prompt processing: **2.5x faster** (497 → 1231 t/s)
- Token generation: **3.1x faster** (19.4 → 60.4 t/s)
## Troubleshooting
### Still Getting OOM Errors?
If you encounter "Out of Device Memory" errors despite the heuristic:
1. **Reduce context size**: Use `-c 2048` or lower
2. **Force fewer layers**: Use `-ngl 10` or lower
3. **Check available VRAM**: Close other GPU applications
4. **Use smaller model**: Try a smaller quantization (Q4_K_M → Q3_K_S)
### Heuristic Not Triggering?
The heuristic only activates when:
- ✅ Vulkan backend is enabled (`GGML_USE_VULKAN=1` during build)
-`-ngl` is not specified (or set to `-1`)
- ✅ GGUF file can be parsed for metadata
If you explicitly set `-ngl`, the heuristic is bypassed.
## Technical Details
### Overhead Calculation
The heuristic reserves **800MB** for:
- KV cache (dynamically allocated by llama.cpp)
- Compute buffers (temporary tensors during inference)
- System overhead (driver, fragmentation)
This value is conservative and works well across different model sizes.
### Model Compatibility
The heuristic generalizes across model architectures by searching for:
- `*.block_count` (layer count)
- `*.embedding_length` (model dimensions)
Tested architectures:
- ✅ Gemma / Gemma 2
- ✅ Llama / Llama 2 / Llama 3
- ✅ Qwen / Qwen 2.5
## Benchmark Script
The `tests/6500xt_benchmark.ps1` script automates testing across different configurations:
```powershell
cd tests
.\6500xt_benchmark.ps1
```
This will test CPU-only vs GPU heuristic and report performance improvements.