llama.cpp/docs/vulkan_low_vram.md

# Dynamic VRAM Allocation for Vulkan Backend

This document describes the dynamic VRAM allocation heuristic for `llama.cpp`'s Vulkan backend, which automatically optimizes GPU layer offloading based on available VRAM.

## Overview

The Vulkan backend now includes a **dynamic heuristic** that automatically calculates the optimal number of GPU layers to offload based on:
- Available VRAM on your GPU
- Model size and layer count (from GGUF metadata)
- Reserved overhead for KV cache and compute buffers

This enables **optimal performance** on low-VRAM devices (like AMD RX 6500 XT with 4GB) without manual configuration or OOM errors.

## How It Works

When you run `llama-cli` or `llama-server` **without** specifying `-ngl` (or with `-ngl -1`), the heuristic:

1. **Queries available VRAM** from your Vulkan device
2. **Parses model metadata** to determine model size and layer count
3. **Reserves overhead** (800MB) for KV cache, compute buffers, and system
4. **Calculates optimal layers**: `(available_vram - overhead) / bytes_per_layer`
5. **Offloads automatically** without risking OOM

### Example Results

**AMD RX 6500 XT (4GB VRAM)**:
- Gemma 2B (1.6GB): **26/27 layers** offloaded → **2.5-3.1x faster**
- Llama 3.2 3B (1.9GB): **28/29 layers** offloaded → **~2x faster**
- Llama 2 7B (3.9GB): **21/33 layers** offloaded → **1.6x faster**
- Llama 2 13B (7.5GB): **14/41 layers** offloaded → **No OOM** ✅

## Usage

### Automatic (Recommended)

Simply run without `-ngl` to enable the dynamic heuristic:

```bash
# Heuristic calculates optimal layers automatically
llama-cli -m models/gemma-2b-q4.gguf -p "Hello"
```

The heuristic will print debug info showing the calculation:
```
Vulkan dynamic heuristic: available_vram=3434 MB, model_size=1623 MB,
n_layers=27, overhead=800 MB, calculated_layers=26
```

### Manual Override

You can still manually specify layers to override the heuristic:

```bash
# Force specific number of layers
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 20

# Force CPU-only
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 0
```

## Performance

Compared to CPU-only (`-ngl 0`), the dynamic heuristic provides:

**Gemma 2B Q4_K_M on AMD RX 6500 XT**:
- Prompt processing: **2.5x faster** (497 → 1231 t/s)
- Token generation: **3.1x faster** (19.4 → 60.4 t/s)

## Troubleshooting

### Still Getting OOM Errors?

If you encounter "Out of Device Memory" errors despite the heuristic:

1. **Reduce context size**: Use `-c 2048` or lower
2. **Force fewer layers**: Use `-ngl 10` or lower
3. **Check available VRAM**: Close other GPU applications
4. **Use smaller model**: Try a smaller quantization (Q4_K_M → Q3_K_S)

### Heuristic Not Triggering?

The heuristic only activates when:
- ✅ Vulkan backend is enabled (`GGML_USE_VULKAN=1` during build)
- ✅ `-ngl` is not specified (or set to `-1`)
- ✅ GGUF file can be parsed for metadata

If you explicitly set `-ngl`, the heuristic is bypassed.

## Technical Details

### Overhead Calculation

The heuristic reserves **800MB** for:
- KV cache (dynamically allocated by llama.cpp)
- Compute buffers (temporary tensors during inference)
- System overhead (driver, fragmentation)

This value is conservative and works well across different model sizes.

### Model Compatibility

The heuristic generalizes across model architectures by searching for:
- `*.block_count` (layer count)
- `*.embedding_length` (model dimensions)

Tested architectures:
- ✅ Gemma / Gemma 2
- ✅ Llama / Llama 2 / Llama 3
- ✅ Qwen / Qwen 2.5

## Benchmark Script

The `tests/6500xt_benchmark.ps1` script automates testing across different configurations:

```powershell
cd tests
.\6500xt_benchmark.ps1
```

This will test CPU-only vs GPU heuristic and report performance improvements.