3.7 KiB

Raw Blame History

Dynamic VRAM Allocation for Vulkan Backend

This document describes the dynamic VRAM allocation heuristic for llama.cpp's Vulkan backend, which automatically optimizes GPU layer offloading based on available VRAM.

Overview

The Vulkan backend now includes a dynamic heuristic that automatically calculates the optimal number of GPU layers to offload based on:

Available VRAM on your GPU
Model size and layer count (from GGUF metadata)
Reserved overhead for KV cache and compute buffers

This enables optimal performance on low-VRAM devices (like AMD RX 6500 XT with 4GB) without manual configuration or OOM errors.

How It Works

When you run llama-cli or llama-server without specifying -ngl (or with -ngl -1), the heuristic:

Queries available VRAM from your Vulkan device
Parses model metadata to determine model size and layer count
Reserves overhead (800MB) for KV cache, compute buffers, and system
Calculates optimal layers: (available_vram - overhead) / bytes_per_layer
Offloads automatically without risking OOM

Example Results

AMD RX 6500 XT (4GB VRAM):

Gemma 2B (1.6GB): 26/27 layers offloaded → 2.5-3.1x faster
Llama 3.2 3B (1.9GB): 28/29 layers offloaded → ~2x faster
Llama 2 7B (3.9GB): 21/33 layers offloaded → 1.6x faster
Llama 2 13B (7.5GB): 14/41 layers offloaded → No OOM ✅

Usage

Automatic (Recommended)

Simply run without -ngl to enable the dynamic heuristic:

# Heuristic calculates optimal layers automatically
llama-cli -m models/gemma-2b-q4.gguf -p "Hello"

The heuristic will print debug info showing the calculation:

Vulkan dynamic heuristic: available_vram=3434 MB, model_size=1623 MB, 
n_layers=27, overhead=800 MB, calculated_layers=26

Manual Override

You can still manually specify layers to override the heuristic:

# Force specific number of layers
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 20

# Force CPU-only
llama-cli -m models/gemma-2b-q4.gguf -p "Hello" -ngl 0

Performance

Compared to CPU-only (-ngl 0), the dynamic heuristic provides:

Gemma 2B Q4_K_M on AMD RX 6500 XT:

Prompt processing: 2.5x faster (497 → 1231 t/s)
Token generation: 3.1x faster (19.4 → 60.4 t/s)

Troubleshooting

Still Getting OOM Errors?

If you encounter "Out of Device Memory" errors despite the heuristic:

Reduce context size: Use -c 2048 or lower
Force fewer layers: Use -ngl 10 or lower
Check available VRAM: Close other GPU applications
Use smaller model: Try a smaller quantization (Q4_K_M → Q3_K_S)

Heuristic Not Triggering?

The heuristic only activates when:

✅ Vulkan backend is enabled (GGML_USE_VULKAN=1 during build)
✅ -ngl is not specified (or set to -1)
✅ GGUF file can be parsed for metadata

If you explicitly set -ngl, the heuristic is bypassed.

Technical Details

Overhead Calculation

The heuristic reserves 800MB for:

KV cache (dynamically allocated by llama.cpp)
Compute buffers (temporary tensors during inference)
System overhead (driver, fragmentation)

This value is conservative and works well across different model sizes.

Model Compatibility

The heuristic generalizes across model architectures by searching for:

*.block_count (layer count)
*.embedding_length (model dimensions)

Tested architectures:

✅ Gemma / Gemma 2
✅ Llama / Llama 2 / Llama 3
✅ Qwen / Qwen 2.5

Benchmark Script

The tests/6500xt_benchmark.ps1 script automates testing across different configurations:

cd tests
.\6500xt_benchmark.ps1

This will test CPU-only vs GPU heuristic and report performance improvements.

3.7 KiB Raw Blame History