7.6 KiB
GGML-VirtGPU Backend
The GGML-VirtGPU backend enables GGML applications to run machine learning computations on host hardware while the application itself runs inside a virtual machine. It uses host-guest shared memory to efficiently share data buffers between the two sides.
This backend relies on the virtio-gpu, and VirglRenderer API Remoting (APIR) component. The backend is split into two libraries:
- a GGML implementation (the "remoting frontend"), running in the guest and interacting with the virtgpu device
- a VirglRenderer APIR compatible library (the "remoting backend"), running in the host and interacting with Virglrenderer and an actual GGML device backend.
Architecture Overview
The GGML-VirtGPU backend consists of three main components:
┌─────────────────────────────────────────┐
│ GGML Application │
│ (llama.cpp, etc.) │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Guest VM (Frontend) │
│ ggml-virtgpu library │
│ │
│ ┌─────────────────────────────────┐ │
│ │ GGML Backend Interface │ │
│ └─────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────┐ │
│ │ VirtGPU Communication │ │
│ │ (hypercalls + shared mem) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
↓
virtio-gpu / virglrenderer APIR
↓
┌─────────────────────────────────────────┐
│ Host System (Backend) │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Backend Dispatcher │ │
│ └─────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────┐ │
│ │ GGML Backend Library │ │
│ │ (Metal/Vulkan/CPU/...) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
Key Components
- Guest-side Frontend (
ggml-virtgpu/): Implements the GGML backend interface and forwards operations to the host - Host-side Backend (
ggml-virtgpu/backend/): Receives forwarded operations and executes them on actual hardware backends - Communication Layer: Uses virtio-gpu hypercalls and shared memory for efficient data transfer
Features
- Dynamic backend loading on the host side (CPU, CUDA, Metal, etc.)
- Zero-copy data transfer via host-guest shared memory pages
Communication Protocol
Hypercalls and Shared Memory
The backend uses two primary communication mechanisms:
- Hypercalls (
DRM_IOCTL_VIRTGPU_EXECBUFFER): Trigger remote execution from guest to host - Shared Memory Pages: Zero-copy data transfer for tensors and parameters
Shared Memory Layout
Each connection uses two shared memory buffers:
- Data Buffer (24 MiB): For command/response data and tensor transfers
- Reply Buffer (16 KiB): For command replies and status information
- Data Buffers: Dynamically allocated host-guest shared buffers served as GGML buffers.
APIR Protocol
The Virglrender API Remoting protocol defines three command types:
HANDSHAKE: Protocol version negotiation and capability discoveryLOADLIBRARY: Dynamic loading of backend libraries on the hostFORWARD: API function call forwarding
Binary Serialization
Commands and data are serialized using a custom binary protocol with:
- Fixed-size encoding for basic types
- Variable-length arrays with size prefixes
- Buffer bounds checking
- Error recovery mechanisms
Supported Operations
Device Operations
- Device enumeration and capability queries
- Memory information (total/free)
- Backend type detection
Buffer Operations
- Buffer allocation and deallocation
- Tensor data transfer (host ↔ guest)
- Memory copying and clearing
Computation Operations
- Graph execution forwarding
Build Requirements
Guest-side Dependencies
libdrmfor DRM/virtio-gpu communication- C++20 compatible compiler
- CMake 3.14+
Host-side Dependencies
- virglrenderer with APIR support (pending upstream review)
- Target backend libraries (libggml-metal, libggml-vulkan, etc.)
Configuration
Environment Variables
GGML_VIRTGPU_BACKEND_LIBRARY: Path to the host-side backend libraryGGML_VIRTGPU_DEBUG: Enable debug logging
Build Options
GGML_VIRTGPU: Enable the VirtGPU backend (ONorOFF, default:OFF)GGML_VIRTGPU_BACKEND: Build the host-side backend component (ON,OFForONLY, default:OFF)
System Requirements
- VM with virtio-gpu support
- VirglRenderer with APIR patches
- Compatible backend libraries on host
Limitations
- VM-specific: Only works in virtual machines with virtio-gpu support
- Host dependency: Requires properly configured host-side backend
- Latency: Small overhead from VM escaping for each operation
Development
Code Generation
The backend uses code generation from YAML configuration:
# Regenerate protocol code
cd ggml-virtgpu/
python regenerate_remoting.py
Adding New Operations
- Add function definition to
ggmlremoting_functions.yaml - Regenerate code with
regenerate_remoting.py - Implement guest-side forwarding in
virtgpu-forward-*.cpp - Implement host-side handling in
backend-dispatched-*.cpp
Limitations
-
This work is pending upstream changes in the VirglRenderer project.
- The backend can be tested with Virglrenderer compiled from source using this PR: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590
-
This work is pending changes in the VMM/hypervisor running the virtual machine, which need to know how to route the newly introduced APIR capset.
- The environment variable
VIRGL_ROUTE_VENUS_TO_APIR=1allows using the Venus capset, until the relevant hypervisors have been patched. However, setting this flag breaks the Vulkan/Venus normal behavior. - The environment variable
GGML_REMOTING_USE_APIR_CAPSETtells theggml-virtgpubackend to use the APIR capset. This will become the default when the relevant hypervisors have been patched.
- The environment variable
-
This work focused on improving the performance of llama.cpp running on MacOS containers, and is mainly tested on this platform. The linux support (via
krun) is in progress.