7.6 KiB

Raw Blame History

GGML-VirtGPU Backend

The GGML-VirtGPU backend enables GGML applications to run machine learning computations on host hardware while the application itself runs inside a virtual machine. It uses host-guest shared memory to efficiently share data buffers between the two sides.

This backend relies on the virtio-gpu, and VirglRenderer API Remoting (APIR) component. The backend is split into two libraries:

a GGML implementation (the "remoting frontend"), running in the guest and interacting with the virtgpu device
a VirglRenderer APIR compatible library (the "remoting backend"), running in the host and interacting with Virglrenderer and an actual GGML device backend.

Architecture Overview

The GGML-VirtGPU backend consists of three main components:

┌─────────────────────────────────────────┐
│           GGML Application              │
│         (llama.cpp, etc.)               │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│        Guest VM (Frontend)              │
│      ggml-virtgpu library               │
│                                         │
│  ┌─────────────────────────────────┐    │
│  │     GGML Backend Interface      │    │
│  └─────────────────────────────────┘    │
│                    ↓                    │
│  ┌─────────────────────────────────┐    │
│  │      VirtGPU Communication      │    │
│  │    (hypercalls + shared mem)    │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘
                    ↓
      virtio-gpu / virglrenderer APIR
                    ↓
┌─────────────────────────────────────────┐
│          Host System (Backend)          │
│                                         │
│  ┌─────────────────────────────────┐    │
│  │      Backend Dispatcher         │    │
│  └─────────────────────────────────┘    │
│                    ↓                    │
│  ┌─────────────────────────────────┐    │
│  │    GGML Backend Library         │    │
│  │   (Metal/Vulkan/CPU/...)        │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘

Key Components

Guest-side Frontend (ggml-virtgpu/): Implements the GGML backend interface and forwards operations to the host
Host-side Backend (ggml-virtgpu/backend/): Receives forwarded operations and executes them on actual hardware backends
Communication Layer: Uses virtio-gpu hypercalls and shared memory for efficient data transfer

Features

Dynamic backend loading on the host side (CPU, CUDA, Metal, etc.)
Zero-copy data transfer via host-guest shared memory pages

Communication Protocol

Hypercalls and Shared Memory

The backend uses two primary communication mechanisms:

Hypercalls (DRM_IOCTL_VIRTGPU_EXECBUFFER): Trigger remote execution from guest to host
Shared Memory Pages: Zero-copy data transfer for tensors and parameters

Shared Memory Layout

Each connection uses two shared memory buffers:

Data Buffer (24 MiB): For command/response data and tensor transfers
Reply Buffer (16 KiB): For command replies and status information
Data Buffers: Dynamically allocated host-guest shared buffers served as GGML buffers.

APIR Protocol

The Virglrender API Remoting protocol defines three command types:

HANDSHAKE: Protocol version negotiation and capability discovery
LOADLIBRARY: Dynamic loading of backend libraries on the host
FORWARD: API function call forwarding

Binary Serialization

Commands and data are serialized using a custom binary protocol with:

Fixed-size encoding for basic types
Variable-length arrays with size prefixes
Buffer bounds checking
Error recovery mechanisms

Supported Operations

Device Operations

Device enumeration and capability queries
Memory information (total/free)
Backend type detection

Buffer Operations

Buffer allocation and deallocation
Tensor data transfer (host ↔ guest)
Memory copying and clearing

Computation Operations

Graph execution forwarding

Build Requirements

Guest-side Dependencies

libdrm for DRM/virtio-gpu communication
C++20 compatible compiler
CMake 3.14+

Host-side Dependencies

virglrenderer with APIR support (pending upstream review)
Target backend libraries (libggml-metal, libggml-vulkan, etc.)

Configuration

Environment Variables

GGML_VIRTGPU_BACKEND_LIBRARY: Path to the host-side backend library
GGML_VIRTGPU_DEBUG: Enable debug logging

Build Options

GGML_VIRTGPU: Enable the VirtGPU backend (ON or OFF, default: OFF)
GGML_VIRTGPU_BACKEND: Build the host-side backend component (ON, OFF or ONLY, default: OFF)

System Requirements

VM with virtio-gpu support
VirglRenderer with APIR patches
Compatible backend libraries on host

Limitations

VM-specific: Only works in virtual machines with virtio-gpu support
Host dependency: Requires properly configured host-side backend
Latency: Small overhead from VM escaping for each operation

Development

Code Generation

The backend uses code generation from YAML configuration:

# Regenerate protocol code
cd ggml-virtgpu/
python regenerate_remoting.py

Adding New Operations

Add function definition to ggmlremoting_functions.yaml
Regenerate code with regenerate_remoting.py
Implement guest-side forwarding in virtgpu-forward-*.cpp
Implement host-side handling in backend-dispatched-*.cpp

Limitations

This work is pending upstream changes in the VirglRenderer project.
- The backend can be tested with Virglrenderer compiled from source using this PR: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590
This work is pending changes in the VMM/hypervisor running the virtual machine, which need to know how to route the newly introduced APIR capset.
- The environment variable VIRGL_ROUTE_VENUS_TO_APIR=1 allows using the Venus capset, until the relevant hypervisors have been patched. However, setting this flag breaks the Vulkan/Venus normal behavior.
- The environment variable GGML_REMOTING_USE_APIR_CAPSET tells the ggml-virtgpu backend to use the APIR capset. This will become the default when the relevant hypervisors have been patched.
This work focused on improving the performance of llama.cpp running on MacOS containers, and is mainly tested on this platform. The linux support (via krun) is in progress.

7.6 KiB Raw Blame History

GGML-VirtGPU Backend

Architecture Overview

Key Components

Features

Communication Protocol

Hypercalls and Shared Memory

Shared Memory Layout

APIR Protocol

Binary Serialization

Supported Operations

Device Operations

Buffer Operations

Computation Operations

Build Requirements

Guest-side Dependencies

Host-side Dependencies

Configuration

Environment Variables

Build Options

System Requirements

Limitations

Development

Code Generation

Adding New Operations

Limitations

7.6 KiB

Raw Blame History