192 lines
7.6 KiB
Markdown
192 lines
7.6 KiB
Markdown
# GGML-VirtGPU Backend
|
|
|
|
The GGML-VirtGPU backend enables GGML applications to run machine
|
|
learning computations on host hardware while the application itself
|
|
runs inside a virtual machine. It uses host-guest shared memory to
|
|
efficiently share data buffers between the two sides.
|
|
|
|
This backend relies on the virtio-gpu, and VirglRenderer API Remoting
|
|
(APIR) component. The backend is split into two libraries:
|
|
- a GGML implementation (the "remoting frontend"), running in the
|
|
guest and interacting with the virtgpu device
|
|
- a VirglRenderer APIR compatible library (the "remoting backend"),
|
|
running in the host and interacting with Virglrenderer and an actual
|
|
GGML device backend.
|
|
|
|
## Architecture Overview
|
|
|
|
The GGML-VirtGPU backend consists of three main components:
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ GGML Application │
|
|
│ (llama.cpp, etc.) │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Guest VM (Frontend) │
|
|
│ ggml-virtgpu library │
|
|
│ │
|
|
│ ┌─────────────────────────────────┐ │
|
|
│ │ GGML Backend Interface │ │
|
|
│ └─────────────────────────────────┘ │
|
|
│ ↓ │
|
|
│ ┌─────────────────────────────────┐ │
|
|
│ │ VirtGPU Communication │ │
|
|
│ │ (hypercalls + shared mem) │ │
|
|
│ └─────────────────────────────────┘ │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
virtio-gpu / virglrenderer APIR
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Host System (Backend) │
|
|
│ │
|
|
│ ┌─────────────────────────────────┐ │
|
|
│ │ Backend Dispatcher │ │
|
|
│ └─────────────────────────────────┘ │
|
|
│ ↓ │
|
|
│ ┌─────────────────────────────────┐ │
|
|
│ │ GGML Backend Library │ │
|
|
│ │ (Metal/Vulkan/CPU/...) │ │
|
|
│ └─────────────────────────────────┘ │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
### Key Components
|
|
|
|
1. **Guest-side Frontend** (`ggml-virtgpu/`): Implements the GGML backend interface and forwards operations to the host
|
|
2. **Host-side Backend** (`ggml-virtgpu/backend/`): Receives forwarded operations and executes them on actual hardware backends
|
|
3. **Communication Layer**: Uses virtio-gpu hypercalls and shared memory for efficient data transfer
|
|
|
|
## Features
|
|
|
|
- **Dynamic backend loading** on the host side (CPU, CUDA, Metal, etc.)
|
|
- **Zero-copy data transfer** via host-guest shared memory pages
|
|
|
|
## Communication Protocol
|
|
|
|
### Hypercalls and Shared Memory
|
|
|
|
The backend uses two primary communication mechanisms:
|
|
|
|
1. **Hypercalls (`DRM_IOCTL_VIRTGPU_EXECBUFFER`)**: Trigger remote execution from guest to host
|
|
2. **Shared Memory Pages**: Zero-copy data transfer for tensors and parameters
|
|
|
|
#### Shared Memory Layout
|
|
|
|
Each connection uses two shared memory buffers:
|
|
|
|
- **Data Buffer** (24 MiB): For command/response data and tensor transfers
|
|
- **Reply Buffer** (16 KiB): For command replies and status information
|
|
- **Data Buffers**: Dynamically allocated host-guest shared buffers
|
|
served as GGML buffers.
|
|
|
|
### APIR Protocol
|
|
|
|
The Virglrender API Remoting protocol defines three command types:
|
|
|
|
- `HANDSHAKE`: Protocol version negotiation and capability discovery
|
|
- `LOADLIBRARY`: Dynamic loading of backend libraries on the host
|
|
- `FORWARD`: API function call forwarding
|
|
|
|
### Binary Serialization
|
|
|
|
Commands and data are serialized using a custom binary protocol with:
|
|
|
|
- Fixed-size encoding for basic types
|
|
- Variable-length arrays with size prefixes
|
|
- Buffer bounds checking
|
|
- Error recovery mechanisms
|
|
|
|
## Supported Operations
|
|
|
|
### Device Operations
|
|
- Device enumeration and capability queries
|
|
- Memory information (total/free)
|
|
- Backend type detection
|
|
|
|
### Buffer Operations
|
|
- Buffer allocation and deallocation
|
|
- Tensor data transfer (host ↔ guest)
|
|
- Memory copying and clearing
|
|
|
|
### Computation Operations
|
|
- Graph execution forwarding
|
|
|
|
## Build Requirements
|
|
|
|
### Guest-side Dependencies
|
|
- `libdrm` for DRM/virtio-gpu communication
|
|
- C++20 compatible compiler
|
|
- CMake 3.14+
|
|
|
|
### Host-side Dependencies
|
|
- virglrenderer with APIR support (pending upstream review)
|
|
- Target backend libraries (libggml-metal, libggml-vulkan, etc.)
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
- `GGML_VIRTGPU_BACKEND_LIBRARY`: Path to the host-side backend library
|
|
- `GGML_VIRTGPU_DEBUG`: Enable debug logging
|
|
|
|
### Build Options
|
|
|
|
- `GGML_VIRTGPU`: Enable the VirtGPU backend (`ON` or `OFF`, default: `OFF`)
|
|
- `GGML_VIRTGPU_BACKEND`: Build the host-side backend component (`ON`, `OFF` or `ONLY`, default: `OFF`)
|
|
|
|
### System Requirements
|
|
|
|
- VM with virtio-gpu support
|
|
- VirglRenderer with APIR patches
|
|
- Compatible backend libraries on host
|
|
|
|
## Limitations
|
|
|
|
- **VM-specific**: Only works in virtual machines with virtio-gpu support
|
|
- **Host dependency**: Requires properly configured host-side backend
|
|
- **Latency**: Small overhead from VM escaping for each operation
|
|
|
|
## Development
|
|
|
|
### Code Generation
|
|
|
|
The backend uses code generation from YAML configuration:
|
|
|
|
```bash
|
|
# Regenerate protocol code
|
|
cd ggml-virtgpu/
|
|
python regenerate_remoting.py
|
|
```
|
|
|
|
### Adding New Operations
|
|
|
|
1. Add function definition to `ggmlremoting_functions.yaml`
|
|
2. Regenerate code with `regenerate_remoting.py`
|
|
3. Implement guest-side forwarding in `virtgpu-forward-*.cpp`
|
|
4. Implement host-side handling in `backend-dispatched-*.cpp`
|
|
|
|
## Limitations
|
|
|
|
* This work is pending upstream changes in the VirglRenderer
|
|
project.
|
|
* The backend can be tested with Virglrenderer compiled from source
|
|
using this PR:
|
|
https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590
|
|
* This work is pending changes in the VMM/hypervisor running the
|
|
virtual machine, which need to know how to route the newly
|
|
introduced APIR capset.
|
|
* The environment variable `VIRGL_ROUTE_VENUS_TO_APIR=1` allows
|
|
using the Venus capset, until the relevant hypervisors have been
|
|
patched. However, setting this flag breaks the Vulkan/Venus normal
|
|
behavior.
|
|
* The environment variable `GGML_REMOTING_USE_APIR_CAPSET` tells the
|
|
`ggml-virtgpu` backend to use the APIR capset. This will become
|
|
the default when the relevant hypervisors have been patched.
|
|
|
|
* This work focused on improving the performance of llama.cpp running
|
|
on MacOS containers, and is mainly tested on this platform. The
|
|
linux support (via `krun`) is in progress.
|