ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690)
* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
This commit is contained in:
parent
c42712b056
commit
017761daf5
|
|
@ -276,6 +276,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
|
||||||
| [MUSA](docs/build.md#musa) | Moore Threads GPU |
|
| [MUSA](docs/build.md#musa) | Moore Threads GPU |
|
||||||
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
|
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
|
||||||
| [HIP](docs/build.md#hip) | AMD GPU |
|
| [HIP](docs/build.md#hip) | AMD GPU |
|
||||||
|
| [ZenDNN](docs/build.md#zendnn) | AMD CPU |
|
||||||
| [Vulkan](docs/build.md#vulkan) | GPU |
|
| [Vulkan](docs/build.md#vulkan) | GPU |
|
||||||
| [CANN](docs/build.md#cann) | Ascend NPU |
|
| [CANN](docs/build.md#cann) | Ascend NPU |
|
||||||
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
|
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,258 @@
|
||||||
|
# llama.cpp for AMD ZenDNN
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> **Note:** ZenDNN is **not** the same as zDNN.
|
||||||
|
> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs
|
||||||
|
> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md))
|
||||||
|
|
||||||
|
- [Background](#background)
|
||||||
|
- [OS](#os)
|
||||||
|
- [Hardware](#hardware)
|
||||||
|
- [Supported Operations](#supported-operations)
|
||||||
|
- [DataType Supports](#datatype-supports)
|
||||||
|
- [Linux](#linux)
|
||||||
|
- [Environment Variable](#environment-variable)
|
||||||
|
- [Performance Optimization](#performance-optimization)
|
||||||
|
- [Known Issues](#known-issues)
|
||||||
|
- [TODO](#todo)
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.
|
||||||
|
|
||||||
|
**Llama.cpp + ZenDNN**
|
||||||
|
|
||||||
|
The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN).
|
||||||
|
|
||||||
|
For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html
|
||||||
|
|
||||||
|
## OS
|
||||||
|
|
||||||
|
| OS | Status | Verified |
|
||||||
|
|:-------:|:-------:|:----------------------------------------------:|
|
||||||
|
| Linux | Support | Ubuntu 20.04, 22.04, 24.04 |
|
||||||
|
|
||||||
|
For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os).
|
||||||
|
|
||||||
|
## Hardware
|
||||||
|
|
||||||
|
### AMD CPUs
|
||||||
|
|
||||||
|
**Recommended Processors**
|
||||||
|
|
||||||
|
ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.
|
||||||
|
|
||||||
|
| CPU Family | Status | Notes |
|
||||||
|
|:-----------------------------:|:-------:|:----------------------------------:|
|
||||||
|
| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture |
|
||||||
|
| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture |
|
||||||
|
| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture |
|
||||||
|
| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |
|
||||||
|
|
||||||
|
*Notes:*
|
||||||
|
|
||||||
|
- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
|
||||||
|
- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
|
||||||
|
- For optimal performance, ensure your system has sufficient memory bandwidth.
|
||||||
|
|
||||||
|
## Supported Operations
|
||||||
|
|
||||||
|
The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.
|
||||||
|
|
||||||
|
| Operation | Status | Notes |
|
||||||
|
|:-------------|:-------:|:----------------------------------------------:|
|
||||||
|
| MUL_MAT | ✓ | Accelerated via ZenDNN LowOHA MatMul |
|
||||||
|
|
||||||
|
*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).
|
||||||
|
|
||||||
|
## DataType Supports
|
||||||
|
|
||||||
|
| DataType | Status | Notes |
|
||||||
|
|:----------------------:|:-------:|:---------------------------------------------:|
|
||||||
|
| FP32 | Support | Full precision floating point |
|
||||||
|
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
|
||||||
|
|
||||||
|
*Notes:*
|
||||||
|
|
||||||
|
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
|
||||||
|
|
||||||
|
## Linux
|
||||||
|
|
||||||
|
### I. Setup Environment
|
||||||
|
|
||||||
|
You have two options to set up ZenDNN:
|
||||||
|
|
||||||
|
#### Option 1: Automatic Download and Build (Recommended)
|
||||||
|
|
||||||
|
CMake will automatically download and build ZenDNN for you:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Build llama.cpp - ZenDNN will be automatically downloaded and built
|
||||||
|
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
|
||||||
|
cmake --build build --config Release -j $(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
|
No manual ZenDNN installation required. CMake will handle everything automatically.
|
||||||
|
|
||||||
|
#### Option 2: Use Custom ZenDNN Installation
|
||||||
|
|
||||||
|
If you want to build ZenDNN yourself or use a specific version:
|
||||||
|
|
||||||
|
**Step 1: Build ZenDNN from source**
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Clone ZenDNN repository
|
||||||
|
git clone https://github.com/amd/ZenDNN.git
|
||||||
|
cd ZenDNN
|
||||||
|
git checkout zendnnl
|
||||||
|
|
||||||
|
# Build and install (requires CMake >= 3.25)
|
||||||
|
mkdir build && cd build
|
||||||
|
cmake ..
|
||||||
|
cmake --build . --target all
|
||||||
|
```
|
||||||
|
|
||||||
|
Default installation path: `ZenDNN/build/install`
|
||||||
|
|
||||||
|
**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md).
|
||||||
|
|
||||||
|
**Step 2: Build llama.cpp with custom ZenDNN path**
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Using environment variable
|
||||||
|
export ZENDNN_ROOT=/path/to/ZenDNN/build/install
|
||||||
|
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
|
||||||
|
cmake --build build --config Release -j $(nproc)
|
||||||
|
|
||||||
|
# OR specify path directly in CMake
|
||||||
|
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release
|
||||||
|
cmake --build build --config Release -j $(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
|
### II. Run the Server
|
||||||
|
|
||||||
|
#### 1. Download Model
|
||||||
|
|
||||||
|
Download LLaMA 3.1 8B Instruct BF16 model:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Download from Hugging Face
|
||||||
|
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Start Server
|
||||||
|
|
||||||
|
Run llama.cpp server with ZenDNN acceleration:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Set optimal configuration
|
||||||
|
export OMP_NUM_THREADS=64 # Adjust to your CPU core count
|
||||||
|
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance
|
||||||
|
|
||||||
|
# Start server
|
||||||
|
./build/bin/llama-server \
|
||||||
|
-m models/Llama-3.1-8B-Instruct.BF16.gguf \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8080 \
|
||||||
|
-t 64
|
||||||
|
```
|
||||||
|
|
||||||
|
Access the server at `http://localhost:8080`.
|
||||||
|
|
||||||
|
**Performance tips**:
|
||||||
|
- Set `OMP_NUM_THREADS` to match your physical core count
|
||||||
|
- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance
|
||||||
|
- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`
|
||||||
|
|
||||||
|
## Environment Variable
|
||||||
|
|
||||||
|
### Build Time
|
||||||
|
|
||||||
|
| Name | Value | Function |
|
||||||
|
|--------------------|---------------------------------------|---------------------------------------------|
|
||||||
|
| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support |
|
||||||
|
| ZENDNN_ROOT | Path to ZenDNN installation | Set ZenDNN installation directory |
|
||||||
|
| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading |
|
||||||
|
|
||||||
|
### Runtime
|
||||||
|
|
||||||
|
| Name | Value | Function |
|
||||||
|
|-------------------------|--------------------------|-------------------------------------------------------------------|
|
||||||
|
| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) |
|
||||||
|
| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) |
|
||||||
|
| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) |
|
||||||
|
| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) |
|
||||||
|
| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) |
|
||||||
|
|
||||||
|
**Example**:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
export OMP_NUM_THREADS=64
|
||||||
|
export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance
|
||||||
|
./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Optimization
|
||||||
|
|
||||||
|
### MatMul Algorithm Selection
|
||||||
|
|
||||||
|
ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Available algorithms**:
|
||||||
|
|
||||||
|
| Value | Algorithm | Description |
|
||||||
|
|:-----:|:-----------------------|:----------------------------------------------|
|
||||||
|
| 0 | Dynamic Dispatch | Automatic backend selection (default) |
|
||||||
|
| 1 | AOCL BLIS | AOCL BLIS backend |
|
||||||
|
| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** |
|
||||||
|
| 3 | OneDNN | OneDNN backend |
|
||||||
|
| 4 | OneDNN Blocked | Blocked OneDNN |
|
||||||
|
| 5 | LibXSMM | LibXSMM backend |
|
||||||
|
|
||||||
|
### Profiling and Debugging
|
||||||
|
|
||||||
|
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md).
|
||||||
|
|
||||||
|
## Known Issues
|
||||||
|
|
||||||
|
- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.
|
||||||
|
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
|
||||||
|
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
|
||||||
|
|
||||||
|
## Q&A
|
||||||
|
|
||||||
|
**Q: How do I verify that ZenDNN backend is being used?**
|
||||||
|
|
||||||
|
A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.
|
||||||
|
|
||||||
|
**Q: What performance improvement can I expect?**
|
||||||
|
|
||||||
|
A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.
|
||||||
|
|
||||||
|
**Q: Can I use ZenDNN on non-AMD processors?**
|
||||||
|
|
||||||
|
A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.
|
||||||
|
|
||||||
|
**Q: Does ZenDNN support quantized models?**
|
||||||
|
|
||||||
|
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
|
||||||
|
|
||||||
|
**Q: Why is my inference not faster with ZenDNN?**
|
||||||
|
|
||||||
|
A: Ensure:
|
||||||
|
1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)
|
||||||
|
2. `OMP_NUM_THREADS` is set appropriately (physical core count)
|
||||||
|
3. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS)
|
||||||
|
4. You're using a sufficiently large model (small models may not benefit as much)
|
||||||
|
5. Enable profiling to verify ZenDNN MatMul is being called
|
||||||
|
|
||||||
|
### **GitHub Contribution**:
|
||||||
|
Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.
|
||||||
|
|
||||||
|
## TODO
|
||||||
|
|
||||||
|
- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)
|
||||||
|
|
@ -1,5 +1,10 @@
|
||||||
# llama.cpp for IBM zDNN Accelerator
|
# llama.cpp for IBM zDNN Accelerator
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> **Note:** zDNN is **not** the same as ZenDNN.
|
||||||
|
> - **zDNN** (this page): IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes
|
||||||
|
> - **ZenDNN**: AMD's deep learning library for AMD EPYC CPUs ([see ZenDNN documentation](ZenDNN.md))
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations.
|
IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations.
|
||||||
|
|
|
||||||
|
|
@ -495,6 +495,38 @@ llama_new_context_with_model: CANN compute buffer size = 1260.81 MiB
|
||||||
|
|
||||||
For detailed info, such as model/device supports, CANN install, please refer to [llama.cpp for CANN](./backend/CANN.md).
|
For detailed info, such as model/device supports, CANN install, please refer to [llama.cpp for CANN](./backend/CANN.md).
|
||||||
|
|
||||||
|
## ZenDNN
|
||||||
|
|
||||||
|
ZenDNN provides optimized deep learning primitives for AMD EPYC™ CPUs. It accelerates matrix multiplication operations for inference workloads.
|
||||||
|
|
||||||
|
### Compilation
|
||||||
|
|
||||||
|
- Using `CMake` on Linux (automatic build):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cmake -B build -DGGML_ZENDNN=ON
|
||||||
|
cmake --build build --config Release
|
||||||
|
```
|
||||||
|
|
||||||
|
The first build will automatically download and build ZenDNN, which may take 5-10 minutes. Subsequent builds will be much faster.
|
||||||
|
|
||||||
|
- Using `CMake` with custom ZenDNN installation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
|
||||||
|
cmake --build build --config Release
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
|
||||||
|
You can test with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -n 50
|
||||||
|
```
|
||||||
|
|
||||||
|
For detailed information about hardware support, setup instructions, and performance optimization, refer to [llama.cpp for ZenDNN](./backend/ZenDNN.md).
|
||||||
|
|
||||||
## Arm® KleidiAI™
|
## Arm® KleidiAI™
|
||||||
KleidiAI is a library of optimized microkernels for AI workloads, specifically designed for Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
|
KleidiAI is a library of optimized microkernels for AI workloads, specifically designed for Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
|
||||||
|
|
||||||
|
|
|
||||||
216
docs/ops.md
216
docs/ops.md
|
|
@ -12,111 +12,111 @@ Legend:
|
||||||
- 🟡 Partially supported by this backend
|
- 🟡 Partially supported by this backend
|
||||||
- ❌ Not supported by this backend
|
- ❌ Not supported by this backend
|
||||||
|
|
||||||
| Operation | BLAS | CANN | CPU | CUDA | Metal | OpenCL | SYCL | Vulkan | WebGPU | zDNN |
|
| Operation | BLAS | CANN | CPU | CUDA | Metal | OpenCL | SYCL | Vulkan | WebGPU | ZenDNN | zDNN |
|
||||||
|-----------|------|------|------|------|------|------|------|------|------|------|
|
|-----------|------|------|------|------|------|------|------|------|------|------|------|
|
||||||
| ABS | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
|
| ABS | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ADD | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ |
|
| ADD | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ADD_ID | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| ADD_ID | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ARANGE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| ARANGE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| CEIL | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| CEIL | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| CLAMP | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
|
| CLAMP | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ❌ | ❌ |
|
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| CONT | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ❌ |
|
| CONT | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
|
||||||
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
|
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| COS | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| COS | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| COUNT_EQUAL | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| COUNT_EQUAL | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ |
|
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
|
||||||
| CROSS_ENTROPY_LOSS | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| CROSS_ENTROPY_LOSS | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| CROSS_ENTROPY_LOSS_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| CROSS_ENTROPY_LOSS_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| CUMSUM | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| CUMSUM | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| DIAG_MASK_INF | ❌ | ✅ | ✅ | ✅ | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ |
|
| DIAG_MASK_INF | ❌ | ✅ | ✅ | ✅ | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| DIV | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ |
|
| DIV | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| DUP | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ |
|
| DUP | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | ❌ | ✅ | ❌ |
|
| ELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| EXP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
|
| EXP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| FILL | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| FILL | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ |
|
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| FLOOR | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| FLOOR | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| GATED_LINEAR_ATTN | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
| GATED_LINEAR_ATTN | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| GEGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
|
| GEGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| GEGLU_ERF | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
|
| GEGLU_ERF | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| GEGLU_QUICK | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
|
| GEGLU_QUICK | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| GELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
|
| GELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
|
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
|
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ |
|
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
|
||||||
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| GROUP_NORM_MUL_ADD | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| GROUP_NORM_MUL_ADD | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
|
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
|
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| IM2COL_3D | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| IM2COL_3D | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| L2_NORM | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| L2_NORM | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ |
|
| LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| LOG | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | ✅ | ❌ | ❌ |
|
| LOG | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
|
||||||
| MEAN | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| MEAN | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| MUL | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ |
|
| MUL | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
|
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
|
||||||
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
|
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ |
|
||||||
| NEG | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
|
| NEG | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ❌ | ❌ |
|
| NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| NORM_MUL_ADD | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| NORM_MUL_ADD | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| OPT_STEP_SGD | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| OPT_STEP_SGD | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| OUT_PROD | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
| OUT_PROD | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | ❌ |
|
||||||
| PAD | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
| PAD | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ | ❌ |
|
||||||
| PAD_REFLECT_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
| PAD_REFLECT_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| REGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
|
| REGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| RELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
|
| RELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| REPEAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ❌ | ❌ |
|
| REPEAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| REPEAT_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| REPEAT_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| RMS_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| RMS_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| RMS_NORM_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| RMS_NORM_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| RMS_NORM_MUL_ADD | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| RMS_NORM_MUL_ADD | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| ROLL | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| ROLL | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ROPE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| ROPE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| ROPE_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| ROPE_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| ROUND | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| ROUND | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| RWKV_WKV6 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| RWKV_WKV6 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| RWKV_WKV7 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| RWKV_WKV7 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| SCALE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| SCALE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| SET | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
| SET | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | ❌ |
|
||||||
| SET_ROWS | ❌ | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ |
|
| SET_ROWS | ❌ | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
|
||||||
| SGN | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | ❌ | ✅ | ❌ |
|
| SGN | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| SIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
|
| SIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| SILU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
|
| SILU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| SILU_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| SILU_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| SIN | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| SIN | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| SOFTCAP | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| SOFTCAP | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ |
|
| SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ |
|
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
|
||||||
| SOLVE_TRI | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | 🟡 | ❌ | ❌ |
|
| SOLVE_TRI | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| SQR | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| SQR | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| SQRT | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| SQRT | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| SSM_CONV | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
| SSM_CONV | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ |
|
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| STEP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
|
| STEP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| SUB | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ |
|
| SUB | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| SUM | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| SUM | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| SUM_ROWS | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
| SUM_ROWS | ❌ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ | ❌ |
|
||||||
| SWIGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
|
| SWIGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| SWIGLU_OAI | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | ✅ | ❌ |
|
| SWIGLU_OAI | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| TANH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
|
| TANH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
|
||||||
| TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| TOP_K | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ |
|
| TOP_K | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| TRI | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| TRI | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
|
||||||
| TRUNC | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
|
| TRUNC | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| UPSCALE | ❌ | 🟡 | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | ❌ |
|
| UPSCALE | ❌ | 🟡 | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
|
||||||
| XIELU | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
|
| XIELU | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
|
|
||||||
File diff suppressed because it is too large
Load Diff
|
|
@ -253,6 +253,9 @@ option(GGML_HEXAGON "ggml: enable Hexagon backend"
|
||||||
# toolchain for vulkan-shaders-gen
|
# toolchain for vulkan-shaders-gen
|
||||||
set (GGML_VULKAN_SHADERS_GEN_TOOLCHAIN "" CACHE FILEPATH "ggml: toolchain file for vulkan-shaders-gen")
|
set (GGML_VULKAN_SHADERS_GEN_TOOLCHAIN "" CACHE FILEPATH "ggml: toolchain file for vulkan-shaders-gen")
|
||||||
|
|
||||||
|
option(GGML_ZENDNN "ggml: use ZenDNN" OFF)
|
||||||
|
option(ZENDNN_ROOT "ggml: path to ZenDNN installation" "")
|
||||||
|
|
||||||
# extra artifacts
|
# extra artifacts
|
||||||
option(GGML_BUILD_TESTS "ggml: build tests" ${GGML_STANDALONE})
|
option(GGML_BUILD_TESTS "ggml: build tests" ${GGML_STANDALONE})
|
||||||
option(GGML_BUILD_EXAMPLES "ggml: build examples" ${GGML_STANDALONE})
|
option(GGML_BUILD_EXAMPLES "ggml: build examples" ${GGML_STANDALONE})
|
||||||
|
|
@ -314,6 +317,7 @@ set(GGML_PUBLIC_HEADERS
|
||||||
include/ggml-sycl.h
|
include/ggml-sycl.h
|
||||||
include/ggml-vulkan.h
|
include/ggml-vulkan.h
|
||||||
include/ggml-webgpu.h
|
include/ggml-webgpu.h
|
||||||
|
include/ggml-zendnn.h
|
||||||
include/gguf.h)
|
include/gguf.h)
|
||||||
|
|
||||||
set_target_properties(ggml PROPERTIES PUBLIC_HEADER "${GGML_PUBLIC_HEADERS}")
|
set_target_properties(ggml PROPERTIES PUBLIC_HEADER "${GGML_PUBLIC_HEADERS}")
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,22 @@
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include "ggml-backend.h"
|
||||||
|
#include "ggml.h"
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
// backend API
|
||||||
|
GGML_BACKEND_API ggml_backend_t ggml_backend_zendnn_init(void);
|
||||||
|
|
||||||
|
GGML_BACKEND_API bool ggml_backend_is_zendnn(ggml_backend_t backend);
|
||||||
|
|
||||||
|
// number of threads used for zendnn operations
|
||||||
|
GGML_BACKEND_API void ggml_backend_zendnn_set_n_threads(ggml_backend_t backend_zendnn, int n_threads);
|
||||||
|
|
||||||
|
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_zendnn_reg(void);
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
@ -440,6 +440,7 @@ ggml_add_backend(WebGPU)
|
||||||
ggml_add_backend(zDNN)
|
ggml_add_backend(zDNN)
|
||||||
ggml_add_backend(OpenCL)
|
ggml_add_backend(OpenCL)
|
||||||
ggml_add_backend(Hexagon)
|
ggml_add_backend(Hexagon)
|
||||||
|
ggml_add_backend(ZenDNN)
|
||||||
|
|
||||||
foreach (target ggml-base ggml)
|
foreach (target ggml-base ggml)
|
||||||
target_include_directories(${target} PUBLIC $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../include> $<INSTALL_INTERFACE:include>)
|
target_include_directories(${target} PUBLIC $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../include> $<INSTALL_INTERFACE:include>)
|
||||||
|
|
|
||||||
|
|
@ -73,6 +73,10 @@
|
||||||
#include "ggml-cann.h"
|
#include "ggml-cann.h"
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
#ifdef GGML_USE_ZENDNN
|
||||||
|
#include "ggml-zendnn.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
// disable C++17 deprecation warning for std::codecvt_utf8
|
// disable C++17 deprecation warning for std::codecvt_utf8
|
||||||
#if defined(__clang__)
|
#if defined(__clang__)
|
||||||
# pragma clang diagnostic push
|
# pragma clang diagnostic push
|
||||||
|
|
@ -203,6 +207,9 @@ struct ggml_backend_registry {
|
||||||
#ifdef GGML_USE_OPENCL
|
#ifdef GGML_USE_OPENCL
|
||||||
register_backend(ggml_backend_opencl_reg());
|
register_backend(ggml_backend_opencl_reg());
|
||||||
#endif
|
#endif
|
||||||
|
#ifdef GGML_USE_ZENDNN
|
||||||
|
register_backend(ggml_backend_zendnn_reg());
|
||||||
|
#endif
|
||||||
#ifdef GGML_USE_HEXAGON
|
#ifdef GGML_USE_HEXAGON
|
||||||
register_backend(ggml_backend_hexagon_reg());
|
register_backend(ggml_backend_hexagon_reg());
|
||||||
#endif
|
#endif
|
||||||
|
|
@ -605,6 +612,7 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
ggml_backend_load_best("blas", silent, dir_path);
|
ggml_backend_load_best("blas", silent, dir_path);
|
||||||
|
ggml_backend_load_best("zendnn", silent, dir_path);
|
||||||
ggml_backend_load_best("cann", silent, dir_path);
|
ggml_backend_load_best("cann", silent, dir_path);
|
||||||
ggml_backend_load_best("cuda", silent, dir_path);
|
ggml_backend_load_best("cuda", silent, dir_path);
|
||||||
ggml_backend_load_best("hip", silent, dir_path);
|
ggml_backend_load_best("hip", silent, dir_path);
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,92 @@
|
||||||
|
ggml_add_backend_library(ggml-zendnn
|
||||||
|
ggml-zendnn.cpp)
|
||||||
|
|
||||||
|
# Get ZenDNN path
|
||||||
|
if (NOT DEFINED ZENDNN_ROOT OR ZENDNN_ROOT STREQUAL "")
|
||||||
|
set(ZENDNN_ROOT "$ENV{ZENDNN_ROOT}")
|
||||||
|
endif()
|
||||||
|
|
||||||
|
# Check if path is still empty or OFF
|
||||||
|
if (NOT ZENDNN_ROOT OR ZENDNN_ROOT STREQUAL "" OR ZENDNN_ROOT STREQUAL "OFF")
|
||||||
|
message(STATUS "ZENDNN_ROOT not set. Automatically downloading and building ZenDNN...")
|
||||||
|
message(STATUS "This will take several minutes on first build...")
|
||||||
|
|
||||||
|
include(ExternalProject)
|
||||||
|
|
||||||
|
set(ZENDNN_PREFIX ${CMAKE_BINARY_DIR}/_deps/zendnn-prefix)
|
||||||
|
set(ZENDNN_SOURCE_DIR ${ZENDNN_PREFIX}/src/zendnn)
|
||||||
|
set(ZENDNN_BUILD_DIR ${ZENDNN_PREFIX}/build)
|
||||||
|
set(ZENDNN_INSTALL_DIR ${ZENDNN_BUILD_DIR}/install)
|
||||||
|
|
||||||
|
ExternalProject_Add(
|
||||||
|
zendnn
|
||||||
|
GIT_REPOSITORY https://github.com/amd/ZenDNN.git
|
||||||
|
GIT_TAG zendnnl
|
||||||
|
PREFIX ${ZENDNN_PREFIX}
|
||||||
|
SOURCE_DIR ${ZENDNN_SOURCE_DIR}
|
||||||
|
BINARY_DIR ${ZENDNN_BUILD_DIR}
|
||||||
|
CMAKE_ARGS
|
||||||
|
-DCMAKE_BUILD_TYPE=Release
|
||||||
|
-DCMAKE_INSTALL_PREFIX=${ZENDNN_INSTALL_DIR}
|
||||||
|
-DZENDNNL_BUILD_EXAMPLES=OFF
|
||||||
|
-DZENDNNL_BUILD_DOXYGEN=OFF
|
||||||
|
-DZENDNNL_BUILD_GTEST=OFF
|
||||||
|
-DZENDNNL_BUILD_BENCHDNN=OFF
|
||||||
|
# Enable ALL matmul algorithm backends
|
||||||
|
-DZENDNNL_DEPENDS_AOCLDLP=ON
|
||||||
|
-DZENDNNL_DEPENDS_ONEDNN=ON
|
||||||
|
-DZENDNNL_DEPENDS_LIBXSMM=ON
|
||||||
|
BUILD_COMMAND ${CMAKE_COMMAND} --build ${ZENDNN_BUILD_DIR} --target zendnnl
|
||||||
|
INSTALL_COMMAND ${CMAKE_COMMAND} --build ${ZENDNN_BUILD_DIR} --target install
|
||||||
|
BUILD_ALWAYS OFF
|
||||||
|
LOG_DOWNLOAD ON
|
||||||
|
LOG_CONFIGURE ON
|
||||||
|
LOG_BUILD ON
|
||||||
|
LOG_INSTALL ON
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add dependency so ZenDNN builds before our library
|
||||||
|
add_dependencies(ggml-zendnn zendnn)
|
||||||
|
|
||||||
|
# Set ZENDNN_ROOT to the installation directory
|
||||||
|
set(ZENDNN_ROOT ${ZENDNN_INSTALL_DIR})
|
||||||
|
|
||||||
|
message(STATUS "ZenDNN will be built to: ${ZENDNN_ROOT}")
|
||||||
|
else()
|
||||||
|
message(STATUS "Using custom ZenDNN installation at: ${ZENDNN_ROOT}")
|
||||||
|
endif()
|
||||||
|
|
||||||
|
# ZenDNN headers + libs
|
||||||
|
target_include_directories(ggml-zendnn PRIVATE
|
||||||
|
${ZENDNN_ROOT}/zendnnl/include
|
||||||
|
${ZENDNN_ROOT}/deps/aocldlp/include
|
||||||
|
${ZENDNN_ROOT}/deps/aoclutils/include
|
||||||
|
${ZENDNN_ROOT}/deps/json/include
|
||||||
|
${ZENDNN_ROOT}/deps/libxsmm/include
|
||||||
|
${ZENDNN_ROOT}/deps/onednn/include
|
||||||
|
)
|
||||||
|
|
||||||
|
target_link_directories(ggml-zendnn PRIVATE
|
||||||
|
${ZENDNN_ROOT}/zendnnl/lib
|
||||||
|
${ZENDNN_ROOT}/deps/aocldlp/lib
|
||||||
|
${ZENDNN_ROOT}/deps/aoclutils/lib
|
||||||
|
${ZENDNN_ROOT}/deps/libxsmm/lib
|
||||||
|
${ZENDNN_ROOT}/deps/onednn/lib
|
||||||
|
)
|
||||||
|
|
||||||
|
target_link_libraries(ggml-zendnn PRIVATE
|
||||||
|
zendnnl_archive # ZenDNN main
|
||||||
|
aocl-dlp # AOCL libraries
|
||||||
|
aoclutils
|
||||||
|
au_cpuid
|
||||||
|
dnnl # OneDNN
|
||||||
|
xsmm # libxsmm small matrix math
|
||||||
|
xsmmext
|
||||||
|
xsmmnoblas
|
||||||
|
m
|
||||||
|
pthread
|
||||||
|
)
|
||||||
|
|
||||||
|
if (GGML_OPENMP)
|
||||||
|
target_link_libraries(ggml-zendnn PRIVATE OpenMP::OpenMP_CXX)
|
||||||
|
endif()
|
||||||
|
|
@ -0,0 +1,466 @@
|
||||||
|
#include "ggml-zendnn.h"
|
||||||
|
|
||||||
|
#include "ggml-backend-impl.h"
|
||||||
|
#include "ggml-impl.h"
|
||||||
|
#include "ggml-cpu.h"
|
||||||
|
#include "zendnnl.hpp"
|
||||||
|
|
||||||
|
#include <cstring>
|
||||||
|
|
||||||
|
|
||||||
|
struct ggml_backend_zendnn_context {
|
||||||
|
int n_threads = GGML_DEFAULT_N_THREADS;
|
||||||
|
std::unique_ptr<char[]> work_data;
|
||||||
|
size_t work_size = 0;
|
||||||
|
};
|
||||||
|
|
||||||
|
template<typename T>
|
||||||
|
zendnnl::common::data_type_t ggml_to_zendnn_type() {
|
||||||
|
if constexpr (std::is_same_v<T, float>) {
|
||||||
|
return zendnnl::common::data_type_t::f32;
|
||||||
|
} else if constexpr (std::is_same_v<T, ggml_bf16_t>) {
|
||||||
|
return zendnnl::common::data_type_t::bf16;
|
||||||
|
} else {
|
||||||
|
return zendnnl::common::data_type_t::none;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* ZenDNN matmul: computes C = B * A.
|
||||||
|
*
|
||||||
|
* - A: weights, shape (k, m), column-major (each column is a weight vector for one output).
|
||||||
|
* - B: input, shape (n, k), row-major (each row is an input sample).
|
||||||
|
* - C: output, shape (n, m), row-major.
|
||||||
|
*
|
||||||
|
* Dimensions:
|
||||||
|
* m = output features (columns of C, columns of A)
|
||||||
|
* n = batch size (rows of C, rows of B)
|
||||||
|
* k = inner dimension (columns of B, rows of A)
|
||||||
|
*/
|
||||||
|
template <typename TA, typename TB, typename TC>
|
||||||
|
static bool ggml_zendnn_matmul(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
|
||||||
|
const TA * A, int64_t lda, const TB * B, int64_t ldb, TC * C,
|
||||||
|
int64_t ldc) {
|
||||||
|
|
||||||
|
zendnnl::lowoha::lowoha_params params;
|
||||||
|
params.dtypes.src = ggml_to_zendnn_type<TB>();
|
||||||
|
params.dtypes.wei = ggml_to_zendnn_type<TA>();
|
||||||
|
params.dtypes.dst = ggml_to_zendnn_type<TC>();
|
||||||
|
params.num_threads = ctx->n_threads;
|
||||||
|
|
||||||
|
zendnnl::lowoha::status_t status = zendnnl::lowoha::matmul_direct(
|
||||||
|
'r', false, true, // row-major, don't transpose B, transpose A (because it's column-major)
|
||||||
|
n, // M: rows of B and C
|
||||||
|
m, // N: cols of A^T and C
|
||||||
|
k, // K: cols of B, rows of A
|
||||||
|
1.0f, // alpha
|
||||||
|
B, ldb, // src: B[n,k]
|
||||||
|
A, lda, // weight: A[k,m] column-major (transposed)
|
||||||
|
nullptr, // bias
|
||||||
|
0.0f, // beta
|
||||||
|
C, ldc, // output C[n,m]
|
||||||
|
true, // is_weights_const
|
||||||
|
{}, // batch_params
|
||||||
|
params // params
|
||||||
|
);
|
||||||
|
|
||||||
|
if (status != zendnnl::lowoha::status_t::success) {
|
||||||
|
GGML_LOG_ERROR("%s, ZenDNN matmul failed: status=%d\n", __func__, static_cast<int>(status));
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool ggml_zendnn_sgemm(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
|
||||||
|
const void * A, int64_t lda, const void * B, int64_t ldb, void * C,
|
||||||
|
int64_t ldc, int Atype, int Btype, int Ctype) {
|
||||||
|
|
||||||
|
assert(m >= 0);
|
||||||
|
assert(n >= 0);
|
||||||
|
assert(k >= 0);
|
||||||
|
assert(lda >= k);
|
||||||
|
assert(ldb >= k);
|
||||||
|
assert(ldc >= m);
|
||||||
|
|
||||||
|
// categorize types
|
||||||
|
switch (Atype) {
|
||||||
|
case GGML_TYPE_F32:
|
||||||
|
if (Btype != GGML_TYPE_F32 || Ctype != GGML_TYPE_F32)
|
||||||
|
return false;
|
||||||
|
return ggml_zendnn_matmul<float, float, float>(
|
||||||
|
ctx, m, n, k,
|
||||||
|
(const float *)A, lda,
|
||||||
|
(const float *)B, ldb,
|
||||||
|
(float *)C, ldc);
|
||||||
|
case GGML_TYPE_BF16:
|
||||||
|
if (Btype != GGML_TYPE_BF16)
|
||||||
|
return false;
|
||||||
|
if (Ctype == GGML_TYPE_BF16)
|
||||||
|
return ggml_zendnn_matmul<ggml_bf16_t, ggml_bf16_t, ggml_bf16_t>(
|
||||||
|
ctx, m, n, k,
|
||||||
|
(const ggml_bf16_t *)A, lda,
|
||||||
|
(const ggml_bf16_t *)B, ldb,
|
||||||
|
(ggml_bf16_t *)C, ldc);
|
||||||
|
if (Ctype == GGML_TYPE_F32)
|
||||||
|
return ggml_zendnn_matmul<ggml_bf16_t, ggml_bf16_t, float>(
|
||||||
|
ctx, m, n, k,
|
||||||
|
(const ggml_bf16_t *)A, lda,
|
||||||
|
(const ggml_bf16_t *)B, ldb,
|
||||||
|
(float *)C, ldc);
|
||||||
|
return false;
|
||||||
|
default:
|
||||||
|
return false; // unsupported type
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static void ggml_zendnn_compute_forward_mul_mat(
|
||||||
|
ggml_backend_zendnn_context * ctx,
|
||||||
|
ggml_tensor * dst) {
|
||||||
|
|
||||||
|
const ggml_tensor * src0 = dst->src[0]; // weights
|
||||||
|
const ggml_tensor * src1 = dst->src[1]; // inputs
|
||||||
|
|
||||||
|
GGML_TENSOR_BINARY_OP_LOCALS
|
||||||
|
|
||||||
|
ggml_type const vec_dot_type = ggml_get_type_traits_cpu(src0->type)->vec_dot_type;
|
||||||
|
ggml_from_float_t const from_float = ggml_get_type_traits_cpu(vec_dot_type)->from_float;
|
||||||
|
|
||||||
|
GGML_ASSERT(ne0 == ne01);
|
||||||
|
GGML_ASSERT(ne1 == ne11);
|
||||||
|
GGML_ASSERT(ne2 == ne12);
|
||||||
|
GGML_ASSERT(ne3 == ne13);
|
||||||
|
|
||||||
|
// we don't support permuted src0 or src1
|
||||||
|
GGML_ASSERT(nb00 == ggml_type_size(src0->type));
|
||||||
|
GGML_ASSERT(nb10 == ggml_type_size(src1->type));
|
||||||
|
|
||||||
|
// dst cannot be transposed or permuted
|
||||||
|
GGML_ASSERT(nb0 == sizeof(float));
|
||||||
|
GGML_ASSERT(nb0 <= nb1);
|
||||||
|
GGML_ASSERT(nb1 <= nb2);
|
||||||
|
GGML_ASSERT(nb2 <= nb3);
|
||||||
|
|
||||||
|
// broadcast factors
|
||||||
|
const int64_t r2 = ne12/ne02;
|
||||||
|
const int64_t r3 = ne13/ne03;
|
||||||
|
|
||||||
|
void * work_data = ctx->work_data.get();
|
||||||
|
if (src1->type != vec_dot_type) {
|
||||||
|
const size_t nbw1 = ggml_row_size(vec_dot_type, ne10);
|
||||||
|
const size_t nbw2 = nbw1 * ne11;
|
||||||
|
const size_t nbw3 = nbw2 * ne12;
|
||||||
|
const size_t desired_wsize = ne13 * nbw3;
|
||||||
|
if (ctx->work_size < desired_wsize) {
|
||||||
|
ctx->work_data.reset(new char[desired_wsize]);
|
||||||
|
ctx->work_size = desired_wsize;
|
||||||
|
}
|
||||||
|
work_data = ctx->work_data.get();
|
||||||
|
|
||||||
|
// #pragma omp parallel for num_threads(ctx->n_threads)
|
||||||
|
#pragma omp parallel for collapse(3) num_threads(ctx->n_threads) schedule(static)
|
||||||
|
for (int64_t i13 = 0; i13 < ne13; ++i13) {
|
||||||
|
for (int64_t i12 = 0; i12 < ne12; ++i12) {
|
||||||
|
for (int64_t i11 = 0; i11 < ne11; ++i11) {
|
||||||
|
const float * src1_f32 = (float *)((char *)src1->data + i11*nb11 + i12*nb12 + i13*nb13);
|
||||||
|
void * src1_conv = (char *)work_data + i11*nbw1 + i12*nbw2 + i13*nbw3;
|
||||||
|
from_float(src1_f32, src1_conv, ne10);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int64_t i13 = 0; i13 < ne13; i13++) {
|
||||||
|
for (int64_t i12 = 0; i12 < ne12; i12++) {
|
||||||
|
const void* wdata = src1->type == vec_dot_type ? src1->data : work_data;
|
||||||
|
const size_t row_size = ggml_row_size(vec_dot_type, ne10);
|
||||||
|
if (!ggml_zendnn_sgemm(ctx,
|
||||||
|
ne01, // m
|
||||||
|
ne11, // n
|
||||||
|
ne10, // k
|
||||||
|
static_cast<const char *>(src0->data) + (i12/r2)*nb02 + (i13/r3)*nb03,
|
||||||
|
ne00, // lda
|
||||||
|
static_cast<const char *>(wdata) + (i12*ne11 + i13*ne12*ne11)*row_size,
|
||||||
|
ne10, // ldb
|
||||||
|
static_cast<char *>(dst->data) + i12*nb2 + i13*nb3,
|
||||||
|
ne01, // ldc
|
||||||
|
src0->type,
|
||||||
|
vec_dot_type,
|
||||||
|
dst->type))
|
||||||
|
GGML_ABORT("%s: ZenDNN sgemm failed\n", __func__);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// backend interface
|
||||||
|
|
||||||
|
static const char * ggml_backend_zendnn_get_name(ggml_backend_t backend) {
|
||||||
|
return "ZenDNN";
|
||||||
|
|
||||||
|
GGML_UNUSED(backend);
|
||||||
|
}
|
||||||
|
|
||||||
|
static void ggml_backend_zendnn_free(ggml_backend_t backend) {
|
||||||
|
ggml_backend_zendnn_context * ctx = (ggml_backend_zendnn_context *)backend->context;
|
||||||
|
delete ctx;
|
||||||
|
delete backend;
|
||||||
|
}
|
||||||
|
|
||||||
|
static ggml_status ggml_backend_zendnn_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
|
||||||
|
ggml_backend_zendnn_context * ctx = (ggml_backend_zendnn_context *)backend->context;
|
||||||
|
|
||||||
|
for (int i = 0; i < cgraph->n_nodes; i++) {
|
||||||
|
struct ggml_tensor * node = cgraph->nodes[i];
|
||||||
|
|
||||||
|
switch (node->op) {
|
||||||
|
case GGML_OP_MUL_MAT:
|
||||||
|
ggml_zendnn_compute_forward_mul_mat(ctx, node);
|
||||||
|
break;
|
||||||
|
case GGML_OP_NONE:
|
||||||
|
case GGML_OP_RESHAPE:
|
||||||
|
case GGML_OP_VIEW:
|
||||||
|
case GGML_OP_PERMUTE:
|
||||||
|
case GGML_OP_TRANSPOSE:
|
||||||
|
break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
GGML_ABORT("%s: unsupported op %s\n", __func__, ggml_op_desc(node));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return GGML_STATUS_SUCCESS;
|
||||||
|
|
||||||
|
GGML_UNUSED(backend);
|
||||||
|
}
|
||||||
|
|
||||||
|
static struct ggml_backend_i ggml_backend_zendnn_i = {
|
||||||
|
/* .get_name = */ ggml_backend_zendnn_get_name,
|
||||||
|
/* .free = */ ggml_backend_zendnn_free,
|
||||||
|
/* .set_tensor_async = */ NULL,
|
||||||
|
/* .get_tensor_async = */ NULL,
|
||||||
|
/* .cpy_tensor_async = */ NULL,
|
||||||
|
/* .synchronize = */ NULL,
|
||||||
|
/* .graph_plan_create = */ NULL,
|
||||||
|
/* .graph_plan_free = */ NULL,
|
||||||
|
/* .graph_plan_update = */ NULL,
|
||||||
|
/* .graph_plan_compute = */ NULL,
|
||||||
|
/* .graph_compute = */ ggml_backend_zendnn_graph_compute,
|
||||||
|
/* .event_record = */ NULL,
|
||||||
|
/* .event_wait = */ NULL,
|
||||||
|
/* .graph_optimize = */ NULL,
|
||||||
|
};
|
||||||
|
|
||||||
|
static ggml_guid_t ggml_backend_zendnn_guid(void) {
|
||||||
|
static const char * guid_str = "AMD-ZENDNN-ACCEL";
|
||||||
|
return reinterpret_cast<ggml_guid_t>(const_cast<char*>(guid_str));
|
||||||
|
}
|
||||||
|
|
||||||
|
ggml_backend_t ggml_backend_zendnn_init(void) {
|
||||||
|
ggml_backend_zendnn_context * ctx = new ggml_backend_zendnn_context;
|
||||||
|
|
||||||
|
ggml_backend_t backend = new ggml_backend {
|
||||||
|
/* .guid = */ ggml_backend_zendnn_guid(),
|
||||||
|
/* .iface = */ ggml_backend_zendnn_i,
|
||||||
|
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_zendnn_reg(), 0),
|
||||||
|
/* .context = */ ctx,
|
||||||
|
};
|
||||||
|
|
||||||
|
return backend;
|
||||||
|
}
|
||||||
|
|
||||||
|
bool ggml_backend_is_zendnn(ggml_backend_t backend) {
|
||||||
|
return backend != NULL && ggml_guid_matches(backend->guid, ggml_backend_zendnn_guid());
|
||||||
|
}
|
||||||
|
|
||||||
|
void ggml_backend_zendnn_set_n_threads(ggml_backend_t backend_zendnn, int n_threads) {
|
||||||
|
GGML_ASSERT(ggml_backend_is_zendnn(backend_zendnn));
|
||||||
|
|
||||||
|
ggml_backend_zendnn_context * ctx = (ggml_backend_zendnn_context *)backend_zendnn->context;
|
||||||
|
ctx->n_threads = n_threads;
|
||||||
|
}
|
||||||
|
|
||||||
|
// device interface
|
||||||
|
static const char * ggml_backend_zendnn_device_get_name(ggml_backend_dev_t dev) {
|
||||||
|
return "ZenDNN";
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
/**
|
||||||
|
* ZenDNN is AMD's performance library providing optimized primitives and implementations
|
||||||
|
* for deep learning workloads on AMD CPUs. It targets improved performance for common
|
||||||
|
* neural network operations on AMD architectures. For more information, see:
|
||||||
|
* https://www.amd.com/en/developer/zendnn.html
|
||||||
|
*/
|
||||||
|
static const char * ggml_backend_zendnn_device_get_description(ggml_backend_dev_t dev) {
|
||||||
|
return "ZenDNN: AMD optimized primitives backend for GGML (optimized for AMD CPUs)";
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
|
||||||
|
static void ggml_backend_zendnn_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
|
||||||
|
*free = 0;
|
||||||
|
*total = 0;
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
|
||||||
|
static enum ggml_backend_dev_type ggml_backend_zendnn_device_get_type(ggml_backend_dev_t dev) {
|
||||||
|
return GGML_BACKEND_DEVICE_TYPE_ACCEL;
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
|
||||||
|
static void ggml_backend_zendnn_device_get_props(ggml_backend_dev_t dev, struct ggml_backend_dev_props * props) {
|
||||||
|
props->name = ggml_backend_zendnn_device_get_name(dev);
|
||||||
|
props->description = ggml_backend_zendnn_device_get_description(dev);
|
||||||
|
props->type = ggml_backend_zendnn_device_get_type(dev);
|
||||||
|
ggml_backend_zendnn_device_get_memory(dev, &props->memory_free, &props->memory_total);
|
||||||
|
props->caps = {
|
||||||
|
/* .async = */ false,
|
||||||
|
/* .host_buffer = */ false,
|
||||||
|
/* .buffer_from_host_ptr = */ true,
|
||||||
|
/* .events = */ false
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
static ggml_backend_t ggml_backend_zendnn_device_init_backend(ggml_backend_dev_t dev, const char * params) {
|
||||||
|
ggml_backend_t backend = ggml_backend_zendnn_init();
|
||||||
|
if (backend == NULL) {
|
||||||
|
GGML_LOG_ERROR("%s: error: failed to initialize ZenDNN backend\n", __func__);
|
||||||
|
return NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return backend;
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
GGML_UNUSED(params);
|
||||||
|
}
|
||||||
|
|
||||||
|
static ggml_backend_buffer_type_t ggml_backend_zendnn_device_get_buffer_type(ggml_backend_dev_t dev) {
|
||||||
|
return ggml_backend_cpu_buffer_type();
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
|
||||||
|
static ggml_backend_buffer_t ggml_backend_zendnn_device_buffer_from_host_ptr(ggml_backend_dev_t dev, void * ptr, size_t size, size_t max_tensor_size) {
|
||||||
|
return ggml_backend_cpu_buffer_from_ptr(ptr, size);
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
GGML_UNUSED(max_tensor_size);
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool ggml_backend_zendnn_device_supports_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
|
||||||
|
switch (op->op) {
|
||||||
|
case GGML_OP_NONE:
|
||||||
|
case GGML_OP_RESHAPE:
|
||||||
|
case GGML_OP_VIEW:
|
||||||
|
case GGML_OP_PERMUTE:
|
||||||
|
case GGML_OP_TRANSPOSE:
|
||||||
|
return true;
|
||||||
|
|
||||||
|
case GGML_OP_MUL_MAT:
|
||||||
|
{
|
||||||
|
const ggml_tensor * weights = op->src[0];
|
||||||
|
const ggml_tensor * inputs = op->src[1];
|
||||||
|
|
||||||
|
const int64_t ne10 = inputs->ne[0];
|
||||||
|
const int64_t ne0 = op->ne[0];
|
||||||
|
const int64_t ne1 = op->ne[1];
|
||||||
|
|
||||||
|
const int64_t min_batch = 1;
|
||||||
|
if (!ggml_is_contiguous(weights) || !ggml_is_contiguous(inputs) ||
|
||||||
|
ne0 < min_batch || ne1 < min_batch || ne10 < min_batch) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
switch (weights->type) {
|
||||||
|
case GGML_TYPE_F32:
|
||||||
|
case GGML_TYPE_BF16:
|
||||||
|
return true;
|
||||||
|
default:
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
} break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool ggml_backend_zendnn_device_supports_buft(ggml_backend_dev_t dev, ggml_backend_buffer_type_t buft) {
|
||||||
|
return ggml_backend_buft_is_host(buft);
|
||||||
|
|
||||||
|
GGML_UNUSED(dev);
|
||||||
|
}
|
||||||
|
|
||||||
|
static const struct ggml_backend_device_i ggml_backend_zendnn_device_i = {
|
||||||
|
/* .get_name = */ ggml_backend_zendnn_device_get_name,
|
||||||
|
/* .get_description = */ ggml_backend_zendnn_device_get_description,
|
||||||
|
/* .get_memory = */ ggml_backend_zendnn_device_get_memory,
|
||||||
|
/* .get_type = */ ggml_backend_zendnn_device_get_type,
|
||||||
|
/* .get_props = */ ggml_backend_zendnn_device_get_props,
|
||||||
|
/* .init_backend = */ ggml_backend_zendnn_device_init_backend,
|
||||||
|
/* .get_buffer_type = */ ggml_backend_zendnn_device_get_buffer_type,
|
||||||
|
/* .get_host_buffer_type = */ NULL,
|
||||||
|
/* .buffer_from_host_ptr = */ ggml_backend_zendnn_device_buffer_from_host_ptr,
|
||||||
|
/* .supports_op = */ ggml_backend_zendnn_device_supports_op,
|
||||||
|
/* .supports_buft = */ ggml_backend_zendnn_device_supports_buft,
|
||||||
|
/* .offload_op = */ NULL,
|
||||||
|
/* .event_new = */ NULL,
|
||||||
|
/* .event_free = */ NULL,
|
||||||
|
/* .event_synchronize = */ NULL,
|
||||||
|
};
|
||||||
|
|
||||||
|
// backend reg interface
|
||||||
|
static const char * ggml_backend_zendnn_reg_get_name(ggml_backend_reg_t reg) {
|
||||||
|
return "ZenDNN";
|
||||||
|
|
||||||
|
GGML_UNUSED(reg);
|
||||||
|
}
|
||||||
|
|
||||||
|
static size_t ggml_backend_zendnn_reg_get_device_count(ggml_backend_reg_t reg) {
|
||||||
|
return 1;
|
||||||
|
|
||||||
|
GGML_UNUSED(reg);
|
||||||
|
}
|
||||||
|
|
||||||
|
static ggml_backend_dev_t ggml_backend_zendnn_reg_get_device(ggml_backend_reg_t reg, size_t index) {
|
||||||
|
GGML_ASSERT(index == 0);
|
||||||
|
|
||||||
|
static ggml_backend_device ggml_backend_zendnn_device = {
|
||||||
|
/* .iface = */ ggml_backend_zendnn_device_i,
|
||||||
|
/* .reg = */ reg,
|
||||||
|
/* .context = */ nullptr,
|
||||||
|
};
|
||||||
|
|
||||||
|
return &ggml_backend_zendnn_device;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void * ggml_backend_zendnn_get_proc_address(ggml_backend_reg_t reg, const char * name) {
|
||||||
|
if (std::strcmp(name, "ggml_backend_set_n_threads") == 0) {
|
||||||
|
return (void *) ggml_backend_zendnn_set_n_threads;
|
||||||
|
}
|
||||||
|
return NULL;
|
||||||
|
|
||||||
|
GGML_UNUSED(reg);
|
||||||
|
GGML_UNUSED(name);
|
||||||
|
}
|
||||||
|
|
||||||
|
static const struct ggml_backend_reg_i ggml_backend_zendnn_reg_i = {
|
||||||
|
/* .get_name = */ ggml_backend_zendnn_reg_get_name,
|
||||||
|
/* .get_device_count = */ ggml_backend_zendnn_reg_get_device_count,
|
||||||
|
/* .get_device = */ ggml_backend_zendnn_reg_get_device,
|
||||||
|
/* .get_proc_address = */ ggml_backend_zendnn_get_proc_address,
|
||||||
|
};
|
||||||
|
|
||||||
|
ggml_backend_reg_t ggml_backend_zendnn_reg(void) {
|
||||||
|
static struct ggml_backend_reg ggml_backend_zendnn_reg = {
|
||||||
|
/* .api_version = */ GGML_BACKEND_API_VERSION,
|
||||||
|
/* .iface = */ ggml_backend_zendnn_reg_i,
|
||||||
|
/* .context = */ NULL,
|
||||||
|
};
|
||||||
|
|
||||||
|
return &ggml_backend_zendnn_reg;
|
||||||
|
}
|
||||||
|
|
||||||
|
GGML_BACKEND_DL_IMPL(ggml_backend_zendnn_reg)
|
||||||
|
|
@ -1737,7 +1737,8 @@ struct markdown_printer : public printer {
|
||||||
fields.emplace_back("params");
|
fields.emplace_back("params");
|
||||||
fields.emplace_back("backend");
|
fields.emplace_back("backend");
|
||||||
bool is_cpu_backend = test::get_backend().find("CPU") != std::string::npos ||
|
bool is_cpu_backend = test::get_backend().find("CPU") != std::string::npos ||
|
||||||
test::get_backend().find("BLAS") != std::string::npos;
|
test::get_backend().find("BLAS") != std::string::npos ||
|
||||||
|
test::get_backend().find("ZenDNN") != std::string::npos;
|
||||||
if (!is_cpu_backend) {
|
if (!is_cpu_backend) {
|
||||||
fields.emplace_back("n_gpu_layers");
|
fields.emplace_back("n_gpu_layers");
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue