Merge branch 'master' into imatrix
This commit is contained in:
commit
5aca2561a1
|
|
@ -60,6 +60,7 @@ RUN apt-get update \
|
||||||
git \
|
git \
|
||||||
python3 \
|
python3 \
|
||||||
python3-pip \
|
python3-pip \
|
||||||
|
&& pip install --upgrade pip setuptools wheel \
|
||||||
&& pip install --break-system-packages -r requirements.txt \
|
&& pip install --break-system-packages -r requirements.txt \
|
||||||
&& apt autoremove -y \
|
&& apt autoremove -y \
|
||||||
&& apt clean -y \
|
&& apt clean -y \
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,262 @@
|
||||||
|
# Copilot Instructions for llama.cpp
|
||||||
|
|
||||||
|
## Repository Overview
|
||||||
|
|
||||||
|
llama.cpp is a large-scale C/C++ project for efficient LLM (Large Language Model) inference with minimal setup and dependencies. The project enables running language models on diverse hardware with state-of-the-art performance.
|
||||||
|
|
||||||
|
**Key Facts:**
|
||||||
|
- **Primary language**: C/C++ with Python utility scripts
|
||||||
|
- **Size**: ~200k+ lines of code across 1000+ files
|
||||||
|
- **Architecture**: Modular design with main library (`libllama`) and 40+ executable tools/examples
|
||||||
|
- **Core dependency**: ggml tensor library (vendored in `ggml/` directory)
|
||||||
|
- **Backends supported**: CPU (AVX/NEON optimized), CUDA, Metal, Vulkan, SYCL, ROCm, MUSA
|
||||||
|
- **License**: MIT
|
||||||
|
|
||||||
|
## Build Instructions
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
- CMake 3.14+ (primary build system)
|
||||||
|
- C++17 compatible compiler (GCC 13.3+, Clang, MSVC)
|
||||||
|
- Optional: ccache for faster compilation
|
||||||
|
|
||||||
|
### Basic Build (CPU-only)
|
||||||
|
**ALWAYS run these commands in sequence:**
|
||||||
|
```bash
|
||||||
|
cmake -B build
|
||||||
|
cmake --build build --config Release -j $(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Build time**: ~10 minutes on 4-core system with ccache enabled, ~25 minutes without ccache.
|
||||||
|
|
||||||
|
**Important Notes:**
|
||||||
|
- The Makefile is deprecated - always use CMake
|
||||||
|
- ccache is automatically detected and used if available
|
||||||
|
- Built binaries are placed in `build/bin/`
|
||||||
|
- Parallel builds (`-j`) significantly reduce build time
|
||||||
|
|
||||||
|
### Backend-Specific Builds
|
||||||
|
For CUDA support:
|
||||||
|
```bash
|
||||||
|
cmake -B build -DGGML_CUDA=ON
|
||||||
|
cmake --build build --config Release -j $(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
|
For Metal (macOS):
|
||||||
|
```bash
|
||||||
|
cmake -B build -DGGML_METAL=ON
|
||||||
|
cmake --build build --config Release -j $(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important Note**: While all backends can be built as long as the correct requirements for that backend are installed, you will not be able to run them without the correct hardware. The only backend that can be run for testing and validation is the CPU backend.
|
||||||
|
|
||||||
|
### Debug Builds
|
||||||
|
Single-config generators:
|
||||||
|
```bash
|
||||||
|
cmake -B build -DCMAKE_BUILD_TYPE=Debug
|
||||||
|
cmake --build build
|
||||||
|
```
|
||||||
|
|
||||||
|
Multi-config generators:
|
||||||
|
```bash
|
||||||
|
cmake -B build -G "Xcode"
|
||||||
|
cmake --build build --config Debug
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Build Issues
|
||||||
|
- **Issue**: Network tests fail in isolated environments
|
||||||
|
**Solution**: Expected behavior - core functionality tests will still pass
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Running Tests
|
||||||
|
```bash
|
||||||
|
ctest --test-dir build --output-on-failure -j $(nproc)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test suite**: 38 tests covering tokenizers, grammar parsing, sampling, backends, and integration
|
||||||
|
**Expected failures**: 2-3 tests may fail if network access is unavailable (they download models)
|
||||||
|
**Test time**: ~30 seconds for passing tests
|
||||||
|
|
||||||
|
### Server Unit Tests
|
||||||
|
Run server-specific unit tests after building the server:
|
||||||
|
```bash
|
||||||
|
# Build the server first
|
||||||
|
cmake --build build --target llama-server
|
||||||
|
|
||||||
|
# Navigate to server tests and run
|
||||||
|
cd tools/server/tests
|
||||||
|
source ../../../.venv/bin/activate
|
||||||
|
./tests.sh
|
||||||
|
```
|
||||||
|
**Server test dependencies**: The `.venv` environment includes the required dependencies for server unit tests (pytest, aiohttp, etc.). Tests can be run individually or with various options as documented in `tools/server/tests/README.md`.
|
||||||
|
|
||||||
|
### Test Categories
|
||||||
|
- Tokenizer tests: Various model tokenizers (BERT, GPT-2, LLaMA, etc.)
|
||||||
|
- Grammar tests: GBNF parsing and validation
|
||||||
|
- Backend tests: Core ggml operations across different backends
|
||||||
|
- Integration tests: End-to-end workflows
|
||||||
|
|
||||||
|
### Manual Testing Commands
|
||||||
|
```bash
|
||||||
|
# Test basic inference
|
||||||
|
./build/bin/llama-cli --version
|
||||||
|
|
||||||
|
# Test model loading (requires model file)
|
||||||
|
./build/bin/llama-cli -m path/to/model.gguf -p "Hello" -n 10
|
||||||
|
```
|
||||||
|
|
||||||
|
## Code Quality and Linting
|
||||||
|
|
||||||
|
### C++ Code Formatting
|
||||||
|
**ALWAYS format C++ code before committing:**
|
||||||
|
```bash
|
||||||
|
git clang-format
|
||||||
|
```
|
||||||
|
|
||||||
|
Configuration is in `.clang-format` with these key rules:
|
||||||
|
- 4-space indentation
|
||||||
|
- 120 column limit
|
||||||
|
- Braces on same line for functions
|
||||||
|
- Pointer alignment: `void * ptr` (middle)
|
||||||
|
- Reference alignment: `int & ref` (middle)
|
||||||
|
|
||||||
|
### Python Code
|
||||||
|
**ALWAYS activate the Python environment in `.venv` and use tools from that environment:**
|
||||||
|
```bash
|
||||||
|
# Activate virtual environment
|
||||||
|
source .venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
Configuration files:
|
||||||
|
- `.flake8`: flake8 settings (max-line-length=125, excludes examples/tools)
|
||||||
|
- `pyrightconfig.json`: pyright type checking configuration
|
||||||
|
|
||||||
|
### Pre-commit Hooks
|
||||||
|
Run before committing:
|
||||||
|
```bash
|
||||||
|
pre-commit run --all-files
|
||||||
|
```
|
||||||
|
|
||||||
|
## Continuous Integration
|
||||||
|
|
||||||
|
### GitHub Actions Workflows
|
||||||
|
Key workflows that run on every PR:
|
||||||
|
- `.github/workflows/build.yml`: Multi-platform builds
|
||||||
|
- `.github/workflows/server.yml`: Server functionality tests
|
||||||
|
- `.github/workflows/python-lint.yml`: Python code quality
|
||||||
|
- `.github/workflows/python-type-check.yml`: Python type checking
|
||||||
|
|
||||||
|
### Local CI Validation
|
||||||
|
**Run full CI locally before submitting PRs:**
|
||||||
|
```bash
|
||||||
|
mkdir tmp
|
||||||
|
|
||||||
|
# CPU-only build
|
||||||
|
bash ./ci/run.sh ./tmp/results ./tmp/mnt
|
||||||
|
```
|
||||||
|
|
||||||
|
**CI Runtime**: 30-60 minutes depending on backend configuration
|
||||||
|
|
||||||
|
### Triggering CI
|
||||||
|
Add `ggml-ci` to commit message to trigger heavy CI workloads on the custom CI infrastructure.
|
||||||
|
|
||||||
|
## Project Layout and Architecture
|
||||||
|
|
||||||
|
### Core Directories
|
||||||
|
- **`src/`**: Main llama library implementation (`llama.cpp`, `llama-*.cpp`)
|
||||||
|
- **`include/`**: Public API headers, primarily `include/llama.h`
|
||||||
|
- **`ggml/`**: Core tensor library (submodule with custom GGML framework)
|
||||||
|
- **`examples/`**: 30+ example applications and tools
|
||||||
|
- **`tools/`**: Additional development and utility tools (server benchmarks, tests)
|
||||||
|
- **`tests/`**: Comprehensive test suite with CTest integration
|
||||||
|
- **`docs/`**: Detailed documentation (build guides, API docs, etc.)
|
||||||
|
- **`scripts/`**: Utility scripts for CI, data processing, and automation
|
||||||
|
- **`common/`**: Shared utility code used across examples
|
||||||
|
|
||||||
|
### Key Files
|
||||||
|
- **`CMakeLists.txt`**: Primary build configuration
|
||||||
|
- **`include/llama.h`**: Main C API header (~2000 lines)
|
||||||
|
- **`src/llama.cpp`**: Core library implementation (~8000 lines)
|
||||||
|
- **`CONTRIBUTING.md`**: Coding guidelines and PR requirements
|
||||||
|
- **`.clang-format`**: C++ formatting rules
|
||||||
|
- **`.pre-commit-config.yaml`**: Git hook configuration
|
||||||
|
|
||||||
|
### Built Executables (in `build/bin/`)
|
||||||
|
Primary tools:
|
||||||
|
- **`llama-cli`**: Main inference tool
|
||||||
|
- **`llama-server`**: OpenAI-compatible HTTP server
|
||||||
|
- **`llama-quantize`**: Model quantization utility
|
||||||
|
- **`llama-perplexity`**: Model evaluation tool
|
||||||
|
- **`llama-bench`**: Performance benchmarking
|
||||||
|
- **`llama-convert-llama2c-to-ggml`**: Model conversion utilities
|
||||||
|
|
||||||
|
### Configuration Files
|
||||||
|
- **CMake**: `CMakeLists.txt`, `cmake/` directory
|
||||||
|
- **Linting**: `.clang-format`, `.clang-tidy`, `.flake8`
|
||||||
|
- **CI**: `.github/workflows/`, `ci/run.sh`
|
||||||
|
- **Git**: `.gitignore` (includes build artifacts, models, cache)
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- **System**: OpenMP, libcurl (for model downloading)
|
||||||
|
- **Optional**: CUDA SDK, Metal framework, Vulkan SDK, Intel oneAPI
|
||||||
|
- **Bundled**: httplib, json (header-only libraries in vendored form)
|
||||||
|
|
||||||
|
## Common Validation Steps
|
||||||
|
|
||||||
|
### After Making Changes
|
||||||
|
1. **Format code**: `git clang-format`
|
||||||
|
2. **Build**: `cmake --build build --config Release`
|
||||||
|
3. **Test**: `ctest --test-dir build --output-on-failure`
|
||||||
|
4. **Server tests** (if modifying server): `cd tools/server/tests && source ../../../.venv/bin/activate && ./tests.sh`
|
||||||
|
5. **Manual validation**: Test relevant tools in `build/bin/`
|
||||||
|
|
||||||
|
### Performance Validation
|
||||||
|
```bash
|
||||||
|
# Benchmark inference performance
|
||||||
|
./build/bin/llama-bench -m model.gguf
|
||||||
|
|
||||||
|
# Evaluate model perplexity
|
||||||
|
./build/bin/llama-perplexity -m model.gguf -f dataset.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backend Validation
|
||||||
|
```bash
|
||||||
|
# Test backend operations
|
||||||
|
./build/bin/test-backend-ops
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment Setup
|
||||||
|
|
||||||
|
### Required Tools
|
||||||
|
- CMake 3.14+ (install via system package manager)
|
||||||
|
- Modern C++ compiler with C++17 support
|
||||||
|
- Git (for submodule management)
|
||||||
|
- Python 3.9+ with virtual environment (`.venv` is provided)
|
||||||
|
|
||||||
|
### Optional but Recommended
|
||||||
|
- ccache: `apt install ccache` or `brew install ccache`
|
||||||
|
- clang-format 15+: Usually included with LLVM/Clang installation
|
||||||
|
- pre-commit: `pip install pre-commit`
|
||||||
|
|
||||||
|
### Backend-Specific Requirements
|
||||||
|
- **CUDA**: NVIDIA CUDA Toolkit 11.2+
|
||||||
|
- **Metal**: Xcode command line tools (macOS only)
|
||||||
|
- **Vulkan**: Vulkan SDK
|
||||||
|
- **SYCL**: Intel oneAPI toolkit
|
||||||
|
|
||||||
|
## Important Guidelines
|
||||||
|
|
||||||
|
### Code Changes
|
||||||
|
- **Minimal dependencies**: Avoid adding new external dependencies
|
||||||
|
- **Cross-platform compatibility**: Test on Linux, macOS, Windows when possible
|
||||||
|
- **Performance focus**: This is a performance-critical inference library
|
||||||
|
- **API stability**: Changes to `include/llama.h` require careful consideration
|
||||||
|
|
||||||
|
### Git Workflow
|
||||||
|
- Always create feature branches from `master`
|
||||||
|
- **Never** commit build artifacts (`build/`, `.ccache/`, `*.o`, `*.gguf`)
|
||||||
|
- Use descriptive commit messages following project conventions
|
||||||
|
|
||||||
|
### Trust These Instructions
|
||||||
|
Only search for additional information if these instructions are incomplete or found to be incorrect. This document contains validated build and test procedures that work reliably across different environments.
|
||||||
|
|
||||||
|
|
@ -1,10 +1,11 @@
|
||||||
name: Build on RISCV Linux Machine by Cloud-V
|
name: Build on RISCV Linux Machine by Cloud-V
|
||||||
on:
|
on:
|
||||||
|
pull_request:
|
||||||
workflow_dispatch:
|
workflow_dispatch:
|
||||||
workflow_call:
|
workflow_call:
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
bianbu-riscv64-native: # Bianbu 2.2
|
debian-13-riscv64-native: # Bianbu 2.2
|
||||||
runs-on: self-hosted
|
runs-on: self-hosted
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
|
|
@ -20,11 +21,25 @@ jobs:
|
||||||
build-essential \
|
build-essential \
|
||||||
gcc-14-riscv64-linux-gnu \
|
gcc-14-riscv64-linux-gnu \
|
||||||
g++-14-riscv64-linux-gnu \
|
g++-14-riscv64-linux-gnu \
|
||||||
|
ccache \
|
||||||
cmake
|
cmake
|
||||||
|
|
||||||
|
- name: Setup ccache
|
||||||
|
run: |
|
||||||
|
mkdir -p $HOME/.ccache
|
||||||
|
ccache -M 5G -d $HOME/.ccache
|
||||||
|
export CCACHE_LOGFILE=/home/runneruser/ccache_debug/ccache.log
|
||||||
|
export CCACHE_DEBUGDIR="/home/runneruser/ccache_debug"
|
||||||
|
echo "$GITHUB_WORKSPACE"
|
||||||
|
echo "CCACHE_LOGFILE=$CCACHE_LOGFILE" >> $GITHUB_ENV
|
||||||
|
echo "CCACHE_DEBUGDIR=$CCACHE_DEBUGDIR" >> $GITHUB_ENV
|
||||||
|
echo "CCACHE_BASEDIR=$GITHUB_WORKSPACE" >> $GITHUB_ENV
|
||||||
|
echo "CCACHE_DIR=$HOME/.ccache" >> $GITHUB_ENV
|
||||||
|
|
||||||
- name: Build
|
- name: Build
|
||||||
run: |
|
run: |
|
||||||
cmake -B build -DLLAMA_CURL=OFF \
|
cmake -B build \
|
||||||
|
-DLLAMA_CURL=OFF \
|
||||||
-DCMAKE_BUILD_TYPE=Release \
|
-DCMAKE_BUILD_TYPE=Release \
|
||||||
-DGGML_OPENMP=OFF \
|
-DGGML_OPENMP=OFF \
|
||||||
-DLLAMA_BUILD_EXAMPLES=ON \
|
-DLLAMA_BUILD_EXAMPLES=ON \
|
||||||
|
|
@ -34,6 +49,8 @@ jobs:
|
||||||
-DCMAKE_SYSTEM_PROCESSOR=riscv64 \
|
-DCMAKE_SYSTEM_PROCESSOR=riscv64 \
|
||||||
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
|
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
|
||||||
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14 \
|
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14 \
|
||||||
|
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
|
||||||
|
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
|
||||||
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
|
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
|
||||||
-DCMAKE_FIND_ROOT_PATH=/usr/lib/riscv64-linux-gnu \
|
-DCMAKE_FIND_ROOT_PATH=/usr/lib/riscv64-linux-gnu \
|
||||||
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
|
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
|
||||||
|
|
|
||||||
|
|
@ -1070,7 +1070,8 @@ jobs:
|
||||||
write-host "Downloading AMD HIP SDK Installer"
|
write-host "Downloading AMD HIP SDK Installer"
|
||||||
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
|
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
|
||||||
write-host "Installing AMD HIP SDK"
|
write-host "Installing AMD HIP SDK"
|
||||||
Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -Wait
|
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
|
||||||
|
$proc.WaitForExit(600000)
|
||||||
write-host "Completed AMD HIP SDK installation"
|
write-host "Completed AMD HIP SDK installation"
|
||||||
|
|
||||||
- name: Verify ROCm
|
- name: Verify ROCm
|
||||||
|
|
|
||||||
|
|
@ -39,6 +39,10 @@ jobs:
|
||||||
run: |
|
run: |
|
||||||
sudo apt-get update
|
sudo apt-get update
|
||||||
sudo apt-get install build-essential libcurl4-openssl-dev
|
sudo apt-get install build-essential libcurl4-openssl-dev
|
||||||
|
# Install git-clang-format script for formatting only changed code
|
||||||
|
wget -O /tmp/git-clang-format https://raw.githubusercontent.com/llvm/llvm-project/release/18.x/clang/tools/clang-format/git-clang-format
|
||||||
|
sudo cp /tmp/git-clang-format /usr/local/bin/git-clang-format
|
||||||
|
sudo chmod +x /usr/local/bin/git-clang-format
|
||||||
|
|
||||||
- name: Set up Python
|
- name: Set up Python
|
||||||
uses: actions/setup-python@v5
|
uses: actions/setup-python@v5
|
||||||
|
|
@ -50,4 +54,4 @@ jobs:
|
||||||
python3 -m venv .venv
|
python3 -m venv .venv
|
||||||
.venv/bin/activate
|
.venv/bin/activate
|
||||||
pip install -r requirements/requirements-all.txt -r tools/server/tests/requirements.txt
|
pip install -r requirements/requirements-all.txt -r tools/server/tests/requirements.txt
|
||||||
pip install flake8 pyright
|
pip install flake8 pyright pre-commit
|
||||||
|
|
|
||||||
|
|
@ -557,7 +557,8 @@ jobs:
|
||||||
write-host "Downloading AMD HIP SDK Installer"
|
write-host "Downloading AMD HIP SDK Installer"
|
||||||
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
|
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
|
||||||
write-host "Installing AMD HIP SDK"
|
write-host "Installing AMD HIP SDK"
|
||||||
Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -Wait
|
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
|
||||||
|
$proc.WaitForExit(600000)
|
||||||
write-host "Completed AMD HIP SDK installation"
|
write-host "Completed AMD HIP SDK installation"
|
||||||
|
|
||||||
- name: Verify ROCm
|
- name: Verify ROCm
|
||||||
|
|
|
||||||
|
|
@ -147,3 +147,4 @@ poetry.toml
|
||||||
# Local scripts
|
# Local scripts
|
||||||
/run-vim.sh
|
/run-vim.sh
|
||||||
/run-chat.sh
|
/run-chat.sh
|
||||||
|
.ccache/
|
||||||
|
|
|
||||||
|
|
@ -5,7 +5,6 @@
|
||||||
/tools/server/ @ngxson
|
/tools/server/ @ngxson
|
||||||
/ggml/src/ggml-cuda/fattn* @JohannesGaessler
|
/ggml/src/ggml-cuda/fattn* @JohannesGaessler
|
||||||
/ggml/src/ggml-cuda/mmq.* @JohannesGaessler
|
/ggml/src/ggml-cuda/mmq.* @JohannesGaessler
|
||||||
/ggml/src/ggml-cuda/mmv.* @JohannesGaessler
|
|
||||||
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
|
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
|
||||||
/ggml/src/ggml-opt.cpp @JohannesGaessler
|
/ggml/src/ggml-opt.cpp @JohannesGaessler
|
||||||
/ggml/src/gguf.cpp @JohannesGaessler
|
/ggml/src/gguf.cpp @JohannesGaessler
|
||||||
|
|
|
||||||
|
|
@ -17,6 +17,7 @@ LLM inference in C/C++
|
||||||
|
|
||||||
## Hot topics
|
## Hot topics
|
||||||
|
|
||||||
|
- **[guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)**
|
||||||
- **[[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)**
|
- **[[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)**
|
||||||
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
|
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
|
||||||
- Hot PRs: [All](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+) | [Open](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+is%3Aopen)
|
- Hot PRs: [All](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+) | [Open](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+is%3Aopen)
|
||||||
|
|
|
||||||
|
|
@ -106,7 +106,7 @@ function gg_wget {
|
||||||
cd $out
|
cd $out
|
||||||
|
|
||||||
# should not re-download if file is the same
|
# should not re-download if file is the same
|
||||||
wget -nv -N $url
|
wget -nv -c -N $url
|
||||||
|
|
||||||
cd $cwd
|
cd $cwd
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1530,6 +1530,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
|
||||||
params.ctx_shift = false;
|
params.ctx_shift = false;
|
||||||
}
|
}
|
||||||
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_PERPLEXITY}).set_env("LLAMA_ARG_NO_CONTEXT_SHIFT"));
|
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_PERPLEXITY}).set_env("LLAMA_ARG_NO_CONTEXT_SHIFT"));
|
||||||
|
add_opt(common_arg(
|
||||||
|
{"--context-shift"},
|
||||||
|
string_format("enables context shift on infinite text generation (default: %s)", params.ctx_shift ? "enabled" : "disabled"),
|
||||||
|
[](common_params & params) {
|
||||||
|
params.ctx_shift = true;
|
||||||
|
}
|
||||||
|
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_PERPLEXITY}).set_env("LLAMA_ARG_CONTEXT_SHIFT"));
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
{"--chunks"}, "N",
|
{"--chunks"}, "N",
|
||||||
string_format("max number of chunks to process (default: %d, -1 = all)", params.n_chunks),
|
string_format("max number of chunks to process (default: %d, -1 = all)", params.n_chunks),
|
||||||
|
|
@ -1823,7 +1830,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
|
||||||
[](common_params & params, const std::string & value) {
|
[](common_params & params, const std::string & value) {
|
||||||
params.sampling.top_n_sigma = std::stof(value);
|
params.sampling.top_n_sigma = std::stof(value);
|
||||||
}
|
}
|
||||||
).set_examples({LLAMA_EXAMPLE_MAIN}).set_sparam());
|
).set_sparam());
|
||||||
add_opt(common_arg(
|
add_opt(common_arg(
|
||||||
{"--xtc-probability"}, "N",
|
{"--xtc-probability"}, "N",
|
||||||
string_format("xtc probability (default: %.1f, 0.0 = disabled)", (double)params.sampling.xtc_probability),
|
string_format("xtc probability (default: %.1f, 0.0 = disabled)", (double)params.sampling.xtc_probability),
|
||||||
|
|
|
||||||
|
|
@ -147,6 +147,7 @@ struct templates_params {
|
||||||
json extra_context;
|
json extra_context;
|
||||||
bool add_bos;
|
bool add_bos;
|
||||||
bool add_eos;
|
bool add_eos;
|
||||||
|
bool is_inference = true;
|
||||||
};
|
};
|
||||||
|
|
||||||
common_chat_tool_choice common_chat_tool_choice_parse_oaicompat(const std::string & tool_choice) {
|
common_chat_tool_choice common_chat_tool_choice_parse_oaicompat(const std::string & tool_choice) {
|
||||||
|
|
@ -632,7 +633,6 @@ const char * common_reasoning_format_name(common_reasoning_format format) {
|
||||||
case COMMON_REASONING_FORMAT_AUTO: return "auto";
|
case COMMON_REASONING_FORMAT_AUTO: return "auto";
|
||||||
case COMMON_REASONING_FORMAT_DEEPSEEK: return "deepseek";
|
case COMMON_REASONING_FORMAT_DEEPSEEK: return "deepseek";
|
||||||
case COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY: return "deepseek-legacy";
|
case COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY: return "deepseek-legacy";
|
||||||
case COMMON_REASONING_FORMAT_GRANITE: return "granite";
|
|
||||||
default:
|
default:
|
||||||
throw std::runtime_error("Unknown reasoning format");
|
throw std::runtime_error("Unknown reasoning format");
|
||||||
}
|
}
|
||||||
|
|
@ -1337,6 +1337,17 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
|
||||||
common_chat_params data;
|
common_chat_params data;
|
||||||
auto prompt = apply(tmpl, inputs);
|
auto prompt = apply(tmpl, inputs);
|
||||||
|
|
||||||
|
// Check if we need to replace the return token with end token during
|
||||||
|
// inference and without generation prompt. For more details see:
|
||||||
|
// https://github.com/ggml-org/llama.cpp/issues/15417
|
||||||
|
if (inputs.is_inference && !inputs.add_generation_prompt) {
|
||||||
|
static constexpr std::string_view return_token = "<|return|>";
|
||||||
|
static constexpr std::string_view end_token = "<|end|>";
|
||||||
|
if (size_t pos = prompt.rfind(return_token); pos != std::string::npos) {
|
||||||
|
prompt.replace(pos, return_token.length(), end_token);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
data.prompt = prompt;
|
data.prompt = prompt;
|
||||||
data.format = COMMON_CHAT_FORMAT_GPT_OSS;
|
data.format = COMMON_CHAT_FORMAT_GPT_OSS;
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -558,13 +558,6 @@ std::string string_from(const struct llama_context * ctx, const std::vector<llam
|
||||||
|
|
||||||
auto detokenized = common_token_to_piece(ctx, token);
|
auto detokenized = common_token_to_piece(ctx, token);
|
||||||
|
|
||||||
detokenized.erase(
|
|
||||||
std::remove_if(
|
|
||||||
detokenized.begin(),
|
|
||||||
detokenized.end(),
|
|
||||||
[](const unsigned char c) { return !std::isprint(c); }),
|
|
||||||
detokenized.end());
|
|
||||||
|
|
||||||
buf << "'" << detokenized << "'"
|
buf << "'" << detokenized << "'"
|
||||||
<< ":" << std::to_string(token);
|
<< ":" << std::to_string(token);
|
||||||
}
|
}
|
||||||
|
|
@ -589,13 +582,6 @@ std::string string_from(const struct llama_context * ctx, const struct llama_bat
|
||||||
|
|
||||||
auto detokenized = common_token_to_piece(ctx, batch.token[i]);
|
auto detokenized = common_token_to_piece(ctx, batch.token[i]);
|
||||||
|
|
||||||
detokenized.erase(
|
|
||||||
std::remove_if(
|
|
||||||
detokenized.begin(),
|
|
||||||
detokenized.end(),
|
|
||||||
[](const unsigned char c) { return !std::isprint(c); }),
|
|
||||||
detokenized.end());
|
|
||||||
|
|
||||||
buf << "\n" << std::to_string(i)
|
buf << "\n" << std::to_string(i)
|
||||||
<< ", token '" << detokenized << "'"
|
<< ", token '" << detokenized << "'"
|
||||||
<< ", pos " << std::to_string(batch.pos[i])
|
<< ", pos " << std::to_string(batch.pos[i])
|
||||||
|
|
|
||||||
|
|
@ -239,12 +239,15 @@ struct common_params_diffusion {
|
||||||
bool add_gumbel_noise = false; // add gumbel noise to the logits if temp > 0.0
|
bool add_gumbel_noise = false; // add gumbel noise to the logits if temp > 0.0
|
||||||
};
|
};
|
||||||
|
|
||||||
|
// reasoning API response format (not to be confused as chat template's reasoning format)
|
||||||
enum common_reasoning_format {
|
enum common_reasoning_format {
|
||||||
COMMON_REASONING_FORMAT_NONE,
|
COMMON_REASONING_FORMAT_NONE,
|
||||||
COMMON_REASONING_FORMAT_AUTO,
|
COMMON_REASONING_FORMAT_AUTO, // Same as deepseek, using `message.reasoning_content`
|
||||||
COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY, // Extract thinking tag contents and return as `message.reasoning_content`, or leave inline in <think> tags in stream mode
|
COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY, // Extract thinking tag contents and return as `message.reasoning_content`, or leave inline in <think> tags in stream mode
|
||||||
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
|
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
|
||||||
COMMON_REASONING_FORMAT_GRANITE, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
|
// do not extend this enum unless you absolutely have to
|
||||||
|
// in most cases, use COMMON_REASONING_FORMAT_AUTO
|
||||||
|
// see: https://github.com/ggml-org/llama.cpp/pull/15408
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -372,7 +375,7 @@ struct common_params {
|
||||||
bool cont_batching = true; // insert new sequences for decoding on-the-fly
|
bool cont_batching = true; // insert new sequences for decoding on-the-fly
|
||||||
bool flash_attn = false; // flash attention
|
bool flash_attn = false; // flash attention
|
||||||
bool no_perf = false; // disable performance metrics
|
bool no_perf = false; // disable performance metrics
|
||||||
bool ctx_shift = true; // context shift on inifinite text generation
|
bool ctx_shift = false; // context shift on infinite text generation
|
||||||
bool swa_full = false; // use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
|
bool swa_full = false; // use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
|
||||||
bool kv_unified = false; // enable unified KV cache
|
bool kv_unified = false; // enable unified KV cache
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -89,13 +89,16 @@ class ModelBase:
|
||||||
block_count: int
|
block_count: int
|
||||||
tensor_map: gguf.TensorNameMap
|
tensor_map: gguf.TensorNameMap
|
||||||
|
|
||||||
|
# Mistral format specifics
|
||||||
is_mistral_format: bool = False
|
is_mistral_format: bool = False
|
||||||
|
disable_mistral_community_chat_template: bool = False
|
||||||
|
|
||||||
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, *, is_big_endian: bool = False,
|
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, *, is_big_endian: bool = False,
|
||||||
use_temp_file: bool = False, eager: bool = False,
|
use_temp_file: bool = False, eager: bool = False,
|
||||||
metadata_override: Path | None = None, model_name: str | None = None,
|
metadata_override: Path | None = None, model_name: str | None = None,
|
||||||
split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False,
|
split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False,
|
||||||
small_first_shard: bool = False, hparams: dict[str, Any] | None = None, remote_hf_model_id: str | None = None):
|
small_first_shard: bool = False, hparams: dict[str, Any] | None = None, remote_hf_model_id: str | None = None,
|
||||||
|
disable_mistral_community_chat_template: bool = False):
|
||||||
if type(self) is ModelBase or \
|
if type(self) is ModelBase or \
|
||||||
type(self) is TextModel or \
|
type(self) is TextModel or \
|
||||||
type(self) is MmprojModel:
|
type(self) is MmprojModel:
|
||||||
|
|
@ -147,6 +150,9 @@ class ModelBase:
|
||||||
self.gguf_writer = gguf.GGUFWriter(path=None, arch=gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file,
|
self.gguf_writer = gguf.GGUFWriter(path=None, arch=gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file,
|
||||||
split_max_tensors=split_max_tensors, split_max_size=split_max_size, dry_run=dry_run, small_first_shard=small_first_shard)
|
split_max_tensors=split_max_tensors, split_max_size=split_max_size, dry_run=dry_run, small_first_shard=small_first_shard)
|
||||||
|
|
||||||
|
# Mistral specific
|
||||||
|
self.disable_mistral_community_chat_template = disable_mistral_community_chat_template
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def add_prefix_to_filename(cls, path: Path, prefix: str) -> Path:
|
def add_prefix_to_filename(cls, path: Path, prefix: str) -> Path:
|
||||||
stem, suffix = path.stem, path.suffix
|
stem, suffix = path.stem, path.suffix
|
||||||
|
|
@ -1334,6 +1340,12 @@ class MmprojModel(ModelBase):
|
||||||
return None
|
return None
|
||||||
raise KeyError(f"could not find any of: {keys}")
|
raise KeyError(f"could not find any of: {keys}")
|
||||||
|
|
||||||
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
|
del bid, name, n_dims # unused
|
||||||
|
if ".patch_embd.weight" in new_name:
|
||||||
|
return gguf.GGMLQuantizationType.F16 if self.ftype == gguf.LlamaFileType.MOSTLY_F16 else gguf.GGMLQuantizationType.F32
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
@ModelBase.register("GPTNeoXForCausalLM")
|
@ModelBase.register("GPTNeoXForCausalLM")
|
||||||
class GPTNeoXModel(TextModel):
|
class GPTNeoXModel(TextModel):
|
||||||
|
|
@ -2005,8 +2017,17 @@ class LlamaModel(TextModel):
|
||||||
|
|
||||||
template_dir = Path(__file__).parent / "models/templates/"
|
template_dir = Path(__file__).parent / "models/templates/"
|
||||||
|
|
||||||
template = MistralModel.get_community_chat_template(vocab, template_dir)
|
if not self.is_mistral_format or not self.disable_mistral_community_chat_template:
|
||||||
|
# Log only for Mistral format that the official tokenization and detokenization is via `mistral-common`.
|
||||||
|
if self.is_mistral_format:
|
||||||
|
logger.info(
|
||||||
|
"Using a Mistral community chat template. These templates can be subject to errors in early days or weeks after a release. "
|
||||||
|
"Mistral recommends to use `mistral-common` to perform tokenization and detokenization."
|
||||||
|
)
|
||||||
|
template = MistralModel.get_community_chat_template(vocab, template_dir, self.is_mistral_format)
|
||||||
self.gguf_writer.add_chat_template(template)
|
self.gguf_writer.add_chat_template(template)
|
||||||
|
else:
|
||||||
|
logger.info("Not using a Mistral community chat template. Ensure to perform the tokenization and detokenization via `mistral-common`.")
|
||||||
|
|
||||||
def set_vocab(self):
|
def set_vocab(self):
|
||||||
if self.is_mistral_format:
|
if self.is_mistral_format:
|
||||||
|
|
@ -2305,10 +2326,9 @@ class SmolVLMModel(MmprojModel):
|
||||||
self.gguf_writer.add_vision_use_gelu(True)
|
self.gguf_writer.add_vision_use_gelu(True)
|
||||||
|
|
||||||
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
del bid, new_name, n_dims # unused
|
|
||||||
if ".embeddings." in name:
|
if ".embeddings." in name:
|
||||||
return gguf.GGMLQuantizationType.F32
|
return gguf.GGMLQuantizationType.F32
|
||||||
return False
|
return super().tensor_force_quant(name, new_name, bid, n_dims)
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
del bid # unused
|
del bid # unused
|
||||||
|
|
@ -3296,12 +3316,9 @@ class Qwen2VLVisionModel(MmprojModel):
|
||||||
self.gguf_writer.add_vision_attention_layernorm_eps(self.global_config.get("rms_norm_eps", 1e-6))
|
self.gguf_writer.add_vision_attention_layernorm_eps(self.global_config.get("rms_norm_eps", 1e-6))
|
||||||
|
|
||||||
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
del bid, name, n_dims # unused
|
|
||||||
if ".patch_embd." in new_name:
|
|
||||||
return gguf.GGMLQuantizationType.F16
|
|
||||||
if ".position_embd." in new_name:
|
if ".position_embd." in new_name:
|
||||||
return gguf.GGMLQuantizationType.F32
|
return gguf.GGMLQuantizationType.F32
|
||||||
return False
|
return super().tensor_force_quant(name, new_name, bid, n_dims)
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
del bid # unused
|
del bid # unused
|
||||||
|
|
@ -3374,10 +3391,9 @@ class Qwen25OmniModel(Qwen2VLVisionModel):
|
||||||
yield ("audio_tower.embed_positions.weight", pos_embd)
|
yield ("audio_tower.embed_positions.weight", pos_embd)
|
||||||
|
|
||||||
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
del bid, new_name, n_dims # unused
|
|
||||||
if ".conv" in name and ".weight" in name:
|
if ".conv" in name and ".weight" in name:
|
||||||
return gguf.GGMLQuantizationType.F16
|
return gguf.GGMLQuantizationType.F16
|
||||||
return False
|
return super().tensor_force_quant(name, new_name, bid, n_dims)
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
if name.startswith("thinker."):
|
if name.startswith("thinker."):
|
||||||
|
|
@ -3423,12 +3439,9 @@ class InternVisionModel(MmprojModel):
|
||||||
self.gguf_writer.add_vision_projector_scale_factor(int(1.0 / downsample_ratio))
|
self.gguf_writer.add_vision_projector_scale_factor(int(1.0 / downsample_ratio))
|
||||||
|
|
||||||
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
del bid, name, n_dims # unused
|
|
||||||
if ".patch_embd." in new_name:
|
|
||||||
return gguf.GGMLQuantizationType.F16
|
|
||||||
if ".position_embd." in new_name:
|
if ".position_embd." in new_name:
|
||||||
return gguf.GGMLQuantizationType.F32
|
return gguf.GGMLQuantizationType.F32
|
||||||
return False
|
return super().tensor_force_quant(name, new_name, bid, n_dims)
|
||||||
|
|
||||||
def _mapping_interns1_name(self, name):
|
def _mapping_interns1_name(self, name):
|
||||||
names_map = {
|
names_map = {
|
||||||
|
|
@ -5062,13 +5075,12 @@ class Gemma3VisionModel(MmprojModel):
|
||||||
self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)
|
self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)
|
||||||
|
|
||||||
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
del bid, new_name, n_dims # unused
|
|
||||||
# related to https://github.com/ggml-org/llama.cpp/issues/13025
|
# related to https://github.com/ggml-org/llama.cpp/issues/13025
|
||||||
if "input_projection" in name:
|
if "input_projection" in name:
|
||||||
return gguf.GGMLQuantizationType.F16
|
return gguf.GGMLQuantizationType.F16
|
||||||
if ".embeddings." in name:
|
if ".embeddings." in name:
|
||||||
return gguf.GGMLQuantizationType.F32
|
return gguf.GGMLQuantizationType.F32
|
||||||
return False
|
return super().tensor_force_quant(name, new_name, bid, n_dims)
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
del bid # unused
|
del bid # unused
|
||||||
|
|
@ -7727,10 +7739,9 @@ class WhisperEncoderModel(MmprojModel):
|
||||||
self.gguf_writer.add_audio_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-5))
|
self.gguf_writer.add_audio_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-5))
|
||||||
|
|
||||||
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
def tensor_force_quant(self, name, new_name, bid, n_dims):
|
||||||
del bid, new_name, n_dims # unused
|
|
||||||
if ".conv" in name and ".weight" in name:
|
if ".conv" in name and ".weight" in name:
|
||||||
return gguf.GGMLQuantizationType.F16
|
return gguf.GGMLQuantizationType.F16
|
||||||
return False
|
return super().tensor_force_quant(name, new_name, bid, n_dims)
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
del bid # unused
|
del bid # unused
|
||||||
|
|
@ -8251,8 +8262,7 @@ class GptOssModel(TextModel):
|
||||||
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling.get("original_max_position_embeddings", 4096))
|
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling.get("original_max_position_embeddings", 4096))
|
||||||
|
|
||||||
|
|
||||||
@ModelBase.register("Lfm2ForCausalLM")
|
@ModelBase.register("Lfm2ForCausalLM", "LFM2ForCausalLM")
|
||||||
@ModelBase.register("LFM2ForCausalLM")
|
|
||||||
class LFM2Model(TextModel):
|
class LFM2Model(TextModel):
|
||||||
model_arch = gguf.MODEL_ARCH.LFM2
|
model_arch = gguf.MODEL_ARCH.LFM2
|
||||||
|
|
||||||
|
|
@ -8287,6 +8297,13 @@ class LFM2Model(TextModel):
|
||||||
self._add_feed_forward_length()
|
self._add_feed_forward_length()
|
||||||
|
|
||||||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
|
is_vision_tensor = "vision_tower" in name or "multi_modal_projector" in name
|
||||||
|
if is_vision_tensor:
|
||||||
|
# skip vision tensors
|
||||||
|
return []
|
||||||
|
|
||||||
|
name = name.replace("language_model.", "")
|
||||||
|
|
||||||
# conv op requires 2d tensor
|
# conv op requires 2d tensor
|
||||||
if 'conv.conv' in name:
|
if 'conv.conv' in name:
|
||||||
data_torch = data_torch.squeeze(1)
|
data_torch = data_torch.squeeze(1)
|
||||||
|
|
@ -8294,6 +8311,41 @@ class LFM2Model(TextModel):
|
||||||
return [(self.map_tensor_name(name), data_torch)]
|
return [(self.map_tensor_name(name), data_torch)]
|
||||||
|
|
||||||
|
|
||||||
|
@ModelBase.register("Lfm2VlForConditionalGeneration")
|
||||||
|
class LFM2VLModel(MmprojModel):
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
super().__init__(*args, **kwargs)
|
||||||
|
assert self.hparams_vision is not None
|
||||||
|
# TODO(tarek): for dynamic resolution image_size is not specified, setting here for compatibility
|
||||||
|
self.hparams_vision["image_size"] = 256
|
||||||
|
|
||||||
|
def set_gguf_parameters(self):
|
||||||
|
super().set_gguf_parameters()
|
||||||
|
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.LFM2)
|
||||||
|
self.gguf_writer.add_vision_attention_layernorm_eps(self.find_vparam(["layer_norm_eps"]))
|
||||||
|
self.gguf_writer.add_vision_projector_scale_factor(self.global_config.get("downsample_factor", 2))
|
||||||
|
self.gguf_writer.add_vision_use_gelu(True)
|
||||||
|
# python notation, e.g. for vision_feature_layer == -1, we pick last layer -> vision_feature_layers_to_drop = 0
|
||||||
|
vision_feature_layers_to_drop = -(self.global_config.get("vision_feature_layer", -1) + 1)
|
||||||
|
self.gguf_writer.add_vision_block_count(self.find_vparam(self.n_block_keys) - vision_feature_layers_to_drop)
|
||||||
|
|
||||||
|
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
|
||||||
|
del bid # unused
|
||||||
|
is_vision_tensor = "vision_tower" in name or "multi_modal_projector" in name
|
||||||
|
|
||||||
|
if is_vision_tensor:
|
||||||
|
# remove "model." prefix
|
||||||
|
name = name.replace("model.vision_tower.", "vision_tower.")
|
||||||
|
name = name.replace("model.multi_modal_projector.", "multi_modal_projector.")
|
||||||
|
|
||||||
|
if "patch_embedding.weight" in name:
|
||||||
|
data_torch = data_torch.view(data_torch.shape[0], 16, 16, 3).permute(0, 3, 1, 2)
|
||||||
|
|
||||||
|
return [(self.map_tensor_name(name), data_torch)]
|
||||||
|
|
||||||
|
return [] # skip other tensors
|
||||||
|
|
||||||
|
|
||||||
@ModelBase.register("SmallThinkerForCausalLM")
|
@ModelBase.register("SmallThinkerForCausalLM")
|
||||||
class SmallThinkerModel(TextModel):
|
class SmallThinkerModel(TextModel):
|
||||||
model_arch = gguf.MODEL_ARCH.SMALLTHINKER
|
model_arch = gguf.MODEL_ARCH.SMALLTHINKER
|
||||||
|
|
@ -8385,7 +8437,7 @@ class MistralModel(LlamaModel):
|
||||||
undo_permute = False
|
undo_permute = False
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path):
|
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path, is_mistral_format: bool):
|
||||||
assert TokenizerVersion is not None, "mistral_common is not installed"
|
assert TokenizerVersion is not None, "mistral_common is not installed"
|
||||||
assert isinstance(vocab.tokenizer, (Tekkenizer, SentencePieceTokenizer)), (
|
assert isinstance(vocab.tokenizer, (Tekkenizer, SentencePieceTokenizer)), (
|
||||||
f"Expected Tekkenizer or SentencePieceTokenizer, got {type(vocab.tokenizer)}"
|
f"Expected Tekkenizer or SentencePieceTokenizer, got {type(vocab.tokenizer)}"
|
||||||
|
|
@ -8406,7 +8458,13 @@ class MistralModel(LlamaModel):
|
||||||
elif vocab.tokenizer.version == TokenizerVersion.v13:
|
elif vocab.tokenizer.version == TokenizerVersion.v13:
|
||||||
template_file = "unsloth-mistral-Devstral-Small-2507.jinja"
|
template_file = "unsloth-mistral-Devstral-Small-2507.jinja"
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}")
|
err_message = f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}"
|
||||||
|
if is_mistral_format:
|
||||||
|
err_message += (
|
||||||
|
" . Please pass --disable-mistral-community-chat-template argument to the CLI "
|
||||||
|
"if you want to skip this error and use the Mistral official `mistral-common` pre-processing library."
|
||||||
|
)
|
||||||
|
raise ValueError(err_message)
|
||||||
|
|
||||||
template_path = templates_dir / template_file
|
template_path = templates_dir / template_file
|
||||||
if not template_path.exists():
|
if not template_path.exists():
|
||||||
|
|
@ -8601,6 +8659,13 @@ def parse_args() -> argparse.Namespace:
|
||||||
"--mistral-format", action="store_true",
|
"--mistral-format", action="store_true",
|
||||||
help="Whether the model is stored following the Mistral format.",
|
help="Whether the model is stored following the Mistral format.",
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--disable-mistral-community-chat-template", action="store_true",
|
||||||
|
help=(
|
||||||
|
"Whether to disable usage of Mistral community chat templates. If set, use the Mistral official `mistral-common` library for tokenization and detokenization of Mistral models. "
|
||||||
|
"Using `mistral-common` ensure correctness and zero-day support of tokenization for models converted from the Mistral format but requires to manually setup the tokenization server."
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
if not args.print_supported_models and args.model is None:
|
if not args.print_supported_models and args.model is None:
|
||||||
|
|
@ -8707,6 +8772,7 @@ def main() -> None:
|
||||||
fname_out = ModelBase.add_prefix_to_filename(fname_out, "mmproj-")
|
fname_out = ModelBase.add_prefix_to_filename(fname_out, "mmproj-")
|
||||||
|
|
||||||
is_mistral_format = args.mistral_format
|
is_mistral_format = args.mistral_format
|
||||||
|
disable_mistral_community_chat_template = args.disable_mistral_community_chat_template
|
||||||
|
|
||||||
with torch.inference_mode():
|
with torch.inference_mode():
|
||||||
output_type = ftype_map[args.outtype]
|
output_type = ftype_map[args.outtype]
|
||||||
|
|
@ -8733,7 +8799,7 @@ def main() -> None:
|
||||||
split_max_tensors=args.split_max_tensors,
|
split_max_tensors=args.split_max_tensors,
|
||||||
split_max_size=split_str_to_n_bytes(args.split_max_size), dry_run=args.dry_run,
|
split_max_size=split_str_to_n_bytes(args.split_max_size), dry_run=args.dry_run,
|
||||||
small_first_shard=args.no_tensor_first_split,
|
small_first_shard=args.no_tensor_first_split,
|
||||||
remote_hf_model_id=hf_repo_id,
|
remote_hf_model_id=hf_repo_id, disable_mistral_community_chat_template=disable_mistral_community_chat_template
|
||||||
)
|
)
|
||||||
|
|
||||||
if args.vocab_only:
|
if args.vocab_only:
|
||||||
|
|
|
||||||
|
|
@ -198,10 +198,9 @@ The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enab
|
||||||
The following compilation options are also available to tweak performance:
|
The following compilation options are also available to tweak performance:
|
||||||
|
|
||||||
| Option | Legal values | Default | Description |
|
| Option | Legal values | Default | Description |
|
||||||
|-------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|-------------------------------|------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||||
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
|
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
|
||||||
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models |
|
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for CDNA and RDNA4) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
|
||||||
| GGML_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
|
|
||||||
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
|
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
|
||||||
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
|
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -194,7 +194,7 @@ llama_print_timings: total time = 44411.01 ms / 377 tokens
|
||||||
## Orin compile and run
|
## Orin compile and run
|
||||||
### compile
|
### compile
|
||||||
```sh
|
```sh
|
||||||
make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 GGML_CUDA_F16=1 -j 32
|
make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 -j 32
|
||||||
```
|
```
|
||||||
### run on Orin
|
### run on Orin
|
||||||
### case 1
|
### case 1
|
||||||
|
|
|
||||||
|
|
@ -34,6 +34,7 @@ else()
|
||||||
add_subdirectory(gen-docs)
|
add_subdirectory(gen-docs)
|
||||||
add_subdirectory(training)
|
add_subdirectory(training)
|
||||||
add_subdirectory(diffusion)
|
add_subdirectory(diffusion)
|
||||||
|
add_subdirectory(model-conversion)
|
||||||
if (NOT GGML_BACKEND_DL)
|
if (NOT GGML_BACKEND_DL)
|
||||||
add_subdirectory(convert-llama2c-to-ggml)
|
add_subdirectory(convert-llama2c-to-ggml)
|
||||||
# these examples use the backends directly and cannot be built with dynamic loading
|
# these examples use the backends directly and cannot be built with dynamic loading
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,5 @@
|
||||||
This is a swift clone of `examples/batched`.
|
This is a swift clone of `examples/batched`.
|
||||||
|
|
||||||
$ `make`
|
```bash
|
||||||
$ `./llama-batched-swift MODEL_PATH [PROMPT] [PARALLEL]`
|
$ ./llama-batched-swift MODEL_PATH [PROMPT] [PARALLEL]
|
||||||
|
```
|
||||||
|
|
|
||||||
|
|
@ -5,3 +5,9 @@ Demonstration of lookahead decoding technique:
|
||||||
https://lmsys.org/blog/2023-11-21-lookahead-decoding/
|
https://lmsys.org/blog/2023-11-21-lookahead-decoding/
|
||||||
|
|
||||||
More info: https://github.com/ggml-org/llama.cpp/pull/4207
|
More info: https://github.com/ggml-org/llama.cpp/pull/4207
|
||||||
|
|
||||||
|
Sample command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
llama-lookahead -hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF -p "// network server implemented in C\n// author: Peter Hacker\n\n#include" -e -ngl 99 -t 4 -n 512 -c 4096 -kvu
|
||||||
|
```
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
.model_name
|
||||||
|
data
|
||||||
|
ppl
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
set(TARGET llama-logits)
|
||||||
|
add_executable(${TARGET} logits.cpp)
|
||||||
|
install(TARGETS ${TARGET} RUNTIME)
|
||||||
|
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
|
||||||
|
target_compile_features(${TARGET} PRIVATE cxx_std_17)
|
||||||
|
|
@ -0,0 +1,163 @@
|
||||||
|
# Validation functions
|
||||||
|
define validate_model_path
|
||||||
|
@if [ -z "$(MODEL_PATH)" ]; then \
|
||||||
|
echo "Error: MODEL_PATH must be provided either as:"; \
|
||||||
|
echo " 1. Environment variable: export MODEL_PATH=/path/to/model"; \
|
||||||
|
echo " 2. Command line argument: make $(1) MODEL_PATH=/path/to/model"; \
|
||||||
|
exit 1; \
|
||||||
|
fi
|
||||||
|
endef
|
||||||
|
|
||||||
|
define validate_embedding_model_path
|
||||||
|
@if [ -z "$(EMBEDDING_MODEL_PATH)" ]; then \
|
||||||
|
echo "Error: EMBEDDING_MODEL_PATH must be provided either as:"; \
|
||||||
|
echo " 1. Environment variable: export EMBEDDING_MODEL_PATH=/path/to/model"; \
|
||||||
|
echo " 2. Command line argument: make $(1) EMBEDDING_MODEL_PATH=/path/to/model"; \
|
||||||
|
exit 1; \
|
||||||
|
fi
|
||||||
|
endef
|
||||||
|
|
||||||
|
###
|
||||||
|
### Casual Model targets/recipes
|
||||||
|
###
|
||||||
|
causal-convert-model-bf16: OUTTYPE=bf16
|
||||||
|
causal-convert-model-bf16: causal-convert-model
|
||||||
|
|
||||||
|
causal-convert-model:
|
||||||
|
$(call validate_model_path,causal-convert-model)
|
||||||
|
@MODEL_NAME="$(MODEL_NAME)" OUTTYPE="$(OUTTYPE)" MODEL_PATH="$(MODEL_PATH)" \
|
||||||
|
METADATA_OVERRIDE="$(METADATA_OVERRIDE)" \
|
||||||
|
./scripts/causal/convert-model.sh
|
||||||
|
|
||||||
|
causal-run-original-model:
|
||||||
|
$(call validate_model_path,causal-run-original-model)
|
||||||
|
@MODEL_PATH="$(MODEL_PATH)" ./scripts/causal/run-org-model.py
|
||||||
|
|
||||||
|
causal-run-converted-model:
|
||||||
|
@CONVERTED_MODEL="$(CONVERTED_MODEL)" ./scripts/causal/run-converted-model.sh
|
||||||
|
|
||||||
|
causal-verify-logits: causal-run-original-model causal-run-converted-model
|
||||||
|
@./scripts/causal/compare-logits.py
|
||||||
|
@MODEL_PATH="$(MODEL_PATH)" ./scripts/utils/check-nmse.py -m ${MODEL_PATH}
|
||||||
|
|
||||||
|
causal-run-original-embeddings:
|
||||||
|
@./scripts/causal/run-casual-gen-embeddings-org.sh
|
||||||
|
|
||||||
|
causal-run-converted-embeddings:
|
||||||
|
@./scripts/causal/run-converted-model-embeddings-logits.sh
|
||||||
|
|
||||||
|
causal-verify-embeddings: causal-run-original-embeddings causal-run-converted-embeddings
|
||||||
|
@./scripts/causal/compare-embeddings-logits.sh
|
||||||
|
|
||||||
|
causal-inspect-original-model:
|
||||||
|
@./scripts/utils/inspect-org-model.py
|
||||||
|
|
||||||
|
causal-inspect-converted-model:
|
||||||
|
@./scripts/utils/inspect-converted-model.sh
|
||||||
|
|
||||||
|
causal-start-embedding-server:
|
||||||
|
@./scripts/utils/run-embedding-server.sh ${CONVERTED_MODEL}
|
||||||
|
|
||||||
|
causal-curl-embedding-endpoint: causal-run-original-embeddings
|
||||||
|
@./scripts/utils/curl-embedding-server.sh | ./scripts/causal/compare-embeddings-logits.sh
|
||||||
|
|
||||||
|
causal-quantize-Q8_0: QUANTIZED_TYPE = Q8_0
|
||||||
|
causal-quantize-Q8_0: causal-quantize-model
|
||||||
|
|
||||||
|
causal-quantize-Q4_0: QUANTIZED_TYPE = Q4_0
|
||||||
|
causal-quantize-Q4_0: causal-quantize-model
|
||||||
|
|
||||||
|
causal-quantize-model:
|
||||||
|
@CONVERTED_MODEL="$(CONVERTED_MODEL)" QUANTIZED_TYPE="$(QUANTIZED_TYPE)" ./scripts/utils/quantize.sh ${CONVERTED_MODEL} ${QUANTIZED_TYPE}
|
||||||
|
@echo "Export the quantized model path to QUANTIZED_MODEL variable in your environment"
|
||||||
|
|
||||||
|
causal-run-quantized-model:
|
||||||
|
@QUANTIZED_MODEL="$(QUANTIZED_MODEL)" ./scripts/causal/run-converted-model.sh ${QUANTIZED_MODEL}
|
||||||
|
|
||||||
|
|
||||||
|
###
|
||||||
|
### Embedding Model targets/recipes
|
||||||
|
###
|
||||||
|
|
||||||
|
embedding-convert-model-bf16: OUTTYPE=bf16
|
||||||
|
embedding-convert-model-bf16: embedding-convert-model
|
||||||
|
|
||||||
|
embedding-convert-model:
|
||||||
|
$(call validate_embedding_model_path,embedding-convert-model)
|
||||||
|
@MODEL_NAME="$(MODEL_NAME)" OUTTYPE="$(OUTTYPE)" MODEL_PATH="$(EMBEDDING_MODEL_PATH)" \
|
||||||
|
METADATA_OVERRIDE="$(METADATA_OVERRIDE)" \
|
||||||
|
./scripts/embedding/convert-model.sh
|
||||||
|
|
||||||
|
embedding-run-original-model:
|
||||||
|
$(call validate_embedding_model_path,embedding-run-original-model)
|
||||||
|
@EMBEDDING_MODEL_PATH="$(EMBEDDING_MODEL_PATH)" ./scripts/embedding/run-original-model.py
|
||||||
|
|
||||||
|
embedding-run-converted-model:
|
||||||
|
@CONVERTED_EMBEDDING_MODEL="$(CONVERTED_EMBEDDING_MODEL)" ./scripts/embedding/run-converted-model.sh ${CONVERTED_EMBEDDING_MODEL}
|
||||||
|
|
||||||
|
embedding-verify-logits: embedding-run-original-model embedding-run-converted-model
|
||||||
|
@./scripts/embedding/compare-embeddings-logits.sh
|
||||||
|
|
||||||
|
embedding-inspect-original-model:
|
||||||
|
$(call validate_embedding_model_path,embedding-inspect-original-model)
|
||||||
|
@EMBEDDING_MODEL_PATH="$(EMBEDDING_MODEL_PATH)" ./scripts/utils/inspect-org-model.py -m ${EMBEDDING_MODEL_PATH}
|
||||||
|
|
||||||
|
embedding-inspect-converted-model:
|
||||||
|
@CONVERTED_EMBEDDING_MODEL="$(CONVERTED_EMBEDDING_MODEL)" ./scripts/utils/inspect-converted-model.sh ${CONVERTED_EMBEDDING_MODEL}
|
||||||
|
|
||||||
|
embedding-start-embedding-server:
|
||||||
|
@./scripts/utils/run-embedding-server.sh ${CONVERTED_EMBEDDING_MODEL}
|
||||||
|
|
||||||
|
embedding-curl-embedding-endpoint:
|
||||||
|
@./scripts/utils/curl-embedding-server.sh | ./scripts/embedding/compare-embeddings-logits.sh
|
||||||
|
|
||||||
|
embedding-quantize-Q8_0: QUANTIZED_TYPE = Q8_0
|
||||||
|
embedding-quantize-Q8_0: embedding-quantize-model
|
||||||
|
|
||||||
|
embedding-quantize-Q4_0: QUANTIZED_TYPE = Q4_0
|
||||||
|
embedding-quantize-Q4_0: embedding-quantize-model
|
||||||
|
|
||||||
|
embedding-quantize-model:
|
||||||
|
@./scripts/utils/quantize.sh ${CONVERTED_EMBEDDING_MODEL} ${QUANTIZED_TYPE}
|
||||||
|
@echo "Export the quantized model path to QUANTIZED_EMBEDDING_MODEL variable in your environment"
|
||||||
|
|
||||||
|
embedding-run-quantized-model:
|
||||||
|
@./scripts/embedding/run-converted-model.sh ${QUANTIZED_EMBEDDING_MODEL}
|
||||||
|
|
||||||
|
###
|
||||||
|
### Perplexity targets/recipes
|
||||||
|
###
|
||||||
|
perplexity-data-gen:
|
||||||
|
CONVERTED_MODEL="$(CONVERTED_MODEL)" ./scripts/utils/perplexity-gen.sh
|
||||||
|
|
||||||
|
perplexity-run-full:
|
||||||
|
QUANTIZED_MODEL="$(QUANTIZED_MODEL)" LOOGITS_FILE="$(LOGITS_FILE)" \
|
||||||
|
./scripts/utils/perplexity-run.sh
|
||||||
|
|
||||||
|
perplexity-run:
|
||||||
|
QUANTIZED_MODEL="$(QUANTIZED_MODEL)" ./scripts/utils/perplexity-run-simple.sh
|
||||||
|
|
||||||
|
###
|
||||||
|
### HuggingFace targets/recipes
|
||||||
|
###
|
||||||
|
|
||||||
|
hf-create-model:
|
||||||
|
@./scripts/utils/hf-create-model.py -m "${MODEL_NAME}" -ns "${NAMESPACE}" -b "${ORIGINAL_BASE_MODEL}"
|
||||||
|
|
||||||
|
hf-create-model-private:
|
||||||
|
@./scripts/utils/hf-create-model.py -m "${MODEL_NAME}" -ns "${NAMESPACE}" -b "${ORIGINAL_BASE_MODEL}" -p
|
||||||
|
|
||||||
|
hf-upload-gguf-to-model:
|
||||||
|
@./scripts/utils/hf-upload-gguf-model.py -m "${MODEL_PATH}" -r "${REPO_ID}" -o "${NAME_IN_REPO}"
|
||||||
|
|
||||||
|
hf-create-collection:
|
||||||
|
@./scripts/utils/hf-create-collection.py -n "${NAME}" -d "${DESCRIPTION}" -ns "${NAMESPACE}"
|
||||||
|
|
||||||
|
hf-add-model-to-collection:
|
||||||
|
@./scripts/utils/hf-add-model-to-collection.py -c "${COLLECTION}" -m "${MODEL}"
|
||||||
|
|
||||||
|
|
||||||
|
.PHONY: clean
|
||||||
|
clean:
|
||||||
|
@${RM} -rf data .converted_embedding_model.txt .converted_model.txt .embedding_model_name.txt .model_name.txt
|
||||||
|
|
||||||
|
|
@ -0,0 +1,335 @@
|
||||||
|
# Model Conversion Example
|
||||||
|
This directory contains scripts and code to help in the process of converting
|
||||||
|
HuggingFace PyTorch models to GGUF format.
|
||||||
|
|
||||||
|
The motivation for having this is that the conversion process can often be an
|
||||||
|
iterative process, where the original model is inspected, converted, updates
|
||||||
|
made to llama.cpp, converted again, etc. Once the model has been converted it
|
||||||
|
needs to be verified against the original model, and then optionally quantified,
|
||||||
|
and in some cases perplexity checked of the quantized model. And finally the
|
||||||
|
model/models need to the ggml-org on Hugging Face. This tool/example tries to
|
||||||
|
help with this process.
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
The idea is that the makefile targets and scripts here can be used in the
|
||||||
|
development/conversion process assisting with things like:
|
||||||
|
|
||||||
|
* inspect/run the original model to figure out how it works
|
||||||
|
* convert the original model to GGUF format
|
||||||
|
* inspect/run the converted model
|
||||||
|
* verify the logits produced by the original model and the converted model
|
||||||
|
* quantize the model to GGUF format
|
||||||
|
* run perplexity evaluation to verify that the quantized model is performing
|
||||||
|
as expected
|
||||||
|
* upload the model to HuggingFace to make it available for others
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
Create virtual python environment
|
||||||
|
```console
|
||||||
|
$ python3.11 -m venv venv
|
||||||
|
$ source venv/bin/activate
|
||||||
|
(venv) $ pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Causal Language Model Conversion
|
||||||
|
This section describes the steps to convert a causal language model to GGUF and
|
||||||
|
to verify that the conversion was successful.
|
||||||
|
|
||||||
|
### Download the original model
|
||||||
|
First, clone the original model to some local directory:
|
||||||
|
```console
|
||||||
|
$ mkdir models && cd models
|
||||||
|
$ git clone https://huggingface.co/user/model_name
|
||||||
|
$ cd model_name
|
||||||
|
$ git lfs install
|
||||||
|
$ git lfs pull
|
||||||
|
```
|
||||||
|
|
||||||
|
### Set the MODEL_PATH
|
||||||
|
The path to the downloaded model can be provided in two ways:
|
||||||
|
|
||||||
|
**Option 1: Environment variable (recommended for iterative development)**
|
||||||
|
```console
|
||||||
|
export MODEL_PATH=~/work/ai/models/some_model
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: Command line argument (for one-off tasks)**
|
||||||
|
```console
|
||||||
|
make causal-convert-model MODEL_PATH=~/work/ai/models/some_model
|
||||||
|
```
|
||||||
|
|
||||||
|
Command line arguments take precedence over environment variables when both are provided.
|
||||||
|
|
||||||
|
In cases where the transformer implementation for the model has not been released
|
||||||
|
yet it is possible to set the environment variable `UNRELEASED_MODEL_NAME` which
|
||||||
|
will then cause the transformer implementation to be loaded explicitely and not
|
||||||
|
use AutoModelForCausalLM:
|
||||||
|
```
|
||||||
|
export UNRELEASED_MODEL_NAME=SomeNewModel
|
||||||
|
```
|
||||||
|
|
||||||
|
### Inspecting the original tensors
|
||||||
|
```console
|
||||||
|
# Using environment variable
|
||||||
|
(venv) $ make causal-inspect-original-model
|
||||||
|
|
||||||
|
# Or using command line argument
|
||||||
|
(venv) $ make causal-inspect-original-model MODEL_PATH=~/work/ai/models/some_model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running the original model
|
||||||
|
This is mainly to verify that the original model works, and to compare the output
|
||||||
|
from the converted model.
|
||||||
|
```console
|
||||||
|
# Using environment variable
|
||||||
|
(venv) $ make causal-run-original-model
|
||||||
|
|
||||||
|
# Or using command line argument
|
||||||
|
(venv) $ make causal-run-original-model MODEL_PATH=~/work/ai/models/some_model
|
||||||
|
```
|
||||||
|
This command will save two files to the `data` directory, one is a binary file
|
||||||
|
containing logits which will be used for comparison with the converted model
|
||||||
|
later, and the other is a text file which allows for manual visual inspection.
|
||||||
|
|
||||||
|
### Model conversion
|
||||||
|
After updates have been made to [gguf-py](../../gguf-py) to add support for the
|
||||||
|
new model, the model can be converted to GGUF format using the following command:
|
||||||
|
```console
|
||||||
|
# Using environment variable
|
||||||
|
(venv) $ make causal-convert-model
|
||||||
|
|
||||||
|
# Or using command line argument
|
||||||
|
(venv) $ make causal-convert-model MODEL_PATH=~/work/ai/models/some_model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Inspecting the converted model
|
||||||
|
The converted model can be inspected using the following command:
|
||||||
|
```console
|
||||||
|
(venv) $ make inspect-converted-model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running the converted model
|
||||||
|
```console
|
||||||
|
(venv) $ make run-converted-model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model logits verfication
|
||||||
|
The following target will run the original model and the converted model and
|
||||||
|
compare the logits:
|
||||||
|
```console
|
||||||
|
(venv) $ make causal-verify-logits
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quantizing the model
|
||||||
|
The causal model can be quantized to GGUF format using the following command:
|
||||||
|
```console
|
||||||
|
(venv) $ make causal-quantize-Q8_0
|
||||||
|
Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
|
||||||
|
Export the quantized model path to QUANTIZED_MODEL variable in your environment
|
||||||
|
```
|
||||||
|
This will show the path to the quantized model in the terminal, which can then
|
||||||
|
be used to set the `QUANTIZED_MODEL` environment variable:
|
||||||
|
```console
|
||||||
|
export QUANTIZED_MODEL=/path/to/quantized/model-Q8_0.gguf
|
||||||
|
```
|
||||||
|
Then the quantized model can be run using the following command:
|
||||||
|
```console
|
||||||
|
(venv) $ make causal-run-quantized-model
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Embedding Language Model Conversion
|
||||||
|
|
||||||
|
### Download the original model
|
||||||
|
```console
|
||||||
|
$ mkdir models && cd models
|
||||||
|
$ git clone https://huggingface.co/user/model_name
|
||||||
|
$ cd model_name
|
||||||
|
$ git lfs install
|
||||||
|
$ git lfs pull
|
||||||
|
```
|
||||||
|
|
||||||
|
The path to the embedding model can be provided in two ways:
|
||||||
|
|
||||||
|
**Option 1: Environment variable (recommended for iterative development)**
|
||||||
|
```console
|
||||||
|
export EMBEDDING_MODEL_PATH=~/path/to/embedding_model
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: Command line argument (for one-off tasks)**
|
||||||
|
```console
|
||||||
|
make embedding-convert-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model
|
||||||
|
```
|
||||||
|
|
||||||
|
Command line arguments take precedence over environment variables when both are provided.
|
||||||
|
|
||||||
|
### Running the original model
|
||||||
|
This is mainly to verify that the original model works and to compare the output
|
||||||
|
with the output from the converted model.
|
||||||
|
```console
|
||||||
|
# Using environment variable
|
||||||
|
(venv) $ make embedding-run-original-model
|
||||||
|
|
||||||
|
# Or using command line argument
|
||||||
|
(venv) $ make embedding-run-original-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model
|
||||||
|
```
|
||||||
|
This command will save two files to the `data` directory, one is a binary
|
||||||
|
file containing logits which will be used for comparison with the converted
|
||||||
|
model, and the other is a text file which allows for manual visual inspection.
|
||||||
|
|
||||||
|
### Model conversion
|
||||||
|
After updates have been made to [gguf-py](../../gguf-py) to add support for the
|
||||||
|
new model the model can be converted to GGUF format using the following command:
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-convert-model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run the converted model
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-run-converted-model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model logits verfication
|
||||||
|
The following target will run the original model and the converted model (which
|
||||||
|
was done manually in the previous steps) and compare the logits:
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-verify-logits
|
||||||
|
```
|
||||||
|
|
||||||
|
### llama-server verification
|
||||||
|
To verify that the converted model works with llama-server, the following
|
||||||
|
command can be used:
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-start-embedding-server
|
||||||
|
```
|
||||||
|
Then open another terminal and set the `EMBEDDINGS_MODEL_PATH` environment
|
||||||
|
variable as this will not be inherited by the new terminal:
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-curl-embedding-endpoint
|
||||||
|
```
|
||||||
|
This will call the `embedding` endpoing and the output will be piped into
|
||||||
|
the same verification script as used by the target `embedding-verify-logits`.
|
||||||
|
|
||||||
|
The causal model can also be used to produce embeddings and this can be verified
|
||||||
|
using the following commands:
|
||||||
|
```console
|
||||||
|
(venv) $ make causal-start-embedding-server
|
||||||
|
```
|
||||||
|
Then open another terminal and set the `MODEL_PATH` environment
|
||||||
|
variable as this will not be inherited by the new terminal:
|
||||||
|
```console
|
||||||
|
(venv) $ make casual-curl-embedding-endpoint
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quantizing the model
|
||||||
|
The embedding model can be quantized to GGUF format using the following command:
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-quantize-Q8_0
|
||||||
|
Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
|
||||||
|
Export the quantized model path to QUANTIZED_EMBEDDING_MODEL variable in your environment
|
||||||
|
```
|
||||||
|
This will show the path to the quantized model in the terminal, which can then
|
||||||
|
be used to set the `QUANTIZED_EMBEDDING_MODEL` environment variable:
|
||||||
|
```console
|
||||||
|
export QUANTIZED_EMBEDDING_MODEL=/path/to/quantized/model-Q8_0.gguf
|
||||||
|
```
|
||||||
|
Then the quantized model can be run using the following command:
|
||||||
|
```console
|
||||||
|
(venv) $ make embedding-run-quantized-model
|
||||||
|
```
|
||||||
|
|
||||||
|
## Perplexity Evaluation
|
||||||
|
|
||||||
|
### Simple perplexity evaluation
|
||||||
|
This allows to run the perplexity evaluation without having to generate a
|
||||||
|
token/logits file:
|
||||||
|
```console
|
||||||
|
(venv) $ make perplexity-run QUANTIZED_MODEL=~/path/to/quantized/model.gguf
|
||||||
|
```
|
||||||
|
This will use the wikitext dataset to run the perplexity evaluation and
|
||||||
|
output the perplexity score to the terminal. This value can then be compared
|
||||||
|
with the perplexity score of the unquantized model.
|
||||||
|
|
||||||
|
### Full perplexity evaluation
|
||||||
|
First use the converted, non-quantized, model to generate the perplexity evaluation
|
||||||
|
dataset using the following command:
|
||||||
|
```console
|
||||||
|
$ make perplexity-data-gen CONVERTED_MODEL=~/path/to/converted/model.gguf
|
||||||
|
```
|
||||||
|
This will generate a file in the `data` directory named after the model and with
|
||||||
|
a `.kld` suffix which contains the tokens and the logits for the wikitext dataset.
|
||||||
|
|
||||||
|
After the dataset has been generated, the perplexity evaluation can be run using
|
||||||
|
the quantized model:
|
||||||
|
```console
|
||||||
|
$ make perplexity-run-full QUANTIZED_MODEL=~/path/to/quantized/model-Qxx.gguf LOGITS_FILE=data/model.gguf.ppl
|
||||||
|
```
|
||||||
|
|
||||||
|
> 📝 **Note:** The `LOGITS_FILE` is the file generated by the previous command
|
||||||
|
> can be very large, so make sure you have enough disk space available.
|
||||||
|
|
||||||
|
## HuggingFace utilities
|
||||||
|
The following targets are useful for creating collections and model repositories
|
||||||
|
on Hugging Face in the the ggml-org. These can be used when preparing a relase
|
||||||
|
to script the process for new model releases.
|
||||||
|
|
||||||
|
For the following targets a `HF_TOKEN` environment variable is required.
|
||||||
|
|
||||||
|
> 📝 **Note:** Don't forget to logout from Hugging Face after running these
|
||||||
|
> commands, otherwise you might have issues pulling/cloning repositories as
|
||||||
|
> the token will still be in use:
|
||||||
|
> $ huggingface-cli logout
|
||||||
|
> $ unset HF_TOKEN
|
||||||
|
|
||||||
|
### Create a new Hugging Face Model (model repository)
|
||||||
|
This will create a new model repsository on Hugging Face with the specified
|
||||||
|
model name.
|
||||||
|
```console
|
||||||
|
(venv) $ make hf-create-model MODEL_NAME='TestModel' NAMESPACE="danbev"
|
||||||
|
Repository ID: danbev/TestModel-GGUF
|
||||||
|
Repository created: https://huggingface.co/danbev/TestModel-GGUF
|
||||||
|
```
|
||||||
|
Note that we append a `-GGUF` suffix to the model name to ensure a consistent
|
||||||
|
naming convention for GGUF models.
|
||||||
|
|
||||||
|
### Upload a GGUF model to model repository
|
||||||
|
The following target uploads a model to an existing Hugging Face model repository.
|
||||||
|
```console
|
||||||
|
(venv) $ make hf-upload-gguf-to-model MODEL_PATH=dummy-model1.gguf REPO_ID=danbev/TestModel-GGUF
|
||||||
|
📤 Uploading dummy-model1.gguf to danbev/TestModel-GGUF/dummy-model1.gguf
|
||||||
|
✅ Upload successful!
|
||||||
|
🔗 File available at: https://huggingface.co/danbev/TestModel-GGUF/blob/main/dummy-model1.gguf
|
||||||
|
```
|
||||||
|
This command can also be used to update an existing model file in a repository.
|
||||||
|
|
||||||
|
### Create a new Collection
|
||||||
|
```console
|
||||||
|
(venv) $ make hf-new-collection NAME=TestCollection DESCRIPTION="Collection for testing scripts" NAMESPACE=danbev
|
||||||
|
🚀 Creating Hugging Face Collection
|
||||||
|
Title: TestCollection
|
||||||
|
Description: Collection for testing scripts
|
||||||
|
Namespace: danbev
|
||||||
|
Private: False
|
||||||
|
✅ Authenticated as: danbev
|
||||||
|
📚 Creating collection: 'TestCollection'...
|
||||||
|
✅ Collection created successfully!
|
||||||
|
📋 Collection slug: danbev/testcollection-68930fcf73eb3fc200b9956d
|
||||||
|
🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d
|
||||||
|
|
||||||
|
🎉 Collection created successfully!
|
||||||
|
Use this slug to add models: danbev/testcollection-68930fcf73eb3fc200b9956d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add model to a Collection
|
||||||
|
```console
|
||||||
|
(venv) $ make hf-add-model-to-collection COLLECTION=danbev/testcollection-68930fcf73eb3fc200b9956d MODEL=danbev/TestModel-GGUF
|
||||||
|
✅ Authenticated as: danbev
|
||||||
|
🔍 Checking if model exists: danbev/TestModel-GGUF
|
||||||
|
✅ Model found: danbev/TestModel-GGUF
|
||||||
|
📚 Adding model to collection...
|
||||||
|
✅ Model added to collection successfully!
|
||||||
|
🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d
|
||||||
|
|
||||||
|
🎉 Model added successfully!
|
||||||
|
|
||||||
|
```
|
||||||
|
|
@ -0,0 +1,209 @@
|
||||||
|
#include "llama.h"
|
||||||
|
#include <cstdio>
|
||||||
|
#include <cstring>
|
||||||
|
#include <string>
|
||||||
|
#include <vector>
|
||||||
|
#include <ctype.h>
|
||||||
|
#include <filesystem>
|
||||||
|
|
||||||
|
static void print_usage(int, char ** argv) {
|
||||||
|
printf("\nexample usage:\n");
|
||||||
|
printf("\n %s -m model.gguf [-ngl n_gpu_layers] -embd-mode [prompt]\n", argv[0]);
|
||||||
|
printf("\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
int main(int argc, char ** argv) {
|
||||||
|
std::string model_path;
|
||||||
|
std::string prompt = "Hello, my name is";
|
||||||
|
int ngl = 0;
|
||||||
|
bool embedding_mode = false;
|
||||||
|
|
||||||
|
{
|
||||||
|
int i = 1;
|
||||||
|
for (; i < argc; i++) {
|
||||||
|
if (strcmp(argv[i], "-m") == 0) {
|
||||||
|
if (i + 1 < argc) {
|
||||||
|
model_path = argv[++i];
|
||||||
|
} else {
|
||||||
|
print_usage(argc, argv);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
} else if (strcmp(argv[i], "-ngl") == 0) {
|
||||||
|
if (i + 1 < argc) {
|
||||||
|
try {
|
||||||
|
ngl = std::stoi(argv[++i]);
|
||||||
|
} catch (...) {
|
||||||
|
print_usage(argc, argv);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
print_usage(argc, argv);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
} else if (strcmp(argv[i], "-embd-mode") == 0) {
|
||||||
|
if (i + 1 < argc) {
|
||||||
|
try {
|
||||||
|
embedding_mode = true;
|
||||||
|
} catch (...) {
|
||||||
|
print_usage(argc, argv);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
print_usage(argc, argv);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// prompt starts here
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (model_path.empty()) {
|
||||||
|
print_usage(argc, argv);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (i < argc) {
|
||||||
|
prompt = argv[i++];
|
||||||
|
for (; i < argc; i++) {
|
||||||
|
prompt += " ";
|
||||||
|
prompt += argv[i];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
ggml_backend_load_all();
|
||||||
|
llama_model_params model_params = llama_model_default_params();
|
||||||
|
model_params.n_gpu_layers = ngl;
|
||||||
|
|
||||||
|
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
|
||||||
|
|
||||||
|
if (model == NULL) {
|
||||||
|
fprintf(stderr , "%s: error: unable to load model\n" , __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract basename from model_path
|
||||||
|
const char * basename = strrchr(model_path.c_str(), '/');
|
||||||
|
basename = (basename == NULL) ? model_path.c_str() : basename + 1;
|
||||||
|
|
||||||
|
char model_name[256];
|
||||||
|
strncpy(model_name, basename, 255);
|
||||||
|
model_name[255] = '\0';
|
||||||
|
|
||||||
|
char * dot = strrchr(model_name, '.');
|
||||||
|
if (dot != NULL && strcmp(dot, ".gguf") == 0) {
|
||||||
|
*dot = '\0';
|
||||||
|
}
|
||||||
|
printf("Model name: %s\n", model_name);
|
||||||
|
|
||||||
|
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||||
|
const int n_prompt = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
|
||||||
|
|
||||||
|
std::vector<llama_token> prompt_tokens(n_prompt);
|
||||||
|
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true, true) < 0) {
|
||||||
|
fprintf(stderr, "%s: error: failed to tokenize the prompt\n", __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
llama_context_params ctx_params = llama_context_default_params();
|
||||||
|
ctx_params.n_ctx = n_prompt;
|
||||||
|
ctx_params.n_batch = n_prompt;
|
||||||
|
ctx_params.no_perf = false;
|
||||||
|
if (embedding_mode) {
|
||||||
|
ctx_params.embeddings = true;
|
||||||
|
ctx_params.n_ubatch = ctx_params.n_batch;
|
||||||
|
}
|
||||||
|
|
||||||
|
llama_context * ctx = llama_init_from_model(model, ctx_params);
|
||||||
|
if (ctx == NULL) {
|
||||||
|
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("Input prompt: \"%s\"\n", prompt.c_str());
|
||||||
|
printf("Tokenized prompt (%d tokens): ", n_prompt);
|
||||||
|
for (auto id : prompt_tokens) {
|
||||||
|
char buf[128];
|
||||||
|
int n = llama_token_to_piece(vocab, id, buf, sizeof(buf), 0, true);
|
||||||
|
if (n < 0) {
|
||||||
|
fprintf(stderr, "%s: error: failed to convert token to piece\n", __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
std::string s(buf, n);
|
||||||
|
printf("%s", s.c_str());
|
||||||
|
}
|
||||||
|
printf("\n");
|
||||||
|
|
||||||
|
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
|
||||||
|
|
||||||
|
if (llama_decode(ctx, batch)) {
|
||||||
|
fprintf(stderr, "%s : failed to eval\n", __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
float * logits;
|
||||||
|
int n_logits;
|
||||||
|
const char * type;
|
||||||
|
|
||||||
|
if (embedding_mode) {
|
||||||
|
logits = llama_get_embeddings(ctx);
|
||||||
|
n_logits = llama_model_n_embd(model) * batch.n_tokens;
|
||||||
|
type = "-embeddings";
|
||||||
|
printf("Embeddings size: %d\n", n_logits);
|
||||||
|
} else {
|
||||||
|
logits = llama_get_logits_ith(ctx, batch.n_tokens - 1);
|
||||||
|
n_logits = llama_vocab_n_tokens(vocab);
|
||||||
|
type = "";
|
||||||
|
printf("Vocab size: %d\n", n_logits);
|
||||||
|
}
|
||||||
|
|
||||||
|
std::filesystem::create_directory("data");
|
||||||
|
|
||||||
|
// Save logits to binary file
|
||||||
|
char bin_filename[512];
|
||||||
|
snprintf(bin_filename, sizeof(bin_filename), "data/llamacpp-%s%s.bin", model_name, type);
|
||||||
|
printf("Saving logits to %s\n", bin_filename);
|
||||||
|
|
||||||
|
FILE * f = fopen(bin_filename, "wb");
|
||||||
|
if (f == NULL) {
|
||||||
|
fprintf(stderr, "%s: error: failed to open binary output file\n", __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
fwrite(logits, sizeof(float), n_logits, f);
|
||||||
|
fclose(f);
|
||||||
|
|
||||||
|
// Also save as text for debugging
|
||||||
|
char txt_filename[512];
|
||||||
|
snprintf(txt_filename, sizeof(txt_filename), "data/llamacpp-%s%s.txt", model_name, type);
|
||||||
|
f = fopen(txt_filename, "w");
|
||||||
|
if (f == NULL) {
|
||||||
|
fprintf(stderr, "%s: error: failed to open text output file\n", __func__);
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
for (int i = 0; i < n_logits; i++) {
|
||||||
|
fprintf(f, "%d: %.6f\n", i, logits[i]); // Added index and changed format
|
||||||
|
}
|
||||||
|
fclose(f);
|
||||||
|
|
||||||
|
// Print first and last 10 logits for quick verification
|
||||||
|
printf("First 10 logits: ");
|
||||||
|
for (int i = 0; i < 10 && i < n_logits; i++) {
|
||||||
|
printf("%.6f ", logits[i]);
|
||||||
|
}
|
||||||
|
printf("\n");
|
||||||
|
|
||||||
|
printf("Last 10 logits: ");
|
||||||
|
for (int i = n_logits - 10; i < n_logits; i++) {
|
||||||
|
if (i >= 0) printf("%.6f ", logits[i]);
|
||||||
|
}
|
||||||
|
printf("\n\n");
|
||||||
|
|
||||||
|
printf("Logits saved to %s\n", bin_filename);
|
||||||
|
printf("Logits saved to %s\n", txt_filename);
|
||||||
|
|
||||||
|
llama_free(ctx);
|
||||||
|
llama_model_free(model);
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
--extra-index-url https://download.pytorch.org/whl/cpu
|
||||||
|
torch~=2.6.0
|
||||||
|
torchvision~=0.21.0
|
||||||
|
transformers~=4.55.0
|
||||||
|
huggingface-hub~=0.34.0
|
||||||
|
|
@ -0,0 +1,43 @@
|
||||||
|
#/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
MODEL_PATH="${1:-"$MODEL_PATH"}"
|
||||||
|
MODEL_NAME="${2:-$(basename "$MODEL_PATH")}"
|
||||||
|
|
||||||
|
if [ -t 0 ]; then
|
||||||
|
CPP_EMBEDDINGS="data/llamacpp-${MODEL_NAME}-embeddings.bin"
|
||||||
|
else
|
||||||
|
# Process piped JSON data and convert to binary (matching logits.cpp format)
|
||||||
|
TEMP_FILE=$(mktemp /tmp/tmp.XXXXXX.binn)
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import struct
|
||||||
|
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
|
||||||
|
# Flatten all embeddings completely
|
||||||
|
flattened = []
|
||||||
|
for item in data:
|
||||||
|
embedding = item['embedding']
|
||||||
|
for token_embedding in embedding:
|
||||||
|
flattened.extend(token_embedding)
|
||||||
|
|
||||||
|
print(f'Total embedding values: {len(flattened)}', file=sys.stderr)
|
||||||
|
|
||||||
|
# Write as binary floats - matches logitc.cpp fwrite format
|
||||||
|
with open('$TEMP_FILE', 'wb') as f:
|
||||||
|
for value in flattened:
|
||||||
|
f.write(struct.pack('f', value))
|
||||||
|
"
|
||||||
|
CPP_EMBEDDINGS="$TEMP_FILE"
|
||||||
|
trap "rm -f $TEMP_FILE" EXIT
|
||||||
|
fi
|
||||||
|
|
||||||
|
python scripts/utils/semantic_check.py --model-path $MODEL_PATH \
|
||||||
|
--python-embeddings data/pytorch-${MODEL_NAME}-embeddings.bin \
|
||||||
|
--cpp-embeddings $CPP_EMBEDDINGS \
|
||||||
|
--prompt "Hello world today" \
|
||||||
|
--causal
|
||||||
|
|
||||||
|
|
@ -0,0 +1,88 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
def quick_logits_check(pytorch_file, llamacpp_file):
|
||||||
|
"""Lightweight sanity check before NMSE"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
pytorch_logits = np.fromfile(pytorch_file, dtype=np.float32)
|
||||||
|
llamacpp_logits = np.fromfile(llamacpp_file, dtype=np.float32)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ NOK: Failed to load files - {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Check shapes match
|
||||||
|
if pytorch_logits.shape != llamacpp_logits.shape:
|
||||||
|
print(f"❌ NOK: Shape mismatch - PyTorch: {pytorch_logits.shape}, llama.cpp: {llamacpp_logits.shape}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Calculate key metrics
|
||||||
|
diff = pytorch_logits - llamacpp_logits
|
||||||
|
abs_diff = np.abs(diff)
|
||||||
|
max_diff = np.max(abs_diff)
|
||||||
|
|
||||||
|
# Get top 10 predictions from both models
|
||||||
|
pytorch_top10 = np.argsort(pytorch_logits)[-10:][::-1]
|
||||||
|
llamacpp_top10 = np.argsort(llamacpp_logits)[-10:][::-1]
|
||||||
|
print(f"Top 10 PyTorch logits: {pytorch_logits[pytorch_top10]}")
|
||||||
|
print(f"Top 10 llama.cpp logits: {llamacpp_logits[llamacpp_top10]}")
|
||||||
|
print(f"Max absolute difference: {max_diff:.4f}")
|
||||||
|
|
||||||
|
if max_diff > 1.0:
|
||||||
|
print(f"❌ NOK: Large differences detected - max diff: {max_diff:.4f}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
def main():
|
||||||
|
model_path = os.getenv('MODEL_PATH')
|
||||||
|
if not model_path:
|
||||||
|
print("Error: MODEL_PATH environment variable not set")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not os.path.exists(model_path):
|
||||||
|
print(f"Error: Model file not found: {model_path}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
model_name = os.path.splitext(os.path.basename(model_path))[0]
|
||||||
|
data_dir = Path("data")
|
||||||
|
|
||||||
|
pytorch_file = data_dir / f"pytorch-{model_name}.bin"
|
||||||
|
llamacpp_file = data_dir / f"llamacpp-{model_name}.bin"
|
||||||
|
|
||||||
|
if not pytorch_file.exists():
|
||||||
|
print(f"Error: PyTorch logits file not found: {pytorch_file}")
|
||||||
|
print("Please run scripts/run-org-model.sh first to generate this file.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not llamacpp_file.exists():
|
||||||
|
print(f"Error: llama.cpp logits file not found: {llamacpp_file}")
|
||||||
|
print("Please run scripts/run-converted-model.sh first to generate this file.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print("Checked all required files were found. Proceeding...\n")
|
||||||
|
|
||||||
|
|
||||||
|
print("🔍 GGML Model Validation for model ", model_name)
|
||||||
|
print("=" * 40)
|
||||||
|
print(f"PyTorch logits : {pytorch_file}")
|
||||||
|
print(f"llama.cpp logits: {llamacpp_file}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
success = quick_logits_check(pytorch_file, llamacpp_file)
|
||||||
|
|
||||||
|
# Exit with appropriate code
|
||||||
|
if success:
|
||||||
|
print("✅ OK: Lightweight model check successful!")
|
||||||
|
print(" Ok to proceed with NMSE check...")
|
||||||
|
sys.exit(0)
|
||||||
|
else:
|
||||||
|
print(f"❌ NOK: Top 10 predictions don't match - generation will differ")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -0,0 +1,22 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
MODEL_NAME="${MODEL_NAME:-$(basename "$MODEL_PATH")}"
|
||||||
|
OUTPUT_DIR="${OUTPUT_DIR:-../../models}"
|
||||||
|
TYPE="${OUTTYPE:-f16}"
|
||||||
|
METADATA_OVERRIDE="${METADATA_OVERRIDE:-}"
|
||||||
|
CONVERTED_MODEL="${OUTPUT_DIR}/${MODEL_NAME}.gguf"
|
||||||
|
|
||||||
|
echo "Model path: ${MODEL_PATH}"
|
||||||
|
echo "Model name: ${MODEL_NAME}"
|
||||||
|
echo "Data type: ${TYPE}"
|
||||||
|
echo "Converted model path:: ${CONVERTED_MODEL}"
|
||||||
|
echo "Metadata override: ${METADATA_OVERRIDE}"
|
||||||
|
python ../../convert_hf_to_gguf.py --verbose \
|
||||||
|
${MODEL_PATH} \
|
||||||
|
--outfile ${CONVERTED_MODEL} \
|
||||||
|
--outtype ${TYPE} \
|
||||||
|
--metadata "${METADATA_OVERRIDE}"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "The environment variable CONVERTED_MODEL can be set to this path using:"
|
||||||
|
echo "export CONVERTED_MODEL=$(realpath ${CONVERTED_MODEL})"
|
||||||
|
|
@ -0,0 +1,113 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import importlib
|
||||||
|
import sys
|
||||||
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForCausalLM
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Process model with specified path')
|
||||||
|
parser.add_argument('--model-path', '-m', help='Path to the model')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
model_path = os.environ.get('MODEL_PATH', args.model_path)
|
||||||
|
if model_path is None:
|
||||||
|
parser.error("Model path must be specified either via --model-path argument or MODEL_PATH environment variable")
|
||||||
|
|
||||||
|
config = AutoConfig.from_pretrained(model_path)
|
||||||
|
|
||||||
|
print("Model type: ", config.model_type)
|
||||||
|
print("Vocab size: ", config.vocab_size)
|
||||||
|
print("Hidden size: ", config.hidden_size)
|
||||||
|
print("Number of layers: ", config.num_hidden_layers)
|
||||||
|
print("BOS token id: ", config.bos_token_id)
|
||||||
|
print("EOS token id: ", config.eos_token_id)
|
||||||
|
|
||||||
|
print("Loading model and tokenizer using AutoTokenizer:", model_path)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
|
|
||||||
|
if unreleased_model_name:
|
||||||
|
model_name_lower = unreleased_model_name.lower()
|
||||||
|
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
|
||||||
|
class_name = f"{unreleased_model_name}ForCausalLM"
|
||||||
|
print(f"Importing unreleased model module: {unreleased_module_path}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
|
||||||
|
model = model_class.from_pretrained(model_path)
|
||||||
|
except (ImportError, AttributeError) as e:
|
||||||
|
print(f"Failed to import or load model: {e}")
|
||||||
|
else:
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_path)
|
||||||
|
print(f"Model class: {type(model)}")
|
||||||
|
#print(f"Model file: {type(model).__module__}")
|
||||||
|
|
||||||
|
model_name = os.path.basename(model_path)
|
||||||
|
print(f"Model name: {model_name}")
|
||||||
|
|
||||||
|
prompt = "Hello world today"
|
||||||
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
||||||
|
print(f"Input tokens: {input_ids}")
|
||||||
|
print(f"Input text: {repr(prompt)}")
|
||||||
|
print(f"Tokenized: {tokenizer.convert_ids_to_tokens(input_ids[0])}")
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(input_ids, output_hidden_states=True)
|
||||||
|
|
||||||
|
# Extract hidden states from the last layer
|
||||||
|
# outputs.hidden_states is a tuple of (num_layers + 1) tensors
|
||||||
|
# Index -1 gets the last layer, shape: [batch_size, seq_len, hidden_size]
|
||||||
|
last_hidden_states = outputs.hidden_states[-1]
|
||||||
|
|
||||||
|
# Get embeddings for all tokens
|
||||||
|
token_embeddings = last_hidden_states[0].cpu().numpy() # Remove batch dimension
|
||||||
|
|
||||||
|
print(f"Hidden states shape: {last_hidden_states.shape}")
|
||||||
|
print(f"Token embeddings shape: {token_embeddings.shape}")
|
||||||
|
print(f"Hidden dimension: {token_embeddings.shape[-1]}")
|
||||||
|
print(f"Number of tokens: {token_embeddings.shape[0]}")
|
||||||
|
|
||||||
|
# Save raw token embeddings
|
||||||
|
data_dir = Path("data")
|
||||||
|
data_dir.mkdir(exist_ok=True)
|
||||||
|
bin_filename = data_dir / f"pytorch-{model_name}-embeddings.bin"
|
||||||
|
txt_filename = data_dir / f"pytorch-{model_name}-embeddings.txt"
|
||||||
|
|
||||||
|
# Save all token embeddings as binary
|
||||||
|
print(token_embeddings)
|
||||||
|
token_embeddings.astype(np.float32).tofile(bin_filename)
|
||||||
|
|
||||||
|
# Save as text for inspection
|
||||||
|
with open(txt_filename, "w") as f:
|
||||||
|
for i, embedding in enumerate(token_embeddings):
|
||||||
|
for j, val in enumerate(embedding):
|
||||||
|
f.write(f"{i} {j} {val:.6f}\n")
|
||||||
|
|
||||||
|
# Print embeddings per token in the requested format
|
||||||
|
print("\nToken embeddings:")
|
||||||
|
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
|
||||||
|
for i, embedding in enumerate(token_embeddings):
|
||||||
|
# Format: show first few values, ..., then last few values
|
||||||
|
if len(embedding) > 10:
|
||||||
|
# Show first 3 and last 3 values with ... in between
|
||||||
|
first_vals = " ".join(f"{val:8.6f}" for val in embedding[:3])
|
||||||
|
last_vals = " ".join(f"{val:8.6f}" for val in embedding[-3:])
|
||||||
|
print(f"embedding {i}: {first_vals} ... {last_vals}")
|
||||||
|
else:
|
||||||
|
# If embedding is short, show all values
|
||||||
|
vals = " ".join(f"{val:8.6f}" for val in embedding)
|
||||||
|
print(f"embedding {i}: {vals}")
|
||||||
|
|
||||||
|
# Also show token info for reference
|
||||||
|
print(f"\nToken reference:")
|
||||||
|
for i, token in enumerate(tokens):
|
||||||
|
print(f" Token {i}: {repr(token)}")
|
||||||
|
|
||||||
|
print(f"Saved bin logits to: {bin_filename}")
|
||||||
|
print(f"Saved txt logist to: {txt_filename}")
|
||||||
|
|
@ -0,0 +1,18 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# First try command line argument, then environment variable, then file
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-logits -j8
|
||||||
|
|
||||||
|
../../build/bin/llama-logits -m $CONVERTED_MODEL -embd-mode "Hello world today"
|
||||||
|
|
@ -0,0 +1,20 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# First try command line argument, then environment variable, then file
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo $CONVERTED_MODEL
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-logits -j8
|
||||||
|
|
||||||
|
../../build/bin/llama-logits -m "$CONVERTED_MODEL" "Hello, my name is"
|
||||||
|
|
@ -0,0 +1,100 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import importlib
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
|
||||||
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Process model with specified path')
|
||||||
|
parser.add_argument('--model-path', '-m', help='Path to the model')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
model_path = os.environ.get('MODEL_PATH', args.model_path)
|
||||||
|
if model_path is None:
|
||||||
|
parser.error("Model path must be specified either via --model-path argument or MODEL_PATH environment variable")
|
||||||
|
|
||||||
|
config = AutoConfig.from_pretrained(model_path)
|
||||||
|
|
||||||
|
print("Model type: ", config.model_type)
|
||||||
|
print("Vocab size: ", config.vocab_size)
|
||||||
|
print("Hidden size: ", config.hidden_size)
|
||||||
|
print("Number of layers: ", config.num_hidden_layers)
|
||||||
|
print("BOS token id: ", config.bos_token_id)
|
||||||
|
print("EOS token id: ", config.eos_token_id)
|
||||||
|
|
||||||
|
print("Loading model and tokenizer using AutoTokenizer:", model_path)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
|
config = AutoConfig.from_pretrained(model_path)
|
||||||
|
|
||||||
|
if unreleased_model_name:
|
||||||
|
model_name_lower = unreleased_model_name.lower()
|
||||||
|
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
|
||||||
|
class_name = f"{unreleased_model_name}ForCausalLM"
|
||||||
|
print(f"Importing unreleased model module: {unreleased_module_path}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
|
||||||
|
model = model_class.from_pretrained(model_path) # Note: from_pretrained, not fromPretrained
|
||||||
|
except (ImportError, AttributeError) as e:
|
||||||
|
print(f"Failed to import or load model: {e}")
|
||||||
|
exit(1)
|
||||||
|
else:
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_path)
|
||||||
|
|
||||||
|
model_name = os.path.basename(model_path)
|
||||||
|
# Printing the Model class to allow for easier debugging. This can be useful
|
||||||
|
# when working with models that have not been publicly released yet and this
|
||||||
|
# migth require that the concrete class is imported and used directly instead
|
||||||
|
# of using AutoModelForCausalLM.
|
||||||
|
print(f"Model class: {model.__class__.__name__}")
|
||||||
|
|
||||||
|
prompt = "Hello, my name is"
|
||||||
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
print(f"Input tokens: {input_ids}")
|
||||||
|
print(f"Input text: {repr(prompt)}")
|
||||||
|
print(f"Tokenized: {tokenizer.convert_ids_to_tokens(input_ids[0])}")
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(input_ids)
|
||||||
|
logits = outputs.logits
|
||||||
|
|
||||||
|
# Extract logits for the last token (next token prediction)
|
||||||
|
last_logits = logits[0, -1, :].cpu().numpy()
|
||||||
|
|
||||||
|
print(f"Logits shape: {logits.shape}")
|
||||||
|
print(f"Last token logits shape: {last_logits.shape}")
|
||||||
|
print(f"Vocab size: {len(last_logits)}")
|
||||||
|
|
||||||
|
data_dir = Path("data")
|
||||||
|
data_dir.mkdir(exist_ok=True)
|
||||||
|
bin_filename = data_dir / f"pytorch-{model_name}.bin"
|
||||||
|
txt_filename = data_dir / f"pytorch-{model_name}.txt"
|
||||||
|
|
||||||
|
# Save to file for comparison
|
||||||
|
last_logits.astype(np.float32).tofile(bin_filename)
|
||||||
|
|
||||||
|
# Also save as text file for easy inspection
|
||||||
|
with open(txt_filename, "w") as f:
|
||||||
|
for i, logit in enumerate(last_logits):
|
||||||
|
f.write(f"{i}: {logit:.6f}\n")
|
||||||
|
|
||||||
|
# Print some sample logits for quick verification
|
||||||
|
print(f"First 10 logits: {last_logits[:10]}")
|
||||||
|
print(f"Last 10 logits: {last_logits[-10:]}")
|
||||||
|
|
||||||
|
# Show top 5 predicted tokens
|
||||||
|
top_indices = np.argsort(last_logits)[-5:][::-1]
|
||||||
|
print("Top 5 predictions:")
|
||||||
|
for idx in top_indices:
|
||||||
|
token = tokenizer.decode([idx])
|
||||||
|
print(f" Token {idx} ({repr(token)}): {last_logits[idx]:.6f}")
|
||||||
|
|
||||||
|
print(f"Saved bin logits to: {bin_filename}")
|
||||||
|
print(f"Saved txt logist to: {txt_filename}")
|
||||||
|
|
@ -0,0 +1,42 @@
|
||||||
|
#/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
MODEL_PATH="${1:-"$EMBEDDING_MODEL_PATH"}"
|
||||||
|
MODEL_NAME="${2:-$(basename "$MODEL_PATH")}"
|
||||||
|
|
||||||
|
if [ -t 0 ]; then
|
||||||
|
CPP_EMBEDDINGS="data/llamacpp-${MODEL_NAME}-embeddings.bin"
|
||||||
|
else
|
||||||
|
# Process piped JSON data and convert to binary (matching logits.cpp format)
|
||||||
|
TEMP_FILE=$(mktemp /tmp/tmp.XXXXXX.binn)
|
||||||
|
python3 -c "
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import struct
|
||||||
|
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
|
||||||
|
# Flatten all embeddings completely
|
||||||
|
flattened = []
|
||||||
|
for item in data:
|
||||||
|
embedding = item['embedding']
|
||||||
|
for token_embedding in embedding:
|
||||||
|
flattened.extend(token_embedding)
|
||||||
|
|
||||||
|
print(f'Total embedding values: {len(flattened)}', file=sys.stderr)
|
||||||
|
|
||||||
|
# Write as binary floats - matches logitc.cpp fwrite format
|
||||||
|
with open('$TEMP_FILE', 'wb') as f:
|
||||||
|
for value in flattened:
|
||||||
|
f.write(struct.pack('f', value))
|
||||||
|
"
|
||||||
|
CPP_EMBEDDINGS="$TEMP_FILE"
|
||||||
|
trap "rm -f $TEMP_FILE" EXIT
|
||||||
|
fi
|
||||||
|
|
||||||
|
python scripts/utils/semantic_check.py --model-path $MODEL_PATH \
|
||||||
|
--python-embeddings data/pytorch-${MODEL_NAME}-embeddings.bin \
|
||||||
|
--cpp-embeddings $CPP_EMBEDDINGS \
|
||||||
|
--prompt "Hello world today"
|
||||||
|
|
||||||
|
|
@ -0,0 +1,22 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
MODEL_NAME="${MODEL_NAME:-$(basename "$EMBEDDING_MODEL_PATH")}"
|
||||||
|
OUTPUT_DIR="${OUTPUT_DIR:-../../models}"
|
||||||
|
TYPE="${OUTTYPE:-f16}"
|
||||||
|
METADATA_OVERRIDE="${METADATA_OVERRIDE:-}"
|
||||||
|
CONVERTED_MODEL="${OUTPUT_DIR}/${MODEL_NAME}.gguf"
|
||||||
|
|
||||||
|
echo "Model path: ${EMBEDDING_MODEL_PATH}"
|
||||||
|
echo "Model name: ${MODEL_NAME}"
|
||||||
|
echo "Data type: ${TYPE}"
|
||||||
|
echo "Converted model path:: ${CONVERTED_MODEL}"
|
||||||
|
python ../../convert_hf_to_gguf.py --verbose \
|
||||||
|
${EMBEDDING_MODEL_PATH} \
|
||||||
|
--outfile ${CONVERTED_MODEL} \
|
||||||
|
--outtype ${TYPE}
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "The environment variable CONVERTED_EMBEDDING MODEL can be set to this path using:"
|
||||||
|
echo "export CONVERTED_EMBEDDING_MODEL=$(realpath ${CONVERTED_MODEL})"
|
||||||
|
|
@ -0,0 +1,20 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# First try command line argument, then environment variable, then file
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_EMBEDDING_MODEL"}"
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_EMBEDDING_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo $CONVERTED_MODEL
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-logits -j8
|
||||||
|
|
||||||
|
../../build/bin/llama-logits -m "$CONVERTED_MODEL" -embd-mode "Hello world today"
|
||||||
|
|
@ -0,0 +1,116 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import numpy as np
|
||||||
|
import importlib
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer, AutoConfig, AutoModel
|
||||||
|
import torch
|
||||||
|
|
||||||
|
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Process model with specified path')
|
||||||
|
parser.add_argument('--model-path', '-m', help='Path to the model')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
model_path = os.environ.get('EMBEDDING_MODEL_PATH', args.model_path)
|
||||||
|
if model_path is None:
|
||||||
|
parser.error("Model path must be specified either via --model-path argument or EMBEDDING_MODEL_PATH environment variable")
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||||
|
|
||||||
|
if unreleased_model_name:
|
||||||
|
model_name_lower = unreleased_model_name.lower()
|
||||||
|
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
|
||||||
|
class_name = f"{unreleased_model_name}Model"
|
||||||
|
print(f"Importing unreleased model module: {unreleased_module_path}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
|
||||||
|
model = model_class.from_pretrained(model_path) # Note: from_pretrained, not fromPretrained
|
||||||
|
except (ImportError, AttributeError) as e:
|
||||||
|
print(f"Failed to import or load model: {e}")
|
||||||
|
exit(1)
|
||||||
|
else:
|
||||||
|
model = AutoModel.from_pretrained(model_path)
|
||||||
|
print(f"Model class: {type(model)}")
|
||||||
|
#print(f"Model file: {type(model).__module__}")
|
||||||
|
config = AutoConfig.from_pretrained(model_path)
|
||||||
|
|
||||||
|
model_name = os.path.basename(model_path)
|
||||||
|
|
||||||
|
texts = [ "Hello world today" ]
|
||||||
|
|
||||||
|
encoded = tokenizer(
|
||||||
|
texts,
|
||||||
|
padding=True,
|
||||||
|
truncation=True,
|
||||||
|
return_tensors="pt"
|
||||||
|
)
|
||||||
|
|
||||||
|
tokens = encoded['input_ids'][0]
|
||||||
|
token_strings = tokenizer.convert_ids_to_tokens(tokens)
|
||||||
|
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
|
||||||
|
print(f"{token_id:6d} -> '{token_str}'")
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**encoded)
|
||||||
|
hidden_states = outputs.last_hidden_state # Shape: [batch_size, seq_len, hidden_size]
|
||||||
|
|
||||||
|
# Extract embeddings for each token (matching LLAMA_POOLING_TYPE_NONE behavior)
|
||||||
|
all_embeddings = hidden_states[0].cpu().numpy() # Shape: [seq_len, hidden_size]
|
||||||
|
|
||||||
|
print(f"Hidden states shape: {hidden_states.shape}")
|
||||||
|
print(f"All embeddings shape: {all_embeddings.shape}")
|
||||||
|
print(f"Embedding dimension: {all_embeddings.shape[1]}")
|
||||||
|
|
||||||
|
# Print embeddings exactly like embedding.cpp does for LLAMA_POOLING_TYPE_NONE
|
||||||
|
n_embd = all_embeddings.shape[1]
|
||||||
|
n_embd_count = all_embeddings.shape[0]
|
||||||
|
|
||||||
|
print() # Empty line to match C++ output
|
||||||
|
|
||||||
|
for j in range(n_embd_count):
|
||||||
|
embedding = all_embeddings[j]
|
||||||
|
print(f"embedding {j}: ", end="")
|
||||||
|
|
||||||
|
# Print first 3 values
|
||||||
|
for i in range(min(3, n_embd)):
|
||||||
|
print(f"{embedding[i]:9.6f} ", end="")
|
||||||
|
|
||||||
|
print(" ... ", end="")
|
||||||
|
|
||||||
|
# Print last 3 values
|
||||||
|
for i in range(n_embd - 3, n_embd):
|
||||||
|
print(f"{embedding[i]:9.6f} ", end="")
|
||||||
|
|
||||||
|
print() # New line
|
||||||
|
|
||||||
|
print() # Final empty line to match C++ output
|
||||||
|
|
||||||
|
data_dir = Path("data")
|
||||||
|
data_dir.mkdir(exist_ok=True)
|
||||||
|
bin_filename = data_dir / f"pytorch-{model_name}-embeddings.bin"
|
||||||
|
txt_filename = data_dir / f"pytorch-{model_name}-embeddings.txt"
|
||||||
|
|
||||||
|
# Save all embeddings flattened (matching what embedding.cpp would save if it did)
|
||||||
|
flattened_embeddings = all_embeddings.flatten()
|
||||||
|
flattened_embeddings.astype(np.float32).tofile(bin_filename)
|
||||||
|
|
||||||
|
with open(txt_filename, "w") as f:
|
||||||
|
f.write(f"# Model class: {model_name}\n")
|
||||||
|
f.write(f"# Tokens: {token_strings}\n")
|
||||||
|
f.write(f"# Shape: {all_embeddings.shape}\n")
|
||||||
|
f.write(f"# n_embd_count: {n_embd_count}, n_embd: {n_embd}\n\n")
|
||||||
|
|
||||||
|
for j in range(n_embd_count):
|
||||||
|
f.write(f"# Token {j} ({token_strings[j]}):\n")
|
||||||
|
for i, value in enumerate(all_embeddings[j]):
|
||||||
|
f.write(f"{j}_{i}: {value:.6f}\n")
|
||||||
|
f.write("\n")
|
||||||
|
print(f"Total values: {len(flattened_embeddings)} ({n_embd_count} tokens × {n_embd} dimensions)")
|
||||||
|
print("")
|
||||||
|
print(f"Saved bin embeddings to: {bin_filename}")
|
||||||
|
print(f"Saved txt embeddings to: {txt_filename}")
|
||||||
|
|
@ -0,0 +1,13 @@
|
||||||
|
---
|
||||||
|
base_model:
|
||||||
|
- {base_model}
|
||||||
|
---
|
||||||
|
# {model_name} GGUF
|
||||||
|
|
||||||
|
Recommended way to run this model:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
llama-server -hf {namespace}/{model_name}-GGUF -c 0 -fa
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, access http://localhost:8080
|
||||||
|
|
@ -0,0 +1,174 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
def calculate_nmse(reference, test):
|
||||||
|
mse = np.mean((test - reference) ** 2)
|
||||||
|
ref_var = np.var(reference)
|
||||||
|
if ref_var == 0:
|
||||||
|
nmse = float('inf') if mse > 0 else 0.0
|
||||||
|
return mse, mse, ref_var
|
||||||
|
|
||||||
|
nmse = mse / ref_var
|
||||||
|
|
||||||
|
return nmse, mse, ref_var
|
||||||
|
|
||||||
|
def load_logits(file_path):
|
||||||
|
if not os.path.exists(file_path):
|
||||||
|
raise FileNotFoundError(f"File not found: {file_path}")
|
||||||
|
|
||||||
|
if file_path.suffix == '.npy':
|
||||||
|
return np.load(file_path)
|
||||||
|
elif file_path.suffix == '.bin':
|
||||||
|
return np.fromfile(file_path, dtype=np.float32)
|
||||||
|
else:
|
||||||
|
# Try to load as text file
|
||||||
|
try:
|
||||||
|
# If it has index format "0: value", extract just values
|
||||||
|
data = []
|
||||||
|
with open(file_path, 'r') as f:
|
||||||
|
for line in f:
|
||||||
|
if ':' in line:
|
||||||
|
# Format: "index: value"
|
||||||
|
value = float(line.split(':')[1].strip())
|
||||||
|
else:
|
||||||
|
# Just the value
|
||||||
|
value = float(line.strip())
|
||||||
|
data.append(value)
|
||||||
|
return np.array(data, dtype=np.float32)
|
||||||
|
except:
|
||||||
|
return np.loadtxt(file_path, dtype=np.float32)
|
||||||
|
|
||||||
|
def interpret_nmse(nmse):
|
||||||
|
"""Provide interpretation of NMSE value"""
|
||||||
|
if nmse == 0:
|
||||||
|
return "Perfect match", "🎉"
|
||||||
|
elif nmse < 1e-6:
|
||||||
|
return "Essentially identical", "✅"
|
||||||
|
elif nmse < 1e-4:
|
||||||
|
return "Excellent match", "✅"
|
||||||
|
elif nmse < 1e-3:
|
||||||
|
return "Very good match", "👍"
|
||||||
|
elif nmse < 1e-2:
|
||||||
|
return "Good match", "👍"
|
||||||
|
elif nmse < 0.1:
|
||||||
|
return "Acceptable match", "⚠️"
|
||||||
|
elif nmse < 1.0:
|
||||||
|
return "Poor match", "❌"
|
||||||
|
else:
|
||||||
|
return "Very poor match (worse than noise)", "❌"
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Validate model logits')
|
||||||
|
parser.add_argument('-m', '--model-path', required=True, help='Path to the model directory')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
model_name = os.path.splitext(os.path.basename(args.model_path))[0]
|
||||||
|
data_dir = Path("data")
|
||||||
|
|
||||||
|
pytorch_file = data_dir / f"pytorch-{model_name}.bin"
|
||||||
|
llamacpp_file = data_dir / f"llamacpp-{model_name}.bin"
|
||||||
|
|
||||||
|
print(f"Model name: {model_name}")
|
||||||
|
print(f"PyTorch logits file: {pytorch_file}")
|
||||||
|
print(f"llama.cpp logits file: {llamacpp_file}")
|
||||||
|
|
||||||
|
reference_file = pytorch_file
|
||||||
|
test_file = llamacpp_file
|
||||||
|
|
||||||
|
print("📊 NMSE Check for Model Comparison")
|
||||||
|
print("=" * 50)
|
||||||
|
print(f"Reference (ground truth): {reference_file}")
|
||||||
|
print(f"Test (to evaluate): {test_file}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
try:
|
||||||
|
print("Loading reference logits...")
|
||||||
|
reference = load_logits(reference_file)
|
||||||
|
print(f" Shape: {reference.shape}, Type: {reference.dtype}")
|
||||||
|
|
||||||
|
print("Loading test logits...")
|
||||||
|
test = load_logits(test_file)
|
||||||
|
print(f" Shape: {test.shape}, Type: {test.dtype}")
|
||||||
|
|
||||||
|
# Check shapes match
|
||||||
|
if reference.shape != test.shape:
|
||||||
|
print(f"\n❌ Error: Shape mismatch!")
|
||||||
|
print(f" Reference: {reference.shape}")
|
||||||
|
print(f" Test: {test.shape}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"\n✅ Shapes match: {reference.shape}")
|
||||||
|
|
||||||
|
nmse, mse, ref_var = calculate_nmse(reference, test)
|
||||||
|
|
||||||
|
# Additional metrics
|
||||||
|
max_abs_error = np.max(np.abs(test - reference))
|
||||||
|
mean_abs_error = np.mean(np.abs(test - reference))
|
||||||
|
|
||||||
|
# Results
|
||||||
|
print(f"\n📈 METRICS")
|
||||||
|
print("=" * 30)
|
||||||
|
print(f"MSE (Mean Squared Error): {mse:.6e}")
|
||||||
|
print(f"Reference Variance: {ref_var:.6e}")
|
||||||
|
print(f"NMSE: {nmse:.6e}")
|
||||||
|
print(f"Max Absolute Error: {max_abs_error:.6f}")
|
||||||
|
print(f"Mean Absolute Error: {mean_abs_error:.6f}")
|
||||||
|
|
||||||
|
# NMSE in dB (common in signal processing)
|
||||||
|
if nmse > 0:
|
||||||
|
nmse_db = 10 * np.log10(nmse)
|
||||||
|
print(f"NMSE (dB): {nmse_db:.2f} dB")
|
||||||
|
|
||||||
|
# Interpretation
|
||||||
|
interpretation, emoji = interpret_nmse(nmse)
|
||||||
|
print(f"\n🎯 INTERPRETATION")
|
||||||
|
print("=" * 30)
|
||||||
|
print(f"{emoji} {interpretation}")
|
||||||
|
|
||||||
|
# Detailed guidance
|
||||||
|
print(f"\n📋 GUIDANCE")
|
||||||
|
print("=" * 30)
|
||||||
|
if nmse < 1e-3:
|
||||||
|
print("✅ EXCELLENT: Your GGML conversion is working very well!")
|
||||||
|
print(" The differences are negligible for practical use.")
|
||||||
|
elif nmse < 1e-2:
|
||||||
|
print("👍 GOOD: Your GGML conversion is working well.")
|
||||||
|
print(" Small differences are likely due to precision/quantization.")
|
||||||
|
elif nmse < 0.1:
|
||||||
|
print("⚠️ ACCEPTABLE: Conversion is working but with some differences.")
|
||||||
|
print(" Check if you're using quantization (Q4, Q8, etc.)")
|
||||||
|
print(" Test generation quality to see if it's acceptable.")
|
||||||
|
else:
|
||||||
|
print("❌ PROBLEMATIC: Large differences detected.")
|
||||||
|
print(" Check your conversion process for potential issues.")
|
||||||
|
print(" Verify you're using the same model weights.")
|
||||||
|
|
||||||
|
# NMSE benchmarks
|
||||||
|
print(f"\n📚 NMSE BENCHMARKS")
|
||||||
|
print("=" * 30)
|
||||||
|
print("< 1e-6: Essentially identical")
|
||||||
|
print("< 1e-4: Excellent (typical for good conversions)")
|
||||||
|
print("< 1e-3: Very good")
|
||||||
|
print("< 1e-2: Good (acceptable for most use cases)")
|
||||||
|
print("< 0.1: Acceptable (may need verification)")
|
||||||
|
print("> 1.0: Poor (worse than random)")
|
||||||
|
|
||||||
|
# Exit code based on NMSE
|
||||||
|
if nmse < 1e-2:
|
||||||
|
print(f"\n✅ RESULT: PASS (NMSE = {nmse:.2e})")
|
||||||
|
sys.exit(0)
|
||||||
|
else:
|
||||||
|
print(f"\n❌ RESULT: NEEDS REVIEW (NMSE = {nmse:.2e})")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error: {e}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -0,0 +1,6 @@
|
||||||
|
|
||||||
|
COLLECTION_SLUG=$(python ./create_collection.py --return-slug)
|
||||||
|
echo "Created collection: $COLLECTION_SLUG"
|
||||||
|
|
||||||
|
# Use it in the next command
|
||||||
|
python add_model_to_collection.py "$COLLECTION_SLUG" "username/my-model"
|
||||||
|
|
@ -0,0 +1,80 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
from huggingface_hub import HfApi
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
|
||||||
|
def add_model_to_collection(collection_slug, model_id, note=""):
|
||||||
|
"""
|
||||||
|
Add a model to an existing collection
|
||||||
|
|
||||||
|
Args:
|
||||||
|
collection_slug: The slug of the collection (e.g., "username/collection-name-12345")
|
||||||
|
model_id: The model repository ID (e.g., "username/model-name")
|
||||||
|
note: Optional note about the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if successful, False if failed
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Initialize API
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
try:
|
||||||
|
user_info = api.whoami()
|
||||||
|
print(f"✅ Authenticated as: {user_info['name']}")
|
||||||
|
|
||||||
|
# Verify the model exists
|
||||||
|
print(f"🔍 Checking if model exists: {model_id}")
|
||||||
|
try:
|
||||||
|
model_info = api.model_info(model_id)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Model not found or not accessible: {model_id}")
|
||||||
|
print(f"Error: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
print(f"📚 Adding model to collection...")
|
||||||
|
api.add_collection_item(
|
||||||
|
collection_slug=collection_slug,
|
||||||
|
item_id=model_id,
|
||||||
|
item_type="model",
|
||||||
|
note=note
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"✅ Model added to collection successfully!")
|
||||||
|
print(f"🔗 Collection URL: https://huggingface.co/collections/{collection_slug}")
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error adding model to collection: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# This script requires that the environment variable HF_TOKEN is set with your
|
||||||
|
# Hugging Face API token.
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Add model to a Huggingface Collection')
|
||||||
|
parser.add_argument('--collection', '-c', help='The collection slug username/collection-hash', required=True)
|
||||||
|
parser.add_argument('--model', '-m', help='The model to add to the Collection', required=True)
|
||||||
|
parser.add_argument('--note', '-n', help='An optional note/description', required=False)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
collection = args.collection
|
||||||
|
model = args.model
|
||||||
|
note = args.note
|
||||||
|
|
||||||
|
success = add_model_to_collection(
|
||||||
|
collection_slug=collection,
|
||||||
|
model_id=model,
|
||||||
|
note=note
|
||||||
|
)
|
||||||
|
|
||||||
|
if success:
|
||||||
|
print("\n🎉 Model added successfully!")
|
||||||
|
else:
|
||||||
|
print("\n❌ Failed to add model to collection")
|
||||||
|
sys.exit(1)
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -0,0 +1,106 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
from huggingface_hub import HfApi
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def create_collection(title, description, private=False, namespace=None, return_slug=False):
|
||||||
|
"""
|
||||||
|
Create a new collection on Hugging Face
|
||||||
|
|
||||||
|
Args:
|
||||||
|
title: Collection title
|
||||||
|
description: Collection description
|
||||||
|
private: Whether the collection should be private (default: False)
|
||||||
|
namespace: Optional namespace (defaults to your username)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Collection object if successful, None if failed
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Check if HF_TOKEN is available
|
||||||
|
token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN")
|
||||||
|
if not token:
|
||||||
|
print("❌ No HF_TOKEN or HUGGINGFACE_HUB_TOKEN found in environment variables")
|
||||||
|
print("Please set your Hugging Face token as an environment variable")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Initialize API
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Test authentication first
|
||||||
|
user_info = api.whoami()
|
||||||
|
if not return_slug:
|
||||||
|
print(f"✅ Authenticated as: {user_info['name']}")
|
||||||
|
|
||||||
|
# Create the collection
|
||||||
|
if not return_slug:
|
||||||
|
print(f"📚 Creating collection: '{title}'...")
|
||||||
|
collection = api.create_collection(
|
||||||
|
title=title,
|
||||||
|
description=description,
|
||||||
|
private=private,
|
||||||
|
namespace=namespace
|
||||||
|
)
|
||||||
|
|
||||||
|
if not return_slug:
|
||||||
|
print(f"✅ Collection created successfully!")
|
||||||
|
print(f"📋 Collection slug: {collection.slug}")
|
||||||
|
print(f"🔗 Collection URL: https://huggingface.co/collections/{collection.slug}")
|
||||||
|
|
||||||
|
return collection
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error creating collection: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# This script requires that the environment variable HF_TOKEN is set with your
|
||||||
|
# Hugging Face API token.
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Create a Huggingface Collection')
|
||||||
|
parser.add_argument('--name', '-n', help='The name/title of the Collection', required=True)
|
||||||
|
parser.add_argument('--description', '-d', help='The description for the Collection', required=True)
|
||||||
|
parser.add_argument('--namespace', '-ns', help='The namespace to add the Collection to', required=True)
|
||||||
|
parser.add_argument('--private', '-p', help='Create a private Collection', action='store_true') # Fixed
|
||||||
|
parser.add_argument('--return-slug', '-s', help='Only output the collection slug', action='store_true') # Fixed
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
name = args.name
|
||||||
|
description = args.description
|
||||||
|
private = args.private
|
||||||
|
namespace = args.namespace
|
||||||
|
return_slug = args.return_slug
|
||||||
|
|
||||||
|
if not return_slug:
|
||||||
|
print("🚀 Creating Hugging Face Collection")
|
||||||
|
print(f"Title: {name}")
|
||||||
|
print(f"Description: {description}")
|
||||||
|
print(f"Namespace: {namespace}")
|
||||||
|
print(f"Private: {private}")
|
||||||
|
|
||||||
|
collection = create_collection(
|
||||||
|
title=name,
|
||||||
|
description=description,
|
||||||
|
private=private,
|
||||||
|
namespace=namespace,
|
||||||
|
return_slug=return_slug
|
||||||
|
)
|
||||||
|
|
||||||
|
if collection:
|
||||||
|
if return_slug:
|
||||||
|
print(collection.slug)
|
||||||
|
else:
|
||||||
|
print("\n🎉 Collection created successfully!")
|
||||||
|
print(f"Use this slug to add models: {collection.slug}")
|
||||||
|
else:
|
||||||
|
print("\n❌ Failed to create collection")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -0,0 +1,63 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
from huggingface_hub import HfApi
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
# This script requires that the environment variable HF_TOKEN is set with your
|
||||||
|
# Hugging Face API token.
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
def load_template_and_substitute(template_path, **kwargs):
|
||||||
|
try:
|
||||||
|
with open(template_path, 'r', encoding='utf-8') as f:
|
||||||
|
template_content = f.read()
|
||||||
|
|
||||||
|
return template_content.format(**kwargs)
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(f"Template file '{template_path}' not found!")
|
||||||
|
return None
|
||||||
|
except KeyError as e:
|
||||||
|
print(f"Missing template variable: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Create a new Hugging Face model repository')
|
||||||
|
parser.add_argument('--model-name', '-m', help='Name for the model', required=True)
|
||||||
|
parser.add_argument('--namespace', '-ns', help='Namespace to add the model to', required=True)
|
||||||
|
parser.add_argument('--org-base-model', '-b', help='Original Base model name', default="")
|
||||||
|
parser.add_argument('--no-card', action='store_true', help='Skip creating model card')
|
||||||
|
parser.add_argument('--private', '-p', action='store_true', help='Create private model')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
repo_id = f"{args.namespace}/{args.model_name}-GGUF"
|
||||||
|
print("Repository ID: ", repo_id)
|
||||||
|
|
||||||
|
repo_url = api.create_repo(
|
||||||
|
repo_id=repo_id,
|
||||||
|
repo_type="model",
|
||||||
|
private=args.private,
|
||||||
|
exist_ok=False
|
||||||
|
)
|
||||||
|
|
||||||
|
if not args.no_card:
|
||||||
|
template_path = "scripts/readme.md.template"
|
||||||
|
model_card_content = load_template_and_substitute(
|
||||||
|
template_path,
|
||||||
|
model_name=args.model_name,
|
||||||
|
namespace=args.namespace,
|
||||||
|
base_model=args.org_base_model,
|
||||||
|
)
|
||||||
|
|
||||||
|
if model_card_content:
|
||||||
|
api.upload_file(
|
||||||
|
path_or_fileobj=model_card_content.encode('utf-8'),
|
||||||
|
path_in_repo="README.md",
|
||||||
|
repo_id=repo_id
|
||||||
|
)
|
||||||
|
print("Model card created successfully.")
|
||||||
|
else:
|
||||||
|
print("Failed to create model card.")
|
||||||
|
|
||||||
|
print(f"Repository created: {repo_url}")
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,58 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
from huggingface_hub import HfApi
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
|
||||||
|
def upload_gguf_file(local_file_path, repo_id, filename_in_repo=None):
|
||||||
|
"""
|
||||||
|
Upload a GGUF file to a Hugging Face model repository
|
||||||
|
|
||||||
|
Args:
|
||||||
|
local_file_path: Path to your local GGUF file
|
||||||
|
repo_id: Your repository ID (e.g., "username/model-name")
|
||||||
|
filename_in_repo: Optional custom name for the file in the repo
|
||||||
|
"""
|
||||||
|
|
||||||
|
if not os.path.exists(local_file_path):
|
||||||
|
print(f"❌ File not found: {local_file_path}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
if filename_in_repo is None:
|
||||||
|
filename_in_repo = os.path.basename(local_file_path)
|
||||||
|
|
||||||
|
if filename_in_repo is None or filename_in_repo == "":
|
||||||
|
filename_in_repo = os.path.basename(local_file_path)
|
||||||
|
|
||||||
|
print(f"📤 Uploading {local_file_path} to {repo_id}/{filename_in_repo}")
|
||||||
|
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
try:
|
||||||
|
api.upload_file(
|
||||||
|
path_or_fileobj=local_file_path,
|
||||||
|
path_in_repo=filename_in_repo,
|
||||||
|
repo_id=repo_id,
|
||||||
|
repo_type="model",
|
||||||
|
commit_message=f"Upload {filename_in_repo}"
|
||||||
|
)
|
||||||
|
|
||||||
|
print("✅ Upload successful!")
|
||||||
|
print(f"🔗 File available at: https://huggingface.co/{repo_id}/blob/main/{filename_in_repo}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Upload failed: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# This script requires that the environment variable HF_TOKEN is set with your
|
||||||
|
# Hugging Face API token.
|
||||||
|
api = HfApi()
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Upload a GGUF model to a Huggingface model repository')
|
||||||
|
parser.add_argument('--gguf-model-path', '-m', help='The GGUF model file to upload', required=True)
|
||||||
|
parser.add_argument('--repo-id', '-r', help='The repository to upload to', required=True)
|
||||||
|
parser.add_argument('--name', '-o', help='The name in the model repository', required=False)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
upload_gguf_file(args.gguf_model_path, args.repo_id, args.name)
|
||||||
|
|
@ -0,0 +1,14 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# First try command line argument, then environment variable, then file
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
../../gguf-py/gguf/scripts/gguf_dump.py $CONVERTED_MODEL
|
||||||
|
|
@ -0,0 +1,67 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
from safetensors import safe_open
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description='Process model with specified path')
|
||||||
|
parser.add_argument('--model-path', '-m', help='Path to the model')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
model_path = os.environ.get('MODEL_PATH', args.model_path)
|
||||||
|
if model_path is None:
|
||||||
|
parser.error("Model path must be specified either via --model-path argument or MODEL_PATH environment variable")
|
||||||
|
|
||||||
|
# Check if there's an index file (multi-file model)
|
||||||
|
index_path = os.path.join(model_path, "model.safetensors.index.json")
|
||||||
|
single_file_path = os.path.join(model_path, "model.safetensors")
|
||||||
|
|
||||||
|
if os.path.exists(index_path):
|
||||||
|
# Multi-file model
|
||||||
|
print("Multi-file model detected")
|
||||||
|
|
||||||
|
with open(index_path, 'r') as f:
|
||||||
|
index_data = json.load(f)
|
||||||
|
|
||||||
|
# Get the weight map (tensor_name -> file_name)
|
||||||
|
weight_map = index_data.get("weight_map", {})
|
||||||
|
|
||||||
|
# Group tensors by file for efficient processing
|
||||||
|
file_tensors = defaultdict(list)
|
||||||
|
for tensor_name, file_name in weight_map.items():
|
||||||
|
file_tensors[file_name].append(tensor_name)
|
||||||
|
|
||||||
|
print("Tensors in model:")
|
||||||
|
|
||||||
|
# Process each shard file
|
||||||
|
for file_name, tensor_names in file_tensors.items():
|
||||||
|
file_path = os.path.join(model_path, file_name)
|
||||||
|
print(f"\n--- From {file_name} ---")
|
||||||
|
|
||||||
|
with safe_open(file_path, framework="pt") as f:
|
||||||
|
for tensor_name in sorted(tensor_names):
|
||||||
|
tensor = f.get_tensor(tensor_name)
|
||||||
|
print(f"- {tensor_name} : shape = {tensor.shape}, dtype = {tensor.dtype}")
|
||||||
|
|
||||||
|
elif os.path.exists(single_file_path):
|
||||||
|
# Single file model (original behavior)
|
||||||
|
print("Single-file model detected")
|
||||||
|
|
||||||
|
with safe_open(single_file_path, framework="pt") as f:
|
||||||
|
keys = f.keys()
|
||||||
|
print("Tensors in model:")
|
||||||
|
for key in sorted(keys):
|
||||||
|
tensor = f.get_tensor(key)
|
||||||
|
print(f"- {key} : shape = {tensor.shape}, dtype = {tensor.dtype}")
|
||||||
|
|
||||||
|
else:
|
||||||
|
print(f"Error: Neither 'model.safetensors.index.json' nor 'model.safetensors' found in {model_path}")
|
||||||
|
print("Available files:")
|
||||||
|
if os.path.exists(model_path):
|
||||||
|
for item in sorted(os.listdir(model_path)):
|
||||||
|
print(f" {item}")
|
||||||
|
else:
|
||||||
|
print(f" Directory {model_path} does not exist")
|
||||||
|
exit(1)
|
||||||
|
|
@ -0,0 +1,35 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if data/wikitext-2-raw directory exists
|
||||||
|
if [ ! -d "ppl/wikitext-2-raw" ]; then
|
||||||
|
echo "ppl/wikitext-2-raw directory does not exist. Downloading..." >&2
|
||||||
|
mkdir -p ppl
|
||||||
|
pushd ppl
|
||||||
|
./../../../scripts/get-wikitext-2.sh
|
||||||
|
popd
|
||||||
|
fi
|
||||||
|
|
||||||
|
mkdir -p ppl
|
||||||
|
OUTPUTFILE="ppl/$(basename $CONVERTED_MODEL).kld"
|
||||||
|
echo "Model: $CONVERTED_MODEL"
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-perplexity -j8
|
||||||
|
|
||||||
|
../.././build/bin/llama-perplexity -m $CONVERTED_MODEL \
|
||||||
|
-f ppl/wikitext-2-raw/wiki.test.raw \
|
||||||
|
--kl-divergence-base $OUTPUTFILE
|
||||||
|
|
||||||
|
echo "Generated logits in $OUTPUTFILE"
|
||||||
|
|
||||||
|
|
@ -0,0 +1,27 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
QUANTIZED_MODEL="${1:-"$QUANTIZED_MODEL"}"
|
||||||
|
|
||||||
|
if [ -z "$QUANTIZED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. QUANTIZED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if data/wikitext-2-raw directory exists
|
||||||
|
if [ ! -d "ppl/wikitext-2-raw" ]; then
|
||||||
|
echo "ppl/wikitext-2-raw directory does not exist. Downloading..." >&2
|
||||||
|
mkdir -p ppl
|
||||||
|
pushd ppl
|
||||||
|
./../../../scripts/get-wikitext-2.sh
|
||||||
|
popd
|
||||||
|
fi
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-perplexity -j8
|
||||||
|
|
||||||
|
../.././build/bin/llama-perplexity -m $QUANTIZED_MODEL -f ppl/wikitext-2-raw/wiki.test.raw
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,28 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
QUANTIZED_MODEL="${1:-"$QUANTIZED_MODEL"}"
|
||||||
|
LOGITS_FILE="${1:-"$LOGITS_FILE"}"
|
||||||
|
|
||||||
|
if [ -z "$QUANTIZED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. QUANTIZED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ! -f ${LOGITS_FILE} ]; then
|
||||||
|
echo "Error: logits file '${LOGITS_FILE} was not found"
|
||||||
|
echo "Did you run the perplexity-gen.sh script?"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Model: $QUANTIZED_MODEL"
|
||||||
|
echo "Data file: $LOGITS_FILE"
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-perplexity -j8
|
||||||
|
|
||||||
|
../.././build/bin/llama-perplexity -m $QUANTIZED_MODEL \
|
||||||
|
--kl-divergence-base $LOGITS_FILE \
|
||||||
|
--kl-divergence
|
||||||
|
|
@ -0,0 +1,34 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
|
||||||
|
QUANTIZED_TYPE="${2:-"$QUANTIZED_TYPE"}"
|
||||||
|
QUANTIZED_MODEL=$CONVERTED_MODEL
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo $CONVERTED_MODEL
|
||||||
|
|
||||||
|
# Process the quantized model filename
|
||||||
|
if [[ "$QUANTIZED_MODEL" == *.gguf ]]; then
|
||||||
|
# Remove .gguf suffix, add quantized type, then add .gguf back
|
||||||
|
BASE_NAME="${QUANTIZED_MODEL%.gguf}"
|
||||||
|
QUANTIZED_MODEL="${BASE_NAME}-${QUANTIZED_TYPE}.gguf"
|
||||||
|
else
|
||||||
|
echo "Error: QUANTIZED_MODEL must end with .gguf extension" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-quantize -j8
|
||||||
|
|
||||||
|
../../build/bin/llama-quantize $CONVERTED_MODEL $QUANTIZED_MODEL $QUANTIZED_TYPE
|
||||||
|
|
||||||
|
echo "Quantized model saved to: $QUANTIZED_MODEL"
|
||||||
|
|
@ -0,0 +1,22 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
#
|
||||||
|
# First try command line argument, then environment variable, then file
|
||||||
|
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
|
||||||
|
|
||||||
|
# Final check if we have a model path
|
||||||
|
if [ -z "$CONVERTED_MODEL" ]; then
|
||||||
|
echo "Error: Model path must be provided either as:" >&2
|
||||||
|
echo " 1. Command line argument" >&2
|
||||||
|
echo " 2. CONVERTED_MODEL environment variable" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo $CONVERTED_MODEL
|
||||||
|
|
||||||
|
cmake --build ../../build --target llama-server
|
||||||
|
|
||||||
|
../../build/bin/llama-server -m $CONVERTED_MODEL \
|
||||||
|
--embedding \
|
||||||
|
--pooling none
|
||||||
|
|
@ -0,0 +1,179 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import importlib
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModel
|
||||||
|
|
||||||
|
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
|
||||||
|
|
||||||
|
def cosine_similarity(a, b=None):
|
||||||
|
a = np.asarray(a)
|
||||||
|
if b is None:
|
||||||
|
b = a
|
||||||
|
else:
|
||||||
|
b = np.asarray(b)
|
||||||
|
|
||||||
|
if a.ndim == 1:
|
||||||
|
a = a.reshape(1, -1)
|
||||||
|
if b.ndim == 1:
|
||||||
|
b = b.reshape(1, -1)
|
||||||
|
|
||||||
|
a_norms = np.linalg.norm(a, axis=1, keepdims=True)
|
||||||
|
b_norms = np.linalg.norm(b, axis=1, keepdims=True)
|
||||||
|
|
||||||
|
a_norms = np.where(a_norms == 0, 1e-8, a_norms)
|
||||||
|
b_norms = np.where(b_norms == 0, 1e-8, b_norms)
|
||||||
|
|
||||||
|
a_normalized = a / a_norms
|
||||||
|
b_normalized = b / b_norms
|
||||||
|
|
||||||
|
# Compute cosine similarity
|
||||||
|
return np.dot(a_normalized, b_normalized.T)
|
||||||
|
|
||||||
|
def load_embeddings_from_file(filename, n_tokens, n_embd):
|
||||||
|
embeddings = np.fromfile(filename, dtype=np.float32)
|
||||||
|
return embeddings.reshape(n_tokens, n_embd)
|
||||||
|
|
||||||
|
def test_single_prompt_similarity(python_emb, cpp_emb, tokens, prompt):
|
||||||
|
np.set_printoptions(suppress=True, precision=6)
|
||||||
|
print("pytorch embeddings:");
|
||||||
|
print(python_emb)
|
||||||
|
print("llama.cpp embeddings:");
|
||||||
|
print(cpp_emb)
|
||||||
|
print(f"\n=== Prompt: '{prompt}' ===")
|
||||||
|
print(f"Tokens: {tokens}")
|
||||||
|
print(f"Embeddings shape: Python {python_emb.shape}, llama.cpp {cpp_emb.shape}")
|
||||||
|
|
||||||
|
n_tokens = len(tokens)
|
||||||
|
|
||||||
|
# 1. Direct embedding comparison
|
||||||
|
print(f"\n1. Raw Embedding Magnitude Comparison:")
|
||||||
|
# Check if the distance of each token embedding from the origin and compare
|
||||||
|
# if the vectors are on the same "sphere". This does not tell us about
|
||||||
|
# direction (meaning of the token embedding), just magnitude.
|
||||||
|
for i in range(n_tokens):
|
||||||
|
py_mag = np.linalg.norm(python_emb[i]) # calculate standard euclidean norm for Python embeddings
|
||||||
|
cpp_mag = np.linalg.norm(cpp_emb[i]) # calculate standard euclidean norm for llama.cpp embeddings
|
||||||
|
ratio = py_mag / cpp_mag if cpp_mag > 0 else float('inf')
|
||||||
|
print(f" Token {i} ({tokens[i]}): Python={py_mag:.3f}, llama.cpp={cpp_mag:.3f}, ratio={ratio:.3f}")
|
||||||
|
|
||||||
|
# 2. Cosine similarity between tokens within each model
|
||||||
|
# Here we check the direction of token embeddings to see if the have the
|
||||||
|
# same meaning (similarity). This is done by calculating cosine similarity
|
||||||
|
# of a pair of token embeddings within each model.
|
||||||
|
print(f"\n2. Within-Model Token Similarities:")
|
||||||
|
print(" Python model:")
|
||||||
|
for i in range(n_tokens):
|
||||||
|
for j in range(i+1, n_tokens):
|
||||||
|
sim = cosine_similarity([python_emb[i]], [python_emb[j]])[0][0]
|
||||||
|
print(f" {tokens[i]} ↔ {tokens[j]}: {sim:.4f}")
|
||||||
|
|
||||||
|
print(" llama.cpp model:")
|
||||||
|
for i in range(n_tokens):
|
||||||
|
for j in range(i+1, n_tokens):
|
||||||
|
sim = cosine_similarity([cpp_emb[i]], [cpp_emb[j]])[0][0]
|
||||||
|
print(f" {tokens[i]} ↔ {tokens[j]}: {sim:.4f}")
|
||||||
|
|
||||||
|
# 3. Cross-model similarity (same token position)
|
||||||
|
print(f"\n3. Cross-Model Same-Token Similarities:")
|
||||||
|
for i in range(n_tokens):
|
||||||
|
sim = cosine_similarity([python_emb[i]], [cpp_emb[i]])[0][0]
|
||||||
|
print(f" Token {i} ({tokens[i]}): {sim:.4f}")
|
||||||
|
|
||||||
|
# 4. Similarity matrix comparison
|
||||||
|
print(f"\n4. Similarity Matrix Differences:")
|
||||||
|
py_sim_matrix = cosine_similarity(python_emb)
|
||||||
|
cpp_sim_matrix = cosine_similarity(cpp_emb)
|
||||||
|
diff_matrix = np.abs(py_sim_matrix - cpp_sim_matrix)
|
||||||
|
|
||||||
|
print(f" Max difference: {np.max(diff_matrix):.4f}")
|
||||||
|
print(f" Mean difference: {np.mean(diff_matrix):.4f}")
|
||||||
|
print(f" RMS difference: {np.sqrt(np.mean(diff_matrix**2)):.4f}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
'cross_model_similarities': [cosine_similarity([python_emb[i]], [cpp_emb[i]])[0][0] for i in range(n_tokens)],
|
||||||
|
'similarity_matrix_diff': diff_matrix,
|
||||||
|
'max_diff': np.max(diff_matrix),
|
||||||
|
'mean_diff': np.mean(diff_matrix),
|
||||||
|
'rms_diff': np.sqrt(np.mean(diff_matrix**2))
|
||||||
|
}
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Test semantic similarity between Python and llama.cpp embeddings')
|
||||||
|
parser.add_argument('--model-path', '-m', required=True, help='Path to the original Python model')
|
||||||
|
parser.add_argument('--python-embeddings', '-pe', help='Path to pytorch embeddings "logits" binary file')
|
||||||
|
parser.add_argument('--cpp-embeddings', '-ce', help='Path to llama.cpp embeddings "logits" binary file')
|
||||||
|
parser.add_argument('--causal', '-c', default=False, help='if the model is causal (default: false)', action='store_true')
|
||||||
|
parser.add_argument('--prompt', '-p', default='Hello world today', help='Test prompt')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
print("Semantic Similarity Test Between Python and llama.cpp Embedding Models")
|
||||||
|
print("=" * 70)
|
||||||
|
|
||||||
|
# Single prompt detailed comparison
|
||||||
|
print(f"\nTesting with prompt: '{args.prompt}'")
|
||||||
|
|
||||||
|
# Load the python model to get configuration information and also to load the tokenizer.
|
||||||
|
print("Loading model and tokenizer using AutoTokenizer:", args.model_path)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
|
||||||
|
config = AutoConfig.from_pretrained(args.model_path)
|
||||||
|
|
||||||
|
if unreleased_model_name:
|
||||||
|
model_name_lower = unreleased_model_name.lower()
|
||||||
|
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
|
||||||
|
if args.causal:
|
||||||
|
class_name = f"{unreleased_model_name}ForCausalLM"
|
||||||
|
else:
|
||||||
|
class_name = f"{unreleased_model_name}Model"
|
||||||
|
print(f"Model class: {class_name}")
|
||||||
|
print(f"Importing unreleased model module: {unreleased_module_path}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
|
||||||
|
model = model_class.from_pretrained(args.model_path)
|
||||||
|
except (ImportError, AttributeError) as e:
|
||||||
|
print(f"Failed to import or load model: {e}")
|
||||||
|
exit(1)
|
||||||
|
else:
|
||||||
|
if args.causal:
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(args.model_path)
|
||||||
|
else:
|
||||||
|
model = AutoModel.from_pretrained(args.model_path)
|
||||||
|
|
||||||
|
encoded = tokenizer(args.prompt, return_tensors="pt")
|
||||||
|
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
|
||||||
|
n_tokens = len(tokens)
|
||||||
|
print(f"n_tokens: {n_tokens}");
|
||||||
|
print(f"hidden_size: {model.config.hidden_size}")
|
||||||
|
|
||||||
|
# Load binary embeddings from data directory.
|
||||||
|
llamacpp_embeddings = load_embeddings_from_file(args.cpp_embeddings, n_tokens, model.config.hidden_size)
|
||||||
|
python_embeddings = load_embeddings_from_file(args.python_embeddings, n_tokens, model.config.hidden_size)
|
||||||
|
|
||||||
|
# Run comparison
|
||||||
|
results = test_single_prompt_similarity(python_embeddings, llamacpp_embeddings, tokens, args.prompt)
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print(f"\n=== SUMMARY ===")
|
||||||
|
avg_cross_sim = np.mean(results['cross_model_similarities'])
|
||||||
|
print(f"Average cross-model similarity: {avg_cross_sim:.4f}")
|
||||||
|
print(f"Similarity matrix RMS difference: {results['rms_diff']:.4f}")
|
||||||
|
|
||||||
|
# Quality assessment
|
||||||
|
if avg_cross_sim > 0.95:
|
||||||
|
print("✅ EXCELLENT: Models are highly similar")
|
||||||
|
elif avg_cross_sim > 0.90:
|
||||||
|
print("✅ VERY GOOD: Models are very similar")
|
||||||
|
elif avg_cross_sim > 0.80:
|
||||||
|
print("⚠️ GOOD: Models are reasonably similar")
|
||||||
|
elif avg_cross_sim > 0.70:
|
||||||
|
print("⚠️ FAIR: Models have some differences")
|
||||||
|
else:
|
||||||
|
print("❌ POOR: Models are significantly different")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -11,5 +11,5 @@ See the following PRs for more info:
|
||||||
### Usage
|
### Usage
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
make -j && ./llama-passkey -m ./models/llama-7b-v2/ggml-model-f16.gguf --junk 250
|
llama-passkey -m ./models/llama-7b-v2/ggml-model-f16.gguf --junk 250
|
||||||
```
|
```
|
||||||
|
|
|
||||||
|
|
@ -15,7 +15,7 @@ https://github.com/ggml-org/llama.cpp/pull/6193
|
||||||
`retrieval` example can be tested as follows:
|
`retrieval` example can be tested as follows:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
make -j && ./llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .
|
llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .
|
||||||
```
|
```
|
||||||
|
|
||||||
This chunks and embeds all given files and starts a loop requesting query inputs:
|
This chunks and embeds all given files and starts a loop requesting query inputs:
|
||||||
|
|
|
||||||
|
|
@ -18,8 +18,6 @@ if %errorlevel% neq 0 goto ERROR
|
||||||
:: for FP32
|
:: for FP32
|
||||||
cmake -G "Ninja" .. -DLLAMA_CURL=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release
|
cmake -G "Ninja" .. -DLLAMA_CURL=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release
|
||||||
if %errorlevel% neq 0 goto ERROR
|
if %errorlevel% neq 0 goto ERROR
|
||||||
:: build example/main only
|
|
||||||
:: make main
|
|
||||||
|
|
||||||
:: build all binary
|
:: build all binary
|
||||||
cmake --build . -j
|
cmake --build . -j
|
||||||
|
|
|
||||||
|
|
@ -158,7 +158,6 @@ option(GGML_CUDA "ggml: use CUDA"
|
||||||
option(GGML_MUSA "ggml: use MUSA" OFF)
|
option(GGML_MUSA "ggml: use MUSA" OFF)
|
||||||
option(GGML_CUDA_FORCE_MMQ "ggml: use mmq kernels instead of cuBLAS" OFF)
|
option(GGML_CUDA_FORCE_MMQ "ggml: use mmq kernels instead of cuBLAS" OFF)
|
||||||
option(GGML_CUDA_FORCE_CUBLAS "ggml: always use cuBLAS instead of mmq kernels" OFF)
|
option(GGML_CUDA_FORCE_CUBLAS "ggml: always use cuBLAS instead of mmq kernels" OFF)
|
||||||
option(GGML_CUDA_F16 "ggml: use 16 bit floats for some calculations" OFF)
|
|
||||||
set (GGML_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
|
set (GGML_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
|
||||||
"ggml: max. batch size for using peer access")
|
"ggml: max. batch size for using peer access")
|
||||||
option(GGML_CUDA_NO_PEER_COPY "ggml: do not use peer to peer copies" OFF)
|
option(GGML_CUDA_NO_PEER_COPY "ggml: do not use peer to peer copies" OFF)
|
||||||
|
|
|
||||||
|
|
@ -244,6 +244,13 @@
|
||||||
#define GGML_MROPE_SECTIONS 4
|
#define GGML_MROPE_SECTIONS 4
|
||||||
|
|
||||||
#define GGML_UNUSED(x) (void)(x)
|
#define GGML_UNUSED(x) (void)(x)
|
||||||
|
#ifdef __CUDACC__
|
||||||
|
template<typename... Args>
|
||||||
|
__host__ __device__ constexpr inline void ggml_unused_vars_impl(Args&&...) noexcept {}
|
||||||
|
#define GGML_UNUSED_VARS(...) ggml_unused_vars_impl(__VA_ARGS__)
|
||||||
|
#else
|
||||||
|
#define GGML_UNUSED_VARS(...) do { (void)sizeof((__VA_ARGS__, 0)); } while(0)
|
||||||
|
#endif // __CUDACC__
|
||||||
|
|
||||||
#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))
|
#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -19,9 +19,8 @@
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <string>
|
|
||||||
#include <vector>
|
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
|
#include <vector>
|
||||||
|
|
||||||
#ifdef __APPLE__
|
#ifdef __APPLE__
|
||||||
#include <sys/types.h>
|
#include <sys/types.h>
|
||||||
|
|
@ -1352,6 +1351,10 @@ static bool ggml_backend_sched_alloc_splits(ggml_backend_sched_t sched) {
|
||||||
static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t sched) {
|
static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t sched) {
|
||||||
struct ggml_backend_sched_split * splits = sched->splits;
|
struct ggml_backend_sched_split * splits = sched->splits;
|
||||||
|
|
||||||
|
ggml_tensor * prev_ids_tensor = nullptr;
|
||||||
|
std::vector<int32_t> ids;
|
||||||
|
std::vector<ggml_bitset_t> used_ids;
|
||||||
|
|
||||||
for (int i = 0; i < sched->n_splits; i++) {
|
for (int i = 0; i < sched->n_splits; i++) {
|
||||||
struct ggml_backend_sched_split * split = &splits[i];
|
struct ggml_backend_sched_split * split = &splits[i];
|
||||||
int split_backend_id = split->backend_id;
|
int split_backend_id = split->backend_id;
|
||||||
|
|
@ -1378,6 +1381,80 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
|
||||||
} else {
|
} else {
|
||||||
ggml_backend_synchronize(split_backend);
|
ggml_backend_synchronize(split_backend);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// when offloading MoE weights, we can reduce the amount of data copied by copying only the experts that are used
|
||||||
|
ggml_tensor * node = split->graph.nodes[0];
|
||||||
|
if (split->graph.n_nodes > 0 &&
|
||||||
|
ggml_backend_buffer_get_usage(input->buffer) == GGML_BACKEND_BUFFER_USAGE_WEIGHTS &&
|
||||||
|
ggml_backend_buffer_is_host(input->buffer) && (
|
||||||
|
(node->src[0] == input_cpy && node->op == GGML_OP_MUL_MAT_ID)
|
||||||
|
//|| (node->src[1] == input_cpy && node->op == GGML_OP_ADD_ID) /* GGML_OP_ADD_ID weights are small and not worth splitting */
|
||||||
|
)) {
|
||||||
|
|
||||||
|
const int64_t n_expert = node->op == GGML_OP_MUL_MAT_ID ? input->ne[2] : input->ne[1];
|
||||||
|
const size_t expert_size = node->op == GGML_OP_MUL_MAT_ID ? input->nb[2] : input->nb[1];
|
||||||
|
|
||||||
|
ggml_backend_synchronize(input_backend);
|
||||||
|
|
||||||
|
// get the ids
|
||||||
|
ggml_tensor * ids_tensor = node->src[2];
|
||||||
|
if (ids_tensor != prev_ids_tensor) {
|
||||||
|
ids.resize(ggml_nbytes(ids_tensor) / sizeof(int32_t));
|
||||||
|
ggml_backend_tensor_get_async(split_backend, ids_tensor, ids.data(), 0, ggml_nbytes(ids_tensor));
|
||||||
|
ggml_backend_synchronize(split_backend);
|
||||||
|
|
||||||
|
// find the used experts
|
||||||
|
used_ids.clear();
|
||||||
|
used_ids.resize(ggml_bitset_size(n_expert));
|
||||||
|
for (int64_t i1 = 0; i1 < ids_tensor->ne[1]; i1++) {
|
||||||
|
for (int64_t i0 = 0; i0 < ids_tensor->ne[0]; i0++) {
|
||||||
|
int32_t id = ids[i1 * ids_tensor->nb[1]/sizeof(int32_t) + i0 * ids_tensor->nb[0]/sizeof(int32_t)];
|
||||||
|
ggml_bitset_set(used_ids.data(), id);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
prev_ids_tensor = ids_tensor;
|
||||||
|
}
|
||||||
|
|
||||||
|
// group consecutive experts and copy them together
|
||||||
|
auto copy_experts = [&](int32_t first_id, int32_t last_id) {
|
||||||
|
const size_t expert_offset = first_id * expert_size;
|
||||||
|
const size_t expert_size_copy = (last_id - first_id + 1) * expert_size;
|
||||||
|
const size_t padding = std::min<size_t>(expert_size, 512);
|
||||||
|
const size_t padding_end = last_id < n_expert - 1 ? padding : 0;
|
||||||
|
|
||||||
|
ggml_backend_tensor_set_async(split_backend,
|
||||||
|
input_cpy,
|
||||||
|
(const uint8_t *)input->data + expert_offset, expert_offset,
|
||||||
|
// copy a bit extra at the to ensure there are no NaNs in the padding of the last expert
|
||||||
|
// this is necessary for MMQ in the CUDA backend
|
||||||
|
expert_size_copy + padding_end);
|
||||||
|
};
|
||||||
|
|
||||||
|
int id = 0;
|
||||||
|
while (!ggml_bitset_get(used_ids.data(), id)) {
|
||||||
|
id++;
|
||||||
|
}
|
||||||
|
int32_t first_id = id;
|
||||||
|
int32_t last_id = first_id;
|
||||||
|
|
||||||
|
for (++id; id < n_expert; ++id) {
|
||||||
|
if (!ggml_bitset_get(used_ids.data(), id)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (id == last_id + 1) {
|
||||||
|
last_id = id;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
copy_experts(first_id, last_id);
|
||||||
|
|
||||||
|
first_id = id;
|
||||||
|
last_id = id;
|
||||||
|
}
|
||||||
|
copy_experts(first_id, last_id);
|
||||||
|
} else {
|
||||||
// try async copy, but if not possible, we can still use a sync copy without synchronizing the dst backend, since we handle the synchronization here with multiple copies and events
|
// try async copy, but if not possible, we can still use a sync copy without synchronizing the dst backend, since we handle the synchronization here with multiple copies and events
|
||||||
// TODO: add public function to facilitate this, since applications do not have direct access to the backend interface
|
// TODO: add public function to facilitate this, since applications do not have direct access to the backend interface
|
||||||
if (!split_backend->iface.cpy_tensor_async || !split_backend->iface.cpy_tensor_async(input_backend, split_backend, input, input_cpy)) {
|
if (!split_backend->iface.cpy_tensor_async || !split_backend->iface.cpy_tensor_async(input_backend, split_backend, input, input_cpy)) {
|
||||||
|
|
@ -1391,6 +1468,7 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
if (!sched->callback_eval) {
|
if (!sched->callback_eval) {
|
||||||
enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
|
enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
|
||||||
|
|
|
||||||
|
|
@ -2154,17 +2154,39 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
|
||||||
|
|
||||||
GGML_TENSOR_BINARY_OP_LOCALS
|
GGML_TENSOR_BINARY_OP_LOCALS
|
||||||
|
|
||||||
// theta_scale arange, [0,1,...,ne00/2 - 1]
|
|
||||||
int64_t theta_scale_length = ne00 / 2;
|
int64_t theta_scale_length = ne00 / 2;
|
||||||
ggml_cann_pool_alloc theta_scale_allocator(ctx.pool(),
|
|
||||||
theta_scale_length * sizeof(float_t));
|
|
||||||
void* theta_scale_buffer = theta_scale_allocator.get();
|
|
||||||
int64_t theta_scale_ne[] = {theta_scale_length, 1, 1, 1};
|
int64_t theta_scale_ne[] = {theta_scale_length, 1, 1, 1};
|
||||||
size_t theta_scale_nb[] = {sizeof(float_t), sizeof(float_t), sizeof(float_t),
|
size_t theta_scale_nb[] = {sizeof(float_t), sizeof(float_t), sizeof(float_t),
|
||||||
theta_scale_length * sizeof(float_t)};
|
theta_scale_length * sizeof(float_t)};
|
||||||
|
|
||||||
|
GGML_ASSERT(src1->type == GGML_TYPE_I32);
|
||||||
|
int64_t position_length = src1->ne[0];
|
||||||
|
int64_t position_ne[] = {1, 1, position_length, 1};
|
||||||
|
size_t position_nb[] = {sizeof(int32_t), sizeof(int32_t), sizeof(int32_t),
|
||||||
|
sizeof(int32_t) * position_length};
|
||||||
|
|
||||||
|
int64_t theta_ne[] = {theta_scale_length, 1, position_length, 1};
|
||||||
|
size_t theta_nb[GGML_MAX_DIMS];
|
||||||
|
theta_nb[0] = sizeof(float_t);
|
||||||
|
for (int i = 1; i < GGML_MAX_DIMS; i++) {
|
||||||
|
theta_nb[i] = theta_nb[i - 1] * theta_ne[i - 1];
|
||||||
|
}
|
||||||
|
|
||||||
|
bool is_q = (std::strncmp(dst->name, "Qcur-", 5) == 0);
|
||||||
|
bool is_k = (std::strncmp(dst->name, "Kcur-", 5) == 0);
|
||||||
|
|
||||||
|
// used for accuracy testing
|
||||||
|
bool is_attention = is_q || is_k;
|
||||||
|
|
||||||
|
if(ctx.init_ptr == nullptr || !is_attention) {
|
||||||
|
// theta_scale arange, [0,1,...,ne00/2 - 1]
|
||||||
|
if(ctx.init_ptr != nullptr){
|
||||||
|
ACL_CHECK(aclrtFree(ctx.init_ptr));
|
||||||
|
}
|
||||||
|
ACL_CHECK(aclrtMalloc(&ctx.init_ptr, theta_scale_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
|
||||||
|
|
||||||
aclTensor* acl_theta_scale_tensor =
|
aclTensor* acl_theta_scale_tensor =
|
||||||
ggml_cann_create_tensor(theta_scale_buffer, ACL_FLOAT, sizeof(float_t),
|
ggml_cann_create_tensor(ctx.init_ptr, ACL_FLOAT, sizeof(float_t),
|
||||||
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
|
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
|
||||||
float start = 0;
|
float start = 0;
|
||||||
float step = 1;
|
float step = 1;
|
||||||
|
|
@ -2190,13 +2212,33 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
|
||||||
aclnn_div(ctx, acl_theta_scale_tensor, acl_freq_factors_tensor);
|
aclnn_div(ctx, acl_theta_scale_tensor, acl_freq_factors_tensor);
|
||||||
ggml_cann_release_resources(ctx, acl_freq_factors_tensor);
|
ggml_cann_release_resources(ctx, acl_freq_factors_tensor);
|
||||||
}
|
}
|
||||||
|
// release
|
||||||
|
ggml_cann_release_resources(ctx, acl_theta_scale_tensor,acl_theta_scale);
|
||||||
|
}
|
||||||
|
|
||||||
|
if(ctx.sin_ptr == nullptr) {
|
||||||
|
int64_t theta_length = theta_scale_length * ctx.max_prompt_length;
|
||||||
|
ACL_CHECK(aclrtMalloc(&ctx.sin_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
|
||||||
|
ACL_CHECK(aclrtMalloc(&ctx.cos_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
|
||||||
|
}
|
||||||
|
if(position_length > ctx.max_prompt_length) {
|
||||||
|
ctx.max_prompt_length = position_length;
|
||||||
|
int64_t theta_length = theta_scale_length * ctx.max_prompt_length;
|
||||||
|
ACL_CHECK(aclrtFree(ctx.sin_ptr));
|
||||||
|
ACL_CHECK(aclrtFree(ctx.cos_ptr));
|
||||||
|
ACL_CHECK(aclrtMalloc(&ctx.sin_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
|
||||||
|
ACL_CHECK(aclrtMalloc(&ctx.cos_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
|
||||||
|
}
|
||||||
|
|
||||||
|
bool is_fisrt_layer = (std::strncmp(dst->name, "Qcur-0", GGML_MAX_NAME) == 0);
|
||||||
|
|
||||||
|
if(is_fisrt_layer || !is_attention) {
|
||||||
|
|
||||||
|
aclTensor* acl_theta_scale_tensor =
|
||||||
|
ggml_cann_create_tensor(ctx.init_ptr, ACL_FLOAT, sizeof(float_t),
|
||||||
|
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
|
||||||
|
|
||||||
// position
|
// position
|
||||||
GGML_ASSERT(src1->type == GGML_TYPE_I32);
|
|
||||||
int64_t position_length = src1->ne[0];
|
|
||||||
int64_t position_ne[] = {1, 1, position_length, 1};
|
|
||||||
size_t position_nb[] = {sizeof(int32_t), sizeof(int32_t), sizeof(int32_t),
|
|
||||||
sizeof(int32_t) * position_length};
|
|
||||||
aclTensor* acl_position_tensor = ggml_cann_create_tensor(
|
aclTensor* acl_position_tensor = ggml_cann_create_tensor(
|
||||||
src1->data, ggml_cann_type_mapping(src1->type),
|
src1->data, ggml_cann_type_mapping(src1->type),
|
||||||
ggml_type_size(src1->type), position_ne, position_nb, GGML_MAX_DIMS);
|
ggml_type_size(src1->type), position_ne, position_nb, GGML_MAX_DIMS);
|
||||||
|
|
@ -2206,12 +2248,7 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
|
||||||
ggml_cann_pool_alloc theta_allocator(ctx.pool(),
|
ggml_cann_pool_alloc theta_allocator(ctx.pool(),
|
||||||
theta_length * sizeof(float_t));
|
theta_length * sizeof(float_t));
|
||||||
void* theta_buffer = theta_allocator.get();
|
void* theta_buffer = theta_allocator.get();
|
||||||
int64_t theta_ne[] = {theta_scale_length, 1, position_length, 1};
|
|
||||||
size_t theta_nb[GGML_MAX_DIMS];
|
|
||||||
theta_nb[0] = sizeof(float_t);
|
|
||||||
for (int i = 1; i < GGML_MAX_DIMS; i++) {
|
|
||||||
theta_nb[i] = theta_nb[i - 1] * theta_ne[i - 1];
|
|
||||||
}
|
|
||||||
aclTensor* acl_theta_tensor =
|
aclTensor* acl_theta_tensor =
|
||||||
ggml_cann_create_tensor(theta_buffer, ACL_FLOAT, sizeof(float_t),
|
ggml_cann_create_tensor(theta_buffer, ACL_FLOAT, sizeof(float_t),
|
||||||
theta_ne, theta_nb, GGML_MAX_DIMS);
|
theta_ne, theta_nb, GGML_MAX_DIMS);
|
||||||
|
|
@ -2219,22 +2256,28 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
|
||||||
acl_theta_tensor);
|
acl_theta_tensor);
|
||||||
|
|
||||||
// sin/cos
|
// sin/cos
|
||||||
ggml_cann_pool_alloc sin_allocator(ctx.pool(),
|
|
||||||
theta_length * sizeof(float_t));
|
|
||||||
void* sin_buffer = sin_allocator.get();
|
|
||||||
aclTensor* acl_sin_tensor = ggml_cann_create_tensor(
|
aclTensor* acl_sin_tensor = ggml_cann_create_tensor(
|
||||||
sin_buffer, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
|
ctx.sin_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
|
||||||
GGML_MAX_DIMS, ACL_FORMAT_ND);
|
GGML_MAX_DIMS, ACL_FORMAT_ND);
|
||||||
aclnn_sin(ctx, acl_theta_tensor, acl_sin_tensor);
|
aclnn_sin(ctx, acl_theta_tensor, acl_sin_tensor);
|
||||||
|
|
||||||
ggml_cann_pool_alloc cos_allocator(ctx.pool(),
|
|
||||||
theta_length * sizeof(float_t));
|
|
||||||
void* cos_buffer = cos_allocator.get();
|
|
||||||
aclTensor* acl_cos_tensor = ggml_cann_create_tensor(
|
aclTensor* acl_cos_tensor = ggml_cann_create_tensor(
|
||||||
cos_buffer, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
|
ctx.cos_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
|
||||||
GGML_MAX_DIMS, ACL_FORMAT_ND);
|
GGML_MAX_DIMS, ACL_FORMAT_ND);
|
||||||
aclnn_cos(ctx, acl_theta_tensor, acl_cos_tensor);
|
aclnn_cos(ctx, acl_theta_tensor, acl_cos_tensor);
|
||||||
|
|
||||||
|
// release
|
||||||
|
ggml_cann_release_resources(ctx, acl_theta_scale_tensor, acl_position_tensor,
|
||||||
|
acl_theta_tensor, acl_sin_tensor, acl_cos_tensor);
|
||||||
|
}
|
||||||
|
|
||||||
|
aclTensor* acl_sin_tensor = ggml_cann_create_tensor(
|
||||||
|
ctx.sin_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
|
||||||
|
GGML_MAX_DIMS, ACL_FORMAT_ND);
|
||||||
|
aclTensor* acl_cos_tensor = ggml_cann_create_tensor(
|
||||||
|
ctx.cos_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
|
||||||
|
GGML_MAX_DIMS, ACL_FORMAT_ND);
|
||||||
|
|
||||||
// attn_factor
|
// attn_factor
|
||||||
if (attn_factor != 1) {
|
if (attn_factor != 1) {
|
||||||
aclnn_muls(ctx, acl_sin_tensor, attn_factor, nullptr, true);
|
aclnn_muls(ctx, acl_sin_tensor, attn_factor, nullptr, true);
|
||||||
|
|
@ -2257,8 +2300,7 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
|
||||||
}
|
}
|
||||||
|
|
||||||
// release
|
// release
|
||||||
ggml_cann_release_resources(ctx, acl_theta_scale_tensor, acl_position_tensor,
|
ggml_cann_release_resources(ctx, acl_sin_tensor, acl_cos_tensor);
|
||||||
acl_theta_tensor, acl_sin_tensor, acl_cos_tensor, acl_theta_scale);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
|
|
|
||||||
|
|
@ -368,6 +368,10 @@ struct ggml_backend_cann_context {
|
||||||
std::string name; /**< Name of the device. */
|
std::string name; /**< Name of the device. */
|
||||||
std::string description; /**< Description of the device. */
|
std::string description; /**< Description of the device. */
|
||||||
aclrtEvent copy_event = nullptr; /**< Event for managing copy operations. */
|
aclrtEvent copy_event = nullptr; /**< Event for managing copy operations. */
|
||||||
|
void* init_ptr = nullptr;
|
||||||
|
void* sin_ptr = nullptr;
|
||||||
|
void* cos_ptr = nullptr;
|
||||||
|
int64_t max_prompt_length = 65536;
|
||||||
#ifdef USE_ACL_GRAPH
|
#ifdef USE_ACL_GRAPH
|
||||||
/// Cached CANN ACL graph used for executing the current ggml computation graph.
|
/// Cached CANN ACL graph used for executing the current ggml computation graph.
|
||||||
std::unique_ptr<ggml_cann_graph> cann_graph;
|
std::unique_ptr<ggml_cann_graph> cann_graph;
|
||||||
|
|
@ -414,6 +418,15 @@ struct ggml_backend_cann_context {
|
||||||
ACL_CHECK(aclrtDestroyStream(streams[i]));
|
ACL_CHECK(aclrtDestroyStream(streams[i]));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
if(init_ptr != nullptr) {
|
||||||
|
ACL_CHECK(aclrtFree(init_ptr));
|
||||||
|
}
|
||||||
|
if(sin_ptr != nullptr) {
|
||||||
|
ACL_CHECK(aclrtFree(sin_ptr));
|
||||||
|
}
|
||||||
|
if(cos_ptr != nullptr) {
|
||||||
|
ACL_CHECK(aclrtFree(cos_ptr));
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
|
|
|
||||||
|
|
@ -73,7 +73,6 @@
|
||||||
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
|
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
|
||||||
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
|
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
|
||||||
#define ggml_vec_dot_iq1_m_q8_K_generic ggml_vec_dot_iq1_m_q8_K
|
#define ggml_vec_dot_iq1_m_q8_K_generic ggml_vec_dot_iq1_m_q8_K
|
||||||
#define ggml_vec_dot_mxfp4_q8_0_generic ggml_vec_dot_mxfp4_q8_0
|
|
||||||
// repack.cpp
|
// repack.cpp
|
||||||
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
|
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
|
||||||
#define ggml_quantize_mat_q8_0_4x8_generic ggml_quantize_mat_q8_0_4x8
|
#define ggml_quantize_mat_q8_0_4x8_generic ggml_quantize_mat_q8_0_4x8
|
||||||
|
|
|
||||||
|
|
@ -278,6 +278,72 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void ggml_vec_dot_mxfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
|
||||||
|
assert(nrc == 1);
|
||||||
|
UNUSED(nrc);
|
||||||
|
UNUSED(bx);
|
||||||
|
UNUSED(by);
|
||||||
|
UNUSED(bs);
|
||||||
|
assert(n % QK_MXFP4 == 0);
|
||||||
|
static_assert(QK_MXFP4 == QK8_0, "QK_MXFP4 and QK8_0 must be the same");
|
||||||
|
|
||||||
|
const block_mxfp4 * GGML_RESTRICT x = vx;
|
||||||
|
const block_q8_0 * GGML_RESTRICT y = vy;
|
||||||
|
|
||||||
|
const int nb = n / QK_MXFP4;
|
||||||
|
|
||||||
|
int ib = 0;
|
||||||
|
float sumf = 0;
|
||||||
|
|
||||||
|
#if defined(__POWER9_VECTOR__)
|
||||||
|
const vector signed char lowMask = vec_splats((signed char)0xF);
|
||||||
|
const vector unsigned char vshift4 = vec_splats((unsigned char)4);
|
||||||
|
vector float vsumf0 = vec_splats(0.0f);
|
||||||
|
|
||||||
|
vector signed char kv = vec_xl(0, (const signed char *)kvalues_mxfp4);
|
||||||
|
|
||||||
|
#pragma GCC unroll 8
|
||||||
|
for (; ib < nb; ++ib) {
|
||||||
|
__builtin_prefetch(x[ib].qs, 0, 1);
|
||||||
|
__builtin_prefetch(y[ib].qs, 0, 1);
|
||||||
|
|
||||||
|
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d) *
|
||||||
|
GGML_E8M0_TO_FP32_HALF(x[ib].e));
|
||||||
|
|
||||||
|
vector signed char q8y0 = vec_xl( 0, y[ib].qs);
|
||||||
|
vector signed char q8y1 = vec_xl(16, y[ib].qs);
|
||||||
|
|
||||||
|
vector signed char qxs = (vector signed char)vec_xl(0, x[ib].qs);
|
||||||
|
|
||||||
|
vector unsigned char lo_nibbles = (vector unsigned char)vec_and(qxs, lowMask);
|
||||||
|
vector unsigned char hi_nibbles = (vector unsigned char)vec_sr(qxs, vshift4);
|
||||||
|
|
||||||
|
vector signed char q4x0 = vec_perm(kv, kv, lo_nibbles);
|
||||||
|
vector signed char q4x1 = vec_perm(kv, kv, hi_nibbles);
|
||||||
|
|
||||||
|
vector signed short qv0 = vec_add(vec_mule(q4x0, q8y0), vec_mulo(q4x0, q8y0));
|
||||||
|
vector signed short qv1 = vec_add(vec_mule(q4x1, q8y1), vec_mulo(q4x1, q8y1));
|
||||||
|
|
||||||
|
vector signed int vsumi0 = vec_splats((int32_t)0);
|
||||||
|
vsumi0 = vec_sum4s(qv0, vsumi0);
|
||||||
|
vsumi0 = vec_sum4s(qv1, vsumi0);
|
||||||
|
|
||||||
|
vsumf0 = vec_madd(vec_ctf(vsumi0, 0), vyd, vsumf0);
|
||||||
|
}
|
||||||
|
|
||||||
|
vsumf0 = vec_add(vsumf0, vec_sld(vsumf0, vsumf0, 4));
|
||||||
|
vsumf0 = vec_add(vsumf0, vec_sld(vsumf0, vsumf0, 8));
|
||||||
|
sumf = vec_extract(vsumf0, 0);
|
||||||
|
*s = sumf;
|
||||||
|
#else
|
||||||
|
UNUSED(x);
|
||||||
|
UNUSED(y);
|
||||||
|
UNUSED(ib);
|
||||||
|
UNUSED(sumf);
|
||||||
|
ggml_vec_dot_mxfp4_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
|
void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
|
||||||
const int qk = QK8_0;
|
const int qk = QK8_0;
|
||||||
const int nb = n / qk;
|
const int nb = n / qk;
|
||||||
|
|
|
||||||
|
|
@ -24,12 +24,6 @@ if (CUDAToolkit_FOUND)
|
||||||
# for best performance and to also build real architectures for the most commonly used GPUs.
|
# for best performance and to also build real architectures for the most commonly used GPUs.
|
||||||
if (GGML_NATIVE AND CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.6" AND CMAKE_VERSION VERSION_GREATER_EQUAL "3.24")
|
if (GGML_NATIVE AND CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.6" AND CMAKE_VERSION VERSION_GREATER_EQUAL "3.24")
|
||||||
set(CMAKE_CUDA_ARCHITECTURES "native")
|
set(CMAKE_CUDA_ARCHITECTURES "native")
|
||||||
elseif(GGML_CUDA_F16 OR GGML_CUDA_DMMV_F16)
|
|
||||||
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.8")
|
|
||||||
set(CMAKE_CUDA_ARCHITECTURES "60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real;89-real")
|
|
||||||
else()
|
|
||||||
set(CMAKE_CUDA_ARCHITECTURES "60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real")
|
|
||||||
endif()
|
|
||||||
else()
|
else()
|
||||||
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.8")
|
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.8")
|
||||||
set(CMAKE_CUDA_ARCHITECTURES "50-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real;89-real")
|
set(CMAKE_CUDA_ARCHITECTURES "50-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real;89-real")
|
||||||
|
|
@ -91,10 +85,6 @@ if (CUDAToolkit_FOUND)
|
||||||
add_compile_definitions(GGML_CUDA_NO_FA)
|
add_compile_definitions(GGML_CUDA_NO_FA)
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
if (GGML_CUDA_F16 OR GGML_CUDA_DMMV_F16)
|
|
||||||
add_compile_definitions(GGML_CUDA_F16)
|
|
||||||
endif()
|
|
||||||
|
|
||||||
if (GGML_CUDA_NO_PEER_COPY)
|
if (GGML_CUDA_NO_PEER_COPY)
|
||||||
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
|
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
|
||||||
endif()
|
endif()
|
||||||
|
|
|
||||||
|
|
@ -11,14 +11,14 @@ static __global__ void add_id_kernel(
|
||||||
const int64_t i1 = blockIdx.x;
|
const int64_t i1 = blockIdx.x;
|
||||||
const int64_t i2 = blockIdx.y;
|
const int64_t i2 = blockIdx.y;
|
||||||
|
|
||||||
const int i11 = *(int32_t *) ((char *) src2 + i1*sizeof(int32_t) + i2*nb21);
|
const int i11 = *(const int32_t *) ((const char *) src2 + i1*sizeof(int32_t) + i2*nb21);
|
||||||
|
|
||||||
const size_t nb1 = ne0 * sizeof(float);
|
const size_t nb1 = ne0 * sizeof(float);
|
||||||
const size_t nb2 = ne1 * nb1;
|
const size_t nb2 = ne1 * nb1;
|
||||||
|
|
||||||
float * dst_row = (float *)((char *)dst + i1*nb1 + i2*nb2);
|
float * dst_row = (float *)((char *)dst + i1*nb1 + i2*nb2);
|
||||||
const float * src0_row = (const float *)((char *)src0 + i1*nb01 + i2*nb02);
|
const float * src0_row = (const float *)((const char *)src0 + i1*nb01 + i2*nb02);
|
||||||
const float * src1_row = (const float *)((char *)src1 + i11*nb11);
|
const float * src1_row = (const float *)((const char *)src1 + i11*nb11);
|
||||||
|
|
||||||
for (int64_t i0 = threadIdx.x; i0 < ne0; i0 += blockDim.x) {
|
for (int64_t i0 = threadIdx.x; i0 < ne0; i0 += blockDim.x) {
|
||||||
dst_row[i0] = src0_row[i0] + src1_row[i0];
|
dst_row[i0] = src0_row[i0] + src1_row[i0];
|
||||||
|
|
|
||||||
|
|
@ -78,6 +78,8 @@
|
||||||
#define GGML_CUDA_CC_IS_CDNA3(cc) (cc >= GGML_CUDA_CC_CDNA3 && cc < GGML_CUDA_CC_RDNA1)
|
#define GGML_CUDA_CC_IS_CDNA3(cc) (cc >= GGML_CUDA_CC_CDNA3 && cc < GGML_CUDA_CC_RDNA1)
|
||||||
|
|
||||||
// Moore Threads
|
// Moore Threads
|
||||||
|
#define MUSART_HMASK 40300 // MUSA rc4.3, min. ver. for half2 -> uint mask comparisons
|
||||||
|
|
||||||
#define GGML_CUDA_CC_QY1 (GGML_CUDA_CC_OFFSET_MTHREADS + 0x210) // MTT S80, MTT S3000
|
#define GGML_CUDA_CC_QY1 (GGML_CUDA_CC_OFFSET_MTHREADS + 0x210) // MTT S80, MTT S3000
|
||||||
#define GGML_CUDA_CC_QY2 (GGML_CUDA_CC_OFFSET_MTHREADS + 0x220) // MTT S4000
|
#define GGML_CUDA_CC_QY2 (GGML_CUDA_CC_OFFSET_MTHREADS + 0x220) // MTT S4000
|
||||||
#define GGML_CUDA_CC_NG (GGML_CUDA_CC_OFFSET_MTHREADS + 0x310) // TBD
|
#define GGML_CUDA_CC_NG (GGML_CUDA_CC_OFFSET_MTHREADS + 0x310) // TBD
|
||||||
|
|
@ -204,14 +206,6 @@ static const char * cu_get_error_str(CUresult err) {
|
||||||
#define GGML_CUDA_ASSUME(x)
|
#define GGML_CUDA_ASSUME(x)
|
||||||
#endif // CUDART_VERSION >= 11010
|
#endif // CUDART_VERSION >= 11010
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
|
||||||
typedef half dfloat; // dequantize float
|
|
||||||
typedef half2 dfloat2;
|
|
||||||
#else
|
|
||||||
typedef float dfloat; // dequantize float
|
|
||||||
typedef float2 dfloat2;
|
|
||||||
#endif // GGML_CUDA_F16
|
|
||||||
|
|
||||||
#if (!defined(GGML_USE_HIP) && !defined(GGML_CUDA_NO_VMM)) || (defined(GGML_USE_HIP) && !defined(GGML_HIP_NO_VMM))
|
#if (!defined(GGML_USE_HIP) && !defined(GGML_CUDA_NO_VMM)) || (defined(GGML_USE_HIP) && !defined(GGML_HIP_NO_VMM))
|
||||||
#define GGML_USE_VMM
|
#define GGML_USE_VMM
|
||||||
#endif // (!defined(GGML_USE_HIP) && !defined(GGML_CUDA_NO_VMM)) || (defined(GGML_USE_HIP) && !defined(GGML_HIP_NO_VMM))
|
#endif // (!defined(GGML_USE_HIP) && !defined(GGML_CUDA_NO_VMM)) || (defined(GGML_USE_HIP) && !defined(GGML_HIP_NO_VMM))
|
||||||
|
|
@ -490,13 +484,14 @@ static __device__ __forceinline__ half2 warp_reduce_max(half2 x) {
|
||||||
#endif // !defined(GGML_USE_HIP) && __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL || defined(GGML_USE_HIP)
|
#endif // !defined(GGML_USE_HIP) && __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL || defined(GGML_USE_HIP)
|
||||||
}
|
}
|
||||||
|
|
||||||
#if CUDART_VERSION < CUDART_HMASK
|
#if (defined(CUDART_VERSION) && CUDART_VERSION < CUDART_HMASK) || defined(GGML_USE_HIP) || \
|
||||||
|
(defined(MUSART_VERSION) && MUSART_VERSION < MUSART_HMASK)
|
||||||
static __device__ __forceinline__ uint32_t __hgt2_mask(const half2 a, const half2 b) {
|
static __device__ __forceinline__ uint32_t __hgt2_mask(const half2 a, const half2 b) {
|
||||||
const uint32_t mask_low = 0x0000FFFF * (float( __low2half(a)) > float( __low2half(b)));
|
const uint32_t mask_low = 0x0000FFFF * (float( __low2half(a)) > float( __low2half(b)));
|
||||||
const uint32_t mask_high = 0xFFFF0000 * (float(__high2half(a)) > float(__high2half(b)));
|
const uint32_t mask_high = 0xFFFF0000 * (float(__high2half(a)) > float(__high2half(b)));
|
||||||
return mask_low | mask_high;
|
return mask_low | mask_high;
|
||||||
}
|
}
|
||||||
#endif // CUDART_VERSION < CUDART_HMASK
|
#endif // (defined(CUDART_VERSION) && CUDART_VERSION < CUDART_HMASK) || defined(GGML_USE_HIP) || (defined(MUSART_VERSION) && MUSART_VERSION < MUSART_HMASK)
|
||||||
|
|
||||||
static __device__ __forceinline__ int ggml_cuda_dp4a(const int a, const int b, int c) {
|
static __device__ __forceinline__ int ggml_cuda_dp4a(const int a, const int b, int c) {
|
||||||
#if defined(GGML_USE_HIP)
|
#if defined(GGML_USE_HIP)
|
||||||
|
|
@ -556,7 +551,7 @@ static __device__ __forceinline__ float ggml_cuda_e8m0_to_fp32(uint8_t x) {
|
||||||
#endif // CUDART_VERSION >= 12050
|
#endif // CUDART_VERSION >= 12050
|
||||||
}
|
}
|
||||||
|
|
||||||
typedef void (*dequantize_kernel_t)(const void * vx, const int64_t ib, const int iqs, dfloat2 & v);
|
typedef void (*dequantize_kernel_t)(const void * vx, const int64_t ib, const int iqs, float2 & v);
|
||||||
|
|
||||||
static __device__ __forceinline__ float get_alibi_slope(
|
static __device__ __forceinline__ float get_alibi_slope(
|
||||||
const float max_bias, const uint32_t h, const uint32_t n_head_log2, const float m0, const float m1
|
const float max_bias, const uint32_t h, const uint32_t n_head_log2, const float m0, const float m1
|
||||||
|
|
|
||||||
|
|
@ -34,10 +34,7 @@ static __global__ void conv_transpose_1d_kernel(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
dst[global_index] = accumulator;
|
dst[global_index] = accumulator;
|
||||||
GGML_UNUSED(p0); GGML_UNUSED(d0); GGML_UNUSED(src0_ne3);
|
GGML_UNUSED_VARS(p0, d0, src0_ne3, src1_ne3, dst_ne3, src1_ne1, dst_ne1, src1_ne2, dst_ne2);
|
||||||
GGML_UNUSED(src1_ne3); GGML_UNUSED(dst_ne3);
|
|
||||||
GGML_UNUSED(src1_ne1); GGML_UNUSED(dst_ne1);
|
|
||||||
GGML_UNUSED(src1_ne2); GGML_UNUSED(dst_ne2);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static void conv_transpose_1d_f32_f32_cuda(
|
static void conv_transpose_1d_f32_f32_cuda(
|
||||||
|
|
|
||||||
|
|
@ -27,7 +27,7 @@ static __global__ void dequantize_block(const void * __restrict__ vx, dst_t * __
|
||||||
const int64_t y_offset = qr == 1 ? 1 : qk/2;
|
const int64_t y_offset = qr == 1 ? 1 : qk/2;
|
||||||
|
|
||||||
// dequantize
|
// dequantize
|
||||||
dfloat2 v;
|
float2 v;
|
||||||
dequantize_kernel(vx, ib, iqs, v);
|
dequantize_kernel(vx, ib, iqs, v);
|
||||||
|
|
||||||
const int64_t iy0 = ((i03*ne02 + i02)*ne01 + i01)*ne00 + iybs + iqs;
|
const int64_t iy0 = ((i03*ne02 + i02)*ne01 + i01)*ne00 + iybs + iqs;
|
||||||
|
|
@ -71,9 +71,7 @@ static __global__ void dequantize_block_q8_0_f16(const void * __restrict__ vx, h
|
||||||
y2[iy/2 + threadIdx.x] = __hmul2(make_half2(qs.x, qs.y), __half2half2(d));
|
y2[iy/2 + threadIdx.x] = __hmul2(make_half2(qs.x, qs.y), __half2half2(d));
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(vx);
|
GGML_UNUSED_VARS(vx, y, k);
|
||||||
GGML_UNUSED(y);
|
|
||||||
GGML_UNUSED(k);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -42,7 +42,7 @@ static __device__ void cpy_blck_q8_0_f32(const char * cxi, char * cdsti) {
|
||||||
|
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
for (int j = 0; j < QK8_0; j += 2) {
|
for (int j = 0; j < QK8_0; j += 2) {
|
||||||
dfloat2 dq;
|
float2 dq;
|
||||||
dequantize_q8_0(cxi, 0, j, dq);
|
dequantize_q8_0(cxi, 0, j, dq);
|
||||||
*(cdstf + j) = dq.x;
|
*(cdstf + j) = dq.x;
|
||||||
*(cdstf + j + 1) = dq.y;
|
*(cdstf + j + 1) = dq.y;
|
||||||
|
|
@ -55,7 +55,7 @@ static __device__ void cpy_blck_q_f32(const char * cxi, char * cdsti) {
|
||||||
|
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
for (int j = 0; j < qk/2; j++) {
|
for (int j = 0; j < qk/2; j++) {
|
||||||
dfloat2 dq;
|
float2 dq;
|
||||||
dequant(cxi, 0, j, dq);
|
dequant(cxi, 0, j, dq);
|
||||||
*(cdstf + j) = dq.x;
|
*(cdstf + j) = dq.x;
|
||||||
*(cdstf + j + qk/2) = dq.y;
|
*(cdstf + j + qk/2) = dq.y;
|
||||||
|
|
@ -134,8 +134,7 @@ void ggml_cuda_cpy_dest_ptrs_copy(ggml_cuda_graph * cuda_graph, char ** host_des
|
||||||
CUDA_CHECK(cudaMemcpyAsync(cuda_graph->dest_ptrs_d, host_dest_ptrs, host_dest_ptrs_size*sizeof(char *), cudaMemcpyHostToDevice, stream));
|
CUDA_CHECK(cudaMemcpyAsync(cuda_graph->dest_ptrs_d, host_dest_ptrs, host_dest_ptrs_size*sizeof(char *), cudaMemcpyHostToDevice, stream));
|
||||||
cuda_graph->graph_cpynode_index = 0; // reset index
|
cuda_graph->graph_cpynode_index = 0; // reset index
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(cuda_graph); GGML_UNUSED(host_dest_ptrs);
|
GGML_UNUSED_VARS(cuda_graph, host_dest_ptrs, host_dest_ptrs_size, stream);
|
||||||
GGML_UNUSED(host_dest_ptrs_size); GGML_UNUSED(stream);
|
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,48 +1,37 @@
|
||||||
#include "common.cuh"
|
#include "common.cuh"
|
||||||
|
|
||||||
static __device__ __forceinline__ void dequantize_q4_0(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
|
static __device__ __forceinline__ void dequantize_q4_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
|
||||||
const block_q4_0 * x = (const block_q4_0 *) vx;
|
const block_q4_0 * x = (const block_q4_0 *) vx;
|
||||||
|
|
||||||
const dfloat d = x[ib].d;
|
const float d = x[ib].d;
|
||||||
|
|
||||||
const int vui = x[ib].qs[iqs];
|
const int vui = x[ib].qs[iqs];
|
||||||
|
|
||||||
v.x = vui & 0xF;
|
v.x = vui & 0xF;
|
||||||
v.y = vui >> 4;
|
v.y = vui >> 4;
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
|
||||||
v = __hsub2(v, {8.0f, 8.0f});
|
|
||||||
v = __hmul2(v, {d, d});
|
|
||||||
#else
|
|
||||||
v.x = (v.x - 8.0f) * d;
|
v.x = (v.x - 8.0f) * d;
|
||||||
v.y = (v.y - 8.0f) * d;
|
v.y = (v.y - 8.0f) * d;
|
||||||
#endif // GGML_CUDA_F16
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static __device__ __forceinline__ void dequantize_q4_1(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
|
static __device__ __forceinline__ void dequantize_q4_1(const void * vx, const int64_t ib, const int iqs, float2 & v){
|
||||||
const block_q4_1 * x = (const block_q4_1 *) vx;
|
const block_q4_1 * x = (const block_q4_1 *) vx;
|
||||||
|
|
||||||
const dfloat d = __low2half(x[ib].dm);
|
const float2 dm = __half22float2(x[ib].dm);
|
||||||
const dfloat m = __high2half(x[ib].dm);
|
|
||||||
|
|
||||||
const int vui = x[ib].qs[iqs];
|
const int vui = x[ib].qs[iqs];
|
||||||
|
|
||||||
v.x = vui & 0xF;
|
v.x = vui & 0xF;
|
||||||
v.y = vui >> 4;
|
v.y = vui >> 4;
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
v.x = (v.x * dm.x) + dm.y;
|
||||||
v = __hmul2(v, {d, d});
|
v.y = (v.y * dm.x) + dm.y;
|
||||||
v = __hadd2(v, {m, m});
|
|
||||||
#else
|
|
||||||
v.x = (v.x * d) + m;
|
|
||||||
v.y = (v.y * d) + m;
|
|
||||||
#endif // GGML_CUDA_F16
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static __device__ __forceinline__ void dequantize_q5_0(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
|
static __device__ __forceinline__ void dequantize_q5_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
|
||||||
const block_q5_0 * x = (const block_q5_0 *) vx;
|
const block_q5_0 * x = (const block_q5_0 *) vx;
|
||||||
|
|
||||||
const dfloat d = x[ib].d;
|
const float d = x[ib].d;
|
||||||
|
|
||||||
uint32_t qh;
|
uint32_t qh;
|
||||||
memcpy(&qh, x[ib].qh, sizeof(qh));
|
memcpy(&qh, x[ib].qh, sizeof(qh));
|
||||||
|
|
@ -53,20 +42,14 @@ static __device__ __forceinline__ void dequantize_q5_0(const void * vx, const in
|
||||||
v.x = ((x[ib].qs[iqs] & 0xf) | xh_0);
|
v.x = ((x[ib].qs[iqs] & 0xf) | xh_0);
|
||||||
v.y = ((x[ib].qs[iqs] >> 4) | xh_1);
|
v.y = ((x[ib].qs[iqs] >> 4) | xh_1);
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
|
||||||
v = __hsub2(v, {16.0f, 16.0f});
|
|
||||||
v = __hmul2(v, {d, d});
|
|
||||||
#else
|
|
||||||
v.x = (v.x - 16.0f) * d;
|
v.x = (v.x - 16.0f) * d;
|
||||||
v.y = (v.y - 16.0f) * d;
|
v.y = (v.y - 16.0f) * d;
|
||||||
#endif // GGML_CUDA_F16
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static __device__ __forceinline__ void dequantize_q5_1(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
|
static __device__ __forceinline__ void dequantize_q5_1(const void * vx, const int64_t ib, const int iqs, float2 & v){
|
||||||
const block_q5_1 * x = (const block_q5_1 *) vx;
|
const block_q5_1 * x = (const block_q5_1 *) vx;
|
||||||
|
|
||||||
const dfloat d = __low2half(x[ib].dm);
|
const float2 dm = __half22float2(x[ib].dm);
|
||||||
const dfloat m = __high2half(x[ib].dm);
|
|
||||||
|
|
||||||
uint32_t qh;
|
uint32_t qh;
|
||||||
memcpy(&qh, x[ib].qh, sizeof(qh));
|
memcpy(&qh, x[ib].qh, sizeof(qh));
|
||||||
|
|
@ -77,27 +60,18 @@ static __device__ __forceinline__ void dequantize_q5_1(const void * vx, const in
|
||||||
v.x = ((x[ib].qs[iqs] & 0xf) | xh_0);
|
v.x = ((x[ib].qs[iqs] & 0xf) | xh_0);
|
||||||
v.y = ((x[ib].qs[iqs] >> 4) | xh_1);
|
v.y = ((x[ib].qs[iqs] >> 4) | xh_1);
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
v.x = (v.x * dm.x) + dm.y;
|
||||||
v = __hmul2(v, {d, d});
|
v.y = (v.y * dm.x) + dm.y;
|
||||||
v = __hadd2(v, {m, m});
|
|
||||||
#else
|
|
||||||
v.x = (v.x * d) + m;
|
|
||||||
v.y = (v.y * d) + m;
|
|
||||||
#endif // GGML_CUDA_F16
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static __device__ __forceinline__ void dequantize_q8_0(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
|
static __device__ __forceinline__ void dequantize_q8_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
|
||||||
const block_q8_0 * x = (const block_q8_0 *) vx;
|
const block_q8_0 * x = (const block_q8_0 *) vx;
|
||||||
|
|
||||||
const dfloat d = x[ib].d;
|
const float d = x[ib].d;
|
||||||
|
|
||||||
v.x = x[ib].qs[iqs + 0];
|
v.x = x[ib].qs[iqs + 0];
|
||||||
v.y = x[ib].qs[iqs + 1];
|
v.y = x[ib].qs[iqs + 1];
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
|
||||||
v = __hmul2(v, {d, d});
|
|
||||||
#else
|
|
||||||
v.x *= d;
|
v.x *= d;
|
||||||
v.y *= d;
|
v.y *= d;
|
||||||
#endif // GGML_CUDA_F16
|
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -704,28 +704,6 @@ static __global__ void flash_attn_combine_results(
|
||||||
dst[tid] = VKQ_numerator / VKQ_denominator;
|
dst[tid] = VKQ_numerator / VKQ_denominator;
|
||||||
}
|
}
|
||||||
|
|
||||||
[[noreturn]]
|
|
||||||
static void on_no_fattn_vec_case(const int D) {
|
|
||||||
if (D == 64) {
|
|
||||||
fprintf(stderr, "Unsupported KV type combination for head_size 64.\n");
|
|
||||||
fprintf(stderr, "By default only f16 KV cache is supported.\n");
|
|
||||||
fprintf(stderr, "Compile with GGML_CUDA_FA_ALL_QUANTS for V cache quantization support.\n");
|
|
||||||
GGML_ABORT("fatal error");
|
|
||||||
} else if (D == 128) {
|
|
||||||
fprintf(stderr, "Unsupported KV type combination for head_size 128.\n");
|
|
||||||
fprintf(stderr, "Supported combinations:\n");
|
|
||||||
fprintf(stderr, " - K == q4_0, V == q4_0, 4.50 BPV\n");
|
|
||||||
fprintf(stderr, " - K == q8_0, V == q8_0, 8.50 BPV\n");
|
|
||||||
fprintf(stderr, " - K == f16, V == f16, 16.00 BPV\n");
|
|
||||||
fprintf(stderr, "Compile with GGML_CUDA_FA_ALL_QUANTS for all combinations of q4_0, q4_1, q5_0, q5_1, q8_0, and f16.\n");
|
|
||||||
GGML_ABORT("fatal error");
|
|
||||||
} else {
|
|
||||||
fprintf(stderr, "Unsupported KV type combination for head_size %d.\n", D);
|
|
||||||
fprintf(stderr, "Only f16 is supported.\n");
|
|
||||||
GGML_ABORT("fatal error");
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
template <int DV, int ncols1, int ncols2>
|
template <int DV, int ncols1, int ncols2>
|
||||||
void launch_fattn(
|
void launch_fattn(
|
||||||
ggml_backend_cuda_context & ctx, ggml_tensor * dst, fattn_kernel_t fattn_kernel, const int nwarps, const size_t nbytes_shared,
|
ggml_backend_cuda_context & ctx, ggml_tensor * dst, fattn_kernel_t fattn_kernel, const int nwarps, const size_t nbytes_shared,
|
||||||
|
|
|
||||||
|
|
@ -767,14 +767,11 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q_f2); GGML_UNUSED(K_h2); GGML_UNUSED(V_h2);
|
GGML_UNUSED_VARS(Q_f2, K_h2, V_h2, mask_h2, dstk, dstk_fixup,
|
||||||
GGML_UNUSED(mask_h2); GGML_UNUSED(dstk); GGML_UNUSED(dstk_fixup);
|
scale, slope, logit_softcap, ne01, ne02,
|
||||||
GGML_UNUSED(scale); GGML_UNUSED(slope); GGML_UNUSED(logit_softcap);
|
stride_K, stride_V, stride_mask,
|
||||||
GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(stride_K); GGML_UNUSED(stride_V);
|
tile_Q, tile_K, tile_V, tile_mask,
|
||||||
GGML_UNUSED(stride_mask); GGML_UNUSED(tile_K);
|
Q_B, VKQ_C, KQ_max, KQ_rowsum, kb0);
|
||||||
GGML_UNUSED(tile_V); GGML_UNUSED(tile_mask); GGML_UNUSED(Q_B);
|
|
||||||
GGML_UNUSED(VKQ_C); GGML_UNUSED(KQ_max); GGML_UNUSED(KQ_rowsum);
|
|
||||||
GGML_UNUSED(kb0); GGML_UNUSED(tile_Q);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -1236,12 +1233,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q_f2); GGML_UNUSED(K_h2); GGML_UNUSED(V_h2);
|
GGML_UNUSED_VARS(Q_f2, K_h2, V_h2, mask_h2, sinks_f, dstk, dstk_fixup,
|
||||||
GGML_UNUSED(mask_h2); GGML_UNUSED(dstk); GGML_UNUSED(dstk_fixup);
|
scale, slope, logit_softcap, ne01, ne02,
|
||||||
GGML_UNUSED(scale); GGML_UNUSED(slope); GGML_UNUSED(logit_softcap);
|
stride_Q1, stride_Q2, stride_K, stride_V, stride_mask,
|
||||||
GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(stride_Q1);
|
jt, kb0_start, kb0_stop);
|
||||||
GGML_UNUSED(stride_Q2); GGML_UNUSED(stride_K); GGML_UNUSED(stride_V); GGML_UNUSED(stride_mask);
|
|
||||||
GGML_UNUSED(jt); GGML_UNUSED(kb0_start); GGML_UNUSED(kb0_stop);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -1395,17 +1390,15 @@ static __global__ void flash_attn_ext_f16(
|
||||||
(Q_f2, K_h2, V_h2, mask_h2, sinks_f, dstk, dst_meta, scale, slope, logit_softcap,
|
(Q_f2, K_h2, V_h2, mask_h2, sinks_f, dstk, dst_meta, scale, slope, logit_softcap,
|
||||||
ne01, ne02, stride_Q1, stride_Q2, stride_K, stride_V, stride_mask, jt, kb0_start_kernel, kb0_stop_kernel);
|
ne01, ne02, stride_Q1, stride_Q2, stride_K, stride_V, stride_mask, jt, kb0_start_kernel, kb0_stop_kernel);
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(TURING_MMA_AVAILABLE)
|
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(TURING_MMA_AVAILABLE)
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -299,17 +299,15 @@ static __global__ void flash_attn_tile_ext_f16(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(ne03); GGML_UNUSED(ne10); GGML_UNUSED(ne11);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne12); GGML_UNUSED(ne13); GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33); GGML_UNUSED(nb01); GGML_UNUSED(nb02);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb03); GGML_UNUSED(nb11); GGML_UNUSED(nb12);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(nb13); GGML_UNUSED(nb21); GGML_UNUSED(nb22);
|
|
||||||
GGML_UNUSED(nb23);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(FP16_AVAILABLE)
|
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(FP16_AVAILABLE)
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -38,17 +38,15 @@ static __global__ void flash_attn_tile_ext_f32(
|
||||||
return;
|
return;
|
||||||
#endif // FP16_MMA_AVAILABLE
|
#endif // FP16_MMA_AVAILABLE
|
||||||
if (use_logit_softcap && !(D == 128 || D == 256)) {
|
if (use_logit_softcap && !(D == 128 || D == 256)) {
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
@ -312,17 +310,15 @@ static __global__ void flash_attn_tile_ext_f32(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // FLASH_ATTN_AVAILABLE
|
#endif // FLASH_ATTN_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -349,17 +349,15 @@ static __global__ void flash_attn_vec_ext_f16(
|
||||||
dst_meta[((sequence*ne01 + ic0 + tid)*ne02 + head)*gridDim.y + blockIdx.y] = make_float2(kqmax[tid], kqsum[tid]);
|
dst_meta[((sequence*ne01 + ic0 + tid)*ne02 + head)*gridDim.y + blockIdx.y] = make_float2(kqmax[tid], kqsum[tid]);
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(FP16_AVAILABLE)
|
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(FP16_AVAILABLE)
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -37,17 +37,15 @@ static __global__ void flash_attn_vec_ext_f32(
|
||||||
|
|
||||||
// Skip unused kernel variants for faster compilation:
|
// Skip unused kernel variants for faster compilation:
|
||||||
if (use_logit_softcap && !(D == 128 || D == 256)) {
|
if (use_logit_softcap && !(D == 128 || D == 256)) {
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(ne03); GGML_UNUSED(ne10); GGML_UNUSED(ne11);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne12); GGML_UNUSED(ne13); GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33); GGML_UNUSED(nb01); GGML_UNUSED(nb02);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb03); GGML_UNUSED(nb11); GGML_UNUSED(nb12);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(nb13); GGML_UNUSED(nb21); GGML_UNUSED(nb22);
|
|
||||||
GGML_UNUSED(nb23);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
@ -345,17 +343,15 @@ static __global__ void flash_attn_vec_ext_f32(
|
||||||
dst_meta[((sequence*ne01 + ic0 + tid)*ne02 + head)*gridDim.y + blockIdx.y] = make_float2(kqmax[tid], kqsum[tid]);
|
dst_meta[((sequence*ne01 + ic0 + tid)*ne02 + head)*gridDim.y + blockIdx.y] = make_float2(kqmax[tid], kqsum[tid]);
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
|
|
||||||
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // FLASH_ATTN_AVAILABLE
|
#endif // FLASH_ATTN_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -471,16 +471,15 @@ static __global__ void flash_attn_ext_f16(
|
||||||
dst_meta[j_dst_unrolled] = dst_meta_val;
|
dst_meta[j_dst_unrolled] = dst_meta_val;
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
|
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
|
||||||
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
|
max_bias, m0, m1, n_head_log2, logit_softcap,
|
||||||
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
|
ne00, ne01, ne02, ne03,
|
||||||
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
|
nb01, nb02, nb03,
|
||||||
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
|
ne10, ne11, ne12, ne13,
|
||||||
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
|
nb11, nb12, nb13,
|
||||||
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33); GGML_UNUSED(nb31);
|
nb21, nb22, nb23,
|
||||||
GGML_UNUSED(nb32); GGML_UNUSED(nb33); GGML_UNUSED(nb01); GGML_UNUSED(nb02);
|
ne31, ne32, ne33,
|
||||||
GGML_UNUSED(nb03); GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
|
nb31, nb32, nb33);
|
||||||
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // defined(FLASH_ATTN_AVAILABLE) && (__CUDA_ARCH__ == GGML_CUDA_CC_VOLTA || (defined(GGML_HIP_ROCWMMA_FATTN) && defined(FP16_MMA_AVAILABLE)))
|
#endif // defined(FLASH_ATTN_AVAILABLE) && (__CUDA_ARCH__ == GGML_CUDA_CC_VOLTA || (defined(GGML_HIP_ROCWMMA_FATTN) && defined(FP16_MMA_AVAILABLE)))
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -190,7 +190,7 @@ static void ggml_cuda_flash_attn_ext_vec_f16(ggml_backend_cuda_context & ctx, gg
|
||||||
FATTN_VEC_F16_CASE(256, GGML_TYPE_F16, GGML_TYPE_F16)
|
FATTN_VEC_F16_CASE(256, GGML_TYPE_F16, GGML_TYPE_F16)
|
||||||
#endif // GGML_CUDA_FA_ALL_QUANTS
|
#endif // GGML_CUDA_FA_ALL_QUANTS
|
||||||
|
|
||||||
on_no_fattn_vec_case(Q->ne[0]);
|
GGML_ABORT("fatal error");
|
||||||
}
|
}
|
||||||
|
|
||||||
#define FATTN_VEC_F32_CASE(D, type_K, type_V) \
|
#define FATTN_VEC_F32_CASE(D, type_K, type_V) \
|
||||||
|
|
@ -265,74 +265,184 @@ static void ggml_cuda_flash_attn_ext_vec_f32(ggml_backend_cuda_context & ctx, gg
|
||||||
FATTN_VEC_F32_CASE(256, GGML_TYPE_F16, GGML_TYPE_F16)
|
FATTN_VEC_F32_CASE(256, GGML_TYPE_F16, GGML_TYPE_F16)
|
||||||
#endif // GGML_CUDA_FA_ALL_QUANTS
|
#endif // GGML_CUDA_FA_ALL_QUANTS
|
||||||
|
|
||||||
on_no_fattn_vec_case(Q->ne[0]);
|
GGML_ABORT("fatal error");
|
||||||
}
|
}
|
||||||
|
|
||||||
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
|
// Best FlashAttention kernel for a specific GPU:
|
||||||
|
enum best_fattn_kernel {
|
||||||
|
BEST_FATTN_KERNEL_NONE = 0,
|
||||||
|
BEST_FATTN_KERNEL_TILE_F32 = 200,
|
||||||
|
BEST_FATTN_KERNEL_TILE_F16 = 210,
|
||||||
|
BEST_FATTN_KERNEL_VEC_F32 = 100,
|
||||||
|
BEST_FATTN_KERNEL_VEC_F16 = 110,
|
||||||
|
BEST_FATTN_KERNEL_WMMA_F16 = 300,
|
||||||
|
BEST_FATTN_KERNEL_MMA_F16 = 400,
|
||||||
|
};
|
||||||
|
|
||||||
|
static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const ggml_tensor * dst) {
|
||||||
|
#ifndef FLASH_ATTN_AVAILABLE
|
||||||
|
GGML_UNUSED(device); GGML_UNUSED(dst);
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
#endif// FLASH_ATTN_AVAILABLE
|
||||||
|
|
||||||
const ggml_tensor * KQV = dst;
|
const ggml_tensor * KQV = dst;
|
||||||
const ggml_tensor * Q = dst->src[0];
|
const ggml_tensor * Q = dst->src[0];
|
||||||
const ggml_tensor * K = dst->src[1];
|
const ggml_tensor * K = dst->src[1];
|
||||||
const ggml_tensor * V = dst->src[2];
|
const ggml_tensor * V = dst->src[2];
|
||||||
const ggml_tensor * mask = dst->src[3];
|
const ggml_tensor * mask = dst->src[3];
|
||||||
|
|
||||||
ggml_cuda_set_device(ctx.device);
|
const int gqa_ratio = Q->ne[2] / K->ne[2];
|
||||||
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
|
GGML_ASSERT(Q->ne[2] % K->ne[2] == 0);
|
||||||
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
|
|
||||||
|
const int cc = ggml_cuda_info().devices[device].cc;
|
||||||
|
const int warp_size = ggml_cuda_info().devices[device].warp_size;
|
||||||
const enum ggml_prec prec = ggml_flash_attn_ext_get_prec(KQV);
|
const enum ggml_prec prec = ggml_flash_attn_ext_get_prec(KQV);
|
||||||
|
|
||||||
#if defined(GGML_HIP_ROCWMMA_FATTN)
|
switch (K->ne[0]) {
|
||||||
if (GGML_CUDA_CC_IS_AMD(cc) && fp16_mma_available(cc)) {
|
case 64:
|
||||||
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
|
case 128:
|
||||||
return;
|
case 256:
|
||||||
|
if (V->ne[0] != K->ne[0]) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
}
|
}
|
||||||
#endif // defined(GGML_HIP_ROCWMMA_FATTN)
|
break;
|
||||||
|
case 80:
|
||||||
if (!fast_fp16_available(cc)) {
|
case 96:
|
||||||
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
|
case 112:
|
||||||
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
|
if (V->ne[0] != K->ne[0]) {
|
||||||
} else {
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
|
|
||||||
}
|
}
|
||||||
return;
|
if (!fp16_mma_available(cc) && !turing_mma_available(cc)) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case 576:
|
||||||
|
if (V->ne[0] != 512) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
}
|
||||||
|
if (!turing_mma_available(cc) || gqa_ratio % 16 != 0) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (!fp16_mma_available(cc)) {
|
#ifndef GGML_CUDA_FA_ALL_QUANTS
|
||||||
if (prec == GGML_PREC_DEFAULT) {
|
if (K->type != V->type) {
|
||||||
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
|
|
||||||
} else {
|
|
||||||
ggml_cuda_flash_attn_ext_tile_f16(ctx, dst);
|
|
||||||
}
|
}
|
||||||
} else {
|
#endif // GGML_CUDA_FA_ALL_QUANTS
|
||||||
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
|
|
||||||
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
|
switch (K->type) {
|
||||||
} else {
|
case GGML_TYPE_F16:
|
||||||
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
|
break;
|
||||||
|
case GGML_TYPE_Q4_1:
|
||||||
|
case GGML_TYPE_Q5_0:
|
||||||
|
case GGML_TYPE_Q5_1:
|
||||||
|
#ifndef GGML_CUDA_FA_ALL_QUANTS
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
#endif // GGML_CUDA_FA_ALL_QUANTS
|
||||||
|
case GGML_TYPE_Q4_0:
|
||||||
|
case GGML_TYPE_Q8_0:
|
||||||
|
#ifdef GGML_CUDA_FA_ALL_QUANTS
|
||||||
|
if (K->ne[0] != 128 && K->ne[0] != 64) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
}
|
}
|
||||||
|
#else
|
||||||
|
if (K->ne[0] != 128) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
}
|
}
|
||||||
return;
|
#endif // GGML_CUDA_FA_ALL_QUANTS
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
}
|
||||||
|
|
||||||
|
switch (V->type) {
|
||||||
|
case GGML_TYPE_F16:
|
||||||
|
break;
|
||||||
|
case GGML_TYPE_Q4_1:
|
||||||
|
case GGML_TYPE_Q5_0:
|
||||||
|
case GGML_TYPE_Q5_1:
|
||||||
|
case GGML_TYPE_Q4_0:
|
||||||
|
case GGML_TYPE_Q8_0:
|
||||||
|
if (K->ne[0] != 128) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (mask && mask->ne[2] != 1) {
|
||||||
|
return BEST_FATTN_KERNEL_NONE;
|
||||||
}
|
}
|
||||||
|
|
||||||
const bool gqa_opt_applies = ((Q->ne[2] / K->ne[2]) % 2 == 0) && mask; // The mma-based kernels have GQA-specific optimizations
|
|
||||||
const bool mma_needs_data_conversion = K->type != GGML_TYPE_F16 || V->type != GGML_TYPE_F16;
|
|
||||||
const bool mma_faster_for_rtx4000 = Q->ne[3] > 1 || (Q->ne[2] > 4*K->ne[2] && K->ne[1] >= 8192);
|
|
||||||
const bool mma_faster_for_bs1 = turing_mma_available(cc) && gqa_opt_applies && !mma_needs_data_conversion &&
|
|
||||||
(cc < GGML_CUDA_CC_ADA_LOVELACE || mma_faster_for_rtx4000);
|
|
||||||
const bool can_use_vector_kernel = Q->ne[0] <= 256 && Q->ne[0] % (2*warp_size) == 0;
|
const bool can_use_vector_kernel = Q->ne[0] <= 256 && Q->ne[0] % (2*warp_size) == 0;
|
||||||
|
|
||||||
|
// If Turing tensor cores available, use them except for some cases with batch size 1:
|
||||||
|
if (turing_mma_available(cc)) {
|
||||||
|
const bool gqa_opt_applies = gqa_ratio % 2 == 0 && mask; // The mma-based kernels have GQA-specific optimizations
|
||||||
|
const bool mma_needs_data_conversion = K->type != GGML_TYPE_F16 || V->type != GGML_TYPE_F16;
|
||||||
|
const bool mma_faster_for_rtx4000 = Q->ne[3] > 1 || (gqa_ratio > 4 && K->ne[1] >= 8192);
|
||||||
|
const bool mma_faster_for_bs1 = gqa_opt_applies && !mma_needs_data_conversion &&
|
||||||
|
(cc < GGML_CUDA_CC_ADA_LOVELACE || mma_faster_for_rtx4000);
|
||||||
if (Q->ne[1] == 1 && can_use_vector_kernel && !mma_faster_for_bs1) {
|
if (Q->ne[1] == 1 && can_use_vector_kernel && !mma_faster_for_bs1) {
|
||||||
if (prec == GGML_PREC_DEFAULT) {
|
if (prec == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
|
||||||
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
|
return BEST_FATTN_KERNEL_VEC_F16;
|
||||||
} else {
|
|
||||||
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
|
|
||||||
}
|
}
|
||||||
return;
|
return BEST_FATTN_KERNEL_VEC_F32;
|
||||||
|
}
|
||||||
|
return BEST_FATTN_KERNEL_MMA_F16;
|
||||||
}
|
}
|
||||||
|
|
||||||
// The MMA implementation needs Turing or newer, use the old WMMA code for Volta:
|
// Use kernels specializes for small batch sizes if possible:
|
||||||
if (fp16_mma_available(cc) && !turing_mma_available(cc)) {
|
if (Q->ne[1] <= 8 && can_use_vector_kernel) {
|
||||||
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
|
if (prec == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
|
||||||
return;
|
return BEST_FATTN_KERNEL_VEC_F16;
|
||||||
|
}
|
||||||
|
return BEST_FATTN_KERNEL_VEC_F32;
|
||||||
}
|
}
|
||||||
|
|
||||||
ggml_cuda_flash_attn_ext_mma_f16(ctx, dst);
|
// For large batch sizes, use the WMMA kernel if possible:
|
||||||
|
if (fp16_mma_available(cc)) {
|
||||||
|
return BEST_FATTN_KERNEL_WMMA_F16;
|
||||||
|
}
|
||||||
|
|
||||||
|
// If there is no suitable kernel for tensor cores or small batch sizes, use the generic kernel for large batch sizes:
|
||||||
|
if (prec == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
|
||||||
|
return BEST_FATTN_KERNEL_TILE_F16;
|
||||||
|
}
|
||||||
|
return BEST_FATTN_KERNEL_TILE_F32;
|
||||||
|
}
|
||||||
|
|
||||||
|
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
|
||||||
|
ggml_cuda_set_device(ctx.device);
|
||||||
|
switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
|
||||||
|
case BEST_FATTN_KERNEL_NONE:
|
||||||
|
GGML_ABORT("fatal error");
|
||||||
|
case BEST_FATTN_KERNEL_TILE_F32:
|
||||||
|
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
|
||||||
|
break;
|
||||||
|
case BEST_FATTN_KERNEL_TILE_F16:
|
||||||
|
ggml_cuda_flash_attn_ext_tile_f16(ctx, dst);
|
||||||
|
break;
|
||||||
|
case BEST_FATTN_KERNEL_VEC_F32:
|
||||||
|
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
|
||||||
|
break;
|
||||||
|
case BEST_FATTN_KERNEL_VEC_F16:
|
||||||
|
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
|
||||||
|
break;
|
||||||
|
case BEST_FATTN_KERNEL_WMMA_F16:
|
||||||
|
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
|
||||||
|
break;
|
||||||
|
case BEST_FATTN_KERNEL_MMA_F16:
|
||||||
|
ggml_cuda_flash_attn_ext_mma_f16(ctx, dst);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
bool ggml_cuda_flash_attn_ext_supported(int device, const ggml_tensor * dst) {
|
||||||
|
return ggml_cuda_get_best_fattn_kernel(device, dst) != BEST_FATTN_KERNEL_NONE;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,3 +1,5 @@
|
||||||
#include "common.cuh"
|
#include "common.cuh"
|
||||||
|
|
||||||
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
|
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
|
||||||
|
|
||||||
|
bool ggml_cuda_flash_attn_ext_supported(int device, const ggml_tensor * dst);
|
||||||
|
|
|
||||||
|
|
@ -32,7 +32,7 @@ static __global__ void k_get_rows(
|
||||||
const int y_offset = qr == 1 ? 1 : qk/2;
|
const int y_offset = qr == 1 ? 1 : qk/2;
|
||||||
|
|
||||||
// dequantize
|
// dequantize
|
||||||
dfloat2 v;
|
float2 v;
|
||||||
dequantize_kernel(src0_row, ib, iqs, v);
|
dequantize_kernel(src0_row, ib, iqs, v);
|
||||||
|
|
||||||
dst_row[iybs + iqs + 0] = ggml_cuda_cast<dst_t>(v.x);
|
dst_row[iybs + iqs + 0] = ggml_cuda_cast<dst_t>(v.x);
|
||||||
|
|
|
||||||
|
|
@ -1328,9 +1328,7 @@ static void ggml_cuda_op_mul_mat_cublas(
|
||||||
&beta, dst_dd_i, ldc));
|
&beta, dst_dd_i, ldc));
|
||||||
}
|
}
|
||||||
|
|
||||||
GGML_UNUSED(dst);
|
GGML_UNUSED_VARS(dst, src1_ddq_i, src1_padded_row_size);
|
||||||
GGML_UNUSED(src1_ddq_i);
|
|
||||||
GGML_UNUSED(src1_padded_row_size);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static void ggml_cuda_set_peer_access(const int n_tokens, int main_device) {
|
static void ggml_cuda_set_peer_access(const int n_tokens, int main_device) {
|
||||||
|
|
@ -3499,44 +3497,8 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
|
||||||
case GGML_OP_GATED_LINEAR_ATTN:
|
case GGML_OP_GATED_LINEAR_ATTN:
|
||||||
case GGML_OP_RWKV_WKV7:
|
case GGML_OP_RWKV_WKV7:
|
||||||
return true;
|
return true;
|
||||||
case GGML_OP_FLASH_ATTN_EXT: {
|
case GGML_OP_FLASH_ATTN_EXT:
|
||||||
#ifndef FLASH_ATTN_AVAILABLE
|
return ggml_cuda_flash_attn_ext_supported(dev_ctx->device, op);
|
||||||
return false;
|
|
||||||
#endif // FLASH_ATTN_AVAILABLE
|
|
||||||
if (op->src[1]->ne[0] != op->src[2]->ne[0]) {
|
|
||||||
const int cc = ggml_cuda_info().devices[dev_ctx->device].cc;
|
|
||||||
if (!turing_mma_available(cc)) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
const int gqa_ratio = op->src[0]->ne[2] / op->src[1]->ne[2];
|
|
||||||
return op->src[1]->ne[0] == 576 && op->src[2]->ne[0] == 512 && op->src[3] && gqa_ratio % 16 == 0;
|
|
||||||
}
|
|
||||||
// TODO: more general-purpose attention sink support [TAG_ATTN_SINKS]
|
|
||||||
if (op->src[4] && !fp16_mma_available(ggml_cuda_info().devices[dev_ctx->device].cc)
|
|
||||||
&& op->src[0]->ne[0] != 64 && op->src[0]->ne[0] != 128) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (op->src[0]->ne[0] == 192) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (op->src[1]->type == GGML_TYPE_BF16 || op->src[2]->type == GGML_TYPE_BF16) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (op->src[0]->ne[0] == 64 && op->src[1]->type == GGML_TYPE_F16) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (op->src[0]->ne[0] == 128) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (op->src[0]->ne[0] == 256 && op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (op->src[3] && op->src[3]->ne[2] != 1) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
return fp16_mma_available(ggml_cuda_info().devices[dev_ctx->device].cc) &&
|
|
||||||
op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16;
|
|
||||||
}
|
|
||||||
case GGML_OP_CROSS_ENTROPY_LOSS:
|
case GGML_OP_CROSS_ENTROPY_LOSS:
|
||||||
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
|
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
|
||||||
case GGML_OP_OPT_STEP_ADAMW:
|
case GGML_OP_OPT_STEP_ADAMW:
|
||||||
|
|
@ -3672,10 +3634,6 @@ static ggml_backend_feature * ggml_backend_cuda_get_features(ggml_backend_reg_t
|
||||||
features.push_back({ "NO_PEER_COPY", "1" });
|
features.push_back({ "NO_PEER_COPY", "1" });
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
|
||||||
features.push_back({ "F16", "1" });
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#ifdef GGML_CUDA_USE_GRAPHS
|
#ifdef GGML_CUDA_USE_GRAPHS
|
||||||
features.push_back({ "USE_GRAPHS", "1" });
|
features.push_back({ "USE_GRAPHS", "1" });
|
||||||
#endif
|
#endif
|
||||||
|
|
|
||||||
|
|
@ -291,9 +291,7 @@ namespace ggml_cuda_mma {
|
||||||
: "=r"(xi[0]), "=r"(xi[2]), "=r"(xi[1]), "=r"(xi[3])
|
: "=r"(xi[0]), "=r"(xi[2]), "=r"(xi[1]), "=r"(xi[3])
|
||||||
: "l"(xs));
|
: "l"(xs));
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(t);
|
GGML_UNUSED_VARS(t, xs0, stride);
|
||||||
GGML_UNUSED(xs0);
|
|
||||||
GGML_UNUSED(stride);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -315,9 +313,7 @@ namespace ggml_cuda_mma {
|
||||||
: "r"(A.x[1]), "r"(B.x[0]));
|
: "r"(A.x[1]), "r"(B.x[0]));
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -345,9 +341,7 @@ namespace ggml_cuda_mma {
|
||||||
: "r"(A.x[3]), "r"(B.x[1]));
|
: "r"(A.x[3]), "r"(B.x[1]));
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -372,9 +366,7 @@ namespace ggml_cuda_mma {
|
||||||
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[1]));
|
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[1]));
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -408,9 +400,7 @@ namespace ggml_cuda_mma {
|
||||||
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
|
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -425,9 +415,7 @@ namespace ggml_cuda_mma {
|
||||||
: "+r"(Dxi[0]), "+r"(Dxi[1]), "+r"(Dxi[2]), "+r"(Dxi[3])
|
: "+r"(Dxi[0]), "+r"(Dxi[1]), "+r"(Dxi[2]), "+r"(Dxi[3])
|
||||||
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]));
|
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]));
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMPERE_MMA_AVAILABLE
|
#endif // AMPERE_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -452,9 +440,7 @@ namespace ggml_cuda_mma {
|
||||||
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[1]));
|
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[1]));
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -469,9 +455,7 @@ namespace ggml_cuda_mma {
|
||||||
: "+r"(Dxi[0]), "+r"(Dxi[1]), "+r"(Dxi[2]), "+r"(Dxi[3])
|
: "+r"(Dxi[0]), "+r"(Dxi[1]), "+r"(Dxi[2]), "+r"(Dxi[3])
|
||||||
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]));
|
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]));
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMPERE_MMA_AVAILABLE
|
#endif // AMPERE_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -505,9 +489,7 @@ namespace ggml_cuda_mma {
|
||||||
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
|
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
|
||||||
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // TURING_MMA_AVAILABLE
|
#endif // TURING_MMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -533,9 +515,7 @@ namespace ggml_cuda_mma {
|
||||||
0, 0, 0);
|
0, 0, 0);
|
||||||
#endif // defined(CDNA3)
|
#endif // defined(CDNA3)
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMD_MFMA_AVAILABLE
|
#endif // AMD_MFMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -561,9 +541,7 @@ namespace ggml_cuda_mma {
|
||||||
0, 0, 0);
|
0, 0, 0);
|
||||||
#endif // defined(CDNA3)
|
#endif // defined(CDNA3)
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(D);
|
GGML_UNUSED_VARS(D, A, B);
|
||||||
GGML_UNUSED(A);
|
|
||||||
GGML_UNUSED(B);
|
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMD_MFMA_AVAILABLE
|
#endif // AMD_MFMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -132,11 +132,11 @@ static __global__ void mul_mat_f(
|
||||||
dst[j*stride_col_dst + row0 + threadIdx.x] = sum;
|
dst[j*stride_col_dst + row0 + threadIdx.x] = sum;
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
|
GGML_UNUSED_VARS(x, y, ids, dst,
|
||||||
|
ncols, nchannels_y, stride_row, stride_col_y, stride_col_dst,
|
||||||
|
channel_ratio, stride_channel_x, stride_channel_y, stride_channel_dst,
|
||||||
|
sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(ids); GGML_UNUSED(dst);
|
|
||||||
GGML_UNUSED(ncols); GGML_UNUSED(nchannels_y); GGML_UNUSED(stride_row); GGML_UNUSED(stride_col_y); GGML_UNUSED(stride_col_dst);
|
|
||||||
GGML_UNUSED(channel_ratio); GGML_UNUSED(stride_channel_x); GGML_UNUSED(stride_channel_y); GGML_UNUSED(stride_channel_dst);
|
|
||||||
GGML_UNUSED(sample_ratio); GGML_UNUSED(stride_sample_x); GGML_UNUSED(stride_sample_y); GGML_UNUSED(stride_sample_dst);
|
|
||||||
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
|
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -151,7 +151,6 @@ static void mul_mat_f_cuda(
|
||||||
cudaStream_t stream) {
|
cudaStream_t stream) {
|
||||||
typedef tile<16, 8, T> tile_A;
|
typedef tile<16, 8, T> tile_A;
|
||||||
typedef tile< 8, 8, T> tile_B;
|
typedef tile< 8, 8, T> tile_B;
|
||||||
typedef tile<16, 8, float> tile_C;
|
|
||||||
|
|
||||||
GGML_ASSERT(!ids && "mul_mat_id not implemented");
|
GGML_ASSERT(!ids && "mul_mat_id not implemented");
|
||||||
|
|
||||||
|
|
@ -352,9 +351,6 @@ void ggml_cuda_mul_mat_f(ggml_backend_cuda_context & ctx, const ggml_tensor * sr
|
||||||
GGML_ASSERT(!ids || ids->nb[0] == ggml_type_size(ids->type));
|
GGML_ASSERT(!ids || ids->nb[0] == ggml_type_size(ids->type));
|
||||||
GGML_ASSERT( nb0 == ts_dst);
|
GGML_ASSERT( nb0 == ts_dst);
|
||||||
|
|
||||||
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
|
|
||||||
const enum ggml_prec prec = fast_fp16_available(cc) ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
|
|
||||||
|
|
||||||
const float * src1_d = (const float *) src1->data;
|
const float * src1_d = (const float *) src1->data;
|
||||||
const int32_t * ids_d = ids ? (const int32_t *) ids->data : nullptr;
|
const int32_t * ids_d = ids ? (const int32_t *) ids->data : nullptr;
|
||||||
float * dst_d = (float *) dst->data;
|
float * dst_d = (float *) dst->data;
|
||||||
|
|
|
||||||
|
|
@ -266,10 +266,7 @@ void ggml_cuda_op_mul_mat_q(
|
||||||
|
|
||||||
ggml_cuda_mul_mat_q_switch_type(ctx, args, stream);
|
ggml_cuda_mul_mat_q_switch_type(ctx, args, stream);
|
||||||
|
|
||||||
GGML_UNUSED(src1);
|
GGML_UNUSED_VARS(src1, dst, src1_ddf_i, src1_padded_row_size);
|
||||||
GGML_UNUSED(dst);
|
|
||||||
GGML_UNUSED(src1_ddf_i);
|
|
||||||
GGML_UNUSED(src1_padded_row_size);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) {
|
bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) {
|
||||||
|
|
|
||||||
|
|
@ -1255,7 +1255,7 @@ static __device__ __forceinline__ void vec_dot_q8_0_16_q8_1_mma(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(sum); GGML_UNUSED(k00);
|
GGML_UNUSED_VARS(x, y, sum, k00);
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMD_MFMA_AVAILABLE
|
#endif // AMD_MFMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -1572,7 +1572,7 @@ static __device__ __forceinline__ void vec_dot_q2_K_q8_1_mma(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(sum); GGML_UNUSED(k00);
|
GGML_UNUSED_VARS(x, y, sum, k00);
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMD_MFMA_AVAILABLE
|
#endif // AMD_MFMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -2301,7 +2301,7 @@ static __device__ __forceinline__ void vec_dot_q6_K_q8_1_mma(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(sum); GGML_UNUSED(k00);
|
GGML_UNUSED_VARS(x, y, sum, k00);
|
||||||
NO_DEVICE_CODE;
|
NO_DEVICE_CODE;
|
||||||
#endif // AMD_MFMA_AVAILABLE
|
#endif // AMD_MFMA_AVAILABLE
|
||||||
}
|
}
|
||||||
|
|
@ -2855,12 +2855,14 @@ static __device__ __forceinline__ void mmq_write_back_mma(
|
||||||
#else
|
#else
|
||||||
typedef tile<16, 8, int> tile_C;
|
typedef tile<16, 8, int> tile_C;
|
||||||
constexpr int rows_per_warp = 2 * granularity;
|
constexpr int rows_per_warp = 2 * granularity;
|
||||||
#endif
|
#endif // defined(AMD_MFMA_AVAILABLE)
|
||||||
constexpr int ntx = rows_per_warp/tile_C::I; // Number of x minitiles per warp.
|
constexpr int ntx = rows_per_warp/tile_C::I; // Number of x minitiles per warp.
|
||||||
|
|
||||||
const int i0 = (threadIdx.y / ntx) * (ntx*tile_C::I);
|
const int i0 = (threadIdx.y / ntx) * (ntx*tile_C::I);
|
||||||
#if defined(TURING_MMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
|
#if defined(TURING_MMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
|
||||||
static_assert(nwarps*tile_C::I == mmq_y, "nwarps*tile_C::I != mmq_y");
|
static_assert(nwarps*tile_C::I == mmq_y, "nwarps*tile_C::I != mmq_y");
|
||||||
|
#else
|
||||||
|
GGML_UNUSED(nwarps);
|
||||||
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE)
|
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE)
|
||||||
|
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
|
|
|
||||||
|
|
@ -433,12 +433,7 @@ void ggml_cuda_op_mul_mat_vec_f(
|
||||||
GGML_ABORT("unsupported type: %s", ggml_type_name(src0->type));
|
GGML_ABORT("unsupported type: %s", ggml_type_name(src0->type));
|
||||||
}
|
}
|
||||||
|
|
||||||
GGML_UNUSED(ctx);
|
GGML_UNUSED_VARS(ctx, src1, dst, src1_ddq_i, src1_ncols, src1_padded_row_size);
|
||||||
GGML_UNUSED(src1);
|
|
||||||
GGML_UNUSED(dst);
|
|
||||||
GGML_UNUSED(src1_ddq_i);
|
|
||||||
GGML_UNUSED(src1_ncols);
|
|
||||||
GGML_UNUSED(src1_padded_row_size);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
bool ggml_cuda_should_use_mmvf(enum ggml_type type, int cc, const int64_t * src0_ne, int64_t ne11) {
|
bool ggml_cuda_should_use_mmvf(enum ggml_type type, int cc, const int64_t * src0_ne, int64_t ne11) {
|
||||||
|
|
|
||||||
|
|
@ -596,9 +596,5 @@ void ggml_cuda_op_mul_mat_vec_q(
|
||||||
src0_dd_i, src0->type, src1_ddq_i, nullptr, dst_dd_i, ne00, row_diff, src1_ncols, stride_row_x, stride_col_y, nrows_dst,
|
src0_dd_i, src0->type, src1_ddq_i, nullptr, dst_dd_i, ne00, row_diff, src1_ncols, stride_row_x, stride_col_y, nrows_dst,
|
||||||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, stream);
|
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, stream);
|
||||||
|
|
||||||
GGML_UNUSED(src1);
|
GGML_UNUSED_VARS(src1, dst, src1_ddf_i, src1_ncols, src1_padded_row_size);
|
||||||
GGML_UNUSED(dst);
|
|
||||||
GGML_UNUSED(src1_ddf_i);
|
|
||||||
GGML_UNUSED(src1_ncols);
|
|
||||||
GGML_UNUSED(src1_padded_row_size);
|
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -39,7 +39,7 @@ static __global__ void reduce_rows_f32(const float * __restrict__ x, float * __r
|
||||||
}
|
}
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
sum = 0.0f;
|
sum = 0.0f;
|
||||||
if (lane_id < (blockDim.x / WARP_SIZE)) {
|
if (lane_id < (static_cast<int>(blockDim.x) / WARP_SIZE)) {
|
||||||
sum = s_sum[lane_id];
|
sum = s_sum[lane_id];
|
||||||
}
|
}
|
||||||
sum = warp_reduce_sum(sum);
|
sum = warp_reduce_sum(sum);
|
||||||
|
|
|
||||||
|
|
@ -87,7 +87,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q4_1_q8_1_imp
|
||||||
sumi = ggml_cuda_dp4a(vi1, u[2*i+1], sumi);
|
sumi = ggml_cuda_dp4a(vi1, u[2*i+1], sumi);
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
#ifdef FAST_FP16_AVAILABLE
|
||||||
const float2 tmp = __half22float2(__hmul2(dm4, ds8));
|
const float2 tmp = __half22float2(__hmul2(dm4, ds8));
|
||||||
const float d4d8 = tmp.x;
|
const float d4d8 = tmp.x;
|
||||||
const float m4s8 = tmp.y;
|
const float m4s8 = tmp.y;
|
||||||
|
|
@ -96,7 +96,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q4_1_q8_1_imp
|
||||||
const float2 ds8f = __half22float2(ds8);
|
const float2 ds8f = __half22float2(ds8);
|
||||||
const float d4d8 = dm4f.x * ds8f.x;
|
const float d4d8 = dm4f.x * ds8f.x;
|
||||||
const float m4s8 = dm4f.y * ds8f.y;
|
const float m4s8 = dm4f.y * ds8f.y;
|
||||||
#endif // GGML_CUDA_F16
|
#endif // FAST_FP16_AVAILABLE
|
||||||
|
|
||||||
// scale second part of sum by QI8_1/(vdr * QR4_1) to compensate for multiple threads adding it
|
// scale second part of sum by QI8_1/(vdr * QR4_1) to compensate for multiple threads adding it
|
||||||
return sumi * d4d8 + m4s8 / (QI8_1 / (vdr * QR4_1));
|
return sumi * d4d8 + m4s8 / (QI8_1 / (vdr * QR4_1));
|
||||||
|
|
@ -158,7 +158,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q5_1_q8_1_imp
|
||||||
sumi = ggml_cuda_dp4a(vi1, u[2*i+1], sumi); // SIMD dot product of quantized values
|
sumi = ggml_cuda_dp4a(vi1, u[2*i+1], sumi); // SIMD dot product of quantized values
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
#ifdef FAST_FP16_AVAILABLE
|
||||||
const float2 tmp = __half22float2(__hmul2(dm5, ds8));
|
const float2 tmp = __half22float2(__hmul2(dm5, ds8));
|
||||||
const float d5d8 = tmp.x;
|
const float d5d8 = tmp.x;
|
||||||
const float m5s8 = tmp.y;
|
const float m5s8 = tmp.y;
|
||||||
|
|
@ -167,7 +167,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q5_1_q8_1_imp
|
||||||
const float2 ds8f = __half22float2(ds8);
|
const float2 ds8f = __half22float2(ds8);
|
||||||
const float d5d8 = dm5f.x * ds8f.x;
|
const float d5d8 = dm5f.x * ds8f.x;
|
||||||
const float m5s8 = dm5f.y * ds8f.y;
|
const float m5s8 = dm5f.y * ds8f.y;
|
||||||
#endif // GGML_CUDA_F16
|
#endif // FAST_FP16_AVAILABLE
|
||||||
|
|
||||||
// scale second part of sum by QI5_1 / vdr to compensate for multiple threads adding it
|
// scale second part of sum by QI5_1 / vdr to compensate for multiple threads adding it
|
||||||
return sumi*d5d8 + m5s8 / (QI5_1 / vdr);
|
return sumi*d5d8 + m5s8 / (QI5_1 / vdr);
|
||||||
|
|
@ -201,7 +201,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q8_1_q8_1_imp
|
||||||
sumi = ggml_cuda_dp4a(v[i], u[i], sumi);
|
sumi = ggml_cuda_dp4a(v[i], u[i], sumi);
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef GGML_CUDA_F16
|
#ifdef FAST_FP16_AVAILABLE
|
||||||
const float2 tmp = __half22float2(__hmul2(dm8, ds8));
|
const float2 tmp = __half22float2(__hmul2(dm8, ds8));
|
||||||
const float d8d8 = tmp.x;
|
const float d8d8 = tmp.x;
|
||||||
const float m8s8 = tmp.y;
|
const float m8s8 = tmp.y;
|
||||||
|
|
@ -210,7 +210,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q8_1_q8_1_imp
|
||||||
const float2 ds8f = __half22float2(ds8);
|
const float2 ds8f = __half22float2(ds8);
|
||||||
const float d8d8 = dm8f.x * ds8f.x;
|
const float d8d8 = dm8f.x * ds8f.x;
|
||||||
const float m8s8 = dm8f.y * ds8f.y;
|
const float m8s8 = dm8f.y * ds8f.y;
|
||||||
#endif // GGML_CUDA_F16
|
#endif // FAST_FP16_AVAILABLE
|
||||||
|
|
||||||
// scale second part of sum by QI8_1/ vdr to compensate for multiple threads adding it
|
// scale second part of sum by QI8_1/ vdr to compensate for multiple threads adding it
|
||||||
return sumi*d8d8 + m8s8 / (QI8_1 / vdr);
|
return sumi*d8d8 + m8s8 / (QI8_1 / vdr);
|
||||||
|
|
|
||||||
|
|
@ -1846,7 +1846,7 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
|
||||||
case GGML_OP_ROPE:
|
case GGML_OP_ROPE:
|
||||||
return true;
|
return true;
|
||||||
case GGML_OP_IM2COL:
|
case GGML_OP_IM2COL:
|
||||||
return op->src[0]->type == GGML_TYPE_F16;
|
return op->src[1]->type == GGML_TYPE_F32 && (op->type == GGML_TYPE_F16 || op->type == GGML_TYPE_F32);
|
||||||
case GGML_OP_POOL_1D:
|
case GGML_OP_POOL_1D:
|
||||||
return false;
|
return false;
|
||||||
case GGML_OP_UPSCALE:
|
case GGML_OP_UPSCALE:
|
||||||
|
|
@ -4703,7 +4703,6 @@ static int ggml_metal_encode_node(
|
||||||
{
|
{
|
||||||
GGML_ASSERT(ggml_is_contiguous(src0));
|
GGML_ASSERT(ggml_is_contiguous(src0));
|
||||||
GGML_ASSERT(ggml_is_contiguous(src1));
|
GGML_ASSERT(ggml_is_contiguous(src1));
|
||||||
GGML_ASSERT(src0->type == GGML_TYPE_F16);
|
|
||||||
GGML_ASSERT(src1->type == GGML_TYPE_F32);
|
GGML_ASSERT(src1->type == GGML_TYPE_F32);
|
||||||
GGML_ASSERT( dst->type == GGML_TYPE_F16 || dst->type == GGML_TYPE_F32);
|
GGML_ASSERT( dst->type == GGML_TYPE_F16 || dst->type == GGML_TYPE_F32);
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -96,10 +96,6 @@ if (MUSAToolkit_FOUND)
|
||||||
add_compile_definitions(GGML_CUDA_NO_FA)
|
add_compile_definitions(GGML_CUDA_NO_FA)
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
if (GGML_CUDA_F16 OR GGML_CUDA_DMMV_F16)
|
|
||||||
add_compile_definitions(GGML_CUDA_F16)
|
|
||||||
endif()
|
|
||||||
|
|
||||||
if (GGML_CUDA_NO_PEER_COPY)
|
if (GGML_CUDA_NO_PEER_COPY)
|
||||||
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
|
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
|
||||||
endif()
|
endif()
|
||||||
|
|
|
||||||
|
|
@ -112,6 +112,9 @@ set(GGML_OPENCL_KERNELS
|
||||||
mul_mat_f16_f32
|
mul_mat_f16_f32
|
||||||
conv2d
|
conv2d
|
||||||
conv2d_f16_f32
|
conv2d_f16_f32
|
||||||
|
flash_attn_f32_f16
|
||||||
|
flash_attn_f16
|
||||||
|
flash_attn_f32
|
||||||
)
|
)
|
||||||
|
|
||||||
foreach (K ${GGML_OPENCL_KERNELS})
|
foreach (K ${GGML_OPENCL_KERNELS})
|
||||||
|
|
|
||||||
|
|
@ -25,6 +25,7 @@
|
||||||
#include <vector>
|
#include <vector>
|
||||||
#include <string>
|
#include <string>
|
||||||
#include <cmath>
|
#include <cmath>
|
||||||
|
#include <map>
|
||||||
#include <memory>
|
#include <memory>
|
||||||
#include <charconv>
|
#include <charconv>
|
||||||
#include <mutex>
|
#include <mutex>
|
||||||
|
|
@ -332,6 +333,7 @@ struct ggml_backend_opencl_context {
|
||||||
|
|
||||||
cl_int alignment;
|
cl_int alignment;
|
||||||
size_t max_alloc_size;
|
size_t max_alloc_size;
|
||||||
|
size_t max_workgroup_size;
|
||||||
bool fp16_support;
|
bool fp16_support;
|
||||||
bool has_vector_subgroup_broadcast;
|
bool has_vector_subgroup_broadcast;
|
||||||
bool disable_fusion;
|
bool disable_fusion;
|
||||||
|
|
@ -424,6 +426,14 @@ struct ggml_backend_opencl_context {
|
||||||
cl_kernel kernel_diag_mask_inf, kernel_diag_mask_inf_8;
|
cl_kernel kernel_diag_mask_inf, kernel_diag_mask_inf_8;
|
||||||
cl_kernel kernel_soft_max, kernel_soft_max_4;
|
cl_kernel kernel_soft_max, kernel_soft_max_4;
|
||||||
cl_kernel kernel_soft_max_f16, kernel_soft_max_4_f16;
|
cl_kernel kernel_soft_max_f16, kernel_soft_max_4_f16;
|
||||||
|
std::map<std::pair<int, int>, cl_kernel> kernels_flash_attn_f16;
|
||||||
|
std::map<std::pair<int, int>, cl_kernel> kernels_flash_attn_f16_q1;
|
||||||
|
std::map<std::pair<int, int>, cl_kernel> kernels_flash_attn_f32;
|
||||||
|
std::map<std::pair<int, int>, cl_kernel> kernels_flash_attn_f32_q1;
|
||||||
|
std::map<std::pair<int, int>, cl_kernel> kernels_flash_attn_f32_f16;
|
||||||
|
std::map<std::pair<int, int>, cl_kernel> kernels_flash_attn_f32_f16_q1;
|
||||||
|
std::map<std::pair<int, int>, int> kernels_flash_attn_bm;
|
||||||
|
std::map<std::pair<int, int>, int> kernels_flash_attn_bn;
|
||||||
cl_kernel kernel_get_rows_f32, kernel_get_rows_f16, kernel_get_rows_q4_0;
|
cl_kernel kernel_get_rows_f32, kernel_get_rows_f16, kernel_get_rows_q4_0;
|
||||||
cl_kernel kernel_set_rows_f32, kernel_set_rows_f16;
|
cl_kernel kernel_set_rows_f32, kernel_set_rows_f16;
|
||||||
cl_kernel kernel_rope_norm_f32, kernel_rope_norm_f16, kernel_rope_neox_f32, kernel_rope_neox_f16;
|
cl_kernel kernel_rope_norm_f32, kernel_rope_norm_f16, kernel_rope_neox_f32, kernel_rope_neox_f16;
|
||||||
|
|
@ -1308,6 +1318,73 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx, ggml_cl_ve
|
||||||
GGML_LOG_CONT(".");
|
GGML_LOG_CONT(".");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// flash_attn
|
||||||
|
{
|
||||||
|
#ifdef GGML_OPENCL_EMBED_KERNELS
|
||||||
|
const std::string kernel_src_f16 {
|
||||||
|
#include "flash_attn_f16.cl.h"
|
||||||
|
};
|
||||||
|
const std::string kernel_src_f32 {
|
||||||
|
#include "flash_attn_f32.cl.h"
|
||||||
|
};
|
||||||
|
const std::string kernel_src_f32_f16 {
|
||||||
|
#include "flash_attn_f32_f16.cl.h"
|
||||||
|
};
|
||||||
|
#else
|
||||||
|
const std::string kernel_src_f16 = read_file("flash_attn_f16.cl");
|
||||||
|
const std::string kernel_src_f32 = read_file("flash_attn_f32.cl");
|
||||||
|
const std::string kernel_src_f32_f16 = read_file("flash_attn_f32_f16.cl");
|
||||||
|
#endif
|
||||||
|
|
||||||
|
if (!kernel_src_f16.empty() && !kernel_src_f32.empty() && !kernel_src_f32_f16.empty()) {
|
||||||
|
const struct { int dk; int dv; int bm; int bn; } fa_dims[] = {
|
||||||
|
{ 64, 64, 64, 64}, { 80, 80, 64, 32}, { 96, 96, 64, 32},
|
||||||
|
{112, 112, 32, 32}, {128, 128, 32, 32}, {192, 128, 16, 16},
|
||||||
|
{192, 192, 16, 16}, {256, 256, 16, 16},
|
||||||
|
};
|
||||||
|
|
||||||
|
for (size_t i = 0; i < sizeof(fa_dims)/sizeof(fa_dims[0]); ++i) {
|
||||||
|
const int dk = fa_dims[i].dk;
|
||||||
|
const int dv = fa_dims[i].dv;
|
||||||
|
const int bm = fa_dims[i].bm;
|
||||||
|
const int bn = fa_dims[i].bn;
|
||||||
|
std::string OPTS = compile_opts +
|
||||||
|
" -D DK=" + std::to_string(dk) +
|
||||||
|
" -D DV=" + std::to_string(dv) +
|
||||||
|
" -D BLOCK_M=" + std::to_string(bm) +
|
||||||
|
" -D BLOCK_N=" + std::to_string(bn);
|
||||||
|
|
||||||
|
cl_program prog_f16 = build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src_f16.c_str(), OPTS);
|
||||||
|
cl_kernel k_f16, k_f16_q1;
|
||||||
|
CL_CHECK((k_f16 = clCreateKernel(prog_f16, "flash_attn_f16", &err), err));
|
||||||
|
CL_CHECK((k_f16_q1 = clCreateKernel(prog_f16, "flash_attn_f16_q1", &err), err));
|
||||||
|
backend_ctx->kernels_flash_attn_f16[{dk, dv}] = k_f16;
|
||||||
|
backend_ctx->kernels_flash_attn_f16_q1[{dk, dv}] = k_f16_q1;
|
||||||
|
CL_CHECK(clReleaseProgram(prog_f16));
|
||||||
|
|
||||||
|
cl_program prog_f32 = build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src_f32.c_str(), OPTS);
|
||||||
|
cl_kernel k_f32, k_f32_q1;
|
||||||
|
CL_CHECK((k_f32 = clCreateKernel(prog_f32, "flash_attn_f32", &err), err));
|
||||||
|
CL_CHECK((k_f32_q1 = clCreateKernel(prog_f32, "flash_attn_f32_q1", &err), err));
|
||||||
|
backend_ctx->kernels_flash_attn_f32[{dk, dv}] = k_f32;
|
||||||
|
backend_ctx->kernels_flash_attn_f32_q1[{dk, dv}] = k_f32_q1;
|
||||||
|
CL_CHECK(clReleaseProgram(prog_f32));
|
||||||
|
|
||||||
|
cl_program prog_f32_f16 = build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src_f32_f16.c_str(), OPTS);
|
||||||
|
cl_kernel k_f32_f16, k_f32_f16_q1;
|
||||||
|
CL_CHECK((k_f32_f16 = clCreateKernel(prog_f32_f16, "flash_attn_f32_f16", &err), err));
|
||||||
|
CL_CHECK((k_f32_f16_q1 = clCreateKernel(prog_f32_f16, "flash_attn_f32_f16_q1", &err), err));
|
||||||
|
backend_ctx->kernels_flash_attn_f32_f16[{dk, dv}] = k_f32_f16;
|
||||||
|
backend_ctx->kernels_flash_attn_f32_f16_q1[{dk, dv}] = k_f32_f16_q1;
|
||||||
|
CL_CHECK(clReleaseProgram(prog_f32_f16));
|
||||||
|
|
||||||
|
backend_ctx->kernels_flash_attn_bm[{dk, dv}] = bm;
|
||||||
|
backend_ctx->kernels_flash_attn_bn[{dk, dv}] = bn;
|
||||||
|
}
|
||||||
|
GGML_LOG_CONT(".");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// argsort
|
// argsort
|
||||||
{
|
{
|
||||||
#ifdef GGML_OPENCL_EMBED_KERNELS
|
#ifdef GGML_OPENCL_EMBED_KERNELS
|
||||||
|
|
@ -2142,6 +2219,9 @@ static ggml_backend_opencl_context * ggml_cl2_init(ggml_backend_dev_t dev) {
|
||||||
clGetDeviceInfo(device, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(size_t), &backend_ctx->max_alloc_size, NULL);
|
clGetDeviceInfo(device, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(size_t), &backend_ctx->max_alloc_size, NULL);
|
||||||
GGML_LOG_INFO("ggml_opencl: max mem alloc size: %zu MB\n", backend_ctx->max_alloc_size/1024/1024);
|
GGML_LOG_INFO("ggml_opencl: max mem alloc size: %zu MB\n", backend_ctx->max_alloc_size/1024/1024);
|
||||||
|
|
||||||
|
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &backend_ctx->max_workgroup_size, NULL);
|
||||||
|
GGML_LOG_INFO("ggml_opencl: device max workgroup size: %lu\n", backend_ctx->max_workgroup_size);
|
||||||
|
|
||||||
// Check SVM.
|
// Check SVM.
|
||||||
cl_device_svm_capabilities svm_caps;
|
cl_device_svm_capabilities svm_caps;
|
||||||
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_SVM_CAPABILITIES, sizeof(cl_device_svm_capabilities), &svm_caps, 0));
|
CL_CHECK(clGetDeviceInfo(device, CL_DEVICE_SVM_CAPABILITIES, sizeof(cl_device_svm_capabilities), &svm_caps, 0));
|
||||||
|
|
@ -2457,7 +2537,8 @@ static ggml_status ggml_backend_opencl_graph_compute(ggml_backend_t backend, ggm
|
||||||
}
|
}
|
||||||
|
|
||||||
static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
|
static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
|
||||||
GGML_UNUSED(dev);
|
ggml_backend_opencl_device_context * dev_ctx = (ggml_backend_opencl_device_context *)dev->context;
|
||||||
|
ggml_backend_opencl_context * backend_ctx = dev_ctx->backend_ctx;
|
||||||
|
|
||||||
switch (op->op) {
|
switch (op->op) {
|
||||||
case GGML_OP_NONE:
|
case GGML_OP_NONE:
|
||||||
|
|
@ -2632,10 +2713,58 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
|
||||||
}
|
}
|
||||||
case GGML_OP_IM2COL:
|
case GGML_OP_IM2COL:
|
||||||
return true;
|
return true;
|
||||||
case GGML_OP_ARGSORT:
|
case GGML_OP_ARGSORT: {
|
||||||
return op->src[0]->type == GGML_TYPE_F32;
|
cl_kernel kernel = backend_ctx->kernel_argsort_f32_i32;
|
||||||
|
int max_workgroup_size = backend_ctx->get_kernel_workgroup_size(kernel);
|
||||||
|
|
||||||
|
int cols = 1;
|
||||||
|
while (cols < op->ne[0]) {
|
||||||
|
cols *= 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
return cols <= max_workgroup_size && op->src[0]->type == GGML_TYPE_F32;
|
||||||
|
}
|
||||||
case GGML_OP_SUM_ROWS:
|
case GGML_OP_SUM_ROWS:
|
||||||
return op->src[0]->type == GGML_TYPE_F32 && ggml_is_contiguous(op->src[0]);
|
return op->src[0]->type == GGML_TYPE_F32 && ggml_is_contiguous(op->src[0]);
|
||||||
|
case GGML_OP_FLASH_ATTN_EXT:
|
||||||
|
{
|
||||||
|
if (op->src[4]) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
const ggml_tensor * q = op->src[0];
|
||||||
|
const ggml_tensor * k = op->src[1];
|
||||||
|
const ggml_tensor * v = op->src[2];
|
||||||
|
|
||||||
|
const int dk = q->ne[0];
|
||||||
|
const int dv = v->ne[0];
|
||||||
|
|
||||||
|
const struct { int dk; int dv; } supported_dims[] = {
|
||||||
|
{ 64, 64}, { 80, 80}, { 96, 96},
|
||||||
|
{112, 112}, {128, 128}, {192, 128},
|
||||||
|
{192, 192}, {256, 256},
|
||||||
|
};
|
||||||
|
|
||||||
|
bool dims_supported = false;
|
||||||
|
for (size_t i = 0; i < sizeof(supported_dims)/sizeof(supported_dims[0]); ++i) {
|
||||||
|
if (supported_dims[i].dk == dk && supported_dims[i].dv == dv) {
|
||||||
|
dims_supported = true;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!dims_supported) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
const bool is_f32_f32 = q->type == GGML_TYPE_F32 && k->type == GGML_TYPE_F32 &&
|
||||||
|
v->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32;
|
||||||
|
const bool is_f16_f16 = q->type == GGML_TYPE_F16 && k->type == GGML_TYPE_F16 &&
|
||||||
|
v->type == GGML_TYPE_F16 && op->type == GGML_TYPE_F16;
|
||||||
|
const bool is_f32_f16 = q->type == GGML_TYPE_F32 && k->type == GGML_TYPE_F16 &&
|
||||||
|
v->type == GGML_TYPE_F16 && op->type == GGML_TYPE_F32;
|
||||||
|
|
||||||
|
return is_f32_f32 || is_f16_f16 || is_f32_f16;
|
||||||
|
}
|
||||||
default:
|
default:
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
@ -5451,6 +5580,133 @@ static void ggml_cl_timestep_embedding(ggml_backend_t backend, const ggml_tensor
|
||||||
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, NULL, dst);
|
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, NULL, dst);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
static void ggml_cl_flash_attn(ggml_backend_t backend, const ggml_tensor * q, const ggml_tensor * k, ggml_tensor * dst) {
|
||||||
|
const ggml_tensor * v = dst->src[2];
|
||||||
|
const ggml_tensor * mask = dst->src[3];
|
||||||
|
GGML_ASSERT(q->extra);
|
||||||
|
GGML_ASSERT(k->extra);
|
||||||
|
GGML_ASSERT(v->extra);
|
||||||
|
GGML_ASSERT(dst->extra);
|
||||||
|
if (mask) {
|
||||||
|
GGML_ASSERT(mask->extra);
|
||||||
|
}
|
||||||
|
|
||||||
|
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
|
||||||
|
|
||||||
|
const int n_q = q->ne[1];
|
||||||
|
const int n_kv = k->ne[1];
|
||||||
|
const int d_head_q = q->ne[0];
|
||||||
|
const int d_head_v = v->ne[0];
|
||||||
|
const int n_head = q->ne[2];
|
||||||
|
const int n_head_kv = k->ne[2];
|
||||||
|
const int n_batch = q->ne[3];
|
||||||
|
|
||||||
|
cl_kernel kernel = NULL;
|
||||||
|
|
||||||
|
const bool is_f16 = q->type == GGML_TYPE_F16;
|
||||||
|
const bool is_mixed = q->type == GGML_TYPE_F32 && k->type == GGML_TYPE_F16;
|
||||||
|
const std::pair<int, int> dk_dv = {d_head_q, d_head_v};
|
||||||
|
|
||||||
|
if (n_q == 1) {
|
||||||
|
if (is_mixed) {
|
||||||
|
kernel = backend_ctx->kernels_flash_attn_f32_f16_q1.at(dk_dv);
|
||||||
|
} else if (is_f16) {
|
||||||
|
kernel = backend_ctx->kernels_flash_attn_f16_q1.at(dk_dv);
|
||||||
|
} else {
|
||||||
|
kernel = backend_ctx->kernels_flash_attn_f32_q1.at(dk_dv);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
if (is_mixed) {
|
||||||
|
kernel = backend_ctx->kernels_flash_attn_f32_f16.at(dk_dv);
|
||||||
|
} else if (is_f16) {
|
||||||
|
kernel = backend_ctx->kernels_flash_attn_f16.at(dk_dv);
|
||||||
|
} else {
|
||||||
|
kernel = backend_ctx->kernels_flash_attn_f32.at(dk_dv);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
GGML_ASSERT(kernel != NULL);
|
||||||
|
|
||||||
|
ggml_tensor_extra_cl * extra_q = (ggml_tensor_extra_cl *)q->extra;
|
||||||
|
ggml_tensor_extra_cl * extra_k = (ggml_tensor_extra_cl *)k->extra;
|
||||||
|
ggml_tensor_extra_cl * extra_v = (ggml_tensor_extra_cl *)v->extra;
|
||||||
|
ggml_tensor_extra_cl * extra_o = (ggml_tensor_extra_cl *)dst->extra;
|
||||||
|
ggml_tensor_extra_cl * extra_mask = mask ? (ggml_tensor_extra_cl *)mask->extra : NULL;
|
||||||
|
|
||||||
|
cl_ulong offset_q = extra_q->offset + q->view_offs;
|
||||||
|
cl_ulong offset_k = extra_k->offset + k->view_offs;
|
||||||
|
cl_ulong offset_v = extra_v->offset + v->view_offs;
|
||||||
|
cl_ulong offset_o = extra_o->offset + dst->view_offs;
|
||||||
|
cl_mem mask_buffer = extra_mask ? extra_mask->data_device : NULL;
|
||||||
|
cl_ulong offset_mask = extra_mask ? extra_mask->offset + mask->view_offs : 0;
|
||||||
|
|
||||||
|
const cl_ulong q_nb1 = q->nb[1], q_nb2 = q->nb[2], q_nb3 = q->nb[3];
|
||||||
|
const cl_ulong k_nb1 = k->nb[1], k_nb2 = k->nb[2], k_nb3 = k->nb[3];
|
||||||
|
const cl_ulong v_nb1 = v->nb[1], v_nb2 = v->nb[2], v_nb3 = v->nb[3];
|
||||||
|
const cl_ulong o_nb1 = dst->nb[1], o_nb2 = dst->nb[2], o_nb3 = dst->nb[3];
|
||||||
|
const cl_ulong mask_nb1 = mask ? mask->nb[1] : 0;
|
||||||
|
const cl_ulong mask_nb2 = mask ? mask->nb[2] : 0;
|
||||||
|
const cl_ulong mask_nb3 = mask ? mask->nb[3] : 0;
|
||||||
|
const int mask_ne2 = mask ? mask->ne[2] : 0;
|
||||||
|
const int mask_ne3 = mask ? mask->ne[3] : 0;
|
||||||
|
|
||||||
|
float scale, max_bias, logit_softcap;
|
||||||
|
const float * params = (const float *)dst->op_params;
|
||||||
|
scale = params[0];
|
||||||
|
max_bias = params[1];
|
||||||
|
logit_softcap = params[2];
|
||||||
|
|
||||||
|
const int is_causal = (mask == NULL && n_q > 1 && n_q == n_kv);
|
||||||
|
|
||||||
|
const int n_head_log2_val = n_head > 0 ? 1u << (int)floorf(log2f((float)n_head)) : 0;
|
||||||
|
const float n_head_log2_f = n_head_log2_val > 0 ? (float)n_head_log2_val : 1.0f;
|
||||||
|
const float m0 = powf(2.0f, -(max_bias) / n_head_log2_f);
|
||||||
|
const float m1 = powf(2.0f, -(max_bias / 2.0f) / n_head_log2_f);
|
||||||
|
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra_q->data_device));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &offset_q));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra_k->data_device));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &offset_k));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra_v->data_device));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &offset_v));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_mem), &extra_o->data_device));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_ulong), &offset_o));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(float), &scale));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &n_q));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &n_kv));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &is_causal));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &n_head));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(cl_ulong), &q_nb1)); CL_CHECK(clSetKernelArg(kernel, 14, sizeof(cl_ulong), &q_nb2)); CL_CHECK(clSetKernelArg(kernel, 15, sizeof(cl_ulong), &q_nb3));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 16, sizeof(cl_ulong), &k_nb1)); CL_CHECK(clSetKernelArg(kernel, 17, sizeof(cl_ulong), &k_nb2)); CL_CHECK(clSetKernelArg(kernel, 18, sizeof(cl_ulong), &k_nb3));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 19, sizeof(cl_ulong), &v_nb1)); CL_CHECK(clSetKernelArg(kernel, 20, sizeof(cl_ulong), &v_nb2)); CL_CHECK(clSetKernelArg(kernel, 21, sizeof(cl_ulong), &v_nb3));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 22, sizeof(cl_ulong), &o_nb1)); CL_CHECK(clSetKernelArg(kernel, 23, sizeof(cl_ulong), &o_nb2)); CL_CHECK(clSetKernelArg(kernel, 24, sizeof(cl_ulong), &o_nb3));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 25, sizeof(float), &max_bias));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 26, sizeof(float), &m0));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 27, sizeof(float), &m1));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 28, sizeof(int), &n_head_log2_val));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 29, sizeof(float), &logit_softcap));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 30, sizeof(int), &n_head_kv));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 31, sizeof(cl_mem), &mask_buffer));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 32, sizeof(cl_ulong), &offset_mask));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 33, sizeof(cl_ulong), &mask_nb1));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 34, sizeof(cl_ulong), &mask_nb2));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 35, sizeof(cl_ulong), &mask_nb3));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 36, sizeof(int), &mask_ne2));
|
||||||
|
CL_CHECK(clSetKernelArg(kernel, 37, sizeof(int), &mask_ne3));
|
||||||
|
|
||||||
|
if (n_q == 1) {
|
||||||
|
const size_t wg_size = 64;
|
||||||
|
size_t local_work_size[] = { wg_size, 1 };
|
||||||
|
size_t global_work_size[] = { wg_size, (size_t)(n_head * n_batch) };
|
||||||
|
backend_ctx->enqueue_ndrange_kernel(kernel, 2, global_work_size, local_work_size, dst);
|
||||||
|
} else {
|
||||||
|
const int block_m = backend_ctx->kernels_flash_attn_bm.at(dk_dv);
|
||||||
|
const size_t wg_size = block_m;
|
||||||
|
size_t local_work_size[] = { wg_size, 1 };
|
||||||
|
size_t global_work_size[] = { (size_t)((n_q + block_m - 1) / block_m) * wg_size, (size_t)(n_head * n_batch) };
|
||||||
|
backend_ctx->enqueue_ndrange_kernel(kernel, 2, global_work_size, local_work_size, dst);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
static void ggml_cl_mul_mat_f16_f32_tiled(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
static void ggml_cl_mul_mat_f16_f32_tiled(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
|
||||||
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
|
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
|
||||||
|
|
||||||
|
|
@ -7607,6 +7863,12 @@ bool ggml_cl_compute_forward(ggml_backend_t backend, struct ggml_tensor * tensor
|
||||||
}
|
}
|
||||||
func = ggml_cl_sum_rows;
|
func = ggml_cl_sum_rows;
|
||||||
break;
|
break;
|
||||||
|
case GGML_OP_FLASH_ATTN_EXT:
|
||||||
|
if (!any_on_device) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
ggml_cl_flash_attn(backend, tensor->src[0], tensor->src[1], tensor);
|
||||||
|
return true;
|
||||||
default:
|
default:
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,343 @@
|
||||||
|
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
|
||||||
|
|
||||||
|
#define ACC_TYPE float
|
||||||
|
#define ACC_TYPE4 float4
|
||||||
|
#define DATA_TYPE half
|
||||||
|
#define DATA_TYPE4 half4
|
||||||
|
#define CONVERT_ACC4(x) convert_float4(x)
|
||||||
|
#define CONVERT_DATA4(x) convert_half4(x)
|
||||||
|
|
||||||
|
#define DK_VEC (DK/4)
|
||||||
|
#define DV_VEC (DV/4)
|
||||||
|
#define WG_SIZE (BLOCK_M)
|
||||||
|
#define Q1_WG_SIZE 64
|
||||||
|
|
||||||
|
inline float get_alibi_slope(
|
||||||
|
const float max_bias, const uint h, const uint n_head_log2, const float m0, const float m1
|
||||||
|
) {
|
||||||
|
if (max_bias <= 0.0f) {
|
||||||
|
return 1.0f;
|
||||||
|
}
|
||||||
|
const float base = h < n_head_log2 ? m0 : m1;
|
||||||
|
const int exph = h < n_head_log2 ? h + 1 : 2*(h - n_head_log2) + 1;
|
||||||
|
|
||||||
|
return pow(base, exph);
|
||||||
|
}
|
||||||
|
__kernel void flash_attn_f16(
|
||||||
|
const global void * q_void, ulong q_offset,
|
||||||
|
const global void * k_void, ulong k_offset,
|
||||||
|
const global void * v_void, ulong v_offset,
|
||||||
|
global void * o_void, ulong o_offset,
|
||||||
|
const float scale,
|
||||||
|
const int n_q,
|
||||||
|
const int n_kv,
|
||||||
|
const int is_causal,
|
||||||
|
const int n_head,
|
||||||
|
const ulong q_nb1, const ulong q_nb2, const ulong q_nb3,
|
||||||
|
const ulong k_nb1, const ulong k_nb2, const ulong k_nb3,
|
||||||
|
const ulong v_nb1, const ulong v_nb2, const ulong v_nb3,
|
||||||
|
const ulong o_nb1, const ulong o_nb2, const ulong o_nb3,
|
||||||
|
const float max_bias,
|
||||||
|
const float m0,
|
||||||
|
const float m1,
|
||||||
|
const int n_head_log2,
|
||||||
|
const float logit_softcap,
|
||||||
|
const int n_head_kv,
|
||||||
|
const global void* mask_void,
|
||||||
|
const ulong mask_offset,
|
||||||
|
const ulong mask_nb1,
|
||||||
|
const ulong mask_nb2,
|
||||||
|
const ulong mask_nb3,
|
||||||
|
const int mask_ne2,
|
||||||
|
const int mask_ne3
|
||||||
|
) {
|
||||||
|
const int tid = get_local_id(0);
|
||||||
|
const int block_q_idx = get_group_id(0);
|
||||||
|
const int head_batch_idx = get_global_id(1);
|
||||||
|
|
||||||
|
const int my_query_row = block_q_idx * BLOCK_M + tid;
|
||||||
|
|
||||||
|
const int batch_idx = head_batch_idx / n_head;
|
||||||
|
const int head_idx = head_batch_idx % n_head;
|
||||||
|
|
||||||
|
const int gqa_ratio = n_head / n_head_kv;
|
||||||
|
const int head_kv_idx = head_idx / gqa_ratio;
|
||||||
|
|
||||||
|
const global char* q_base = (const global char*)q_void + q_offset;
|
||||||
|
const global char* k_base = (const global char*)k_void + k_offset;
|
||||||
|
const global char* v_base = (const global char*)v_void + v_offset;
|
||||||
|
global char* o_base = (global char*)o_void + o_offset;
|
||||||
|
|
||||||
|
const global char* mask_base = NULL;
|
||||||
|
if (mask_void != NULL) {
|
||||||
|
const int mask_head_idx = head_idx % mask_ne2;
|
||||||
|
const int mask_batch_idx = batch_idx % mask_ne3;
|
||||||
|
mask_base = (const global char*)mask_void + mask_offset + mask_batch_idx * mask_nb3 + mask_head_idx * mask_nb2;
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 q_priv[DK_VEC];
|
||||||
|
if (my_query_row < n_q) {
|
||||||
|
const ulong q_row_offset = batch_idx * q_nb3 + head_idx * q_nb2 + my_query_row * q_nb1;
|
||||||
|
const global DATA_TYPE4* q_ptr = (const global DATA_TYPE4*)(q_base + q_row_offset);
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DK_VEC; ++i) {
|
||||||
|
q_priv[i] = CONVERT_ACC4(q_ptr[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 o_acc[DV_VEC];
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_acc[i] = (ACC_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
ACC_TYPE m_i = -INFINITY;
|
||||||
|
ACC_TYPE l_i = 0.0f;
|
||||||
|
|
||||||
|
float slope = get_alibi_slope(max_bias, head_idx, n_head_log2, m0, m1);
|
||||||
|
|
||||||
|
__local DATA_TYPE4 l_k[BLOCK_N][DK_VEC];
|
||||||
|
__local DATA_TYPE4 l_v[BLOCK_N][DV_VEC];
|
||||||
|
|
||||||
|
for (int k_start = 0; k_start < n_kv; k_start += BLOCK_N) {
|
||||||
|
for (int i = tid; i < BLOCK_N * DK_VEC; i += WG_SIZE) {
|
||||||
|
const int row = i / DK_VEC;
|
||||||
|
const int col = i % DK_VEC;
|
||||||
|
const int k_row_idx = k_start + row;
|
||||||
|
if (k_row_idx < n_kv) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_row_idx * k_nb1;
|
||||||
|
l_k[row][col] = ((__global DATA_TYPE4*)(k_base + k_row_offset))[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (int i = tid; i < BLOCK_N * DV_VEC; i += WG_SIZE) {
|
||||||
|
const int row = i / DV_VEC;
|
||||||
|
const int col = i % DV_VEC;
|
||||||
|
const int v_row_idx = k_start + row;
|
||||||
|
if (v_row_idx < n_kv) {
|
||||||
|
const ulong v_row_offset = batch_idx * v_nb3 + head_kv_idx * v_nb2 + v_row_idx * v_nb1;
|
||||||
|
l_v[row][col] = ((__global DATA_TYPE4*)(v_base + v_row_offset))[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
|
||||||
|
if (my_query_row >= n_q) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int j = 0; j < BLOCK_N; j += 2) {
|
||||||
|
const int k_row0 = k_start + j;
|
||||||
|
const int k_row1 = k_start + j + 1;
|
||||||
|
|
||||||
|
ACC_TYPE4 dot_acc0 = (ACC_TYPE4)(0.0f);
|
||||||
|
ACC_TYPE4 dot_acc1 = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc0 = mad(q_priv[k], CONVERT_ACC4(l_k[j][k]), dot_acc0);
|
||||||
|
dot_acc1 = mad(q_priv[k], CONVERT_ACC4(l_k[j+1][k]), dot_acc1);
|
||||||
|
}
|
||||||
|
ACC_TYPE score0 = (dot_acc0.s0 + dot_acc0.s1 + dot_acc0.s2 + dot_acc0.s3) * scale;
|
||||||
|
ACC_TYPE score1 = (dot_acc1.s0 + dot_acc1.s1 + dot_acc1.s2 + dot_acc1.s3) * scale;
|
||||||
|
|
||||||
|
if (is_causal) {
|
||||||
|
if (k_row0 > (n_kv - n_q + my_query_row)) score0 = -INFINITY;
|
||||||
|
if (k_row1 > (n_kv - n_q + my_query_row)) score1 = -INFINITY;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (k_row0 >= n_kv) score0 = -INFINITY;
|
||||||
|
if (k_row1 >= n_kv) score1 = -INFINITY;
|
||||||
|
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global DATA_TYPE* mask_ptr = (const global DATA_TYPE*)(mask_base + my_query_row * mask_nb1);
|
||||||
|
if (k_row0 < n_kv) score0 += slope * (ACC_TYPE)mask_ptr[k_row0];
|
||||||
|
if (k_row1 < n_kv) score1 += slope * (ACC_TYPE)mask_ptr[k_row1];
|
||||||
|
}
|
||||||
|
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score0 = logit_softcap * tanh(score0 / logit_softcap);
|
||||||
|
score1 = logit_softcap * tanh(score1 / logit_softcap);
|
||||||
|
}
|
||||||
|
|
||||||
|
const ACC_TYPE m_new = max(m_i, max(score0, score1));
|
||||||
|
const ACC_TYPE p0 = exp(score0 - m_new);
|
||||||
|
const ACC_TYPE p1 = exp(score1 - m_new);
|
||||||
|
const ACC_TYPE scale_prev = exp(m_i - m_new);
|
||||||
|
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_acc[i] = o_acc[i] * scale_prev + p0 * CONVERT_ACC4(l_v[j][i]) + p1 * CONVERT_ACC4(l_v[j+1][i]);
|
||||||
|
}
|
||||||
|
l_i = l_i * scale_prev + p0 + p1;
|
||||||
|
m_i = m_new;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (my_query_row < n_q) {
|
||||||
|
const ulong o_row_offset = batch_idx * o_nb3 + my_query_row * o_nb2 + head_idx * o_nb1;
|
||||||
|
global DATA_TYPE4 *o_row = (global DATA_TYPE4 *)(o_base + o_row_offset);
|
||||||
|
if (l_i > 0.0f) {
|
||||||
|
const ACC_TYPE l_inv = 1.0f / l_i;
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_row[i] = CONVERT_DATA4(o_acc[i] * l_inv);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_row[i] = (DATA_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__kernel void flash_attn_f16_q1(
|
||||||
|
const global void * q_void, ulong q_offset,
|
||||||
|
const global void * k_void, ulong k_offset,
|
||||||
|
const global void * v_void, ulong v_offset,
|
||||||
|
global void * o_void, ulong o_offset,
|
||||||
|
const float scale,
|
||||||
|
const int n_q,
|
||||||
|
const int n_kv,
|
||||||
|
const int is_causal,
|
||||||
|
const int n_head,
|
||||||
|
const ulong q_nb1, const ulong q_nb2, const ulong q_nb3,
|
||||||
|
const ulong k_nb1, const ulong k_nb2, const ulong k_nb3,
|
||||||
|
const ulong v_nb1, const ulong v_nb2, const ulong v_nb3,
|
||||||
|
const ulong o_nb1, const ulong o_nb2, const ulong o_nb3,
|
||||||
|
const float max_bias,
|
||||||
|
const float m0,
|
||||||
|
const float m1,
|
||||||
|
const int n_head_log2,
|
||||||
|
const float logit_softcap,
|
||||||
|
const int n_head_kv,
|
||||||
|
const global void* mask_void,
|
||||||
|
const ulong mask_offset,
|
||||||
|
const ulong mask_nb1,
|
||||||
|
const ulong mask_nb2,
|
||||||
|
const ulong mask_nb3,
|
||||||
|
const int mask_ne2,
|
||||||
|
const int mask_ne3
|
||||||
|
) {
|
||||||
|
const int tid = get_local_id(0);
|
||||||
|
const int head_batch_idx = get_global_id(1);
|
||||||
|
|
||||||
|
const int batch_idx = head_batch_idx / n_head;
|
||||||
|
const int head_idx = head_batch_idx % n_head;
|
||||||
|
|
||||||
|
const int gqa_ratio = n_head / n_head_kv;
|
||||||
|
const int head_kv_idx = head_idx / gqa_ratio;
|
||||||
|
|
||||||
|
const global char* q_base = (const global char*)q_void + q_offset;
|
||||||
|
const global char* k_base = (const global char*)k_void + k_offset;
|
||||||
|
const global char* v_base = (const global char*)v_void + v_offset;
|
||||||
|
global char* o_base = (global char*)o_void + o_offset;
|
||||||
|
|
||||||
|
const global char* mask_base = NULL;
|
||||||
|
if (mask_void != NULL) {
|
||||||
|
const int mask_head_idx = head_idx % mask_ne2;
|
||||||
|
const int mask_batch_idx = batch_idx % mask_ne3;
|
||||||
|
mask_base = (const global char*)mask_void + mask_offset + mask_batch_idx * mask_nb3 + mask_head_idx * mask_nb2;
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 q_priv[DK_VEC];
|
||||||
|
const ulong q_row_offset = batch_idx * q_nb3 + head_idx * q_nb2;
|
||||||
|
const global DATA_TYPE4* q_ptr = (const global DATA_TYPE4*)(q_base + q_row_offset);
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DK_VEC; ++i) {
|
||||||
|
q_priv[i] = CONVERT_ACC4(q_ptr[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
float slope = get_alibi_slope(max_bias, head_idx, n_head_log2, m0, m1);
|
||||||
|
|
||||||
|
ACC_TYPE m_i = -INFINITY;
|
||||||
|
for (int k_idx = tid; k_idx < n_kv; k_idx += Q1_WG_SIZE) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_idx * k_nb1;
|
||||||
|
const global DATA_TYPE4* k_ptr = (const global DATA_TYPE4*)(k_base + k_row_offset);
|
||||||
|
ACC_TYPE4 dot_acc = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc = mad(q_priv[k], CONVERT_ACC4(k_ptr[k]), dot_acc);
|
||||||
|
}
|
||||||
|
ACC_TYPE score = (dot_acc.s0 + dot_acc.s1 + dot_acc.s2 + dot_acc.s3) * scale;
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global DATA_TYPE* mask_ptr = (const global DATA_TYPE*)(mask_base);
|
||||||
|
score += slope * (ACC_TYPE)mask_ptr[k_idx];
|
||||||
|
}
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score = logit_softcap * tanh(score / logit_softcap);
|
||||||
|
}
|
||||||
|
m_i = max(m_i, score);
|
||||||
|
}
|
||||||
|
|
||||||
|
__local ACC_TYPE local_m[Q1_WG_SIZE];
|
||||||
|
local_m[tid] = m_i;
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_m[tid] = max(local_m[tid], local_m[tid + s]);
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
const ACC_TYPE m_final = local_m[0];
|
||||||
|
|
||||||
|
ACC_TYPE4 o_acc[DV_VEC];
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) o_acc[i] = (ACC_TYPE4)(0.0f);
|
||||||
|
ACC_TYPE l_i = 0.0f;
|
||||||
|
|
||||||
|
for (int k_idx = tid; k_idx < n_kv; k_idx += Q1_WG_SIZE) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_idx * k_nb1;
|
||||||
|
const ulong v_row_offset = batch_idx * v_nb3 + head_kv_idx * v_nb2 + k_idx * v_nb1;
|
||||||
|
const global DATA_TYPE4* k_ptr = (const global DATA_TYPE4*)(k_base + k_row_offset);
|
||||||
|
const global DATA_TYPE4* v_ptr = (const global DATA_TYPE4*)(v_base + v_row_offset);
|
||||||
|
ACC_TYPE4 dot_acc = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc = mad(q_priv[k], CONVERT_ACC4(k_ptr[k]), dot_acc);
|
||||||
|
}
|
||||||
|
ACC_TYPE score = (dot_acc.s0 + dot_acc.s1 + dot_acc.s2 + dot_acc.s3) * scale;
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global DATA_TYPE* mask_ptr = (const global DATA_TYPE*)(mask_base);
|
||||||
|
score += slope * (ACC_TYPE)mask_ptr[k_idx];
|
||||||
|
}
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score = logit_softcap * tanh(score / logit_softcap);
|
||||||
|
}
|
||||||
|
const ACC_TYPE p = exp(score - m_final);
|
||||||
|
l_i += p;
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; i++) {
|
||||||
|
o_acc[i] = mad(p, CONVERT_ACC4(v_ptr[i]), o_acc[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__local ACC_TYPE local_l[Q1_WG_SIZE];
|
||||||
|
__local ACC_TYPE4 local_o_comp[Q1_WG_SIZE];
|
||||||
|
local_l[tid] = l_i;
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_l[tid] += local_l[tid + s];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
|
||||||
|
const ulong o_row_offset = batch_idx * o_nb3 + head_idx * o_nb1;
|
||||||
|
global DATA_TYPE4 *o_row = (global DATA_TYPE4 *)(o_base + o_row_offset);
|
||||||
|
const ACC_TYPE l_final = local_l[0];
|
||||||
|
|
||||||
|
if (l_final > 0.0f) {
|
||||||
|
const ACC_TYPE l_inv = 1.0f / l_final;
|
||||||
|
for (int i = 0; i < DV_VEC; i++) {
|
||||||
|
local_o_comp[tid] = o_acc[i];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_o_comp[tid] += local_o_comp[tid + s];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
if (tid == 0) {
|
||||||
|
o_row[i] = CONVERT_DATA4(local_o_comp[0] * l_inv);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if (tid == 0) {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) o_row[i] = (DATA_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,343 @@
|
||||||
|
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
|
||||||
|
|
||||||
|
#define ACC_TYPE float
|
||||||
|
#define ACC_TYPE4 float4
|
||||||
|
#define DATA_TYPE float
|
||||||
|
#define DATA_TYPE4 float4
|
||||||
|
#define CONVERT_ACC4(x) (x)
|
||||||
|
#define CONVERT_DATA4(x) (x)
|
||||||
|
|
||||||
|
#define DK_VEC (DK/4)
|
||||||
|
#define DV_VEC (DV/4)
|
||||||
|
#define WG_SIZE (BLOCK_M)
|
||||||
|
#define Q1_WG_SIZE 64
|
||||||
|
|
||||||
|
inline float get_alibi_slope(
|
||||||
|
const float max_bias, const uint h, const uint n_head_log2, const float m0, const float m1
|
||||||
|
) {
|
||||||
|
if (max_bias <= 0.0f) {
|
||||||
|
return 1.0f;
|
||||||
|
}
|
||||||
|
const float base = h < n_head_log2 ? m0 : m1;
|
||||||
|
const int exph = h < n_head_log2 ? h + 1 : 2*(h - n_head_log2) + 1;
|
||||||
|
|
||||||
|
return pow(base, exph);
|
||||||
|
}
|
||||||
|
__kernel void flash_attn_f32(
|
||||||
|
const global void * q_void, ulong q_offset,
|
||||||
|
const global void * k_void, ulong k_offset,
|
||||||
|
const global void * v_void, ulong v_offset,
|
||||||
|
global void * o_void, ulong o_offset,
|
||||||
|
const float scale,
|
||||||
|
const int n_q,
|
||||||
|
const int n_kv,
|
||||||
|
const int is_causal,
|
||||||
|
const int n_head,
|
||||||
|
const ulong q_nb1, const ulong q_nb2, const ulong q_nb3,
|
||||||
|
const ulong k_nb1, const ulong k_nb2, const ulong k_nb3,
|
||||||
|
const ulong v_nb1, const ulong v_nb2, const ulong v_nb3,
|
||||||
|
const ulong o_nb1, const ulong o_nb2, const ulong o_nb3,
|
||||||
|
const float max_bias,
|
||||||
|
const float m0,
|
||||||
|
const float m1,
|
||||||
|
const int n_head_log2,
|
||||||
|
const float logit_softcap,
|
||||||
|
const int n_head_kv,
|
||||||
|
const global void* mask_void,
|
||||||
|
const ulong mask_offset,
|
||||||
|
const ulong mask_nb1,
|
||||||
|
const ulong mask_nb2,
|
||||||
|
const ulong mask_nb3,
|
||||||
|
const int mask_ne2,
|
||||||
|
const int mask_ne3
|
||||||
|
) {
|
||||||
|
const int tid = get_local_id(0);
|
||||||
|
const int block_q_idx = get_group_id(0);
|
||||||
|
const int head_batch_idx = get_global_id(1);
|
||||||
|
|
||||||
|
const int my_query_row = block_q_idx * BLOCK_M + tid;
|
||||||
|
|
||||||
|
const int batch_idx = head_batch_idx / n_head;
|
||||||
|
const int head_idx = head_batch_idx % n_head;
|
||||||
|
|
||||||
|
const int gqa_ratio = n_head / n_head_kv;
|
||||||
|
const int head_kv_idx = head_idx / gqa_ratio;
|
||||||
|
|
||||||
|
const global char* q_base = (const global char*)q_void + q_offset;
|
||||||
|
const global char* k_base = (const global char*)k_void + k_offset;
|
||||||
|
const global char* v_base = (const global char*)v_void + v_offset;
|
||||||
|
global char* o_base = (global char*)o_void + o_offset;
|
||||||
|
|
||||||
|
const global char* mask_base = NULL;
|
||||||
|
if (mask_void != NULL) {
|
||||||
|
const int mask_head_idx = head_idx % mask_ne2;
|
||||||
|
const int mask_batch_idx = batch_idx % mask_ne3;
|
||||||
|
mask_base = (const global char*)mask_void + mask_offset + mask_batch_idx * mask_nb3 + mask_head_idx * mask_nb2;
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 q_priv[DK_VEC];
|
||||||
|
if (my_query_row < n_q) {
|
||||||
|
const ulong q_row_offset = batch_idx * q_nb3 + head_idx * q_nb2 + my_query_row * q_nb1;
|
||||||
|
const global DATA_TYPE4* q_ptr = (const global DATA_TYPE4*)(q_base + q_row_offset);
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DK_VEC; ++i) {
|
||||||
|
q_priv[i] = CONVERT_ACC4(q_ptr[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 o_acc[DV_VEC];
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_acc[i] = (ACC_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
ACC_TYPE m_i = -INFINITY;
|
||||||
|
ACC_TYPE l_i = 0.0f;
|
||||||
|
|
||||||
|
float slope = get_alibi_slope(max_bias, head_idx, n_head_log2, m0, m1);
|
||||||
|
|
||||||
|
__local DATA_TYPE4 l_k[BLOCK_N][DK_VEC];
|
||||||
|
__local DATA_TYPE4 l_v[BLOCK_N][DV_VEC];
|
||||||
|
|
||||||
|
for (int k_start = 0; k_start < n_kv; k_start += BLOCK_N) {
|
||||||
|
for (int i = tid; i < BLOCK_N * DK_VEC; i += WG_SIZE) {
|
||||||
|
const int row = i / DK_VEC;
|
||||||
|
const int col = i % DK_VEC;
|
||||||
|
const int k_row_idx = k_start + row;
|
||||||
|
if (k_row_idx < n_kv) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_row_idx * k_nb1;
|
||||||
|
l_k[row][col] = ((__global DATA_TYPE4*)(k_base + k_row_offset))[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (int i = tid; i < BLOCK_N * DV_VEC; i += WG_SIZE) {
|
||||||
|
const int row = i / DV_VEC;
|
||||||
|
const int col = i % DV_VEC;
|
||||||
|
const int v_row_idx = k_start + row;
|
||||||
|
if (v_row_idx < n_kv) {
|
||||||
|
const ulong v_row_offset = batch_idx * v_nb3 + head_kv_idx * v_nb2 + v_row_idx * v_nb1;
|
||||||
|
l_v[row][col] = ((__global DATA_TYPE4*)(v_base + v_row_offset))[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
|
||||||
|
if (my_query_row >= n_q) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int j = 0; j < BLOCK_N; j += 2) {
|
||||||
|
const int k_row0 = k_start + j;
|
||||||
|
const int k_row1 = k_start + j + 1;
|
||||||
|
|
||||||
|
ACC_TYPE4 dot_acc0 = (ACC_TYPE4)(0.0f);
|
||||||
|
ACC_TYPE4 dot_acc1 = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc0 = mad(q_priv[k], CONVERT_ACC4(l_k[j][k]), dot_acc0);
|
||||||
|
dot_acc1 = mad(q_priv[k], CONVERT_ACC4(l_k[j+1][k]), dot_acc1);
|
||||||
|
}
|
||||||
|
ACC_TYPE score0 = (dot_acc0.s0 + dot_acc0.s1 + dot_acc0.s2 + dot_acc0.s3) * scale;
|
||||||
|
ACC_TYPE score1 = (dot_acc1.s0 + dot_acc1.s1 + dot_acc1.s2 + dot_acc1.s3) * scale;
|
||||||
|
|
||||||
|
if (is_causal) {
|
||||||
|
if (k_row0 > (n_kv - n_q + my_query_row)) score0 = -INFINITY;
|
||||||
|
if (k_row1 > (n_kv - n_q + my_query_row)) score1 = -INFINITY;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (k_row0 >= n_kv) score0 = -INFINITY;
|
||||||
|
if (k_row1 >= n_kv) score1 = -INFINITY;
|
||||||
|
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global DATA_TYPE* mask_ptr = (const global DATA_TYPE*)(mask_base + my_query_row * mask_nb1);
|
||||||
|
if (k_row0 < n_kv) score0 += slope * (ACC_TYPE)mask_ptr[k_row0];
|
||||||
|
if (k_row1 < n_kv) score1 += slope * (ACC_TYPE)mask_ptr[k_row1];
|
||||||
|
}
|
||||||
|
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score0 = logit_softcap * tanh(score0 / logit_softcap);
|
||||||
|
score1 = logit_softcap * tanh(score1 / logit_softcap);
|
||||||
|
}
|
||||||
|
|
||||||
|
const ACC_TYPE m_new = max(m_i, max(score0, score1));
|
||||||
|
const ACC_TYPE p0 = exp(score0 - m_new);
|
||||||
|
const ACC_TYPE p1 = exp(score1 - m_new);
|
||||||
|
const ACC_TYPE scale_prev = exp(m_i - m_new);
|
||||||
|
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_acc[i] = o_acc[i] * scale_prev + p0 * CONVERT_ACC4(l_v[j][i]) + p1 * CONVERT_ACC4(l_v[j+1][i]);
|
||||||
|
}
|
||||||
|
l_i = l_i * scale_prev + p0 + p1;
|
||||||
|
m_i = m_new;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (my_query_row < n_q) {
|
||||||
|
const ulong o_row_offset = batch_idx * o_nb3 + my_query_row * o_nb2 + head_idx * o_nb1;
|
||||||
|
global DATA_TYPE4 *o_row = (global DATA_TYPE4 *)(o_base + o_row_offset);
|
||||||
|
if (l_i > 0.0f) {
|
||||||
|
const ACC_TYPE l_inv = 1.0f / l_i;
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_row[i] = CONVERT_DATA4(o_acc[i] * l_inv);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_row[i] = (DATA_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__kernel void flash_attn_f32_q1(
|
||||||
|
const global void * q_void, ulong q_offset,
|
||||||
|
const global void * k_void, ulong k_offset,
|
||||||
|
const global void * v_void, ulong v_offset,
|
||||||
|
global void * o_void, ulong o_offset,
|
||||||
|
const float scale,
|
||||||
|
const int n_q,
|
||||||
|
const int n_kv,
|
||||||
|
const int is_causal,
|
||||||
|
const int n_head,
|
||||||
|
const ulong q_nb1, const ulong q_nb2, const ulong q_nb3,
|
||||||
|
const ulong k_nb1, const ulong k_nb2, const ulong k_nb3,
|
||||||
|
const ulong v_nb1, const ulong v_nb2, const ulong v_nb3,
|
||||||
|
const ulong o_nb1, const ulong o_nb2, const ulong o_nb3,
|
||||||
|
const float max_bias,
|
||||||
|
const float m0,
|
||||||
|
const float m1,
|
||||||
|
const int n_head_log2,
|
||||||
|
const float logit_softcap,
|
||||||
|
const int n_head_kv,
|
||||||
|
const global void* mask_void,
|
||||||
|
const ulong mask_offset,
|
||||||
|
const ulong mask_nb1,
|
||||||
|
const ulong mask_nb2,
|
||||||
|
const ulong mask_nb3,
|
||||||
|
const int mask_ne2,
|
||||||
|
const int mask_ne3
|
||||||
|
) {
|
||||||
|
const int tid = get_local_id(0);
|
||||||
|
const int head_batch_idx = get_global_id(1);
|
||||||
|
|
||||||
|
const int batch_idx = head_batch_idx / n_head;
|
||||||
|
const int head_idx = head_batch_idx % n_head;
|
||||||
|
|
||||||
|
const int gqa_ratio = n_head / n_head_kv;
|
||||||
|
const int head_kv_idx = head_idx / gqa_ratio;
|
||||||
|
|
||||||
|
const global char* q_base = (const global char*)q_void + q_offset;
|
||||||
|
const global char* k_base = (const global char*)k_void + k_offset;
|
||||||
|
const global char* v_base = (const global char*)v_void + v_offset;
|
||||||
|
global char* o_base = (global char*)o_void + o_offset;
|
||||||
|
|
||||||
|
const global char* mask_base = NULL;
|
||||||
|
if (mask_void != NULL) {
|
||||||
|
const int mask_head_idx = head_idx % mask_ne2;
|
||||||
|
const int mask_batch_idx = batch_idx % mask_ne3;
|
||||||
|
mask_base = (const global char*)mask_void + mask_offset + mask_batch_idx * mask_nb3 + mask_head_idx * mask_nb2;
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 q_priv[DK_VEC];
|
||||||
|
const ulong q_row_offset = batch_idx * q_nb3 + head_idx * q_nb2;
|
||||||
|
const global DATA_TYPE4* q_ptr = (const global DATA_TYPE4*)(q_base + q_row_offset);
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DK_VEC; ++i) {
|
||||||
|
q_priv[i] = CONVERT_ACC4(q_ptr[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
float slope = get_alibi_slope(max_bias, head_idx, n_head_log2, m0, m1);
|
||||||
|
|
||||||
|
ACC_TYPE m_i = -INFINITY;
|
||||||
|
for (int k_idx = tid; k_idx < n_kv; k_idx += Q1_WG_SIZE) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_idx * k_nb1;
|
||||||
|
const global DATA_TYPE4* k_ptr = (const global DATA_TYPE4*)(k_base + k_row_offset);
|
||||||
|
ACC_TYPE4 dot_acc = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc = mad(q_priv[k], CONVERT_ACC4(k_ptr[k]), dot_acc);
|
||||||
|
}
|
||||||
|
ACC_TYPE score = (dot_acc.s0 + dot_acc.s1 + dot_acc.s2 + dot_acc.s3) * scale;
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global DATA_TYPE* mask_ptr = (const global DATA_TYPE*)(mask_base);
|
||||||
|
score += slope * (ACC_TYPE)mask_ptr[k_idx];
|
||||||
|
}
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score = logit_softcap * tanh(score / logit_softcap);
|
||||||
|
}
|
||||||
|
m_i = max(m_i, score);
|
||||||
|
}
|
||||||
|
|
||||||
|
__local ACC_TYPE local_m[Q1_WG_SIZE];
|
||||||
|
local_m[tid] = m_i;
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_m[tid] = max(local_m[tid], local_m[tid + s]);
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
const ACC_TYPE m_final = local_m[0];
|
||||||
|
|
||||||
|
ACC_TYPE4 o_acc[DV_VEC];
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) o_acc[i] = (ACC_TYPE4)(0.0f);
|
||||||
|
ACC_TYPE l_i = 0.0f;
|
||||||
|
|
||||||
|
for (int k_idx = tid; k_idx < n_kv; k_idx += Q1_WG_SIZE) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_idx * k_nb1;
|
||||||
|
const ulong v_row_offset = batch_idx * v_nb3 + head_kv_idx * v_nb2 + k_idx * v_nb1;
|
||||||
|
const global DATA_TYPE4* k_ptr = (const global DATA_TYPE4*)(k_base + k_row_offset);
|
||||||
|
const global DATA_TYPE4* v_ptr = (const global DATA_TYPE4*)(v_base + v_row_offset);
|
||||||
|
ACC_TYPE4 dot_acc = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc = mad(q_priv[k], CONVERT_ACC4(k_ptr[k]), dot_acc);
|
||||||
|
}
|
||||||
|
ACC_TYPE score = (dot_acc.s0 + dot_acc.s1 + dot_acc.s2 + dot_acc.s3) * scale;
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global DATA_TYPE* mask_ptr = (const global DATA_TYPE*)(mask_base);
|
||||||
|
score += slope * (ACC_TYPE)mask_ptr[k_idx];
|
||||||
|
}
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score = logit_softcap * tanh(score / logit_softcap);
|
||||||
|
}
|
||||||
|
const ACC_TYPE p = exp(score - m_final);
|
||||||
|
l_i += p;
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; i++) {
|
||||||
|
o_acc[i] = mad(p, CONVERT_ACC4(v_ptr[i]), o_acc[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__local ACC_TYPE local_l[Q1_WG_SIZE];
|
||||||
|
__local ACC_TYPE4 local_o_comp[Q1_WG_SIZE];
|
||||||
|
local_l[tid] = l_i;
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_l[tid] += local_l[tid + s];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
|
||||||
|
const ulong o_row_offset = batch_idx * o_nb3 + head_idx * o_nb1;
|
||||||
|
global DATA_TYPE4 *o_row = (global DATA_TYPE4 *)(o_base + o_row_offset);
|
||||||
|
const ACC_TYPE l_final = local_l[0];
|
||||||
|
|
||||||
|
if (l_final > 0.0f) {
|
||||||
|
const ACC_TYPE l_inv = 1.0f / l_final;
|
||||||
|
for (int i = 0; i < DV_VEC; i++) {
|
||||||
|
local_o_comp[tid] = o_acc[i];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_o_comp[tid] += local_o_comp[tid + s];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
if (tid == 0) {
|
||||||
|
o_row[i] = CONVERT_DATA4(local_o_comp[0] * l_inv);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if (tid == 0) {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) o_row[i] = (DATA_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,346 @@
|
||||||
|
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
|
||||||
|
|
||||||
|
#define ACC_TYPE float
|
||||||
|
#define ACC_TYPE4 float4
|
||||||
|
#define Q_DATA_TYPE4 float4
|
||||||
|
#define KV_DATA_TYPE4 half4
|
||||||
|
#define O_DATA_TYPE4 float4
|
||||||
|
#define MASK_DATA_TYPE half
|
||||||
|
#define CONVERT_Q_ACC4(x) (x)
|
||||||
|
#define CONVERT_KV_ACC4(x) convert_float4(x)
|
||||||
|
#define CONVERT_O_DATA4(x) (x)
|
||||||
|
|
||||||
|
#define DK_VEC (DK/4)
|
||||||
|
#define DV_VEC (DV/4)
|
||||||
|
#define WG_SIZE (BLOCK_M)
|
||||||
|
#define Q1_WG_SIZE 64
|
||||||
|
|
||||||
|
inline float get_alibi_slope(
|
||||||
|
const float max_bias, const uint h, const uint n_head_log2, const float m0, const float m1
|
||||||
|
) {
|
||||||
|
if (max_bias <= 0.0f) {
|
||||||
|
return 1.0f;
|
||||||
|
}
|
||||||
|
const float base = h < n_head_log2 ? m0 : m1;
|
||||||
|
const int exph = h < n_head_log2 ? h + 1 : 2*(h - n_head_log2) + 1;
|
||||||
|
|
||||||
|
return pow(base, exph);
|
||||||
|
}
|
||||||
|
__kernel void flash_attn_f32_f16(
|
||||||
|
const global void * q_void, ulong q_offset,
|
||||||
|
const global void * k_void, ulong k_offset,
|
||||||
|
const global void * v_void, ulong v_offset,
|
||||||
|
global void * o_void, ulong o_offset,
|
||||||
|
const float scale,
|
||||||
|
const int n_q,
|
||||||
|
const int n_kv,
|
||||||
|
const int is_causal,
|
||||||
|
const int n_head,
|
||||||
|
const ulong q_nb1, const ulong q_nb2, const ulong q_nb3,
|
||||||
|
const ulong k_nb1, const ulong k_nb2, const ulong k_nb3,
|
||||||
|
const ulong v_nb1, const ulong v_nb2, const ulong v_nb3,
|
||||||
|
const ulong o_nb1, const ulong o_nb2, const ulong o_nb3,
|
||||||
|
const float max_bias,
|
||||||
|
const float m0,
|
||||||
|
const float m1,
|
||||||
|
const int n_head_log2,
|
||||||
|
const float logit_softcap,
|
||||||
|
const int n_head_kv,
|
||||||
|
const global void* mask_void,
|
||||||
|
const ulong mask_offset,
|
||||||
|
const ulong mask_nb1,
|
||||||
|
const ulong mask_nb2,
|
||||||
|
const ulong mask_nb3,
|
||||||
|
const int mask_ne2,
|
||||||
|
const int mask_ne3
|
||||||
|
) {
|
||||||
|
const int tid = get_local_id(0);
|
||||||
|
const int block_q_idx = get_group_id(0);
|
||||||
|
const int head_batch_idx = get_global_id(1);
|
||||||
|
|
||||||
|
const int my_query_row = block_q_idx * BLOCK_M + tid;
|
||||||
|
|
||||||
|
const int batch_idx = head_batch_idx / n_head;
|
||||||
|
const int head_idx = head_batch_idx % n_head;
|
||||||
|
|
||||||
|
const int gqa_ratio = n_head / n_head_kv;
|
||||||
|
const int head_kv_idx = head_idx / gqa_ratio;
|
||||||
|
|
||||||
|
const global char* q_base = (const global char*)q_void + q_offset;
|
||||||
|
const global char* k_base = (const global char*)k_void + k_offset;
|
||||||
|
const global char* v_base = (const global char*)v_void + v_offset;
|
||||||
|
global char* o_base = (global char*)o_void + o_offset;
|
||||||
|
|
||||||
|
const global char* mask_base = NULL;
|
||||||
|
if (mask_void != NULL) {
|
||||||
|
const int mask_head_idx = head_idx % mask_ne2;
|
||||||
|
const int mask_batch_idx = batch_idx % mask_ne3;
|
||||||
|
mask_base = (const global char*)mask_void + mask_offset + mask_batch_idx * mask_nb3 + mask_head_idx * mask_nb2;
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 q_priv[DK_VEC];
|
||||||
|
if (my_query_row < n_q) {
|
||||||
|
const ulong q_row_offset = batch_idx * q_nb3 + head_idx * q_nb2 + my_query_row * q_nb1;
|
||||||
|
const global Q_DATA_TYPE4* q_ptr = (const global Q_DATA_TYPE4*)(q_base + q_row_offset);
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DK_VEC; ++i) {
|
||||||
|
q_priv[i] = CONVERT_Q_ACC4(q_ptr[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 o_acc[DV_VEC];
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_acc[i] = (ACC_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
ACC_TYPE m_i = -INFINITY;
|
||||||
|
ACC_TYPE l_i = 0.0f;
|
||||||
|
|
||||||
|
float slope = get_alibi_slope(max_bias, head_idx, n_head_log2, m0, m1);
|
||||||
|
|
||||||
|
__local KV_DATA_TYPE4 l_k[BLOCK_N][DK_VEC];
|
||||||
|
__local KV_DATA_TYPE4 l_v[BLOCK_N][DV_VEC];
|
||||||
|
|
||||||
|
for (int k_start = 0; k_start < n_kv; k_start += BLOCK_N) {
|
||||||
|
for (int i = tid; i < BLOCK_N * DK_VEC; i += WG_SIZE) {
|
||||||
|
const int row = i / DK_VEC;
|
||||||
|
const int col = i % DK_VEC;
|
||||||
|
const int k_row_idx = k_start + row;
|
||||||
|
if (k_row_idx < n_kv) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_row_idx * k_nb1;
|
||||||
|
l_k[row][col] = ((__global KV_DATA_TYPE4*)(k_base + k_row_offset))[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (int i = tid; i < BLOCK_N * DV_VEC; i += WG_SIZE) {
|
||||||
|
const int row = i / DV_VEC;
|
||||||
|
const int col = i % DV_VEC;
|
||||||
|
const int v_row_idx = k_start + row;
|
||||||
|
if (v_row_idx < n_kv) {
|
||||||
|
const ulong v_row_offset = batch_idx * v_nb3 + head_kv_idx * v_nb2 + v_row_idx * v_nb1;
|
||||||
|
l_v[row][col] = ((__global KV_DATA_TYPE4*)(v_base + v_row_offset))[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
|
||||||
|
if (my_query_row >= n_q) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int j = 0; j < BLOCK_N; j += 2) {
|
||||||
|
const int k_row0 = k_start + j;
|
||||||
|
const int k_row1 = k_start + j + 1;
|
||||||
|
|
||||||
|
ACC_TYPE4 dot_acc0 = (ACC_TYPE4)(0.0f);
|
||||||
|
ACC_TYPE4 dot_acc1 = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc0 = mad(q_priv[k], CONVERT_KV_ACC4(l_k[j][k]), dot_acc0);
|
||||||
|
dot_acc1 = mad(q_priv[k], CONVERT_KV_ACC4(l_k[j+1][k]), dot_acc1);
|
||||||
|
}
|
||||||
|
ACC_TYPE score0 = (dot_acc0.s0 + dot_acc0.s1 + dot_acc0.s2 + dot_acc0.s3) * scale;
|
||||||
|
ACC_TYPE score1 = (dot_acc1.s0 + dot_acc1.s1 + dot_acc1.s2 + dot_acc1.s3) * scale;
|
||||||
|
|
||||||
|
if (is_causal) {
|
||||||
|
if (k_row0 > (n_kv - n_q + my_query_row)) score0 = -INFINITY;
|
||||||
|
if (k_row1 > (n_kv - n_q + my_query_row)) score1 = -INFINITY;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (k_row0 >= n_kv) score0 = -INFINITY;
|
||||||
|
if (k_row1 >= n_kv) score1 = -INFINITY;
|
||||||
|
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global MASK_DATA_TYPE* mask_ptr = (const global MASK_DATA_TYPE*)(mask_base + my_query_row * mask_nb1);
|
||||||
|
if (k_row0 < n_kv) score0 += slope * (ACC_TYPE)mask_ptr[k_row0];
|
||||||
|
if (k_row1 < n_kv) score1 += slope * (ACC_TYPE)mask_ptr[k_row1];
|
||||||
|
}
|
||||||
|
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score0 = logit_softcap * tanh(score0 / logit_softcap);
|
||||||
|
score1 = logit_softcap * tanh(score1 / logit_softcap);
|
||||||
|
}
|
||||||
|
|
||||||
|
const ACC_TYPE m_new = max(m_i, max(score0, score1));
|
||||||
|
const ACC_TYPE p0 = exp(score0 - m_new);
|
||||||
|
const ACC_TYPE p1 = exp(score1 - m_new);
|
||||||
|
const ACC_TYPE scale_prev = exp(m_i - m_new);
|
||||||
|
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_acc[i] = o_acc[i] * scale_prev + p0 * CONVERT_KV_ACC4(l_v[j][i]) + p1 * CONVERT_KV_ACC4(l_v[j+1][i]);
|
||||||
|
}
|
||||||
|
l_i = l_i * scale_prev + p0 + p1;
|
||||||
|
m_i = m_new;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (my_query_row < n_q) {
|
||||||
|
const ulong o_row_offset = batch_idx * o_nb3 + my_query_row * o_nb2 + head_idx * o_nb1;
|
||||||
|
global O_DATA_TYPE4 *o_row = (global O_DATA_TYPE4 *)(o_base + o_row_offset);
|
||||||
|
if (l_i > 0.0f) {
|
||||||
|
const ACC_TYPE l_inv = 1.0f / l_i;
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_row[i] = CONVERT_O_DATA4(o_acc[i] * l_inv);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) {
|
||||||
|
o_row[i] = (O_DATA_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__kernel void flash_attn_f32_f16_q1(
|
||||||
|
const global void * q_void, ulong q_offset,
|
||||||
|
const global void * k_void, ulong k_offset,
|
||||||
|
const global void * v_void, ulong v_offset,
|
||||||
|
global void * o_void, ulong o_offset,
|
||||||
|
const float scale,
|
||||||
|
const int n_q,
|
||||||
|
const int n_kv,
|
||||||
|
const int is_causal,
|
||||||
|
const int n_head,
|
||||||
|
const ulong q_nb1, const ulong q_nb2, const ulong q_nb3,
|
||||||
|
const ulong k_nb1, const ulong k_nb2, const ulong k_nb3,
|
||||||
|
const ulong v_nb1, const ulong v_nb2, const ulong v_nb3,
|
||||||
|
const ulong o_nb1, const ulong o_nb2, const ulong o_nb3,
|
||||||
|
const float max_bias,
|
||||||
|
const float m0,
|
||||||
|
const float m1,
|
||||||
|
const int n_head_log2,
|
||||||
|
const float logit_softcap,
|
||||||
|
const int n_head_kv,
|
||||||
|
const global void* mask_void,
|
||||||
|
const ulong mask_offset,
|
||||||
|
const ulong mask_nb1,
|
||||||
|
const ulong mask_nb2,
|
||||||
|
const ulong mask_nb3,
|
||||||
|
const int mask_ne2,
|
||||||
|
const int mask_ne3
|
||||||
|
) {
|
||||||
|
const int tid = get_local_id(0);
|
||||||
|
const int head_batch_idx = get_global_id(1);
|
||||||
|
|
||||||
|
const int batch_idx = head_batch_idx / n_head;
|
||||||
|
const int head_idx = head_batch_idx % n_head;
|
||||||
|
|
||||||
|
const int gqa_ratio = n_head / n_head_kv;
|
||||||
|
const int head_kv_idx = head_idx / gqa_ratio;
|
||||||
|
|
||||||
|
const global char* q_base = (const global char*)q_void + q_offset;
|
||||||
|
const global char* k_base = (const global char*)k_void + k_offset;
|
||||||
|
const global char* v_base = (const global char*)v_void + v_offset;
|
||||||
|
global char* o_base = (global char*)o_void + o_offset;
|
||||||
|
|
||||||
|
const global char* mask_base = NULL;
|
||||||
|
if (mask_void != NULL) {
|
||||||
|
const int mask_head_idx = head_idx % mask_ne2;
|
||||||
|
const int mask_batch_idx = batch_idx % mask_ne3;
|
||||||
|
mask_base = (const global char*)mask_void + mask_offset + mask_batch_idx * mask_nb3 + mask_head_idx * mask_nb2;
|
||||||
|
}
|
||||||
|
|
||||||
|
ACC_TYPE4 q_priv[DK_VEC];
|
||||||
|
const ulong q_row_offset = batch_idx * q_nb3 + head_idx * q_nb2;
|
||||||
|
const global Q_DATA_TYPE4* q_ptr = (const global Q_DATA_TYPE4*)(q_base + q_row_offset);
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DK_VEC; ++i) {
|
||||||
|
q_priv[i] = CONVERT_Q_ACC4(q_ptr[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
float slope = get_alibi_slope(max_bias, head_idx, n_head_log2, m0, m1);
|
||||||
|
|
||||||
|
ACC_TYPE m_i = -INFINITY;
|
||||||
|
for (int k_idx = tid; k_idx < n_kv; k_idx += Q1_WG_SIZE) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_idx * k_nb1;
|
||||||
|
const global KV_DATA_TYPE4* k_ptr = (const global KV_DATA_TYPE4*)(k_base + k_row_offset);
|
||||||
|
ACC_TYPE4 dot_acc = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc = mad(q_priv[k], CONVERT_KV_ACC4(k_ptr[k]), dot_acc);
|
||||||
|
}
|
||||||
|
ACC_TYPE score = (dot_acc.s0 + dot_acc.s1 + dot_acc.s2 + dot_acc.s3) * scale;
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global MASK_DATA_TYPE* mask_ptr = (const global MASK_DATA_TYPE*)(mask_base);
|
||||||
|
score += slope * (ACC_TYPE)mask_ptr[k_idx];
|
||||||
|
}
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score = logit_softcap * tanh(score / logit_softcap);
|
||||||
|
}
|
||||||
|
m_i = max(m_i, score);
|
||||||
|
}
|
||||||
|
|
||||||
|
__local ACC_TYPE local_m[Q1_WG_SIZE];
|
||||||
|
local_m[tid] = m_i;
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_m[tid] = max(local_m[tid], local_m[tid + s]);
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
const ACC_TYPE m_final = local_m[0];
|
||||||
|
|
||||||
|
ACC_TYPE4 o_acc[DV_VEC];
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) o_acc[i] = (ACC_TYPE4)(0.0f);
|
||||||
|
ACC_TYPE l_i = 0.0f;
|
||||||
|
|
||||||
|
for (int k_idx = tid; k_idx < n_kv; k_idx += Q1_WG_SIZE) {
|
||||||
|
const ulong k_row_offset = batch_idx * k_nb3 + head_kv_idx * k_nb2 + k_idx * k_nb1;
|
||||||
|
const ulong v_row_offset = batch_idx * v_nb3 + head_kv_idx * v_nb2 + k_idx * v_nb1;
|
||||||
|
const global KV_DATA_TYPE4* k_ptr = (const global KV_DATA_TYPE4*)(k_base + k_row_offset);
|
||||||
|
const global KV_DATA_TYPE4* v_ptr = (const global KV_DATA_TYPE4*)(v_base + v_row_offset);
|
||||||
|
ACC_TYPE4 dot_acc = (ACC_TYPE4)(0.0f);
|
||||||
|
#pragma unroll
|
||||||
|
for (int k = 0; k < DK_VEC; k++) {
|
||||||
|
dot_acc = mad(q_priv[k], CONVERT_KV_ACC4(k_ptr[k]), dot_acc);
|
||||||
|
}
|
||||||
|
ACC_TYPE score = (dot_acc.s0 + dot_acc.s1 + dot_acc.s2 + dot_acc.s3) * scale;
|
||||||
|
if (mask_base != NULL) {
|
||||||
|
const global MASK_DATA_TYPE* mask_ptr = (const global MASK_DATA_TYPE*)(mask_base);
|
||||||
|
score += slope * (ACC_TYPE)mask_ptr[k_idx];
|
||||||
|
}
|
||||||
|
if (logit_softcap > 0.0f) {
|
||||||
|
score = logit_softcap * tanh(score / logit_softcap);
|
||||||
|
}
|
||||||
|
const ACC_TYPE p = exp(score - m_final);
|
||||||
|
l_i += p;
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; i++) {
|
||||||
|
o_acc[i] = mad(p, CONVERT_KV_ACC4(v_ptr[i]), o_acc[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__local ACC_TYPE local_l[Q1_WG_SIZE];
|
||||||
|
__local ACC_TYPE4 local_o_comp[Q1_WG_SIZE];
|
||||||
|
local_l[tid] = l_i;
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_l[tid] += local_l[tid + s];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
|
||||||
|
const ulong o_row_offset = batch_idx * o_nb3 + head_idx * o_nb1;
|
||||||
|
global O_DATA_TYPE4 *o_row = (global O_DATA_TYPE4 *)(o_base + o_row_offset);
|
||||||
|
const ACC_TYPE l_final = local_l[0];
|
||||||
|
|
||||||
|
if (l_final > 0.0f) {
|
||||||
|
const ACC_TYPE l_inv = 1.0f / l_final;
|
||||||
|
for (int i = 0; i < DV_VEC; i++) {
|
||||||
|
local_o_comp[tid] = o_acc[i];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
#pragma unroll
|
||||||
|
for (int s = Q1_WG_SIZE / 2; s > 0; s >>= 1) {
|
||||||
|
if (tid < s) local_o_comp[tid] += local_o_comp[tid + s];
|
||||||
|
barrier(CLK_LOCAL_MEM_FENCE);
|
||||||
|
}
|
||||||
|
if (tid == 0) {
|
||||||
|
o_row[i] = CONVERT_O_DATA4(local_o_comp[0] * l_inv);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if (tid == 0) {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < DV_VEC; ++i) o_row[i] = (O_DATA_TYPE4)(0.0f);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -566,7 +566,7 @@ static float make_q3_quants(int n, int nmax, const float * GGML_RESTRICT x, int8
|
||||||
for (int i = 0; i < n; ++i) {
|
for (int i = 0; i < n; ++i) {
|
||||||
L[i] += nmax;
|
L[i] += nmax;
|
||||||
}
|
}
|
||||||
return sumlx / suml2;
|
return suml2 > 0.0f ? sumlx / suml2 : 0.0f;
|
||||||
}
|
}
|
||||||
for (int i = 0; i < n; ++i) {
|
for (int i = 0; i < n; ++i) {
|
||||||
int l = nearest_int(iscale * x[i]);
|
int l = nearest_int(iscale * x[i]);
|
||||||
|
|
@ -901,7 +901,7 @@ static float make_qp_quants(int n, int nmax, const float * GGML_RESTRICT x, uint
|
||||||
for (int i = 0; i < n; ++i) {
|
for (int i = 0; i < n; ++i) {
|
||||||
max = MAX(max, x[i]);
|
max = MAX(max, x[i]);
|
||||||
}
|
}
|
||||||
if (!max) { // all zero
|
if (max < GROUP_MAX_EPS) { // all zero
|
||||||
for (int i = 0; i < n; ++i) { L[i] = 0; }
|
for (int i = 0; i < n; ++i) { L[i] = 0; }
|
||||||
return 0.f;
|
return 0.f;
|
||||||
}
|
}
|
||||||
|
|
@ -966,7 +966,7 @@ static float make_qp_quants(int n, int nmax, const float * GGML_RESTRICT x, uint
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return sumlx/suml2;
|
return suml2 > 0.0f ? sumlx / suml2 : 0.0f;
|
||||||
}
|
}
|
||||||
|
|
||||||
static void quantize_row_q2_K_impl(const float * GGML_RESTRICT x, block_q2_K * GGML_RESTRICT y, int k, const float * GGML_RESTRICT quant_weights) {
|
static void quantize_row_q2_K_impl(const float * GGML_RESTRICT x, block_q2_K * GGML_RESTRICT y, int k, const float * GGML_RESTRICT quant_weights) {
|
||||||
|
|
@ -4266,7 +4266,7 @@ static void quantize_row_iq1_s_impl(const float * GGML_RESTRICT x, void * GGML_R
|
||||||
sumw[j+1] = sumw[j] + weight[i];
|
sumw[j+1] = sumw[j] + weight[i];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
float best_score = -FLT_MIN, scale = max;
|
float best_score = -FLT_MAX, scale = max;
|
||||||
int besti1 = -1, besti2 = -1, best_shift = 0;
|
int besti1 = -1, besti2 = -1, best_shift = 0;
|
||||||
for (int i1 = 0; i1 <= block_size; ++i1) {
|
for (int i1 = 0; i1 <= block_size; ++i1) {
|
||||||
for (int i2 = i1; i2 <= block_size; ++i2) {
|
for (int i2 = i1; i2 <= block_size; ++i2) {
|
||||||
|
|
@ -4442,7 +4442,7 @@ static void quantize_row_iq1_m_impl(const float * GGML_RESTRICT x, void * GGML_R
|
||||||
idx[2*j] = j;
|
idx[2*j] = j;
|
||||||
}
|
}
|
||||||
qsort(pairs, block_size, 2*sizeof(float), iq1_sort_helper);
|
qsort(pairs, block_size, 2*sizeof(float), iq1_sort_helper);
|
||||||
float best_score = -FLT_MIN, scale = max;
|
float best_score = -FLT_MAX, scale = max;
|
||||||
int besti1 = -1, besti2 = -1, best_k = -1;
|
int besti1 = -1, besti2 = -1, best_k = -1;
|
||||||
// 0: +, +
|
// 0: +, +
|
||||||
// 1: +, -
|
// 1: +, -
|
||||||
|
|
|
||||||
File diff suppressed because it is too large
Load Diff
|
|
@ -1,22 +1,24 @@
|
||||||
#version 450
|
#version 450
|
||||||
|
#extension GL_EXT_control_flow_attributes : enable
|
||||||
|
|
||||||
#include "types.comp"
|
#include "types.comp"
|
||||||
|
|
||||||
#define BLOCK_SIZE 1024
|
layout(constant_id = 0) const int BLOCK_SIZE = 1024;
|
||||||
|
layout(constant_id = 1) const int BLOCK_SIZE_LOG2 = 10;
|
||||||
#define ASC 0
|
#define ASC 0
|
||||||
|
|
||||||
layout(local_size_x = BLOCK_SIZE, local_size_y = 1, local_size_z = 1) in;
|
layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
|
||||||
|
|
||||||
layout (binding = 0) readonly buffer A {A_TYPE data_a[];};
|
layout (binding = 0) readonly buffer A {A_TYPE data_a[];};
|
||||||
layout (binding = 1) buffer D {int data_d[];};
|
layout (binding = 1) buffer D {int data_d[];};
|
||||||
|
|
||||||
layout (push_constant) uniform parameter {
|
layout (push_constant) uniform parameter {
|
||||||
uint ncols;
|
uint ncols;
|
||||||
uint ncols_pad;
|
|
||||||
uint order;
|
uint order;
|
||||||
} p;
|
} p;
|
||||||
|
|
||||||
shared int dst_row[BLOCK_SIZE];
|
shared int dst_row[BLOCK_SIZE];
|
||||||
|
shared A_TYPE a_sh[BLOCK_SIZE];
|
||||||
|
|
||||||
void swap(uint idx0, uint idx1) {
|
void swap(uint idx0, uint idx1) {
|
||||||
int tmp = dst_row[idx0];
|
int tmp = dst_row[idx0];
|
||||||
|
|
@ -24,7 +26,7 @@ void swap(uint idx0, uint idx1) {
|
||||||
dst_row[idx1] = tmp;
|
dst_row[idx1] = tmp;
|
||||||
}
|
}
|
||||||
|
|
||||||
void main() {
|
void argsort(bool needs_bounds_check) {
|
||||||
// bitonic sort
|
// bitonic sort
|
||||||
const int col = int(gl_LocalInvocationID.x);
|
const int col = int(gl_LocalInvocationID.x);
|
||||||
const uint row = gl_WorkGroupID.y;
|
const uint row = gl_WorkGroupID.y;
|
||||||
|
|
@ -32,38 +34,46 @@ void main() {
|
||||||
const uint row_offset = row * p.ncols;
|
const uint row_offset = row * p.ncols;
|
||||||
|
|
||||||
// initialize indices
|
// initialize indices
|
||||||
if (col < p.ncols_pad) {
|
|
||||||
dst_row[col] = col;
|
dst_row[col] = col;
|
||||||
}
|
a_sh[col] = data_a[row_offset + col];
|
||||||
barrier();
|
barrier();
|
||||||
|
|
||||||
for (uint k = 2; k <= p.ncols_pad; k *= 2) {
|
uint num_outer_loop_iters = BLOCK_SIZE_LOG2;
|
||||||
for (uint j = k / 2; j > 0; j /= 2) {
|
[[unroll]] for (uint k = 2, outer_idx = 0; outer_idx < num_outer_loop_iters; k *= 2, outer_idx++) {
|
||||||
const uint ixj = col ^ j;
|
uint num_inner_loop_iters = outer_idx + 1;
|
||||||
if (col < p.ncols_pad && ixj > col) {
|
[[unroll]] for (uint j = k / 2, inner_idx = 0; inner_idx < num_inner_loop_iters; j /= 2, inner_idx++) {
|
||||||
if ((col & k) == 0) {
|
const int ixj = int(col ^ j);
|
||||||
if (dst_row[col] >= p.ncols ||
|
|
||||||
(dst_row[ixj] < p.ncols && (p.order == ASC ?
|
int idx_0 = (col & k) == 0 ? col : ixj;
|
||||||
data_a[row_offset + dst_row[col]] > data_a[row_offset + dst_row[ixj]] :
|
int idx_1 = (col & k) == 0 ? ixj : col;
|
||||||
data_a[row_offset + dst_row[col]] < data_a[row_offset + dst_row[ixj]]))
|
|
||||||
) {
|
int sh_idx_0 = dst_row[idx_0];
|
||||||
swap(col, ixj);
|
int sh_idx_1 = dst_row[idx_1];
|
||||||
}
|
bool idx_0_oob = needs_bounds_check ? sh_idx_0 >= p.ncols : false;
|
||||||
} else {
|
bool idx_1_oob = needs_bounds_check ? sh_idx_1 >= p.ncols : false;
|
||||||
if (dst_row[ixj] >= p.ncols ||
|
|
||||||
(dst_row[col] < p.ncols && (p.order == ASC ?
|
if ((idx_0_oob ||
|
||||||
data_a[row_offset + dst_row[col]] < data_a[row_offset + dst_row[ixj]] :
|
(!idx_1_oob && a_sh[sh_idx_0] > a_sh[sh_idx_1])) && (ixj > col)) {
|
||||||
data_a[row_offset + dst_row[col]] > data_a[row_offset + dst_row[ixj]]))
|
swap(idx_0, idx_1);
|
||||||
) {
|
|
||||||
swap(col, ixj);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
barrier();
|
barrier();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (col < p.ncols) {
|
if (col < p.ncols) {
|
||||||
|
if (p.order == ASC) {
|
||||||
data_d[row_offset + col] = dst_row[col];
|
data_d[row_offset + col] = dst_row[col];
|
||||||
|
} else {
|
||||||
|
data_d[row_offset + p.ncols - col - 1] = dst_row[col];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void main() {
|
||||||
|
if (p.ncols == BLOCK_SIZE) {
|
||||||
|
argsort(false);
|
||||||
|
} else {
|
||||||
|
argsort(true);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,20 @@
|
||||||
|
#version 450
|
||||||
|
|
||||||
|
#include "generic_head.comp"
|
||||||
|
#include "types.comp"
|
||||||
|
|
||||||
|
#extension GL_EXT_control_flow_attributes : enable
|
||||||
|
|
||||||
|
layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;
|
||||||
|
|
||||||
|
layout (binding = 0) readonly buffer X {A_TYPE data_a[];};
|
||||||
|
layout (binding = 1) writeonly buffer D {D_TYPE data_d[];};
|
||||||
|
|
||||||
|
void main() {
|
||||||
|
const uint i = gl_GlobalInvocationID.z * 262144 + gl_GlobalInvocationID.y * 512 + gl_GlobalInvocationID.x;
|
||||||
|
|
||||||
|
if (i >= p.KX) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
data_d[i] = D_TYPE(exp(float(data_a[i])));
|
||||||
|
}
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue