Compare commits
7 Commits
b5e7285baf
...
04fbc5f23e
| Author | SHA1 | Date |
|---|---|---|
|
|
04fbc5f23e | |
|
|
f184dd9208 | |
|
|
422c2aff1c | |
|
|
8800226d65 | |
|
|
e689fc4e91 | |
|
|
a4ec34e1cd | |
|
|
de17e3f745 |
|
|
@ -91,6 +91,12 @@ jobs:
|
|||
echo "name=${SAFE_NAME}-b${BUILD_NUMBER}-${SHORT_HASH}" >> $GITHUB_OUTPUT
|
||||
fi
|
||||
|
||||
- name: Downcase github.repository_owner
|
||||
run: |
|
||||
echo "repository_owner_lowercase=${GITHUB_REPOSITORY_OWNER@L}" >> $GITHUB_ENV
|
||||
env:
|
||||
GITHUB_REPOSITORY_OWNER: '${{ github.repository_owner }}'
|
||||
|
||||
- name: Build and push Docker image (versioned)
|
||||
if: github.event_name == 'push'
|
||||
uses: docker/build-push-action@v4
|
||||
|
|
@ -98,7 +104,7 @@ jobs:
|
|||
context: .
|
||||
push: true
|
||||
platforms: ${{ matrix.config.platforms }}
|
||||
tags: "ghcr.io/${{ github.repository_owner }}/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
|
||||
tags: "ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
|
||||
file: ${{ matrix.config.dockerfile }}
|
||||
|
||||
- name: Build and push Docker image (tagged)
|
||||
|
|
@ -107,5 +113,5 @@ jobs:
|
|||
context: .
|
||||
push: ${{ github.event_name == 'push' }}
|
||||
platforms: ${{ matrix.config.platforms }}
|
||||
tags: "ghcr.io/${{ github.repository_owner }}/llama.cpp:${{ matrix.config.tag }},ghcr.io/${{ github.repository_owner }}/llama.cpp:${{ matrix.config.tag }}-${{ steps.tag.outputs.name }}"
|
||||
tags: "ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }},ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }}-${{ steps.tag.outputs.name }}"
|
||||
file: ${{ matrix.config.dockerfile }}
|
||||
|
|
|
|||
|
|
@ -68,7 +68,7 @@ It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS,
|
|||
|
||||
| Intel GPU | Status | Verified Model |
|
||||
|-------------------------------|---------|---------------------------------------|
|
||||
| Intel Data Center Max Series | Support | Max 1550 |
|
||||
| Intel Data Center Max Series | Support | Max 1550, 1100 |
|
||||
| Intel Data Center Flex Series | Support | Flex 170 |
|
||||
| Intel Arc Series | Support | Arc 770, 730M |
|
||||
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
|
||||
|
|
@ -84,8 +84,7 @@ It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS,
|
|||
- **Execution Unit (EU)**
|
||||
- If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use.
|
||||
|
||||
### Nvidia GPU
|
||||
The BLAS acceleration on Nvidia GPU through oneAPI can be obtained using the Nvidia plugins for oneAPI and the cuBLAS backend of the upstream oneMKL library. Details and instructions on how to setup the runtime and library can be found in [this section](#i-setup-environment)
|
||||
### Other Vendor GPU
|
||||
|
||||
**Verified devices**
|
||||
|
||||
|
|
@ -94,14 +93,9 @@ The BLAS acceleration on Nvidia GPU through oneAPI can be obtained using the Nvi
|
|||
| Ampere Series | Support | A100, A4000 |
|
||||
| Ampere Series *(Mobile)* | Support | RTX 40 Series |
|
||||
|
||||
*Notes:*
|
||||
- Support for Nvidia targets through oneAPI is currently limited to Linux platforms.
|
||||
|
||||
- Please make sure the native oneAPI MKL *(dedicated to intel CPUs and GPUs)* is not "visible" at this stage to properly setup and use the built-from-source oneMKL with cuBLAS backend in llama.cpp for Nvidia GPUs.
|
||||
|
||||
|
||||
## Docker
|
||||
The docker build option is currently limited to *intel GPU* targets.
|
||||
|
||||
### Build image
|
||||
```sh
|
||||
# Using FP16
|
||||
|
|
@ -168,29 +162,10 @@ Platform #0: Intel(R) OpenCL HD Graphics
|
|||
- **Nvidia GPU**
|
||||
|
||||
In order to target Nvidia GPUs through SYCL, please make sure the CUDA/CUBLAS native requirements *-found [here](README.md#cuda)-* are installed.
|
||||
Installation can be verified by running the following:
|
||||
```sh
|
||||
nvidia-smi
|
||||
```
|
||||
Please make sure at least one CUDA device is available, which can be displayed like this *(here an A100-40GB Nvidia GPU)*:
|
||||
```
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|
||||
|-----------------------------------------+----------------------+----------------------+
|
||||
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|=========================================+======================+======================|
|
||||
| 0 NVIDIA A100-PCIE-40GB On | 00000000:8D:00.0 Off | 0 |
|
||||
| N/A 36C P0 57W / 250W | 4MiB / 40960MiB | 0% Default |
|
||||
| | | Disabled |
|
||||
+-----------------------------------------+----------------------+----------------------+
|
||||
```
|
||||
|
||||
|
||||
2. **Install Intel® oneAPI Base toolkit**
|
||||
|
||||
- **Base installation**
|
||||
- **For Intel GPU**
|
||||
|
||||
The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
|
||||
|
||||
|
|
@ -202,10 +177,10 @@ Upon a successful installation, SYCL is enabled for the available intel devices,
|
|||
|
||||
- **Adding support to Nvidia GPUs**
|
||||
|
||||
**oneAPI**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
|
||||
**oneAPI Plugin**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
|
||||
|
||||
|
||||
**oneMKL**: The current oneMKL releases *(shipped with the oneAPI base-toolkit)* do not contain the cuBLAS backend. A build from source of the upstream [oneMKL](https://github.com/oneapi-src/oneMKL) with the *cuBLAS* backend enabled is thus required to run it on Nvidia GPUs.
|
||||
**oneMKL for cuBlas**: The current oneMKL releases *(shipped with the oneAPI base-toolkit)* do not contain the cuBLAS backend. A build from source of the upstream [oneMKL](https://github.com/oneapi-src/oneMKL) with the *cuBLAS* backend enabled is thus required to run it on Nvidia GPUs.
|
||||
|
||||
```sh
|
||||
git clone https://github.com/oneapi-src/oneMKL
|
||||
|
|
@ -237,7 +212,7 @@ When targeting an intel GPU, the user should expect one or more level-zero devic
|
|||
|
||||
- **Nvidia GPU**
|
||||
|
||||
Similarly, user targetting Nvidia GPUs should expect at least one SYCL-CUDA device [`ext_oneapi_cuda:gpu`] as bellow:
|
||||
Similarly, user targeting Nvidia GPUs should expect at least one SYCL-CUDA device [`ext_oneapi_cuda:gpu`] as bellow:
|
||||
```
|
||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
|
||||
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
|
||||
|
|
@ -260,6 +235,9 @@ cmake --build .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icp
|
|||
|
||||
# Option 2: Use FP32 by default
|
||||
cmake --build .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
|
||||
|
||||
#build all binary
|
||||
cmake --build . --config Release -j -v
|
||||
```
|
||||
|
||||
#### Nvidia GPU
|
||||
|
|
@ -278,6 +256,10 @@ cmake --build .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=i
|
|||
|
||||
# Option 2: Use FP32 by default
|
||||
cmake --build .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
|
||||
|
||||
#build all binary
|
||||
cmake --build . --config Release -j -v
|
||||
|
||||
```
|
||||
|
||||
### III. Run the inference
|
||||
|
|
@ -357,7 +339,6 @@ Otherwise, you can run the script:
|
|||
|
||||
*Notes:*
|
||||
|
||||
- By default, `mmap` is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing `--no-mmap` to the `/bin/main` if faced with the issue.
|
||||
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
|
||||
|
||||
```sh
|
||||
|
|
@ -438,7 +419,7 @@ cd build
|
|||
|
||||
cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON
|
||||
|
||||
make
|
||||
make -j
|
||||
```
|
||||
|
||||
Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
|
||||
|
|
@ -525,7 +506,6 @@ Otherwise, run the following wrapper script:
|
|||
|
||||
Note:
|
||||
|
||||
- By default, `mmap` is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing `--no-mmap` to the `main.exe` if faced with the issue.
|
||||
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
|
||||
|
||||
```sh
|
||||
|
|
@ -557,12 +537,6 @@ use 1 SYCL GPUs: [0] with Max compute units:512
|
|||
|
||||
## Known Issues
|
||||
|
||||
- Hanging during startup
|
||||
|
||||
llama.cpp uses *mmap* as the default mode for reading the model file and copying it to the GPU. In some systems, `memcpy` might behave abnormally and therefore hang.
|
||||
|
||||
- **Solution**: add `--no-mmap` or `--mmap 0` flag to the `main` executable.
|
||||
|
||||
- `Split-mode:[row]` is not supported.
|
||||
|
||||
## Q&A
|
||||
|
|
@ -574,7 +548,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
|
|||
|
||||
- General compiler error:
|
||||
|
||||
- Remove build folder or try a clean-build.
|
||||
- Remove **build** folder or try a clean-build.
|
||||
|
||||
- I can **not** see `[ext_oneapi_level_zero:gpu]` afer installing the GPU driver on Linux.
|
||||
|
||||
|
|
|
|||
49
ci/run.sh
49
ci/run.sh
|
|
@ -153,6 +153,52 @@ function gg_sum_ctest_release {
|
|||
gg_printf '```\n'
|
||||
}
|
||||
|
||||
# test_scripts_debug
|
||||
|
||||
function gg_run_test_scripts_debug {
|
||||
cd ${SRC}
|
||||
|
||||
set -e
|
||||
|
||||
(cd ./examples/gguf-split && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
|
||||
|
||||
set +e
|
||||
}
|
||||
|
||||
function gg_sum_test_scripts_debug {
|
||||
gg_printf '### %s\n\n' "${ci}"
|
||||
|
||||
gg_printf 'Runs test scripts in debug mode\n'
|
||||
gg_printf '- status: %s\n' "$(cat $OUT/${ci}.exit)"
|
||||
gg_printf '```\n'
|
||||
gg_printf '%s\n' "$(cat $OUT/${ci}-scripts.log)"
|
||||
gg_printf '```\n'
|
||||
gg_printf '\n'
|
||||
}
|
||||
|
||||
# test_scripts_release
|
||||
|
||||
function gg_run_test_scripts_release {
|
||||
cd ${SRC}
|
||||
|
||||
set -e
|
||||
|
||||
(cd ./examples/gguf-split && time bash tests.sh "$SRC/build-ci-release/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
|
||||
|
||||
set +e
|
||||
}
|
||||
|
||||
function gg_sum_test_scripts_release {
|
||||
gg_printf '### %s\n\n' "${ci}"
|
||||
|
||||
gg_printf 'Runs test scripts in release mode\n'
|
||||
gg_printf '- status: %s\n' "$(cat $OUT/${ci}.exit)"
|
||||
gg_printf '```\n'
|
||||
gg_printf '%s\n' "$(cat $OUT/${ci}-scripts.log)"
|
||||
gg_printf '```\n'
|
||||
gg_printf '\n'
|
||||
}
|
||||
|
||||
function gg_get_model {
|
||||
local gguf_3b="$MNT/models/open-llama/3B-v2/ggml-model-f16.gguf"
|
||||
local gguf_7b="$MNT/models/open-llama/7B-v2/ggml-model-f16.gguf"
|
||||
|
|
@ -642,6 +688,9 @@ test $ret -eq 0 && gg_run ctest_release
|
|||
if [ -z ${GG_BUILD_LOW_PERF} ]; then
|
||||
test $ret -eq 0 && gg_run embd_bge_small
|
||||
|
||||
test $ret -eq 0 && gg_run test_scripts_debug
|
||||
test $ret -eq 0 && gg_run test_scripts_release
|
||||
|
||||
if [ -z ${GG_BUILD_VRAM_GB} ] || [ ${GG_BUILD_VRAM_GB} -ge 8 ]; then
|
||||
if [ -z ${GG_BUILD_CUDA} ]; then
|
||||
test $ret -eq 0 && gg_run open_llama_3b_v2
|
||||
|
|
|
|||
|
|
@ -43,17 +43,18 @@ AnyModel = TypeVar("AnyModel", bound="type[Model]")
|
|||
class Model(ABC):
|
||||
_model_classes: dict[str, type[Model]] = {}
|
||||
|
||||
def __init__(self, dir_model: Path, ftype: int, fname_out: Path, is_big_endian: bool):
|
||||
def __init__(self, dir_model: Path, ftype: int, fname_out: Path, is_big_endian: bool, use_temp_file: bool):
|
||||
self.dir_model = dir_model
|
||||
self.ftype = ftype
|
||||
self.fname_out = fname_out
|
||||
self.is_big_endian = is_big_endian
|
||||
self.endianess = gguf.GGUFEndian.BIG if is_big_endian else gguf.GGUFEndian.LITTLE
|
||||
self.use_temp_file = use_temp_file
|
||||
self.is_safetensors = self._is_model_safetensors()
|
||||
self.num_parts = Model.count_model_parts(self.dir_model, ".safetensors" if self.is_safetensors else ".bin")
|
||||
self.part_names = self._get_part_names()
|
||||
self.hparams = Model.load_hparams(self.dir_model)
|
||||
self.gguf_writer = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=False)
|
||||
self.gguf_writer = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file)
|
||||
self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer"])
|
||||
|
||||
@property
|
||||
|
|
@ -2459,6 +2460,7 @@ def parse_args() -> argparse.Namespace:
|
|||
"model", type=Path,
|
||||
help="directory containing model file",
|
||||
)
|
||||
parser.add_argument("--use-temp-file", action="store_true", help="use the tempfile library while processing (helpful when running out of memory, process killed)")
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
|
@ -2502,7 +2504,7 @@ def main() -> None:
|
|||
|
||||
with torch.inference_mode():
|
||||
model_class = Model.from_model_architecture(hparams["architectures"][0])
|
||||
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian)
|
||||
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file)
|
||||
|
||||
print("Set model parameters")
|
||||
model_instance.set_gguf_parameters()
|
||||
|
|
|
|||
|
|
@ -5,5 +5,6 @@ CLI to split / merge GGUF files.
|
|||
**Command line options:**
|
||||
|
||||
- `--split`: split GGUF to multiple GGUF, default operation.
|
||||
- `--split-max-size`: max size per split in `M` or `G`, f.ex. `500M` or `2G`.
|
||||
- `--split-max-tensors`: maximum tensors in each split: default(128)
|
||||
- `--merge`: merge multiple GGUF to a single GGUF.
|
||||
|
|
|
|||
|
|
@ -59,10 +59,10 @@ static size_t split_str_to_n_bytes(std::string str) {
|
|||
int n;
|
||||
if (str.back() == 'M') {
|
||||
sscanf(str.c_str(), "%d", &n);
|
||||
n_bytes = n * 1024 * 1024; // megabytes
|
||||
n_bytes = (size_t)n * 1024 * 1024; // megabytes
|
||||
} else if (str.back() == 'G') {
|
||||
sscanf(str.c_str(), "%d", &n);
|
||||
n_bytes = n * 1024 * 1024 * 1024; // gigabytes
|
||||
n_bytes = (size_t)n * 1024 * 1024 * 1024; // gigabytes
|
||||
} else {
|
||||
throw std::invalid_argument("error: supported units are M (megabytes) or G (gigabytes), but got: " + std::string(1, str.back()));
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,89 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -eu
|
||||
|
||||
if [ $# -lt 1 ]
|
||||
then
|
||||
echo "usage: $0 path_to_build_binary [path_to_temp_folder]"
|
||||
echo "example: $0 ../../build/bin ../../tmp"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ $# -gt 1 ]
|
||||
then
|
||||
TMP_DIR=$2
|
||||
else
|
||||
TMP_DIR=/tmp
|
||||
fi
|
||||
|
||||
set -x
|
||||
|
||||
SPLIT=$1/gguf-split
|
||||
MAIN=$1/main
|
||||
WORK_PATH=$TMP_DIR/gguf-split
|
||||
CUR_DIR=$(pwd)
|
||||
|
||||
mkdir -p "$WORK_PATH"
|
||||
|
||||
# Clean up in case of previously failed test
|
||||
rm -f $WORK_PATH/ggml-model-split*.gguf $WORK_PATH/ggml-model-merge*.gguf
|
||||
|
||||
# 1. Get a model
|
||||
(
|
||||
cd $WORK_PATH
|
||||
"$CUR_DIR"/../../scripts/hf.sh --repo ggml-org/gemma-1.1-2b-it-Q8_0-GGUF --file gemma-1.1-2b-it.Q8_0.gguf
|
||||
)
|
||||
echo PASS
|
||||
|
||||
# 2. Split with max tensors strategy
|
||||
$SPLIT --split-max-tensors 28 $WORK_PATH/gemma-1.1-2b-it.Q8_0.gguf $WORK_PATH/ggml-model-split
|
||||
echo PASS
|
||||
echo
|
||||
|
||||
# 2b. Test the sharded model is loading properly
|
||||
$MAIN --model $WORK_PATH/ggml-model-split-00001-of-00006.gguf --random-prompt --n-predict 32
|
||||
echo PASS
|
||||
echo
|
||||
|
||||
# 3. Merge
|
||||
$SPLIT --merge $WORK_PATH/ggml-model-split-00001-of-00006.gguf $WORK_PATH/ggml-model-merge.gguf
|
||||
echo PASS
|
||||
echo
|
||||
|
||||
# 3b. Test the merged model is loading properly
|
||||
$MAIN --model $WORK_PATH/ggml-model-merge.gguf --random-prompt --n-predict 32
|
||||
echo PASS
|
||||
echo
|
||||
|
||||
# 4. Split with no tensor in metadata
|
||||
#$SPLIT --split-max-tensors 32 --no-tensor-in-metadata $WORK_PATH/ggml-model-merge.gguf $WORK_PATH/ggml-model-split-32-tensors
|
||||
#echo PASS
|
||||
#echo
|
||||
|
||||
# 4b. Test the sharded model is loading properly
|
||||
#$MAIN --model $WORK_PATH/ggml-model-split-32-tensors-00001-of-00006.gguf --random-prompt --n-predict 32
|
||||
#echo PASS
|
||||
#echo
|
||||
|
||||
# 5. Merge
|
||||
#$SPLIT --merge $WORK_PATH/ggml-model-split-32-tensors-00001-of-00006.gguf $WORK_PATH/ggml-model-merge-2.gguf
|
||||
#echo PASS
|
||||
#echo
|
||||
|
||||
# 5b. Test the merged model is loading properly
|
||||
#$MAIN --model $WORK_PATH/ggml-model-merge-2.gguf --random-prompt --n-predict 32
|
||||
#echo PASS
|
||||
#echo
|
||||
|
||||
# 6. Split with size strategy
|
||||
$SPLIT --split-max-size 2G $WORK_PATH/ggml-model-merge.gguf $WORK_PATH/ggml-model-split-2G
|
||||
echo PASS
|
||||
echo
|
||||
|
||||
# 6b. Test the sharded model is loading properly
|
||||
$MAIN --model $WORK_PATH/ggml-model-split-2G-00001-of-00002.gguf --random-prompt --n-predict 32
|
||||
echo PASS
|
||||
echo
|
||||
|
||||
# Clean up
|
||||
rm -f $WORK_PATH/ggml-model-split*.gguf $WORK_PATH/ggml-model-merge*.gguf
|
||||
|
|
@ -20,4 +20,4 @@ cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
|
|||
#cmake --build . --config Release --target llama-bench
|
||||
|
||||
#build all binary
|
||||
cmake --build . --config Release -v
|
||||
cmake --build . --config Release -j -v
|
||||
|
|
|
|||
|
|
@ -12,6 +12,7 @@ if [ $# -gt 0 ]; then
|
|||
GGML_SYCL_SINGLE_GPU=1
|
||||
else
|
||||
GGML_SYCL_DEVICE=0
|
||||
GGML_SYCL_SINGLE_GPU=0
|
||||
fi
|
||||
|
||||
#export GGML_SYCL_DEBUG=1
|
||||
|
|
|
|||
|
|
@ -20,11 +20,11 @@
|
|||
},
|
||||
"nixpkgs": {
|
||||
"locked": {
|
||||
"lastModified": 1712163089,
|
||||
"narHash": "sha256-Um+8kTIrC19vD4/lUCN9/cU9kcOsD1O1m+axJqQPyMM=",
|
||||
"lastModified": 1712791164,
|
||||
"narHash": "sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw=",
|
||||
"owner": "NixOS",
|
||||
"repo": "nixpkgs",
|
||||
"rev": "fd281bd6b7d3e32ddfa399853946f782553163b5",
|
||||
"rev": "1042fd8b148a9105f3c0aca3a6177fd1d9360ba5",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
|
|
|
|||
22
ggml-metal.m
22
ggml-metal.m
|
|
@ -37,6 +37,7 @@ enum ggml_metal_kernel_type {
|
|||
GGML_METAL_KERNEL_TYPE_DIV_ROW,
|
||||
GGML_METAL_KERNEL_TYPE_SCALE,
|
||||
GGML_METAL_KERNEL_TYPE_SCALE_4,
|
||||
GGML_METAL_KERNEL_TYPE_CLAMP,
|
||||
GGML_METAL_KERNEL_TYPE_TANH,
|
||||
GGML_METAL_KERNEL_TYPE_RELU,
|
||||
GGML_METAL_KERNEL_TYPE_GELU,
|
||||
|
|
@ -468,6 +469,7 @@ static struct ggml_metal_context * ggml_metal_init(int n_cb) {
|
|||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_DIV_ROW, div_row, true);
|
||||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_SCALE, scale, true);
|
||||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_SCALE_4, scale_4, true);
|
||||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_CLAMP, clamp, true);
|
||||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_TANH, tanh, true);
|
||||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_RELU, relu, true);
|
||||
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_GELU, gelu, true);
|
||||
|
|
@ -713,6 +715,7 @@ static bool ggml_metal_supports_op(const struct ggml_metal_context * ctx, const
|
|||
case GGML_OP_MUL:
|
||||
case GGML_OP_DIV:
|
||||
case GGML_OP_SCALE:
|
||||
case GGML_OP_CLAMP:
|
||||
case GGML_OP_SQR:
|
||||
case GGML_OP_SUM_ROWS:
|
||||
return true;
|
||||
|
|
@ -1154,6 +1157,25 @@ static enum ggml_status ggml_metal_graph_compute(
|
|||
|
||||
[encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
|
||||
} break;
|
||||
case GGML_OP_CLAMP:
|
||||
{
|
||||
id<MTLComputePipelineState> pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_CLAMP].pipeline;
|
||||
|
||||
float min;
|
||||
float max;
|
||||
memcpy(&min, ((int32_t *) dst->op_params) + 0, sizeof(float));
|
||||
memcpy(&max, ((int32_t *) dst->op_params) + 1, sizeof(float));
|
||||
|
||||
[encoder setComputePipelineState:pipeline];
|
||||
[encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
|
||||
[encoder setBuffer:id_dst offset:offs_dst atIndex:1];
|
||||
[encoder setBytes:&min length:sizeof(min) atIndex:2];
|
||||
[encoder setBytes:&max length:sizeof(max) atIndex:3];
|
||||
|
||||
const int64_t n = ggml_nelements(dst);
|
||||
|
||||
[encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
|
||||
} break;
|
||||
case GGML_OP_UNARY:
|
||||
switch (ggml_get_unary_op(gf->nodes[i])) {
|
||||
case GGML_UNARY_OP_TANH:
|
||||
|
|
|
|||
|
|
@ -213,6 +213,15 @@ kernel void kernel_scale_4(
|
|||
dst[tpig] = src0[tpig] * scale;
|
||||
}
|
||||
|
||||
kernel void kernel_clamp(
|
||||
device const float * src0,
|
||||
device float * dst,
|
||||
constant float & min,
|
||||
constant float & max,
|
||||
uint tpig[[thread_position_in_grid]]) {
|
||||
dst[tpig] = src0[tpig] < min ? min : (src0[tpig] > max ? max : src0[tpig]);
|
||||
}
|
||||
|
||||
kernel void kernel_relu(
|
||||
device const float * src0,
|
||||
device float * dst,
|
||||
|
|
|
|||
|
|
@ -3154,7 +3154,6 @@ typedef float (*vec_dot_q_mul_mat_sycl_t)(
|
|||
#define SYCL_SCALE_BLOCK_SIZE 256
|
||||
#define SYCL_CLAMP_BLOCK_SIZE 256
|
||||
#define SYCL_ROPE_BLOCK_SIZE 256
|
||||
#define SYCL_SOFT_MAX_BLOCK_SIZE 1024
|
||||
#define SYCL_ALIBI_BLOCK_SIZE 32
|
||||
#define SYCL_DIAG_MASK_INF_BLOCK_SIZE 32
|
||||
#define SYCL_QUANTIZE_BLOCK_SIZE 256
|
||||
|
|
@ -13080,11 +13079,13 @@ static void soft_max_f32_sycl(const float * x, const float * mask, const float *
|
|||
const int nrows_y, const float scale, const float max_bias,
|
||||
dpct::queue_ptr stream) {
|
||||
int nth = WARP_SIZE;
|
||||
while (nth < ncols_x && nth < SYCL_SOFT_MAX_BLOCK_SIZE) nth *= 2;
|
||||
int max_block_size = g_work_group_size;
|
||||
while (nth < ncols_x && nth < max_block_size) nth *= 2;
|
||||
if (nth>max_block_size) nth = max_block_size;
|
||||
|
||||
const sycl::range<3> block_dims(1, 1, nth);
|
||||
const sycl::range<3> block_nums(1, 1, nrows_x);
|
||||
const size_t n_local_scratch = (GGML_PAD(ncols_x, WARP_SIZE) + WARP_SIZE);
|
||||
static_assert(SYCL_SOFT_MAX_BLOCK_SIZE == 1024, "These values need to be adjusted.");
|
||||
|
||||
const uint32_t n_head_kv = nrows_x/nrows_y;
|
||||
const uint32_t n_head_log2 = 1u << (uint32_t) floorf(log2f((float) n_head_kv));
|
||||
|
|
@ -13094,6 +13095,12 @@ static void soft_max_f32_sycl(const float * x, const float * mask, const float *
|
|||
|
||||
const size_t local_mem_size = stream->get_device().get_info<sycl::info::device::local_mem_size>();
|
||||
if (n_local_scratch*sizeof(float) < local_mem_size) {
|
||||
if (ncols_x > max_block_size) {
|
||||
soft_max_f32_submitter<true, 0, 0>(x, mask, pos, dst, ncols_x, nrows_y, scale,
|
||||
max_bias, m0, m1, n_head_log2, block_nums,
|
||||
block_dims, n_local_scratch, stream);
|
||||
return;
|
||||
}
|
||||
switch (ncols_x) {
|
||||
case 32:
|
||||
soft_max_f32_submitter<true, 32, 32>(x, mask, pos, dst, ncols_x, nrows_y, scale,
|
||||
|
|
@ -16814,11 +16821,13 @@ static void ggml_backend_sycl_buffer_set_tensor(ggml_backend_buffer_t buffer,
|
|||
const dpct::queue_ptr stream = g_syclStreams[ctx->device][0];
|
||||
SYCL_CHECK(
|
||||
CHECK_TRY_ERROR(dpct::dev_mgr::instance().get_device(ctx->device).queues_wait_and_throw()));
|
||||
|
||||
char* host_buf = (char*)malloc(size);
|
||||
memcpy(host_buf, data, size);
|
||||
SYCL_CHECK(
|
||||
CHECK_TRY_ERROR((*stream)
|
||||
.memcpy((char *)tensor->data + offset, data, size)
|
||||
.memcpy((char *)tensor->data + offset, host_buf, size)
|
||||
.wait()));
|
||||
free(host_buf);
|
||||
}
|
||||
catch (sycl::exception const &exc) {
|
||||
std::cerr << exc.what() << "Exception caught at file:" << __FILE__
|
||||
|
|
|
|||
15
llama.cpp
15
llama.cpp
|
|
@ -16625,6 +16625,21 @@ static int32_t llama_chat_apply_template_internal(
|
|||
if (add_ass) {
|
||||
ss << "### Response:\n";
|
||||
}
|
||||
} else if (tmpl == "command-r" || (tmpl.find("<|START_OF_TURN_TOKEN|>") != std::string::npos && tmpl.find("<|USER_TOKEN|>") != std::string::npos)) {
|
||||
// CohereForAI/c4ai-command-r-plus
|
||||
for (auto message : chat) {
|
||||
std::string role(message->role);
|
||||
if (role == "system") {
|
||||
ss << "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>" << trim(message->content) << "<|END_OF_TURN_TOKEN|>";
|
||||
} else if (role == "user") {
|
||||
ss << "<|START_OF_TURN_TOKEN|><|USER_TOKEN|>" << trim(message->content) << "<|END_OF_TURN_TOKEN|>";
|
||||
} else if (role == "assistant") {
|
||||
ss << "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>" << trim(message->content) << "<|END_OF_TURN_TOKEN|>";
|
||||
}
|
||||
}
|
||||
if (add_ass) {
|
||||
ss << "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>";
|
||||
}
|
||||
} else {
|
||||
// template not supported
|
||||
return -1;
|
||||
|
|
|
|||
|
|
@ -45,6 +45,8 @@ int main(void) {
|
|||
// Orca-Vicuna
|
||||
// No template included in tokenizer_config.json, so this template likely needs to be manually set.
|
||||
"{%- for message in messages %}{%- if message['role'] == 'system' -%}{{-'SYSTEM: ' + message['content'] + '\n' -}}{%- else -%}{%- if message['role'] == 'user' -%}{{-'USER: ' + message['content'] + '\n'-}}{%- else -%}{{-'ASSISTANT: ' + message['content'] + '</s>\n' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{-'ASSISTANT:'-}}{%- endif -%}",
|
||||
// CohereForAI/c4ai-command-r-plus
|
||||
"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}"
|
||||
};
|
||||
std::vector<std::string> expected_output = {
|
||||
// teknium/OpenHermes-2.5-Mistral-7B
|
||||
|
|
@ -69,6 +71,8 @@ int main(void) {
|
|||
"You are a helpful assistant\n\nUSER: Hello\nASSISTANT: Hi there</s>\nUSER: Who are you\nASSISTANT: I am an assistant </s>\nUSER: Another question\nASSISTANT:",
|
||||
// Orca-Vicuna
|
||||
"SYSTEM: You are a helpful assistant\nUSER: Hello\nASSISTANT: Hi there</s>\nUSER: Who are you\nASSISTANT: I am an assistant </s>\nUSER: Another question\nASSISTANT:",
|
||||
// CohereForAI/c4ai-command-r-plus
|
||||
"<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hi there<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Who are you<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>I am an assistant<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Another question<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>",
|
||||
};
|
||||
std::vector<char> formatted_chat(1024);
|
||||
int32_t res;
|
||||
|
|
|
|||
Loading…
Reference in New Issue