From 87a1c76578d8e59c33351c5a7299c3b9b730694c Mon Sep 17 00:00:00 2001 From: prajwalc22 Date: Tue, 15 Apr 2025 08:16:02 +0530 Subject: [PATCH] Update CMake configuration and documentation for --prompt flag --- CMakeLists.txt | 2 +- README.md | 596 ++--------------------------------------------- build/.gitignore | 3 - 3 files changed, 21 insertions(+), 580 deletions(-) delete mode 100644 build/.gitignore diff --git a/CMakeLists.txt b/CMakeLists.txt index b572835..b9558ad 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -cmake_minimum_required(VERSION 3.11) +cmake_minimum_required(VERSION 3.11...4.0) include(FetchContent) diff --git a/README.md b/README.md index e9a6745..2c2020e 100644 --- a/README.md +++ b/README.md @@ -1,583 +1,27 @@ -# gemma.cpp +--- +library_name: gemma.cpp +license: gemma +pipeline_tag: text-generation +tags: [] +extra_gated_heading: Access Gemma on Hugging Face +extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and + agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging + Face and click below. Requests are processed immediately. +extra_gated_button_content: Acknowledge license +--- -gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma -foundation models from Google. +# Gemma Model Card -For additional information about Gemma, see -[ai.google.dev/gemma](https://ai.google.dev/gemma). Model weights, including -gemma.cpp specific artifacts, are -[available on kaggle](https://www.kaggle.com/models/google/gemma). +**Model Page**: [Gemma](https://ai.google.dev/gemma/docs) -## Who is this project for? +This model card corresponds to the 2B base version of the Gemma model for usage with C++ (https://github.com/google/gemma.cpp). This is a compressed version of the weights, which will load, run, and download more quickly. For more information about the model, visit https://huggingface.co/google/gemma-2b. -Modern LLM inference engines are sophisticated systems, often with bespoke -capabilities extending beyond traditional neural network runtimes. With this -comes opportunities for research and innovation through co-design of high level -algorithms and low-level computation. However, there is a gap between -deployment-oriented C++ inference runtimes, which are not designed for -experimentation, and Python-centric ML research frameworks, which abstract away -low-level computation through compilation. +**Resources and Technical Documentation**: -gemma.cpp provides a minimalist implementation of Gemma-1, Gemma-2, Gemma-3, and -PaliGemma models, focusing on simplicity and directness rather than full -generality. This is inspired by vertically-integrated model implementations such -as [ggml](https://github.com/ggerganov/ggml), -[llama.c](https://github.com/karpathy/llama2.c), and -[llama.rs](https://github.com/srush/llama2.rs). +* [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) +* [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma) +* [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335?version=gemma-2b-gg-hf) -gemma.cpp targets experimentation and research use cases. It is intended to be -straightforward to embed in other projects with minimal dependencies and also -easily modifiable with a small ~2K LoC core implementation (along with ~4K LoC -of supporting utilities). We use the [Google -Highway](https://github.com/google/highway) Library to take advantage of -portable SIMD for CPU inference. +**Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent/verify/huggingface?returnModelRepoId=google/gemma-2b-sfp-cpp) -For production-oriented edge deployments we recommend standard deployment -pathways using Python frameworks like JAX, Keras, PyTorch, and Transformers -([all model variations here](https://www.kaggle.com/models/google/gemma)). - -## Contributing - -Community contributions large and small are welcome. See -[DEVELOPERS.md](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md) -for additional notes contributing developers and [join the discord by following -this invite link](https://discord.gg/H5jCBAWxAe). This project follows -[Google's Open Source Community -Guidelines](https://opensource.google.com/conduct/). - -*Active development is currently done on the `dev` branch. Please open pull -requests targeting `dev` branch instead of `main`, which is intended to be more -stable.* - -## Quick Start - -### System requirements - -Before starting, you should have installed: - -- [CMake](https://cmake.org/) -- [Clang C++ compiler](https://clang.llvm.org/get_started.html), supporting at - least C++17. -- `tar` for extracting archives from Kaggle. - -Building natively on Windows requires the Visual Studio 2012 Build Tools with the -optional Clang/LLVM C++ frontend (`clang-cl`). This can be installed from the -command line with -[`winget`](https://learn.microsoft.com/en-us/windows/package-manager/winget/): - -```sh -winget install --id Kitware.CMake -winget install --id Microsoft.VisualStudio.2022.BuildTools --force --override "--passive --wait --add Microsoft.VisualStudio.Workload.VCTools;installRecommended --add Microsoft.VisualStudio.Component.VC.Llvm.Clang --add Microsoft.VisualStudio.Component.VC.Llvm.ClangToolset" -``` - -### Step 1: Obtain model weights and tokenizer from Kaggle or Hugging Face Hub - -Visit the -[Kaggle page for Gemma-2](https://www.kaggle.com/models/google/gemma-2/gemmaCpp) -[or Gemma-1](https://www.kaggle.com/models/google/gemma/frameworks/gemmaCpp), -and select `Model Variations |> Gemma C++`. - -On this tab, the `Variation` dropdown includes the options below. Note bfloat16 -weights are higher fidelity, while 8-bit switched floating point weights enable -faster inference. In general, we recommend starting with the `-sfp` checkpoints. - -If you are unsure which model to start with, we recommend starting with the -smallest Gemma-2 model, i.e. `2.0-2b-it-sfp`. - -Alternatively, visit the -[gemma.cpp](https://huggingface.co/models?other=gemma.cpp) models on the Hugging -Face Hub. First go the model repository of the model of interest (see -recommendations below). Then, click the `Files and versions` tab and download -the model and tokenizer files. For programmatic downloading, if you have -`huggingface_hub` installed, you can also download by running: - -``` -huggingface-cli login # Just the first time -huggingface-cli download google/gemma-2b-sfp-cpp --local-dir build/ -``` - -Gemma-1 2B instruction-tuned (`it`) and pre-trained (`pt`) models: - -| Model name | Description | -| ----------- | ----------- | -| `2b-it` | 2 billion parameter instruction-tuned model, bfloat16 | -| `2b-it-sfp` | 2 billion parameter instruction-tuned model, 8-bit switched floating point | -| `2b-pt` | 2 billion parameter pre-trained model, bfloat16 | -| `2b-pt-sfp` | 2 billion parameter pre-trained model, 8-bit switched floating point | - -Gemma-1 7B instruction-tuned (`it`) and pre-trained (`pt`) models: - -| Model name | Description | -| ----------- | ----------- | -| `7b-it` | 7 billion parameter instruction-tuned model, bfloat16 | -| `7b-it-sfp` | 7 billion parameter instruction-tuned model, 8-bit switched floating point | -| `7b-pt` | 7 billion parameter pre-trained model, bfloat16 | -| `7b-pt-sfp` | 7 billion parameter pre-trained model, 8-bit switched floating point | - -> [!NOTE] -> **Important**: We strongly recommend starting off with the `2b-it-sfp` model to -> get up and running. - -Gemma 2 models are named `gemma2-2b-it` for 2B and `9b-it` or `27b-it`. See the -`kModelFlags` definition in `common.cc`. - -### Step 2: Extract Files - -If you downloaded the models from Hugging Face, skip to step 3. - -After filling out the consent form, the download should proceed to retrieve a -tar archive file `archive.tar.gz`. Extract files from `archive.tar.gz` (this can -take a few minutes): - -``` -tar -xf archive.tar.gz -``` - -This should produce a file containing model weights such as `2b-it-sfp.sbs` and -a tokenizer file (`tokenizer.spm`). You may want to move these files to a -convenient directory location (e.g. the `build/` directory in this repo). - -### Step 3: Build - -The build system uses [CMake](https://cmake.org/). To build the gemma inference -runtime, create a build directory and generate the build files using `cmake` -from the top-level project directory. Note if you previous ran `cmake` and are -re-running with a different setting, be sure to delete all files in the `build/` -directory with `rm -rf build/*`. - -#### Unix-like Platforms -```sh -cmake -B build -``` - -After running `cmake`, you can enter the `build/` directory and run `make` to -build the `./gemma` executable: - -```sh -# Configure `build` directory -cmake --preset make - -# Build project using make -cmake --build --preset make -j [number of parallel threads to use] -``` - -Replace `[number of parallel threads to use]` with a number - the number of -cores available on your system is a reasonable heuristic. For example, -`make -j4 gemma` will build using 4 threads. If the `nproc` command is -available, you can use `make -j$(nproc) gemma` as a reasonable default -for the number of threads. - -If you aren't sure of the right value for the `-j` flag, you can simply run -`make gemma` instead and it should still build the `./gemma` executable. - -> [!NOTE] -> On Windows Subsystem for Linux (WSL) users should set the number of -> parallel threads to 1. Using a larger number may result in errors. - -If the build is successful, you should now have a `gemma` executable in the `build/` directory. - -#### Windows - -```sh -# Configure `build` directory -cmake --preset windows - -# Build project using Visual Studio Build Tools -cmake --build --preset windows -j [number of parallel threads to use] -``` - -If the build is successful, you should now have a `gemma.exe` executable in the `build/` directory. - -#### Bazel - -```sh -bazel build -c opt --cxxopt=-std=c++20 :gemma -``` - -If the build is successful, you should now have a `gemma` executable in the `bazel-bin/` directory. - -#### Make - -If you prefer Makefiles, @jart has made one available here: - -https://github.com/jart/gemma3/blob/main/Makefile - -### Step 4: Run - -You can now run `gemma` from inside the `build/` directory. - -`gemma` has the following required arguments: - -Argument | Description | Example value ---------------- | ---------------------------- | ----------------------- -`--model` | The model type. | `2b-it` ... (see below) -`--weights` | The compressed weights file. | `2b-it-sfp.sbs` -`--weight_type` | The compressed weight type. | `sfp` -`--tokenizer` | The tokenizer file. | `tokenizer.spm` - -`gemma` is invoked as: - -```sh -./gemma \ ---tokenizer [tokenizer file] \ ---weights [compressed weights file] \ ---weight_type [f32 or bf16 or sfp (default:sfp)] \ ---model [2b-it or 2b-pt or 7b-it or 7b-pt or ...] -``` - -Example invocation for the following configuration: - -- Compressed weights file `2b-it-sfp.sbs` (2B instruction-tuned model, 8-bit - switched floating point). -- Tokenizer file `tokenizer.spm`. - -```sh -./gemma \ ---tokenizer tokenizer.spm \ ---weights 2b-it-sfp.sbs --model 2b-it -``` - -### RecurrentGemma - -This repository includes a version of Gemma based on Griffin -([paper](https://arxiv.org/abs/2402.19427), -[code](https://github.com/google-deepmind/recurrentgemma)). Its architecture -includes both recurrent layers and local attention, thus it is more efficient -for longer sequences and has a smaller memory footprint than standard Gemma. We -here provide a C++ implementation of this model based on the paper. - -To use the recurrent version of Gemma included in this repository, build the -gemma binary as noted above in Step 3. Download the compressed weights and -tokenizer from the RecurrentGemma -[Kaggle](https://www.kaggle.com/models/google/recurrentgemma/gemmaCpp) as in -Step 1, and run the binary as follows: - -`./gemma --tokenizer tokenizer.spm --model gr2b-it --weights 2b-it-sfp.sbs` - -### PaliGemma Vision-Language Model - -This repository includes a version of the PaliGemma VLM -([paper](https://arxiv.org/abs/2407.07726), -[code](https://github.com/google-research/big_vision/tree/main/big_vision/configs/proj/paligemma)) -and its successor PaliGemma 2 ([paper](https://arxiv.org/abs/2412.03555)). We -provide a C++ implementation of the PaliGemma model family here. - -To use the version of PaliGemma included in this repository, build the gemma -binary as noted above in Step 3. Download the compressed weights and tokenizer -from -[Kaggle](https://www.kaggle.com/models/google/paligemma/gemmaCpp/paligemma-3b-mix-224) -and run the binary as follows: - -```sh -./gemma \ ---tokenizer paligemma_tokenizer.model \ ---model paligemma-224 \ ---weights paligemma-3b-mix-224-sfp.sbs \ ---image_file paligemma/testdata/image.ppm -``` - -Note that the image reading code is very basic to avoid depending on an image -processing library for now. We currently only support reading binary PPMs (P6). -So use a tool like `convert` to first convert your images into that format, e.g. - -`convert image.jpeg -resize 224x224^ image.ppm` - -(As the image will be resized for processing anyway, we can already resize at -this stage for slightly faster loading.) - -The interaction with the image (using the mix-224 checkpoint) may then look -something like this: - -``` -> Describe the image briefly -A large building with two towers in the middle of a city. -> What type of building is it? -church -> What color is the church? -gray -> caption image -A large building with two towers stands tall on the water's edge. The building -has a brown roof and a window on the side. A tree stands in front of the -building, and a flag waves proudly from its top. The water is calm and blue, -reflecting the sky above. A bridge crosses the water, and a red and white boat -rests on its surface. The building has a window on the side, and a flag on top. -A tall tree stands in front of the building, and a window on the building is -visible from the water. The water is green, and the sky is blue. -``` - -### Migrating to single-file format - -There is now a new format for the weights file, which is a single file that -allows to contain the tokenizer (and the model type) directly. A tool to migrate -from the multi-file format to the single-file format is available. - -```sh -compression/migrate_weights \ - --tokenizer .../tokenizer.spm --weights .../gemma2-2b-it-sfp.sbs \ - --model gemma2-2b-it --output_weights .../gemma2-2b-it-sfp-single.sbs -``` - -After migration, you can use the new weights file with gemma.cpp like this: - -```sh -./gemma --weights .../gemma2-2b-it-sfp-single.sbs -``` - -### Troubleshooting and FAQs - -**Running `./gemma` fails with "Failed to read cache gating_ein_0 (error 294) ..."** - -The most common problem is that the `--weight_type` argument does not match that -of the model file. Revisit step #3 and check which weights you downloaded. - -Note that we have already moved weight type from a compile-time decision to a -runtime argument. In a subsequent step, we plan to bake this information into -the weights. - -**Problems building in Windows / Visual Studio** - -Currently if you're using Windows, we recommend building in WSL (Windows -Subsystem for Linux). We are exploring options to enable other build -configurations, see issues for active discussion. - -**Model does not respond to instructions and produces strange output** - -A common issue is that you are using a pre-trained model, which is not -instruction-tuned and thus does not respond to instructions. Make sure you are -using an instruction-tuned model (`2b-it-sfp`, `2b-it`, `7b-it-sfp`, `7b-it`) -and not a pre-trained model (any model with a `-pt` suffix). - -**What sequence lengths are supported?** - -See `seq_len` in `configs.cc`. For the Gemma 3 models larger than 1B, this is -typically 32K but 128K would also work given enough RAM. Note that long -sequences will be slow due to the quadratic cost of attention. - -**How do I convert my fine-tune to a `.sbs` compressed model file?** - -For PaliGemma (1 and 2) checkpoints, you can use -python/convert_from_safetensors.py to convert from safetensors format (tested -with building via bazel). For an adapter model, you will likely need to call -merge_and_unload() to convert the adapter model to a single-file format before -converting it. - -Here is how to use it using a bazel build of the compression library assuming -locally installed (venv) torch, numpy, safetensors, absl-py, etc.: - -```sh -bazel build //compression/python:compression -BAZEL_OUTPUT_DIR="${PWD}/bazel-bin/compression" -python3 -c "import site; print(site.getsitepackages())" -# Use your sites-packages file here: -ln -s $BAZEL_OUTPUT_DIR [...]/site-packages/compression -python3 python/convert_from_safetensors.py --load_path [...].safetensors.index.json -``` - -See also compression/convert_weights.py for a slightly older option to convert a -pytorch checkpoint. (The code may need updates to work with Gemma-2 models.) - -**What are some easy ways to make the model run faster?** - -1. Make sure you are using the 8-bit switched floating point `-sfp` models. - These are half the size of bf16 and thus use less memory bandwidth and cache - space. -2. If you're on a laptop, make sure power mode is set to maximize performance - and saving mode is **off**. For most laptops, the power saving modes get - activated automatically if the computer is not plugged in. -3. Close other unused cpu-intensive applications. -4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance - cores get engaged. -5. Experiment with the `--num_threads` argument value. Depending on the device, - larger numbers don't always mean better performance. - -We're also working on algorithmic and optimization approaches for faster -inference, stay tuned. - -## Usage - -`gemma` has different usage modes, controlled by the verbosity flag. - -All usage modes are currently interactive, triggering text generation upon -newline input. - -| Verbosity | Usage mode | Details | -| --------------- | ---------- | --------------------------------------------- | -| `--verbosity 0` | Minimal | Only prints generation output. Suitable as a CLI tool. | -| `--verbosity 1` | Default | Standard user-facing terminal UI. | -| `--verbosity 2` | Detailed | Shows additional developer and debug info. | - -### Interactive Terminal App - -By default, verbosity is set to 1, bringing up a terminal-based interactive -interface when `gemma` is invoked: - -```console -$ ./gemma [...] - __ _ ___ _ __ ___ _ __ ___ __ _ ___ _ __ _ __ - / _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \ -| (_| | __/ | | | | | | | | | | (_| || (__| |_) | |_) | - \__, |\___|_| |_| |_|_| |_| |_|\__,_(_)___| .__/| .__/ - __/ | | | | | - |___/ |_| |_| - -tokenizer : tokenizer.spm -compressed_weights : 2b-it-sfp.sbs -model : 2b-it -weights : [no path specified] -max_generated_tokens : 2048 - -*Usage* - Enter an instruction and press enter (%C reset conversation, %Q quits). - -*Examples* - - Write an email to grandma thanking her for the cookies. - - What are some historical attractions to visit around Massachusetts? - - Compute the nth fibonacci number in javascript. - - Write a standup comedy bit about WebGPU programming. - -> What are some outdoorsy places to visit around Boston? - -[ Reading prompt ] ..................... - - -**Boston Harbor and Islands:** - -* **Boston Harbor Islands National and State Park:** Explore pristine beaches, wildlife, and maritime history. -* **Charles River Esplanade:** Enjoy scenic views of the harbor and city skyline. -* **Boston Harbor Cruise Company:** Take a relaxing harbor cruise and admire the city from a different perspective. -* **Seaport Village:** Visit a charming waterfront area with shops, restaurants, and a seaport museum. - -**Forest and Nature:** - -* **Forest Park:** Hike through a scenic forest with diverse wildlife. -* **Quabbin Reservoir:** Enjoy boating, fishing, and hiking in a scenic setting. -* **Mount Forest:** Explore a mountain with breathtaking views of the city and surrounding landscape. - -... -``` - -### Usage as a Command Line Tool - -For using the `gemma` executable as a command line tool, it may be useful to -create an alias for gemma.cpp with arguments fully specified: - -```sh -alias gemma2b="~/gemma.cpp/build/gemma -- --tokenizer ~/gemma.cpp/build/tokenizer.spm --weights ~/gemma.cpp/build/gemma2-2b-it-sfp.sbs --model gemma2-2b-it --verbosity 0" -``` - -Replace the above paths with your own paths to the model and tokenizer paths -from the download. - -Here is an example of prompting `gemma` with a truncated input -file (using a `gemma2b` alias like defined above): - -```sh -cat configs.h | tail -n 35 | tr '\n' ' ' | xargs -0 echo "What does this C++ code do: " | gemma2b -``` - -> [!NOTE] -> CLI usage of gemma.cpp is experimental and should take context length -> limitations into account. - -The output of the above command should look like: - -```console -[ Reading prompt ] [...] -This C++ code snippet defines a set of **constants** used in a large language model (LLM) implementation, likely related to the **attention mechanism**. - -Let's break down the code: -[...] -``` - -### Incorporating gemma.cpp as a Library in your Project - -The easiest way to incorporate gemma.cpp in your own project is to pull in -gemma.cpp and dependencies using `FetchContent`. You can add the following to your -CMakeLists.txt: - -``` -include(FetchContent) - -FetchContent_Declare(sentencepiece GIT_REPOSITORY https://github.com/google/sentencepiece GIT_TAG 53de76561cfc149d3c01037f0595669ad32a5e7c) -FetchContent_MakeAvailable(sentencepiece) - -FetchContent_Declare(gemma GIT_REPOSITORY https://github.com/google/gemma.cpp GIT_TAG origin/main) -FetchContent_MakeAvailable(gemma) - -FetchContent_Declare(highway GIT_REPOSITORY https://github.com/google/highway.git GIT_TAG da250571a45826b21eebbddc1e50d0c1137dee5f) -FetchContent_MakeAvailable(highway) -``` - -Note for the gemma.cpp `GIT_TAG`, you may replace `origin/main` for a specific -commit hash if you would like to pin the library version. - -After your executable is defined (substitute your executable name for -`[Executable Name]` below): - -``` -target_link_libraries([Executable Name] libgemma hwy hwy_contrib sentencepiece) -FetchContent_GetProperties(gemma) -FetchContent_GetProperties(sentencepiece) -target_include_directories([Executable Name] PRIVATE ${gemma_SOURCE_DIR}) -target_include_directories([Executable Name] PRIVATE ${sentencepiece_SOURCE_DIR}) -``` - -### Building gemma.cpp as a Library - -gemma.cpp can also be used as a library dependency in your own project. The -shared library artifact can be built by modifying the make invocation to build -the `libgemma` target instead of `gemma`. - -> [!NOTE] -> If you are using gemma.cpp in your own project with the `FetchContent` steps -> in the previous section, building the library is done automatically by `cmake` -> and this section can be skipped. - -First, run `cmake`: - -```sh -cmake -B build -``` - -Then, run `make` with the `libgemma` target: - -```sh -cd build -make -j [number of parallel threads to use] libgemma -``` - -If this is successful, you should now have a `libgemma` library file in the -`build/` directory. On Unix platforms, the filename is `libgemma.a`. - -## Independent Projects Using gemma.cpp - -Some independent projects using gemma.cpp: - -- [gemma-cpp-python - Python bindings](https://github.com/namtranase/gemma-cpp-python) -- [lua-cgemma - Lua bindings](https://github.com/ufownl/lua-cgemma) -- [Godot engine demo project](https://github.com/Rliop913/Gemma-godot-demo-project) - -If you would like to have your project included, feel free to get in touch or -submit a PR with a `README.md` edit. - -## Acknowledgements and Contacts - -gemma.cpp was started in fall 2023 by [Austin Huang](mailto:austinvhuang@google.com) -and [Jan Wassenberg](mailto:janwas@google.com), and subsequently released February 2024 -thanks to contributions from Phil Culliton, Paul Chang, and Dan Zheng. - -Griffin support was implemented in April 2024 thanks to contributions by Andrey -Mikhaylov, Eugene Kliuchnikov, Jan Wassenberg, Jyrki Alakuijala, Lode -Vandevenne, Luca Versari, Martin Bruse, Phil Culliton, Sami Boukortt, Thomas -Fischbacher and Zoltan Szabadka. - -Gemma-2 support was implemented in June/July 2024 with the help of several -people. - -PaliGemma support was implemented in September 2024 with contributions from -Daniel Keysers. - -[Jan Wassenberg](mailto:janwas@google.com) has continued to contribute many -improvements, including major gains in efficiency, since the initial release. - -This is not an officially supported Google product. +**Authors**: Google \ No newline at end of file diff --git a/build/.gitignore b/build/.gitignore deleted file mode 100644 index 3822a0b..0000000 --- a/build/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -* -!.gitignore -!.hgignore \ No newline at end of file