mirror of https://github.com/google/gemma.cpp.git
Add summary of optimizations/infra present in the repository
PiperOrigin-RevId: 754838402
This commit is contained in:
parent
fe80f10ed7
commit
a3caf6e5d2
42
README.md
42
README.md
|
|
@ -45,9 +45,41 @@ this invite link](https://discord.gg/H5jCBAWxAe). This project follows
|
||||||
[Google's Open Source Community
|
[Google's Open Source Community
|
||||||
Guidelines](https://opensource.google.com/conduct/).
|
Guidelines](https://opensource.google.com/conduct/).
|
||||||
|
|
||||||
*Active development is currently done on the `dev` branch. Please open pull
|
> [!NOTE] Active development is currently done on the `dev` branch. Please open
|
||||||
requests targeting `dev` branch instead of `main`, which is intended to be more
|
> pull requests targeting `dev` branch instead of `main`, which is intended to
|
||||||
stable.*
|
> be more stable.
|
||||||
|
|
||||||
|
## What's inside?
|
||||||
|
|
||||||
|
- LLM
|
||||||
|
|
||||||
|
- CPU-only inference for: Gemma 1-3, Griffin(SSM), PaliGemma 1-2.
|
||||||
|
- Sampling with TopK and temperature.
|
||||||
|
- Backward pass (VJP) and Adam optimizer for Gemma research.
|
||||||
|
|
||||||
|
- Optimizations
|
||||||
|
|
||||||
|
- Mixed-precision (fp8, bf16, fp32, fp64 bit) GEMM:
|
||||||
|
- Designed for BF16 instructions, can efficiently emulate them.
|
||||||
|
- Automatic runtime autotuning 7 parameters per matrix shape.
|
||||||
|
- Weight compression integrated directly into GEMM:
|
||||||
|
- Custom fp8 format with 2..3 mantissa bits; tensor scaling.
|
||||||
|
- Also bf16, f32 and non-uniform 4-bit (NUQ); easy to add new formats.
|
||||||
|
|
||||||
|
- Infrastructure
|
||||||
|
|
||||||
|
- SIMD: single implementation via Highway. Chooses ISA at runtime.
|
||||||
|
- Tensor parallelism: CCX-aware, multi-socket thread pool.
|
||||||
|
- Disk I/O: memory map or parallel read (heuristic with user override).
|
||||||
|
- Custom format with forward/backward-compatible metadata serialization.
|
||||||
|
- Model conversion from Safetensors, not yet open sourced.
|
||||||
|
- Portability: Linux, Windows/OS X supported. CMake/Bazel. 'Any' CPU.
|
||||||
|
|
||||||
|
- Frontends
|
||||||
|
|
||||||
|
- C++ APIs with streaming for single query and batched inference.
|
||||||
|
- Basic interactive command-line app.
|
||||||
|
- Basic Python bindings (pybind11).
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
|
|
@ -411,7 +443,7 @@ newline input.
|
||||||
By default, verbosity is set to 1, bringing up a terminal-based interactive
|
By default, verbosity is set to 1, bringing up a terminal-based interactive
|
||||||
interface when `gemma` is invoked:
|
interface when `gemma` is invoked:
|
||||||
|
|
||||||
```console
|
```sh
|
||||||
$ ./gemma [...]
|
$ ./gemma [...]
|
||||||
__ _ ___ _ __ ___ _ __ ___ __ _ ___ _ __ _ __
|
__ _ ___ _ __ ___ _ __ ___ __ _ ___ _ __ _ __
|
||||||
/ _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
|
/ _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
|
||||||
|
|
@ -481,7 +513,7 @@ cat configs.h | tail -n 35 | tr '\n' ' ' | xargs -0 echo "What does this C++ cod
|
||||||
|
|
||||||
The output of the above command should look like:
|
The output of the above command should look like:
|
||||||
|
|
||||||
```console
|
```sh
|
||||||
[ Reading prompt ] [...]
|
[ Reading prompt ] [...]
|
||||||
This C++ code snippet defines a set of **constants** used in a large language model (LLM) implementation, likely related to the **attention mechanism**.
|
This C++ code snippet defines a set of **constants** used in a large language model (LLM) implementation, likely related to the **attention mechanism**.
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue