Add summary of optimizations/infra present in the repository

PiperOrigin-RevId: 754838402
This commit is contained in:
Jan Wassenberg 2025-05-05 01:45:25 -07:00 committed by Copybara-Service
parent fe80f10ed7
commit a3caf6e5d2
1 changed files with 37 additions and 5 deletions

View File

@ -45,9 +45,41 @@ this invite link](https://discord.gg/H5jCBAWxAe). This project follows
[Google's Open Source Community [Google's Open Source Community
Guidelines](https://opensource.google.com/conduct/). Guidelines](https://opensource.google.com/conduct/).
*Active development is currently done on the `dev` branch. Please open pull > [!NOTE] Active development is currently done on the `dev` branch. Please open
requests targeting `dev` branch instead of `main`, which is intended to be more > pull requests targeting `dev` branch instead of `main`, which is intended to
stable.* > be more stable.
## What's inside?
- LLM
- CPU-only inference for: Gemma 1-3, Griffin(SSM), PaliGemma 1-2.
- Sampling with TopK and temperature.
- Backward pass (VJP) and Adam optimizer for Gemma research.
- Optimizations
- Mixed-precision (fp8, bf16, fp32, fp64 bit) GEMM:
- Designed for BF16 instructions, can efficiently emulate them.
- Automatic runtime autotuning 7 parameters per matrix shape.
- Weight compression integrated directly into GEMM:
- Custom fp8 format with 2..3 mantissa bits; tensor scaling.
- Also bf16, f32 and non-uniform 4-bit (NUQ); easy to add new formats.
- Infrastructure
- SIMD: single implementation via Highway. Chooses ISA at runtime.
- Tensor parallelism: CCX-aware, multi-socket thread pool.
- Disk I/O: memory map or parallel read (heuristic with user override).
- Custom format with forward/backward-compatible metadata serialization.
- Model conversion from Safetensors, not yet open sourced.
- Portability: Linux, Windows/OS X supported. CMake/Bazel. 'Any' CPU.
- Frontends
- C++ APIs with streaming for single query and batched inference.
- Basic interactive command-line app.
- Basic Python bindings (pybind11).
## Quick Start ## Quick Start
@ -411,7 +443,7 @@ newline input.
By default, verbosity is set to 1, bringing up a terminal-based interactive By default, verbosity is set to 1, bringing up a terminal-based interactive
interface when `gemma` is invoked: interface when `gemma` is invoked:
```console ```sh
$ ./gemma [...] $ ./gemma [...]
__ _ ___ _ __ ___ _ __ ___ __ _ ___ _ __ _ __ __ _ ___ _ __ ___ _ __ ___ __ _ ___ _ __ _ __
/ _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \ / _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
@ -481,7 +513,7 @@ cat configs.h | tail -n 35 | tr '\n' ' ' | xargs -0 echo "What does this C++ cod
The output of the above command should look like: The output of the above command should look like:
```console ```sh
[ Reading prompt ] [...] [ Reading prompt ] [...]
This C++ code snippet defines a set of **constants** used in a large language model (LLM) implementation, likely related to the **attention mechanism**. This C++ code snippet defines a set of **constants** used in a large language model (LLM) implementation, likely related to the **attention mechanism**.