mirror of https://github.com/google/gemma.cpp.git
Add note on attention length and SFP
PiperOrigin-RevId: 738698399
This commit is contained in:
parent
3d419ec173
commit
83219e3c68
26
README.md
26
README.md
|
|
@ -347,6 +347,12 @@ instruction-tuned and thus does not respond to instructions. Make sure you are
|
|||
using an instruction-tuned model (`2b-it-sfp`, `2b-it`, `7b-it-sfp`, `7b-it`)
|
||||
and not a pre-trained model (any model with a `-pt` suffix).
|
||||
|
||||
**What sequence lengths are supported?**
|
||||
|
||||
See `seq_len` in `configs.cc`. For the Gemma 3 models larger than 1B, this is
|
||||
typically 32K but 128K would also work given enough RAM. Note that long
|
||||
sequences will be slow due to the quadratic cost of attention.
|
||||
|
||||
**How do I convert my fine-tune to a `.sbs` compressed model file?**
|
||||
|
||||
For PaliGemma (1 and 2) checkpoints, you can use
|
||||
|
|
@ -372,15 +378,17 @@ pytorch checkpoint. (The code may need updates to work with Gemma-2 models.)
|
|||
|
||||
**What are some easy ways to make the model run faster?**
|
||||
|
||||
1. Make sure you are using the 8-bit switched floating point `-sfp` models.
|
||||
2. If you're on a laptop, make sure power mode is set to maximize performance
|
||||
and saving mode is **off**. For most laptops, the power saving modes get
|
||||
activated automatically if the computer is not plugged in.
|
||||
3. Close other unused cpu-intensive applications.
|
||||
4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance
|
||||
cores get engaged.
|
||||
5. Experiment with the `--num_threads` argument value. Depending on the device,
|
||||
larger numbers don't always mean better performance.
|
||||
1. Make sure you are using the 8-bit switched floating point `-sfp` models.
|
||||
These are half the size of bf16 and thus use less memory bandwidth and cache
|
||||
space.
|
||||
2. If you're on a laptop, make sure power mode is set to maximize performance
|
||||
and saving mode is **off**. For most laptops, the power saving modes get
|
||||
activated automatically if the computer is not plugged in.
|
||||
3. Close other unused cpu-intensive applications.
|
||||
4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance
|
||||
cores get engaged.
|
||||
5. Experiment with the `--num_threads` argument value. Depending on the device,
|
||||
larger numbers don't always mean better performance.
|
||||
|
||||
We're also working on algorithmic and optimization approaches for faster
|
||||
inference, stay tuned.
|
||||
|
|
|
|||
|
|
@ -80,7 +80,7 @@ constexpr PromptWrapping kPromptWrapping[] = {
|
|||
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 3B 224/448
|
||||
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 10B 224/448
|
||||
PromptWrapping::GEMMA_VLM, // Gemma3 4B
|
||||
PromptWrapping::GEMMA_IT, // Gemma3 1B
|
||||
PromptWrapping::GEMMA_PT, // Gemma3 1B
|
||||
PromptWrapping::GEMMA_VLM, // Gemma3 12B
|
||||
PromptWrapping::GEMMA_VLM, // Gemma3 27B
|
||||
};
|
||||
|
|
|
|||
Loading…
Reference in New Issue