mirror of https://github.com/google/gemma.cpp.git
Add note on attention length and SFP
PiperOrigin-RevId: 738698399
This commit is contained in:
parent
3d419ec173
commit
83219e3c68
16
README.md
16
README.md
|
|
@ -347,6 +347,12 @@ instruction-tuned and thus does not respond to instructions. Make sure you are
|
||||||
using an instruction-tuned model (`2b-it-sfp`, `2b-it`, `7b-it-sfp`, `7b-it`)
|
using an instruction-tuned model (`2b-it-sfp`, `2b-it`, `7b-it-sfp`, `7b-it`)
|
||||||
and not a pre-trained model (any model with a `-pt` suffix).
|
and not a pre-trained model (any model with a `-pt` suffix).
|
||||||
|
|
||||||
|
**What sequence lengths are supported?**
|
||||||
|
|
||||||
|
See `seq_len` in `configs.cc`. For the Gemma 3 models larger than 1B, this is
|
||||||
|
typically 32K but 128K would also work given enough RAM. Note that long
|
||||||
|
sequences will be slow due to the quadratic cost of attention.
|
||||||
|
|
||||||
**How do I convert my fine-tune to a `.sbs` compressed model file?**
|
**How do I convert my fine-tune to a `.sbs` compressed model file?**
|
||||||
|
|
||||||
For PaliGemma (1 and 2) checkpoints, you can use
|
For PaliGemma (1 and 2) checkpoints, you can use
|
||||||
|
|
@ -373,14 +379,16 @@ pytorch checkpoint. (The code may need updates to work with Gemma-2 models.)
|
||||||
**What are some easy ways to make the model run faster?**
|
**What are some easy ways to make the model run faster?**
|
||||||
|
|
||||||
1. Make sure you are using the 8-bit switched floating point `-sfp` models.
|
1. Make sure you are using the 8-bit switched floating point `-sfp` models.
|
||||||
|
These are half the size of bf16 and thus use less memory bandwidth and cache
|
||||||
|
space.
|
||||||
2. If you're on a laptop, make sure power mode is set to maximize performance
|
2. If you're on a laptop, make sure power mode is set to maximize performance
|
||||||
and saving mode is **off**. For most laptops, the power saving modes get
|
and saving mode is **off**. For most laptops, the power saving modes get
|
||||||
activated automatically if the computer is not plugged in.
|
activated automatically if the computer is not plugged in.
|
||||||
3. Close other unused cpu-intensive applications.
|
3. Close other unused cpu-intensive applications.
|
||||||
4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance
|
4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance
|
||||||
cores get engaged.
|
cores get engaged.
|
||||||
5. Experiment with the `--num_threads` argument value. Depending on the device,
|
5. Experiment with the `--num_threads` argument value. Depending on the device,
|
||||||
larger numbers don't always mean better performance.
|
larger numbers don't always mean better performance.
|
||||||
|
|
||||||
We're also working on algorithmic and optimization approaches for faster
|
We're also working on algorithmic and optimization approaches for faster
|
||||||
inference, stay tuned.
|
inference, stay tuned.
|
||||||
|
|
|
||||||
|
|
@ -80,7 +80,7 @@ constexpr PromptWrapping kPromptWrapping[] = {
|
||||||
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 3B 224/448
|
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 3B 224/448
|
||||||
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 10B 224/448
|
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 10B 224/448
|
||||||
PromptWrapping::GEMMA_VLM, // Gemma3 4B
|
PromptWrapping::GEMMA_VLM, // Gemma3 4B
|
||||||
PromptWrapping::GEMMA_IT, // Gemma3 1B
|
PromptWrapping::GEMMA_PT, // Gemma3 1B
|
||||||
PromptWrapping::GEMMA_VLM, // Gemma3 12B
|
PromptWrapping::GEMMA_VLM, // Gemma3 12B
|
||||||
PromptWrapping::GEMMA_VLM, // Gemma3 27B
|
PromptWrapping::GEMMA_VLM, // Gemma3 27B
|
||||||
};
|
};
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue