This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens |
||
|---|---|---|
| .. | ||
| CMakeLists.txt | ||
| llama.cpp | ||
| unicode-data.cpp | ||
| unicode-data.h | ||
| unicode.cpp | ||
| unicode.h | ||