llama.cpp/src
Francis Couture-Harpin f9d42c598b convert_hf : identify more added control tokens for SPM tokenziers
This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens
2024-07-07 23:28:38 -04:00
..
CMakeLists.txt tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231) 2024-07-04 13:53:42 +03:00
llama.cpp convert_hf : identify more added control tokens for SPM tokenziers 2024-07-07 23:28:38 -04:00
unicode-data.cpp Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258) 2024-07-02 12:18:10 -04:00
unicode-data.h llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
unicode.cpp Detokenizer fixes (#8039) 2024-07-05 19:01:35 +02:00
unicode.h llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00