llama.cpp

History

Francis Couture-Harpin f9d42c598b convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens		2024-07-07 23:28:38 -04:00
..
CMakeLists.txt	tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231 )	2024-07-04 13:53:42 +03:00
llama.cpp	convert_hf : identify more added control tokens for SPM tokenziers	2024-07-07 23:28:38 -04:00
unicode-data.cpp	Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258 )	2024-07-02 12:18:10 -04:00
unicode-data.h	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
unicode.cpp	Detokenizer fixes (#8039 )	2024-07-05 19:01:35 +02:00
unicode.h	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00