llama.cpp

Commit Graph

Author	SHA1	Message	Date
jaime-m-p	edf375d26f	Restore BOM	2024-05-05 01:58:34 +02:00
jaime-m-p	67832e5554	llama3 custom regex split: fix \s	2024-05-05 01:20:23 +02:00
jaime-m-p	8fd849eb90	Unicode tables: separator, lowercase, uppercase and whitespace	2024-05-05 01:19:20 +02:00
jaime-m-p	798b576c06	Merge remote-tracking branch 'upstream/master' into gg/bpe-preprocess	2024-05-04 16:59:24 +02:00
Georgi Gerganov	92139b90af	tests : add test-tokenizer-0.sh + fix some tokenizers (#7036 ) * tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update	2024-05-04 08:32:32 +03:00
jaime-m-p	0c6d820b89	Style	2024-04-30 13:18:25 +02:00
jaime-m-p	2cd1eb0daa	Add alternative regex for custom aplit llama3 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-30 13:02:46 +02:00
jaime-m-p	1d8fcc06ba	GPT2 custom regex split	2024-04-29 19:13:18 +02:00
jaime-m-p	5c38f6ed7a	Move unused variable value	2024-04-29 19:11:37 +02:00
Georgi Gerganov	f4ab2a4147	llama : fix BPE pre-tokenization (#6920 ) * merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * lint : fix * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>	2024-04-29 16:58:41 +03:00
jaime-m-p	a0c870db85	Fix merge	2024-04-29 11:09:52 +02:00
jaime-m-p	866e3941f7	Merge branch 'ggerganov:gg/bpe-preprocess' into gg/bpe-preprocess	2024-04-29 10:55:15 +02:00
Georgi Gerganov	af05268cdd	unicode : cleanup	2024-04-29 11:20:42 +03:00
Georgi Gerganov	c68d2596ea	tests : add more vocabs and tests ggml-ci	2024-04-29 11:09:17 +03:00
jaime-m-p	0cf9ed3457	Restore BOM	2024-04-29 01:35:08 +02:00
jaime-m-p	2a48873914	Typing	2024-04-29 00:12:56 +02:00
jaime-m-p	6e4d2af6c3	already exists unicode_tolower()	2024-04-28 21:57:22 +02:00
jaime-m-p	5cc4b2cf01	Using char32_t for codepoints	2024-04-28 21:51:12 +02:00
Georgi Gerganov	1545550ec2	unicode : normalize signatures	2024-04-28 21:40:36 +03:00
jaime-m-p	e11fe2fb6a	llama3 custom regex split	2024-04-28 19:27:06 +02:00
Georgi Gerganov	ee6d1b3fb4	unicode : simplify	2024-04-28 18:36:57 +03:00
Georgi Gerganov	e972e6cbf8	unicode : clean-up	2024-04-28 18:30:37 +03:00
Georgi Gerganov	b97add52a4	unicode : category support via std::regex	2024-04-28 15:15:57 +03:00
Georgi Gerganov	581c4a0239	unicode : try fix windows	2024-04-27 18:36:00 +03:00
Georgi Gerganov	91eaa414bf	unicode : support \p{N}, \p{L} and \p{P} natively	2024-04-27 17:48:38 +03:00
Georgi Gerganov	ce5485aee0	unicode : always use std::wregex	2024-04-27 17:11:34 +03:00
Georgi Gerganov	a22645c2a7	unicode : set bomb	2024-04-27 11:48:24 +03:00
Georgi Gerganov	c160818ec0	wip	2024-04-27 00:28:36 +03:00
Georgi Gerganov	e9891769ff	unicode : first try custom implementations	2024-04-26 15:09:07 +03:00
Georgi Gerganov	e8c206be61	unicode : shot in the dark to fix tests on Windows	2024-04-26 14:57:12 +03:00
Georgi Gerganov	06d3e693db	unicode : fix? unicode_wstring_to_utf8	2024-04-26 12:55:11 +03:00
Kazim Abrar Mahi	36d983262e	Fixed issue with gpt2 regex custom preprocessor	2024-04-26 11:43:29 +03:00
Kazim Abrar Mahi	feeaf4f39c	Added needed functionality, testing remains	2024-04-26 11:43:29 +03:00
Kazim Abrar Mahi	7e308ed212	Adding unicode regex function	2024-04-26 11:43:29 +03:00
Kazim Abrar Mahi	4056dc5b1e	added and refactored unicode_regex_split and related functions	2024-04-26 11:43:28 +03:00
Kazim Abrar Mahi	1c924e4b35	Resolved issues	2024-04-26 11:43:28 +03:00
Kazim Abrar Mahi	54f93eb50b	Moved header files	2024-04-26 11:43:28 +03:00
Kazim Abrar Mahi	d2cfc2225f	Moved regex patterns to unicode.cpp and updated unicode.h	2024-04-26 11:43:28 +03:00
Jared Van Bortel	32c8486e1f	wpm : portable unicode tolower (#6305 ) Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.	2024-03-26 17:46:21 -04:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00

40 Commits