llama.cpp

Commit Graph

Author	SHA1	Message	Date
Christian Zhou-Zheng	9d7f694438	fix typing and clean up	2024-06-09 16:02:23 -04:00
Christian Zhou-Zheng	f7ecd99691	appease linter	2024-06-09 13:09:05 -04:00
Christian Zhou-Zheng	5a96b8f27f	remove SplitStrategy, SplitArguments	2024-06-09 13:08:06 -04:00
Christian Zhou-Zheng	0471f67f4f	cleanup round 1	2024-06-09 12:40:02 -04:00
Christian Zhou-Zheng	a234bf821b	fix linting	2024-06-09 11:23:55 -04:00
Christian Zhou-Zheng	0779f2f74f	tidy up	2024-06-09 11:20:14 -04:00
Christian Zhou-Zheng	ba1be979eb	fix ti data messiness	2024-06-09 11:10:33 -04:00
Christian Zhou-Zheng	ff2dd7d30d	try to refactor kv data (still fails)	2024-06-09 10:29:47 -04:00
Christian Zhou-Zheng	97dd416903	kv/ti data are still wrong	2024-06-09 00:34:36 -04:00
Christian Zhou-Zheng	03cc9bcbe8	use simplification from #7827	2024-06-08 23:14:26 -04:00
Christian Zhou-Zheng	666bb097a2	Merge branch 'master' into convert-split	2024-06-08 23:06:18 -04:00
Christian Zhou-Zheng	282e71fb39	edit cmd line args	2024-06-08 23:00:42 -04:00
compilade	ed9f252118	gguf-py : decouple adding metadata from writing in GGUFWriter (#7827 ) Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. In addition use_temp_file is now opt-in instead of opt-out defaulting to False. Also GGUFWriter now does not require output file name until when actually writing to it. And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata	2024-06-09 12:34:29 +10:00
Christian Zhou-Zheng	02be0dd654	attempt 3 to appease the linter	2024-06-07 21:26:40 -04:00
Christian Zhou-Zheng	891b19cb81	attempt 2 to appease the linter	2024-06-07 21:20:46 -04:00
Christian Zhou-Zheng	2e70fa1055	attempt to appease the linter	2024-06-07 21:18:30 -04:00
Christian Zhou-Zheng	dc5cf5fd82	Update gguf-py/gguf/gguf_writer_split.py Co-authored-by: compilade <git@compilade.net>	2024-06-07 17:26:30 -04:00
Christian Zhou-Zheng	1312e287ec	Update gguf-py/gguf/constants.py Co-authored-by: compilade <git@compilade.net>	2024-06-07 17:10:51 -04:00
Christian Zhou-Zheng	6d3a256d1d	rename GGUFManager to GGUFWriterSplit	2024-06-07 09:12:44 -04:00
Christian Zhou-Zheng	13ffe22ca7	base-1024 bytes to base-1000	2024-06-06 10:24:11 -04:00
Christian Zhou-Zheng	83e4a3f5cc	make pathlib explicit	2024-06-06 09:00:59 -04:00
Christian Zhou-Zheng	2037eabb64	move kv keys to constants.py	2024-06-06 08:49:46 -04:00
Christian Zhou-Zheng	1cbab22225	type consistency in format_n_bytes_to_str	2024-06-06 08:43:26 -04:00
Christian Zhou-Zheng	3328b0a991	Shard dataclass and un-negative dont_add_architecture	2024-06-06 08:37:35 -04:00
Christian Zhou-Zheng	6a05183b97	GGUFWriter compatibility fix Co-authored-by: compilade <git@compilade.net>	2024-06-06 08:28:10 -04:00
Joan Fontanals	f5d7b268ec	llama : add jina v2 base code (#7596 ) * feat: add changes to handle jina v2 base code * fix: do not complicate things * fix: fix the usage of the code model * fix: fix comments * fix: fix linting issues * fix: remove ollama patches * style : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-06 10:22:41 +03:00
Christian Zhou-Zheng	ce7e6985d2	form shards while adding tensors, SHA256 sums agree with master	2024-06-05 18:29:39 -04:00
Christian Zhou-Zheng	5ad397d610	reduce diffs with master	2024-06-05 13:49:20 -04:00
Christian Zhou-Zheng	bb5ee02096	simplify even further and standardize with GGUFWriter	2024-06-05 12:49:08 -04:00
Christian Zhou-Zheng	f6fd3ea4e9	further simplify GGUFManager	2024-06-05 12:28:40 -04:00
Christian Zhou-Zheng	3e9430df33	reduce duplicated code from gguf_writer	2024-06-05 09:29:33 -04:00
Christian Zhou-Zheng	efead0408c	fix gguf_writer placement and remove comments	2024-06-03 19:34:01 -04:00
Christian Zhou-Zheng	140eb52f3f	Merge branch 'master' into convert-split	2024-06-03 09:07:23 -04:00
Christian Zhou-Zheng	240243e63f	remove unnecessary imports in gguf_manager	2024-06-03 09:01:42 -04:00
Christian Zhou-Zheng	09baf2f3b5	fix Q8 quantization	2024-06-03 08:58:29 -04:00
zhangkaihuo	6f28a333c1	llama : MiniCPM support tied embeddings (#7664 ) * support lm_head * remove the code block --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>	2024-06-03 10:49:30 +03:00
Galunid	9c4c9cc83f	Move convert.py to examples/convert-legacy-llama.py (#7430 ) * Move convert.py to examples/convert-no-torch.py * Fix CI, scripts, readme files * convert-no-torch -> convert-legacy-llama * Move vocab thing to vocab.py * Fix convert-no-torch -> convert-legacy-llama * Fix lost convert.py in ci/run.sh * Fix imports * Fix gguf not imported correctly * Fix flake8 complaints * Fix check-requirements.sh * Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE * Review fixes	2024-05-30 21:40:00 +10:00
Galunid	eb57fee51f	gguf-py : Add tokenizer.ggml.pre to gguf-new-metadata.py (#7627 )	2024-05-30 02:10:40 +02:00
fairydreaming	ee3dff6b8e	Add support for DeepseekV2ForCausalLM (#7519 ) * common : increase max number of experts to 160 * common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture * common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier * convert-hf : add model conversion support for DeepseekV2ForCausalLM * llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models * llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor) * llama : add inference support for LLM_ARCH_DEEPSEEK2 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-28 17:07:05 +02:00
compilade	b83bab15a5	gguf-py : fix and simplify quantized shape round-trip (#7483 ) * gguf-py : fix and simplify quantized shape round-trip * gguf-py : remove unused import	2024-05-25 11:11:48 +10:00
fairydreaming	fbca2f27fc	Add support for ArcticForCausalLM (#7020 ) * common : increase max number of experts to 128 * common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn * gguf-py : add architecture-specific block mappings that override selected general block mappings * convert-hf : add model conversion support for ArcticForCausalLM * convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM) * llama : add inference support for LLM_ARCH_ARCTIC --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-24 14:31:13 +02:00
Christian Zhou-Zheng	6b5c3753c8	refactor SplitStrategy to be a deque Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.	2024-05-24 00:28:48 -04:00
Christian Zhou-Zheng	3ff27efa89	Fix eager tensor memory leak and remove convert.py changes Removed a memory leak caused by unexpected reference retention to eager tensors. Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.	2024-05-23 18:50:21 -04:00
Georgi Gerganov	e84b71c2c6	ggml : drop support for QK_K=64 (#7473 ) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-05-23 10:00:21 +03:00
Christian Zhou-Zheng	2dd784108b	Merge remote-tracking branch 'origin' into convert-split	2024-05-22 20:23:13 -04:00
liuwei-git	201cc11afa	llama : add phi3 128K model support (#7225 ) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-21 23:28:32 +03:00
Georgi Gerganov	fabf30b4c4	llama : remove Persimmon (#7408 ) * llama : remove Persimmon * requirements : remove	2024-05-21 02:35:28 +10:00
compilade	ee52225067	convert-hf : support direct Q8_0 conversion (#7234 ) * convert-hf : support q8_0 conversion * convert-hf : add missing ftype This was messing with the checksums otherwise. * convert-hf : add missing ftype to Baichuan and Xverse I didn't notice these on my first pass.	2024-05-13 14:10:51 -04:00
compilade	5a419926b0	convert-hf : support bfloat16 conversion (#7158 ) * convert-hf : support bfloat16 conversion * gguf-py : flake8 fixes * convert-hf : add missing space after comma * convert-hf : get bit-exact same output as ./quantize The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32 * convert-hf : add --outtype auto-f16 A reason for this to exist is for model quantizers who want an initial GGUF with the most fidelity to the original model while still using a 16-bit float type instead of 32-bit floats. * convert-hf : remove a semicolon because flake8 doesn't like it It's a reflex from when programming in C/C++, I guess. * convert-hf : support outtype templating in outfile name * convert-hf : rename --outtype auto-f16 to --outtype auto	2024-05-11 11:06:26 -04:00
Joan Fontanals	b83cc3f5b3	llama : add Jina Embeddings architecture (#6826 ) * feat: first things to do * feat: create tensors for Jina architecture * fix: use other tensors * feat: embedding gets results * fix: fix usage of ALIBI * fix: clean prints * fix: do some cleanup unused vars * fix: revert changes to Makefile and CMakeLists * fix: revert some changes * fix: fix small detail * fix: fix convert formatting * fix: fix linting and editor * feat: set proper vocab settings * fix: JinaBertForMaskedLM registration * feat: support q_normalization and k_normalization in Jina arch * feat: handle gpt2 tokenizer with Jina architecture * feat: example comments in embedding * feat: rename Jina Bert to Jina Bert V2 * fix: add some changes as per review * feat: proper KQ_pos for Jina embeddings * feat: add capacity to load models ES and DE for Spanish * llama : fix pre-tokenizers * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * minor : clean-up * embedding : add warning about missing SEP --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-11 10:46:09 +03:00

1 2 3

142 Commits