Commit Graph

137 Commits

Author SHA1 Message Date
Christian Zhou-Zheng 9d7f694438 fix typing and clean up 2024-06-09 16:02:23 -04:00
Christian Zhou-Zheng 5a96b8f27f remove SplitStrategy, SplitArguments 2024-06-09 13:08:06 -04:00
Christian Zhou-Zheng 49b9fbe942 actually make the linter happy 2024-06-09 11:37:56 -04:00
Christian Zhou-Zheng a234bf821b fix linting 2024-06-09 11:23:55 -04:00
Christian Zhou-Zheng 0779f2f74f tidy up 2024-06-09 11:20:14 -04:00
Christian Zhou-Zheng 69d6e7a8e9 Merge branch 'master' into convert-split 2024-06-09 11:14:02 -04:00
sasha0552 2decf57bc6
convert-hf : set the model name based on cli arg, if present (#7693)
`--model-name` argument was added a while ago but did not do anything.
This commit fixes this issue and enables this feature.
2024-06-09 16:39:25 +10:00
Christian Zhou-Zheng 97dd416903 kv/ti data are still wrong 2024-06-09 00:34:36 -04:00
Christian Zhou-Zheng 666bb097a2 Merge branch 'master' into convert-split 2024-06-08 23:06:18 -04:00
Christian Zhou-Zheng 282e71fb39 edit cmd line args 2024-06-08 23:00:42 -04:00
compilade 5795b94182
convert-hf : match model part name prefix and suffix (#7687)
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. 

But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.

This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some
persistent problem, but shall do in the meantime.
2024-06-09 12:47:25 +10:00
compilade ed9f252118
gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 

In addition use_temp_file is now opt-in instead of opt-out defaulting to False.

Also GGUFWriter now does not require output file name until when actually writing to it.

And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09 12:34:29 +10:00
Christian Zhou-Zheng 079dfe3a8c
Update convert-hf-to-gguf.py
Co-authored-by: compilade <git@compilade.net>
2024-06-08 15:42:17 -04:00
Christian Zhou-Zheng f658e91f4a comma consistency 2024-06-08 08:10:12 -04:00
Christian Zhou-Zheng 2e70fa1055 attempt to appease the linter 2024-06-07 21:18:30 -04:00
Christian Zhou-Zheng c6ae1d6799 reinstate original gguf package import and fix type annotation 2024-06-07 21:09:03 -04:00
Francis Couture-Harpin e093dfba9f convert-hf : restore executable file permission 2024-06-07 17:31:35 -04:00
Christian Zhou-Zheng 0283fc1771 fix line endings 2024-06-07 17:24:27 -04:00
Christian Zhou-Zheng 5f29d4a617 fix convert-hf-to-gguf.py permissions 2024-06-07 17:19:01 -04:00
Christian Zhou-Zheng 6d3a256d1d rename GGUFManager to GGUFWriterSplit 2024-06-07 09:12:44 -04:00
Christian Zhou-Zheng 706bd69023
re-add type hint
Co-authored-by: compilade <git@compilade.net>
2024-06-06 08:27:25 -04:00
Joan Fontanals f5d7b268ec
llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
Christian Zhou-Zheng ce7e6985d2 form shards while adding tensors, SHA256 sums agree with master 2024-06-05 18:29:39 -04:00
Christian Zhou-Zheng 5ad397d610 reduce diffs with master 2024-06-05 13:49:20 -04:00
Galunid 7672adeec7
Fix encoding in python scripts (#7733) 2024-06-06 03:07:24 +10:00
Christian Zhou-Zheng bb5ee02096 simplify even further and standardize with GGUFWriter 2024-06-05 12:49:08 -04:00
Christian Zhou-Zheng f6fd3ea4e9 further simplify GGUFManager 2024-06-05 12:28:40 -04:00
Christian Zhou-Zheng c8ecbc67e2 oops, actually fix gguf_writer placement 2024-06-03 19:34:37 -04:00
Christian Zhou-Zheng efead0408c fix gguf_writer placement and remove comments 2024-06-03 19:34:01 -04:00
Christian Zhou-Zheng a9c7703c12 fix final? merge issue 2024-06-03 09:18:19 -04:00
Christian Zhou-Zheng 140eb52f3f Merge branch 'master' into convert-split 2024-06-03 09:07:23 -04:00
Galunid 0515ad93f4
convert-hf : Handle NotImplementedError in convert-hf-to-gguf (#7660) 2024-05-31 17:42:33 +02:00
Galunid 9c4c9cc83f
Move convert.py to examples/convert-legacy-llama.py (#7430)
* Move convert.py to examples/convert-no-torch.py

* Fix CI, scripts, readme files

* convert-no-torch -> convert-legacy-llama

* Move vocab thing to vocab.py

* Fix convert-no-torch -> convert-legacy-llama

* Fix lost convert.py in ci/run.sh

* Fix imports

* Fix gguf not imported correctly

* Fix flake8 complaints

* Fix check-requirements.sh

* Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE

* Review fixes
2024-05-30 21:40:00 +10:00
Giuseppe Scrivano 5442939fcc
llama : support small Granite models (#7481)
* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/llama.cpp/issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>
2024-05-28 21:49:49 +03:00
fairydreaming ee3dff6b8e
Add support for DeepseekV2ForCausalLM (#7519)
* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-28 17:07:05 +02:00
Galunid 32a28217f4
Fix aya-23 conversion scripts (#7539) 2024-05-26 16:02:34 +02:00
Bartowski c429b33beb
llama : add Smaug 70B support (#7402) 2024-05-26 15:28:35 +03:00
compilade b83bab15a5
gguf-py : fix and simplify quantized shape round-trip (#7483)
* gguf-py : fix and simplify quantized shape round-trip

* gguf-py : remove unused import
2024-05-25 11:11:48 +10:00
fairydreaming fbca2f27fc
Add support for ArcticForCausalLM (#7020)
* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-24 14:31:13 +02:00
Christian Zhou-Zheng 3ff27efa89 Fix eager tensor memory leak and remove convert.py changes
Removed a memory leak caused by unexpected reference retention to eager tensors.

Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.
2024-05-23 18:50:21 -04:00
fairydreaming 9b82476ee9
Add missing inference support for GPTNeoXForCausalLM (Pythia and GPT-NeoX base models) (#7461)
* convert-hf : add conversion of bloom-style qkv tensor to gpt-style qkv (code borrowed from BloomModel)

* llama : add inference support for LLM_ARCH_GPTNEOX

* llama : add model types for every Pythia variant and GPT-NeoX

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-23 11:49:53 +02:00
Christian Zhou-Zheng 2dd784108b Merge remote-tracking branch 'origin' into convert-split 2024-05-22 20:23:13 -04:00
liuwei-git 201cc11afa
llama : add phi3 128K model support (#7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-21 23:28:32 +03:00
Georgi Gerganov c3f8d58356
tests : test-tokenizer-0.sh print more info (#7402) 2024-05-21 19:53:48 +03:00
jaime-m-p d7e852c1bc
Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p 917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Georgi Gerganov fabf30b4c4
llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00
Anas Ahouzi 6aade19ee7
Add StableLM2 pre-tokenizer (#7349)
* Add StableLM pre-tokenizer

* Fix space

* Fix trailing whitespace
2024-05-19 22:46:46 +10:00
Georgi Gerganov b49a13dd2f
convert : fix set_vocab_sentencepiece (#6866)
* convert : fix set_vocab_sentencepiece

* Update convert-hf-to-gguf.py
2024-05-18 08:46:20 +03:00
Aarni Koskela d273c1402b
py : convert-hf-to-gguf-update improvements (#7340)
* convert-hf-to-gguf-update: automate updating

* convert-hf-to-gguf-update: improve download

* share requests session for performance
* create directories only when needed, don't skip downloads when empty directory encountered
* be more graceful about errors
2024-05-17 15:11:45 +03:00