Commit Graph

142 Commits

Author SHA1 Message Date
Christian Zhou-Zheng 9d7f694438 fix typing and clean up 2024-06-09 16:02:23 -04:00
Christian Zhou-Zheng f7ecd99691 appease linter 2024-06-09 13:09:05 -04:00
Christian Zhou-Zheng 5a96b8f27f remove SplitStrategy, SplitArguments 2024-06-09 13:08:06 -04:00
Christian Zhou-Zheng 0471f67f4f cleanup round 1 2024-06-09 12:40:02 -04:00
Christian Zhou-Zheng a234bf821b fix linting 2024-06-09 11:23:55 -04:00
Christian Zhou-Zheng 0779f2f74f tidy up 2024-06-09 11:20:14 -04:00
Christian Zhou-Zheng ba1be979eb fix ti data messiness 2024-06-09 11:10:33 -04:00
Christian Zhou-Zheng ff2dd7d30d try to refactor kv data (still fails) 2024-06-09 10:29:47 -04:00
Christian Zhou-Zheng 97dd416903 kv/ti data are still wrong 2024-06-09 00:34:36 -04:00
Christian Zhou-Zheng 03cc9bcbe8 use simplification from #7827 2024-06-08 23:14:26 -04:00
Christian Zhou-Zheng 666bb097a2 Merge branch 'master' into convert-split 2024-06-08 23:06:18 -04:00
Christian Zhou-Zheng 282e71fb39 edit cmd line args 2024-06-08 23:00:42 -04:00
compilade ed9f252118
gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 

In addition use_temp_file is now opt-in instead of opt-out defaulting to False.

Also GGUFWriter now does not require output file name until when actually writing to it.

And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09 12:34:29 +10:00
Christian Zhou-Zheng 02be0dd654 attempt 3 to appease the linter 2024-06-07 21:26:40 -04:00
Christian Zhou-Zheng 891b19cb81 attempt 2 to appease the linter 2024-06-07 21:20:46 -04:00
Christian Zhou-Zheng 2e70fa1055 attempt to appease the linter 2024-06-07 21:18:30 -04:00
Christian Zhou-Zheng dc5cf5fd82
Update gguf-py/gguf/gguf_writer_split.py
Co-authored-by: compilade <git@compilade.net>
2024-06-07 17:26:30 -04:00
Christian Zhou-Zheng 1312e287ec
Update gguf-py/gguf/constants.py
Co-authored-by: compilade <git@compilade.net>
2024-06-07 17:10:51 -04:00
Christian Zhou-Zheng 6d3a256d1d rename GGUFManager to GGUFWriterSplit 2024-06-07 09:12:44 -04:00
Christian Zhou-Zheng 13ffe22ca7 base-1024 bytes to base-1000 2024-06-06 10:24:11 -04:00
Christian Zhou-Zheng 83e4a3f5cc make pathlib explicit 2024-06-06 09:00:59 -04:00
Christian Zhou-Zheng 2037eabb64 move kv keys to constants.py 2024-06-06 08:49:46 -04:00
Christian Zhou-Zheng 1cbab22225 type consistency in format_n_bytes_to_str 2024-06-06 08:43:26 -04:00
Christian Zhou-Zheng 3328b0a991 Shard dataclass and un-negative dont_add_architecture 2024-06-06 08:37:35 -04:00
Christian Zhou-Zheng 6a05183b97
GGUFWriter compatibility fix
Co-authored-by: compilade <git@compilade.net>
2024-06-06 08:28:10 -04:00
Joan Fontanals f5d7b268ec
llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
Christian Zhou-Zheng ce7e6985d2 form shards while adding tensors, SHA256 sums agree with master 2024-06-05 18:29:39 -04:00
Christian Zhou-Zheng 5ad397d610 reduce diffs with master 2024-06-05 13:49:20 -04:00
Christian Zhou-Zheng bb5ee02096 simplify even further and standardize with GGUFWriter 2024-06-05 12:49:08 -04:00
Christian Zhou-Zheng f6fd3ea4e9 further simplify GGUFManager 2024-06-05 12:28:40 -04:00
Christian Zhou-Zheng 3e9430df33 reduce duplicated code from gguf_writer 2024-06-05 09:29:33 -04:00
Christian Zhou-Zheng efead0408c fix gguf_writer placement and remove comments 2024-06-03 19:34:01 -04:00
Christian Zhou-Zheng 140eb52f3f Merge branch 'master' into convert-split 2024-06-03 09:07:23 -04:00
Christian Zhou-Zheng 240243e63f remove unnecessary imports in gguf_manager 2024-06-03 09:01:42 -04:00
Christian Zhou-Zheng 09baf2f3b5 fix Q8 quantization 2024-06-03 08:58:29 -04:00
zhangkaihuo 6f28a333c1
llama : MiniCPM support tied embeddings (#7664)
* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>
2024-06-03 10:49:30 +03:00
Galunid 9c4c9cc83f
Move convert.py to examples/convert-legacy-llama.py (#7430)
* Move convert.py to examples/convert-no-torch.py

* Fix CI, scripts, readme files

* convert-no-torch -> convert-legacy-llama

* Move vocab thing to vocab.py

* Fix convert-no-torch -> convert-legacy-llama

* Fix lost convert.py in ci/run.sh

* Fix imports

* Fix gguf not imported correctly

* Fix flake8 complaints

* Fix check-requirements.sh

* Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE

* Review fixes
2024-05-30 21:40:00 +10:00
Galunid eb57fee51f
gguf-py : Add tokenizer.ggml.pre to gguf-new-metadata.py (#7627) 2024-05-30 02:10:40 +02:00
fairydreaming ee3dff6b8e
Add support for DeepseekV2ForCausalLM (#7519)
* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-28 17:07:05 +02:00
compilade b83bab15a5
gguf-py : fix and simplify quantized shape round-trip (#7483)
* gguf-py : fix and simplify quantized shape round-trip

* gguf-py : remove unused import
2024-05-25 11:11:48 +10:00
fairydreaming fbca2f27fc
Add support for ArcticForCausalLM (#7020)
* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-24 14:31:13 +02:00
Christian Zhou-Zheng 6b5c3753c8 refactor SplitStrategy to be a deque
Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.
2024-05-24 00:28:48 -04:00
Christian Zhou-Zheng 3ff27efa89 Fix eager tensor memory leak and remove convert.py changes
Removed a memory leak caused by unexpected reference retention to eager tensors.

Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.
2024-05-23 18:50:21 -04:00
Georgi Gerganov e84b71c2c6
ggml : drop support for QK_K=64 (#7473)
* ggml : drop support for QK_K=64

ggml-ci

* opencl : restore QK_K=256 define
2024-05-23 10:00:21 +03:00
Christian Zhou-Zheng 2dd784108b Merge remote-tracking branch 'origin' into convert-split 2024-05-22 20:23:13 -04:00
liuwei-git 201cc11afa
llama : add phi3 128K model support (#7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-21 23:28:32 +03:00
Georgi Gerganov fabf30b4c4
llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00
compilade ee52225067
convert-hf : support direct Q8_0 conversion (#7234)
* convert-hf : support q8_0 conversion

* convert-hf : add missing ftype

This was messing with the checksums otherwise.

* convert-hf : add missing ftype to Baichuan and Xverse

I didn't notice these on my first pass.
2024-05-13 14:10:51 -04:00
compilade 5a419926b0
convert-hf : support bfloat16 conversion (#7158)
* convert-hf : support bfloat16 conversion

* gguf-py : flake8 fixes

* convert-hf : add missing space after comma

* convert-hf : get bit-exact same output as ./quantize

The quantization version was missing.

* convert-hf : don't round bf16 NANs

* convert-hf : save some memory with np.int16 intermediate bf16 weights

* convert-hf : more closely match llama.cpp with which weights to keep in f32

* convert-hf : add --outtype auto-f16

A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.

* convert-hf : remove a semicolon because flake8 doesn't like it

It's a reflex from when programming in C/C++, I guess.

* convert-hf : support outtype templating in outfile name

* convert-hf : rename --outtype auto-f16 to --outtype auto
2024-05-11 11:06:26 -04:00
Joan Fontanals b83cc3f5b3
llama : add Jina Embeddings architecture (#6826)
* feat: first things to do

* feat: create tensors for Jina architecture

* fix: use other tensors

* feat: embedding gets results

* fix: fix usage of ALIBI

* fix: clean prints

* fix: do some cleanup unused vars

* fix: revert changes to Makefile and CMakeLists

* fix: revert some changes

* fix: fix small detail

* fix: fix convert formatting

* fix: fix linting and editor

* feat: set proper vocab settings

* fix: JinaBertForMaskedLM registration

* feat: support q_normalization and k_normalization in Jina arch

* feat: handle gpt2 tokenizer with Jina architecture

* feat: example comments in embedding

* feat: rename Jina Bert to Jina Bert V2

* fix: add some changes as per review

* feat: proper KQ_pos for Jina embeddings

* feat: add capacity to load models ES and DE for Spanish

* llama : fix pre-tokenizers

* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* minor : clean-up

* embedding : add warning about missing SEP

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-11 10:46:09 +03:00