Commit Graph

842 Commits

Author SHA1 Message Date
HanishKVC c4e829d492 ChatON:Mistral: Decouple \n from suffix, use wrt sys msg 2024-05-06 11:27:56 +05:30
HanishKVC 55e3d63f13 ChatON:Mistral: Update to match jinja file 2024-05-06 11:27:56 +05:30
HanishKVC ad5e5216ce ChatON:Mistral: Add detailed meta json entries 2024-05-06 11:27:56 +05:30
HanishKVC 368fbf17a1 ChatON:ChatML: Update wrt detailed meta json 2024-05-06 11:27:56 +05:30
HanishKVC a64dcd7796 ChatON:Zephyr: Update wrt detailed meta json, also update eos
Pick eos from zephyr's tokenizer_config, which is different from
what was hardcoded in the existing llama_chat_apply_template.
2024-05-06 11:27:56 +05:30
HanishKVC 18cd12524f ChatON:Monarch:Update wrt detailed meta json 2024-05-06 11:27:56 +05:30
HanishKVC 006a398ebf ChatON:DeepSeekCoder: Update tmplid and wrt detailed meta json 2024-05-06 11:27:56 +05:30
HanishKVC 1b2e921186 ChatON:DeepSeek: Update support wrt detailed meta json 2024-05-06 11:27:56 +05:30
HanishKVC 403a6c4323 ChatON:Gemma: update for detailed meta json
Also as part of same add user role entry for system role also.
2024-05-06 11:27:56 +05:30
HanishKVC 01c8db70f7 ChatON+Main: Add C_API wrapper for single
Add a c api wrapper for a single message tagging scenario.

Inturn to match convention followed by existing chat_apply_template
code, make it return the size expected of the tagged message string
buffer. Update internal single logic to help with same.

Explicitly check if tmpl specified is available in the loaded json
or not and then return a error if not found.
2024-05-06 11:27:56 +05:30
HanishKVC 13857f29d6 ChatON+Main: Updates wrt detailed meta json
Fix a oversight wrt key name.

Add a alert in case if passed meta json file contains begin(BoS)
wrt assistant role, similar to check for end (EoS) wrt user role.
Bcas normally both (ie EoS wrt User and BoS wrt Assistant) shouldnt
be needed.

Update main wrt begin & prefix and suffix & end addition.
2024-05-06 11:27:56 +05:30
HanishKVC b9e31304a5 ChatON: Update to new detailed format wrt llama2 and llama3
Wrt llama2
* add bos wrt llama2 system and user begins, but not assistant
* split system suffix into suffix and end, and add systemuser-system
  flags so that end can be avoided wrt system+user message combo
* add eos wrt assistant end
* With these potentially this should work with main and server flows

Wrt llama3
* add empty begin, end fields and systemuser-system flags
* This should potentially work with main and server flows
2024-05-06 11:27:56 +05:30
HanishKVC bf1167bfdb ChatON: Backup the current simple meta json file 2024-05-06 11:27:56 +05:30
HanishKVC 6b23f15ffe ChatON:ChatOnMetaJSon: Add suffix wrt assistant messages 2024-05-06 11:27:56 +05:30
HanishKVC 3064a36e74 ChatON+:Update tmpl_role_kv to retrieve wrt multiple keys
Use the same for user role's begin and prefix entries.
2024-05-06 11:27:56 +05:30
HanishKVC f1f39c5256 ChatON:Add Monarch model template, which uses Begin + Prefix
Inturn Begin/BoS is added only for non 1st user messages in a
system+user prompts chain.
2024-05-06 11:27:56 +05:30
HanishKVC 0f713d4c4f ChatOn: meta json update wrt the new begin related fields 2024-05-06 11:27:56 +05:30
HanishKVC 84367b9fd1 ChatON: Add template for DeepSeek
Was looking at the tokenized vector, and noticed that the EOS
mentioned by existing chat_apply_template of llama.cpp, is different
from what I noticed in tokenizer_config.json of deepseek llm, so
I have added two entries

* "deepseek-alt" which matches llama.cpp's chat_apply_template and
* "deepseek" which matches that in tokenizer_config.json.

This impacts the assistant suffix and reverse prompt entries.

CasOfThis: Need to look into other entries which I added previously
at a later time. However as the default logic should be picking the
EOS from model file, so I assume reverse-prompt being outofsync,
may not matter beyond a limit, potentially.
2024-05-06 11:27:56 +05:30
HanishKVC f4b54069f6 ChatON: Add template for Gemma 2024-05-06 11:27:56 +05:30
HanishKVC 2a8028fba8 ChatON: Add Zephyr template to meta-json file 2024-05-06 11:27:56 +05:30
HanishKVC 42f6b45547 ChatON: Use the constants defined for the keys 2024-05-06 11:27:56 +05:30
HanishKVC efb758ba7d ChatON: Rename helpers to kv suffix, updated wrt metaok
rename because they return value of specified key.

[main] update metaok to take template-id, so that one can cross
check that all needed entries are there wrt that template-id in
the chaton-meta-json file
2024-05-06 11:27:56 +05:30
HanishKVC 11b47fbcfc ChatON:MetaJson: Add key constants, check metaJson loaded ifNeeded 2024-05-06 11:27:56 +05:30
HanishKVC 221ccd6462 ChatOn: Add SystemUser-1st-User-Has-Prefix flag support
Llama2 seems to need it, so chaton-meta-json sample file updated
to use same.
2024-05-06 11:27:56 +05:30
HanishKVC c4cf0e9075 ChatON:Cleanup: BeginEnd, Debug log
Update the note

Rename global-prefix|suffix to global-begin|end.

Rename chat-apply-template to chat-apply-template-single, cas it
handles only a single message.

Add some debug log messages to the helper functions
2024-05-06 11:27:56 +05:30
HanishKVC d87d27512e ChatOn: update sample meta json a bit
Move [inst] [/inst] wrt llama2 from global to individual role
specific parts.

Avoid an extra \n wrt prefixes of llama3
2024-05-06 11:27:55 +05:30
HanishKVC cdbe4f06ce Chaton:Sample Meta JSON cleanup 2024-05-06 11:27:55 +05:30
HanishKVC 050d329e7e ChatOn+Main: Initial go at chaton in main interactive flow 2024-05-06 11:27:55 +05:30
HanishKVC 1374a64200 Chaton:Meta: Add chatml meta data to sample meta json file 2024-05-06 11:27:55 +05:30
HanishKVC 093abc29a2 ChatOn: Update sample meta json to be a valid json 2024-05-06 11:27:55 +05:30
HanishKVC dc56be951d ChatOn:Main: Load and dump any specified chaton meta file 2024-05-06 11:27:55 +05:30
kunnis 628b299106
Adding support for the --numa argument for llama-bench. (#7080) 2024-05-05 14:17:47 +02:00
Xuan Son Nguyen 842500144e
gguf-split: add --no-tensor-first-split (#7072) 2024-05-04 18:56:22 +02:00
maor-ps 03fb8a002d
If first token generated from the server is the stop word the server will crash (#7038)
This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
2024-05-04 11:06:40 +02:00
l3utterfly 8d608a81b7
main : fix off by one error for context shift (#6921) 2024-05-01 22:27:41 +03:00
Johannes Gäßler 3ea0d36000
Server: add tests for batch size, different seeds (#6950) 2024-05-01 17:52:55 +02:00
Johannes Gäßler a8f9b07631
perplexity: more statistics, added documentation (#6936)
* perplexity: more statistics, added documentation

* add LLaMA 3 8b scoreboard
2024-04-30 23:36:27 +02:00
Georgi Gerganov 9c67c2773d
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API

* ggml : fix GQA support in ggml_flash_attn_ext

* ggml : online attention (CPU)

* metal : initial implementation

* metal : f16 precision

* metal : reduce branches

* metal : specialize for head size

* wip : 8 rows per simd group

* wip : 4 rows per simd group

* wip : template for rows per warp

* metal : parallelize across KV size

* metal : parallel reduce across heads

* metal : efficient flash_attn_f16 implementation

* metal : avoid redundant loads of the attention

* metal : scale and mask in matrix form

* metal : fix comment

* llama : avoid ggml_cast, use F32 query

* metal : add parallel reduce version (disabled)

* metal : move output into local memory + optimize

- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments

* metal : add tests, fix scaling, support C > 32

* metal : improve precision

* ggml : fix f16 mad

* metal : minor

* metal : support Q > 8

* tests : add ATTN tests

* metal : disable buffer allocation logs

* tests : more

* metal : faster inner loop for C == 32

* metal : fix array initialization

* tests : ifdef

* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext

* ggml : fix ggml_soft_max mask requirement

* cuda : fix soft_max to use correct mask size

* cuda : add flash_attn kernel (wip)

* metal : optimize softmax for C > 32

* metal : optimize softmax

* tests : minor fix

* cuda : avoid zeroing fragments

* tests : update dims

* cuda : fix __hisinf() result check

* cuda : avoid warp_reduce for smax

* cuda : use int instead of int64_t

Noticeably improves performance (thanks to Johannes)

* cuda : make loops use the same loop values

Thanks Johannes again for the tip

* cuda : unroll some of the loops

* cuda : avoid __hisinf branches

* cuda : use half2 in softmax

* cuda : switch to 1 warp for bs > 16

* cuda : speed-up reduce part of the kernel

* cuda : unroll Q*K^T loop

* cuda : fix -INF block check

* cuda : simplify softmax

* cuda : fix matrix names

* cuda : minor

* llama : adapt to F16 KQ_pos

* llama : adapt new models to F16 KQ_mask

* ggml : fix F16 store (ARM NEON)

* llama : fix type of KQ_mask and KQ_pos

* ggml : fix CPU soft_max

* tests : add hs=256

* cuda : fix build

* metal : improve perf via smaller int registers

* cuda : adapt soft_max to F16 mask and pos

* CUDA: faster FlashAttention, kernel for bs == 1

* 16 cols for Phi-2

* no vec for hs, no hs==256 ncols==32 for Volta

* adjust kernel selection logic

* 4 warps, 256 stride for all D

* no ncols == 64

* Multiple parallel blocks for batch size 1

* fix compile warnings

* fix excessive KQ_b loads

* fix cmake build

* fix KV cache padding, NaN from INFINITY (#6438)

* llama : flash_attn cparam + fix defrag

* server: support flash_attn param

* server: bench: enable flash_attn param

* CUDA: refactor host code, dyn. par. blocks

* fix flash_attn_vec_f16 race condition

* flush softmax exp below threshold to 0

* store temp KQ in registers

* Calculate KQ as FP32 if KQV has GGML_PREC_F32

* Add __hgt2_mask implementation for CUDA 11

* fix KQ FP32 precision fpr parallel_blocks > 1

* llama-bench : add -fa,--flash-attn arg

* metal : add BS=1 kernel for flash attention (#6508)

* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel

* metal : use F32 attention accumulators

* batched-bench : add fattn arg

* llama : simplify llama_build_kv_store

ggml-ci

* llama : adapt build_olmo to changes

* ggml : fix arm fp16 store on windows

* metal : clean-up

* metal : clean-up kernel code

* metal : minor

* tests : remove benchmarks

ggml-ci

* ggml : fix avx512 const correctness

ggml-ci

* ggml : fix soft_max with bias on CPU

ggml-ci

* common : print --flash-attn in help

* ggml : fix num dimensions in ggml_flash_attn_ext

* llama : force disable flash attention for incompatible models

* ggml : ggml_soft_max support F16/F32 mask/pos

ggml-ci

* cuda : uint -> uint32_t

* cuda : "constexpr dim3" -> "const dim3"

ggml-ci

* cuda : try to fix __hgt2_mask

ggml-ci

* ggml : add TODO's for F16/F32 mask/pos support in other backends

* llama : replace bool need_kq_pos with use_alibi

* llama : prep ALiBi support for BERT models

ggml-ci

* llama : fix n_batch requirements

ggml-ci

* cont

* server : add help for --flash-attn arg

* llama : disable FA for AMD

* tests : remove TMP_ATTN_BENCH

ggml-ci

* llama : support save/load state with FA enabled

ggml-ci

* ci : add CUDA save-load-state tests

ggml-ci

* llama : llama_kv_cache_clear zeroes data + fix save-load seq

ggml-ci

* llama : fix copy-paste errors, add TODO

* llama : disallow incompatible states

* llama : update llama_state_get_size after v_trans field

* metal : remove tmp log

* llama : add static reminder for llama_state_get_size

* metal : fix max nsg

ggml-ci

* ci : fix arg order

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 12:16:08 +03:00
Olivier Chafik 8843a98c2b
Improve usability of --model-url & related flags (#6930)
* args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf)

* args: main & server now call gpt_params_handle_model_default

* args: define DEFAULT_MODEL_PATH + update cli docs

* curl: check url of previous download (.json metadata w/ url, etag & lastModified)

* args: fix update to quantize-stats.cpp

* curl: support legacy .etag / .lastModified companion files

* curl: rm legacy .etag file support

* curl: reuse regex across headers callback calls

* curl: unique_ptr to manage lifecycle of curl & outfile

* curl: nit: no need for multiline regex flag

* curl: update failed test (model file collision) + gitignore *.gguf.json
2024-04-30 00:52:50 +01:00
Daniel Bevenius 5539e6fdd1
main : fix typo in comment in main.cpp (#6985)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-29 13:56:59 -04:00
Olivier Chafik b8a7a5a90f
build(cmake): simplify instructions (`cmake -B build && cmake --build build ...`) (#6964)
* readme: cmake . -B build && cmake --build build

* build: fix typo

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* build: drop implicit . from cmake config command

* build: remove another superfluous .

* build: update MinGW cmake commands

* Update README-sycl.md

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* build: reinstate --config Release as not the default w/ some generators + document how to build Debug

* build: revert more --config Release

* build: nit / remove -H from cmake example

* build: reword debug instructions around single/multi config split

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-04-29 17:02:45 +01:00
cpumaxx ffe666572f
llava-cli : multiple images (#6969)
Co-authored-by: root <root@nenya.lothlorien.ca>
2024-04-29 17:34:24 +03:00
Pierrick Hymbert b7368332e2
ci: server: tests python env on github container ubuntu latest / fix n_predict (#6935)
* ci: server: fix python env

* ci: server: fix server tests after #6638

* ci: server: fix windows is not building PR branch
2024-04-27 17:50:48 +02:00
Pierrick Hymbert 0c4d489e29
quantize: add imatrix and dataset metadata in GGUF (#6658)
* imatrix: save the dataset file used in the output file

* llama: support kv overrides type string string

* common: factorize KV Overrides parsing between common and server

* quantize: add imatrix n entries and dataset KV metadata
quantize: factorize KV Overrides parsing between common
#6656

* llama: remove kv override str_value initialization as it does not compile on some toolchain

* quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count`

* quantize: add imatrix filename in KV

* llama: add llama_model_kv_override_free

* common: add llama_model_kv_override_free
common: free kv override if used after model loading

* llama: finally move the string KV override value to the stack

* llama : minor

* no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators.

Co-authored-by: slaren <slarengh@gmail.com>

* kv override: ensure string termination

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-04-26 20:06:33 +02:00
Pierrick Hymbert 7f5ff558ee
server: stop generation at `n_ctx_train` if `n_predict` is not set (#6638)
* server: cap n_predict if not set to n_ctx_train

* server: fix infinite loop

* server: infinite loop, move in process_token
server: infinite loop: set stop limit to true

* minor: spaces

* minor: spaces

* server: include prompt tokens in the EOS limit
2024-04-26 12:15:30 +02:00
Pierrick Hymbert 5790c8dac1
bench: server add stop word for PHI-2 (#6916) 2024-04-26 09:26:16 +02:00
vik 46e12c4692
llava : add support for moondream vision language model (#6899)
* add support for moondream vision language model

This required making the following changes to the CLIP model:

1. Support for patch embedding bias.
2. Make class embedding and pre-layernorm optional.
3. Add support for post-layernorm.

* Update examples/llava/clip.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-25 22:38:31 +03:00
Daniel Bevenius 4ab99d8d47
clip : rename lerp function to avoid conflict (#6894)
This commit renamesthe lerp (linear interpolation) function in clip.cpp
to avoid a conflict with the lerp function in the <cmath> standard C++
library when using c++20.

The motivation for this change is to enable projects that use c++20 to
be able to compile clip.cpp without having to resort to patching it. The
lerp function was added to cmath in version C++20 (202002L) and is why
this is not causing any issue at the moment as C++11/C++17 is currently
used by llama.cpp.

I realize that llama.cpp uses either C++11 (or C++17 in the case for
SYCL) but wanted to ask if this would be an acceptable change just the
same.

Refs: https://en.cppreference.com/w/cpp/numeric/lerp

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-25 15:38:14 +03:00
Georgi Gerganov aa750c1ede
tests : minor bash stuff (#6902)
* tests : minor bash stuff

ggml-ci

* llama : fix build

ggml-ci

* tests : fix CUR_DIR -> ROOT_DIR

ggml-ci

* tests : fix fname

ggml-ci
2024-04-25 14:27:20 +03:00
jiez 1966eb2615
quantize : add '--keep-split' to quantize model into shards (#6688)
* Implement '--keep-split' to quantize model into several shards

* Add test script

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Split model correctly even if tensor id is out-of-order

* Update llama_model_quantize_params

* Fix preci failures

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-25 13:29:35 +03:00