Commit Graph

878 Commits

Author SHA1 Message Date
brian khuu 12fcea5d04 llama: rename llama_token_is_control_token() to llama_token_is_control() 2024-05-22 01:45:07 +10:00
Justine Tunney d8b373c146
Merge branch 'master' into grammar-token 2024-05-21 08:11:32 -07:00
brian khuu 8f76ba54ba main: refactor ctrl_token_no_out --> no_special 2024-05-21 16:05:58 +10:00
brian khuu 7d52482bac main: renamed --no-special from --ctrl-token-no-out and other refactoring 2024-05-21 16:00:59 +10:00
brian khuu c1e8a6d1c0 main: must check pipe status on very top of program 2024-05-21 14:58:34 +10:00
brian khuu 50048f5b45 main: rejig control token descriptor handling 2024-05-21 11:20:00 +10:00
brian khuu 90456a5717 main: only merge stdout and control token if not in conversation or grammar mode 2024-05-21 04:57:26 +10:00
brian khuu c9ea9df7fb main: remove --ctrl-token-fd-out in favor for fcntl() based detection 2024-05-21 04:44:50 +10:00
jaime-m-p 917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Johannes Gäßler 20385cebcc
perplexity: update README FP16 results [no ci] (#7413) 2024-05-20 18:15:38 +02:00
brian khuu ad4b6097c0 main: dprintf isn't part of the IEEE POSIX standard. Just use write(). 2024-05-21 00:37:12 +10:00
brian khuu 9f445a793d main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out 2024-05-20 22:35:08 +10:00
Georgi Gerganov 3bc10cb485
server : fix temperature + disable some tests (#7409)
* server : fix temperature

* server : disable tests relying on parallel determinism

* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
Georgi Gerganov 1cc0155d04
server : tuning tests (#7388)
* server : don't pass temperature as string

* server : increase timeout

* tests : fix the fix 0.8f -> 0.8

ggml-ci

* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov e932094d58
server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
Georgi Gerganov 2789baf480
tests : fix --keep_split -> --keep-split (#7374) 2024-05-20 08:55:09 +03:00
Fred Douglas 1ea2a0036e
quantize : fix --keep-split check (#7374) 2024-05-19 19:37:04 +03:00
brian khuu bcd24f8974 main: use seperate stream for control characters 2024-05-20 01:57:43 +10:00
Johannes Gäßler 1b01f06db0
server: add test for token probs (#7347) 2024-05-19 16:26:02 +02:00
Johannes Gäßler 41858392e1
server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
Georgi Gerganov 854d365aba
cmake : update android comments (#7341) 2024-05-19 11:01:01 +03:00
Georgi Gerganov 511182eabb
android : use "ci-android" branch for CI (#7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler cb42c29427
server: correct --threads documentation [no ci] (#7362) 2024-05-18 11:10:47 +02:00
strawberrymelonpanda ca57e0f35e
perplexity : ndot progress and show stats with < 100 tasks (#7348)
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18 10:57:08 +03:00
Brian ea70e2800b
Merge branch 'master' into grammar-token 2024-05-18 15:28:48 +10:00
Radoslav Gerganov f4bd8b3d26
rpc : set SO_REUSEADDR for the server socket (#7320)
ref: #7293
2024-05-17 17:25:44 +03:00
Radoslav Gerganov ee94172d33
server : add support for the RPC backend (#7305)
ref: #7292
2024-05-17 10:00:17 +03:00
Leon Knauer 9c4fdcbec8
[Server] Added --verbose option to README [no ci] (#7335) 2024-05-17 10:11:03 +10:00
Pierrick Hymbert 24ecb58168
Revert "server bench: fix bench not waiting for model load (#7284)" (#7334)
This reverts commit 583fd6b000.
2024-05-16 20:43:45 +02:00
Radoslav Gerganov 9afdffe70e rpc : get available mem for the CPU backend
This can be overridden with the -m command line option

ref: #7293
2024-05-16 12:04:08 +03:00
Radoslav Gerganov 3b3963c55c rpc : add command line arg for specifying backend memory
ref: #7293
2024-05-16 09:58:29 +03:00
Vaibhav Srivastav ad52d5c259
doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)
* chore: add references to the quantisation space.

* fix grammer lol.

* Update README.md

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 15:38:43 +10:00
slaren 344f9126cc
ggml : tag ggml_tensor::backend as deprecated (#7290) 2024-05-15 15:08:48 +02:00
dm4 ea3b0590ee
embedding : free the batch after execution (#7297) 2024-05-15 15:01:12 +03:00
Johannes Gäßler 583fd6b000
server bench: fix bench not waiting for model load (#7284) 2024-05-15 08:44:16 +02:00
Steve Grubb 4f0263633b
server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
Brian 1265c670fd
Revert "move ndk code to a new library (#6951)" (#7282)
This reverts commit efc8f767c8.
2024-05-14 16:10:39 +03:00
Radoslav Gerganov 5e31828d3e
ggml : add RPC backend (#6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
Elton Kola efc8f767c8
move ndk code to a new library (#6951) 2024-05-14 17:30:30 +10:00
Ryuei 27f65d6267
docs: Fix typo and update description for --embeddings flag (#7026)
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
k.h.lai 30e70334f7
llava-cli: fix base64 prompt (#7248) 2024-05-14 00:02:36 +10:00
Johannes Gäßler 1c570d8bee
perplexity: add BF16 vs. FP16 results (#7150) 2024-05-13 13:03:27 +02:00
Benjamin Findley e586ee4259
change default temperature of OAI compat API from 0 to 1 (#7226)
* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
Xuan Son Nguyen 72c177c1f6
fix system prompt handling (#7153) 2024-05-11 17:28:10 +02:00
Steve Grubb 988631335a
server : free llama_batch on exit (#7212)
* [server] Cleanup a memory leak on exit

There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.

* make tab into spaces
2024-05-11 11:13:02 +03:00
Johannes Gäßler 5ae3426b0b
server: fix reported top tokens for temperature 0 (#7203) 2024-05-11 10:11:28 +02:00
Joan Fontanals b83cc3f5b3
llama : add Jina Embeddings architecture (#6826)
* feat: first things to do

* feat: create tensors for Jina architecture

* fix: use other tensors

* feat: embedding gets results

* fix: fix usage of ALIBI

* fix: clean prints

* fix: do some cleanup unused vars

* fix: revert changes to Makefile and CMakeLists

* fix: revert some changes

* fix: fix small detail

* fix: fix convert formatting

* fix: fix linting and editor

* feat: set proper vocab settings

* fix: JinaBertForMaskedLM registration

* feat: support q_normalization and k_normalization in Jina arch

* feat: handle gpt2 tokenizer with Jina architecture

* feat: example comments in embedding

* feat: rename Jina Bert to Jina Bert V2

* fix: add some changes as per review

* feat: proper KQ_pos for Jina embeddings

* feat: add capacity to load models ES and DE for Spanish

* llama : fix pre-tokenizers

* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* minor : clean-up

* embedding : add warning about missing SEP

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-11 10:46:09 +03:00
slaren e849648888
llama-bench : add pp+tg test type (#7199) 2024-05-10 18:03:54 +02:00
Justine Tunney 4e3880978f
Fix memory bug in grammar parser (#7194)
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug

    ./llamafile -m foo.gguf -p bar --grammar 'root::="'

Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
2024-05-10 21:01:08 +10:00
HanishKVC f89fe2732c
Main+: optionally allow special tokens from user in interactive mode (#7097)
@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.
2024-05-10 20:21:58 +10:00