Commit Graph

883 Commits

Author SHA1 Message Date
Oleksandr Kuvshynov 1d6d9497a8 readme 2024-05-27 12:36:57 -04:00
Oleksandr Kuvshynov de26d49fbe duo: v5 2024-05-25 22:19:23 -04:00
Oleksandr Kuvshynov 7c8699add6 pass user data 2024-05-25 22:10:19 -04:00
Oleksandr Kuvshynov 534093878b duo: v3 2024-05-25 14:41:30 -04:00
Oleksandr Kuvshynov 96811fdf63 duo: v2 2024-05-25 14:23:57 -04:00
Oleksandr Kuvshynov 78938bc0c9 duo: v0 2024-05-25 13:59:28 -04:00
Oleksandr Kuvshynov 83aabb3fb7 readme 2024-05-24 23:56:48 -04:00
Oleksandr Kuvshynov 10d5aefed5 logging 2024-05-24 22:21:41 -04:00
Oleksandr Kuvshynov 66982abcb1 fixes 2024-05-24 12:22:59 -04:00
Oleksandr Kuvshynov 02e2c91d01 correct split id 2024-05-24 09:52:28 -04:00
Oleksandr Kuvshynov 60fe62e6eb some renaming 2024-05-22 23:52:36 -04:00
Oleksandr Kuvshynov 479c80a0db duo: cleanup v2 2024-05-22 23:31:23 -04:00
Oleksandr Kuvshynov eecdd3b0ce duo: first ~working option 2024-05-22 23:02:31 -04:00
Oleksandr Kuvshynov 2849247c4f duo: more cleanup 2024-05-21 22:45:59 -04:00
Oleksandr Kuvshynov f3965704fd duo: simplify a little 2024-05-21 22:31:52 -04:00
Oleksandr Kuvshynov d52d193e58 duo v0
setting up RPC + callback on each split completion

1. start rpc server on local instance on two different ports with 5GB
   allocated each.
2. set up another callback on completion of a split. This seems cleaner
   than trying to second-guess which tensor is the boundary of a split.
3. run it with 8B model @ 4bit, observe split_done captured at a reasonable place.

Next step - bring back linear speculation and start speculating on another remote
   instances.
2024-05-21 16:11:30 -04:00
Amir 11474e756d
examples: cache hf model when --model not provided (#7353)
* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
jaime-m-p d7e852c1bc
Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p 917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Johannes Gäßler 20385cebcc
perplexity: update README FP16 results [no ci] (#7413) 2024-05-20 18:15:38 +02:00
Georgi Gerganov 3bc10cb485
server : fix temperature + disable some tests (#7409)
* server : fix temperature

* server : disable tests relying on parallel determinism

* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
Georgi Gerganov 1cc0155d04
server : tuning tests (#7388)
* server : don't pass temperature as string

* server : increase timeout

* tests : fix the fix 0.8f -> 0.8

ggml-ci

* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov e932094d58
server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
Georgi Gerganov 2789baf480
tests : fix --keep_split -> --keep-split (#7374) 2024-05-20 08:55:09 +03:00
Fred Douglas 1ea2a0036e
quantize : fix --keep-split check (#7374) 2024-05-19 19:37:04 +03:00
Johannes Gäßler 1b01f06db0
server: add test for token probs (#7347) 2024-05-19 16:26:02 +02:00
Johannes Gäßler 41858392e1
server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
Georgi Gerganov 854d365aba
cmake : update android comments (#7341) 2024-05-19 11:01:01 +03:00
Georgi Gerganov 511182eabb
android : use "ci-android" branch for CI (#7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler cb42c29427
server: correct --threads documentation [no ci] (#7362) 2024-05-18 11:10:47 +02:00
strawberrymelonpanda ca57e0f35e
perplexity : ndot progress and show stats with < 100 tasks (#7348)
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18 10:57:08 +03:00
Radoslav Gerganov f4bd8b3d26
rpc : set SO_REUSEADDR for the server socket (#7320)
ref: #7293
2024-05-17 17:25:44 +03:00
Radoslav Gerganov ee94172d33
server : add support for the RPC backend (#7305)
ref: #7292
2024-05-17 10:00:17 +03:00
Leon Knauer 9c4fdcbec8
[Server] Added --verbose option to README [no ci] (#7335) 2024-05-17 10:11:03 +10:00
Pierrick Hymbert 24ecb58168
Revert "server bench: fix bench not waiting for model load (#7284)" (#7334)
This reverts commit 583fd6b000.
2024-05-16 20:43:45 +02:00
Radoslav Gerganov 9afdffe70e rpc : get available mem for the CPU backend
This can be overridden with the -m command line option

ref: #7293
2024-05-16 12:04:08 +03:00
Radoslav Gerganov 3b3963c55c rpc : add command line arg for specifying backend memory
ref: #7293
2024-05-16 09:58:29 +03:00
Vaibhav Srivastav ad52d5c259
doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)
* chore: add references to the quantisation space.

* fix grammer lol.

* Update README.md

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 15:38:43 +10:00
slaren 344f9126cc
ggml : tag ggml_tensor::backend as deprecated (#7290) 2024-05-15 15:08:48 +02:00
dm4 ea3b0590ee
embedding : free the batch after execution (#7297) 2024-05-15 15:01:12 +03:00
Johannes Gäßler 583fd6b000
server bench: fix bench not waiting for model load (#7284) 2024-05-15 08:44:16 +02:00
Steve Grubb 4f0263633b
server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
Brian 1265c670fd
Revert "move ndk code to a new library (#6951)" (#7282)
This reverts commit efc8f767c8.
2024-05-14 16:10:39 +03:00
Radoslav Gerganov 5e31828d3e
ggml : add RPC backend (#6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
Elton Kola efc8f767c8
move ndk code to a new library (#6951) 2024-05-14 17:30:30 +10:00
Ryuei 27f65d6267
docs: Fix typo and update description for --embeddings flag (#7026)
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
k.h.lai 30e70334f7
llava-cli: fix base64 prompt (#7248) 2024-05-14 00:02:36 +10:00
Johannes Gäßler 1c570d8bee
perplexity: add BF16 vs. FP16 results (#7150) 2024-05-13 13:03:27 +02:00
Benjamin Findley e586ee4259
change default temperature of OAI compat API from 0 to 1 (#7226)
* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
Xuan Son Nguyen 72c177c1f6
fix system prompt handling (#7153) 2024-05-11 17:28:10 +02:00