Oleksandr Kuvshynov
228b1bd487
Update README.md
2024-05-27 09:49:55 -07:00
Oleksandr Kuvshynov
1d6d9497a8
readme
2024-05-27 12:36:57 -04:00
Oleksandr Kuvshynov
de26d49fbe
duo: v5
2024-05-25 22:19:23 -04:00
Oleksandr Kuvshynov
7c8699add6
pass user data
2024-05-25 22:10:19 -04:00
Oleksandr Kuvshynov
534093878b
duo: v3
2024-05-25 14:41:30 -04:00
Oleksandr Kuvshynov
96811fdf63
duo: v2
2024-05-25 14:23:57 -04:00
Oleksandr Kuvshynov
78938bc0c9
duo: v0
2024-05-25 13:59:28 -04:00
Oleksandr Kuvshynov
83aabb3fb7
readme
2024-05-24 23:56:48 -04:00
Oleksandr Kuvshynov
10d5aefed5
logging
2024-05-24 22:21:41 -04:00
Oleksandr Kuvshynov
66982abcb1
fixes
2024-05-24 12:22:59 -04:00
Oleksandr Kuvshynov
02e2c91d01
correct split id
2024-05-24 09:52:28 -04:00
Oleksandr Kuvshynov
60fe62e6eb
some renaming
2024-05-22 23:52:36 -04:00
Oleksandr Kuvshynov
479c80a0db
duo: cleanup v2
2024-05-22 23:31:23 -04:00
Oleksandr Kuvshynov
eecdd3b0ce
duo: first ~working option
2024-05-22 23:02:31 -04:00
Oleksandr Kuvshynov
2849247c4f
duo: more cleanup
2024-05-21 22:45:59 -04:00
Oleksandr Kuvshynov
f3965704fd
duo: simplify a little
2024-05-21 22:31:52 -04:00
Oleksandr Kuvshynov
d52d193e58
duo v0
...
setting up RPC + callback on each split completion
1. start rpc server on local instance on two different ports with 5GB
allocated each.
2. set up another callback on completion of a split. This seems cleaner
than trying to second-guess which tensor is the boundary of a split.
3. run it with 8B model @ 4bit, observe split_done captured at a reasonable place.
Next step - bring back linear speculation and start speculating on another remote
instances.
2024-05-21 16:11:30 -04:00
Amir
11474e756d
examples: cache hf model when --model not provided ( #7353 )
...
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
jaime-m-p
d7e852c1bc
Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) ( #7425 )
...
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p
917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm ( #7375 )
...
* Update brute force test: special tokens
* Fix added tokens
- Try to read 'added_tokens.json'.
- Try to read 'tokenizer_config.json'.
- Try to read 'tokenizer.json'.
* Fix special tokens rtrim
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Johannes Gäßler
20385cebcc
perplexity: update README FP16 results [no ci] ( #7413 )
2024-05-20 18:15:38 +02:00
Georgi Gerganov
3bc10cb485
server : fix temperature + disable some tests ( #7409 )
...
* server : fix temperature
* server : disable tests relying on parallel determinism
* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
Georgi Gerganov
1cc0155d04
server : tuning tests ( #7388 )
...
* server : don't pass temperature as string
* server : increase timeout
* tests : fix the fix 0.8f -> 0.8
ggml-ci
* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input ( #7389 )
2024-05-20 08:56:05 +03:00
Georgi Gerganov
2789baf480
tests : fix --keep_split -> --keep-split ( #7374 )
2024-05-20 08:55:09 +03:00
Fred Douglas
1ea2a0036e
quantize : fix --keep-split check ( #7374 )
2024-05-19 19:37:04 +03:00
Johannes Gäßler
1b01f06db0
server: add test for token probs ( #7347 )
2024-05-19 16:26:02 +02:00
Johannes Gäßler
41858392e1
server: fix seed being reported back ( #7382 )
2024-05-19 17:06:33 +03:00
Georgi Gerganov
854d365aba
cmake : update android comments ( #7341 )
2024-05-19 11:01:01 +03:00
Georgi Gerganov
511182eabb
android : use "ci-android" branch for CI ( #7341 )
...
* android : use "ci-android" branch for CI
* ggml : disable SIMD exp and silu for 32-bit ARM
ggml-ci
* android : do not fetch, use add_subdirectory instead
* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler
cb42c29427
server: correct --threads documentation [no ci] ( #7362 )
2024-05-18 11:10:47 +02:00
strawberrymelonpanda
ca57e0f35e
perplexity : ndot progress and show stats with < 100 tasks ( #7348 )
...
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18 10:57:08 +03:00
Radoslav Gerganov
f4bd8b3d26
rpc : set SO_REUSEADDR for the server socket ( #7320 )
...
ref: #7293
2024-05-17 17:25:44 +03:00
Radoslav Gerganov
ee94172d33
server : add support for the RPC backend ( #7305 )
...
ref: #7292
2024-05-17 10:00:17 +03:00
Leon Knauer
9c4fdcbec8
[Server] Added --verbose option to README [no ci] ( #7335 )
2024-05-17 10:11:03 +10:00
Pierrick Hymbert
24ecb58168
Revert "server bench: fix bench not waiting for model load ( #7284 )" ( #7334 )
...
This reverts commit 583fd6b000 .
2024-05-16 20:43:45 +02:00
Radoslav Gerganov
9afdffe70e
rpc : get available mem for the CPU backend
...
This can be overridden with the -m command line option
ref: #7293
2024-05-16 12:04:08 +03:00
Radoslav Gerganov
3b3963c55c
rpc : add command line arg for specifying backend memory
...
ref: #7293
2024-05-16 09:58:29 +03:00
Vaibhav Srivastav
ad52d5c259
doc: add references to hugging face GGUF-my-repo quantisation web tool. ( #7288 )
...
* chore: add references to the quantisation space.
* fix grammer lol.
* Update README.md
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Update README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 15:38:43 +10:00
slaren
344f9126cc
ggml : tag ggml_tensor::backend as deprecated ( #7290 )
2024-05-15 15:08:48 +02:00
dm4
ea3b0590ee
embedding : free the batch after execution ( #7297 )
2024-05-15 15:01:12 +03:00
Johannes Gäßler
583fd6b000
server bench: fix bench not waiting for model load ( #7284 )
2024-05-15 08:44:16 +02:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit ( #7264 )
...
* server: free sampling contexts on exit
This cleans up last leak found by the address sanitizer.
* fix whitespace
* fix whitespace
2024-05-14 16:11:24 +02:00
Brian
1265c670fd
Revert "move ndk code to a new library ( #6951 )" ( #7282 )
...
This reverts commit efc8f767c8 .
2024-05-14 16:10:39 +03:00
Radoslav Gerganov
5e31828d3e
ggml : add RPC backend ( #6829 )
...
* ggml : add RPC backend
The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).
* set TCP_NODELAY
* add CI workflows
* Address review comments
* fix warning
* implement llama_max_devices() for RPC
* Address review comments
* Address review comments
* wrap sockfd into a struct
* implement get_alignment and get_max_size
* add get_device_memory
* fix warning
* win32 support
* add README
* readme : trim trailing whitespace
* Address review comments
* win32 fix
* Address review comments
* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
Elton Kola
efc8f767c8
move ndk code to a new library ( #6951 )
2024-05-14 17:30:30 +10:00
Ryuei
27f65d6267
docs: Fix typo and update description for --embeddings flag ( #7026 )
...
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
k.h.lai
30e70334f7
llava-cli: fix base64 prompt ( #7248 )
2024-05-14 00:02:36 +10:00
Johannes Gäßler
1c570d8bee
perplexity: add BF16 vs. FP16 results ( #7150 )
2024-05-13 13:03:27 +02:00
Benjamin Findley
e586ee4259
change default temperature of OAI compat API from 0 to 1 ( #7226 )
...
* change default temperature of OAI compat API from 0 to 1
* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00