Oleksandr Kuvshynov
7c8699add6
pass user data
2024-05-25 22:10:19 -04:00
Oleksandr Kuvshynov
534093878b
duo: v3
2024-05-25 14:41:30 -04:00
Oleksandr Kuvshynov
96811fdf63
duo: v2
2024-05-25 14:23:57 -04:00
Oleksandr Kuvshynov
78938bc0c9
duo: v0
2024-05-25 13:59:28 -04:00
Oleksandr Kuvshynov
83aabb3fb7
readme
2024-05-24 23:56:48 -04:00
Oleksandr Kuvshynov
10d5aefed5
logging
2024-05-24 22:21:41 -04:00
Oleksandr Kuvshynov
66982abcb1
fixes
2024-05-24 12:22:59 -04:00
Oleksandr Kuvshynov
02e2c91d01
correct split id
2024-05-24 09:52:28 -04:00
Oleksandr Kuvshynov
60fe62e6eb
some renaming
2024-05-22 23:52:36 -04:00
Oleksandr Kuvshynov
479c80a0db
duo: cleanup v2
2024-05-22 23:31:23 -04:00
Oleksandr Kuvshynov
eecdd3b0ce
duo: first ~working option
2024-05-22 23:02:31 -04:00
Oleksandr Kuvshynov
2849247c4f
duo: more cleanup
2024-05-21 22:45:59 -04:00
Oleksandr Kuvshynov
f3965704fd
duo: simplify a little
2024-05-21 22:31:52 -04:00
Oleksandr Kuvshynov
d52d193e58
duo v0
...
setting up RPC + callback on each split completion
1. start rpc server on local instance on two different ports with 5GB
allocated each.
2. set up another callback on completion of a split. This seems cleaner
than trying to second-guess which tensor is the boundary of a split.
3. run it with 8B model @ 4bit, observe split_done captured at a reasonable place.
Next step - bring back linear speculation and start speculating on another remote
instances.
2024-05-21 16:11:30 -04:00
Amir
11474e756d
examples: cache hf model when --model not provided ( #7353 )
...
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
jaime-m-p
d7e852c1bc
Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) ( #7425 )
...
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p
917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm ( #7375 )
...
* Update brute force test: special tokens
* Fix added tokens
- Try to read 'added_tokens.json'.
- Try to read 'tokenizer_config.json'.
- Try to read 'tokenizer.json'.
* Fix special tokens rtrim
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Johannes Gäßler
20385cebcc
perplexity: update README FP16 results [no ci] ( #7413 )
2024-05-20 18:15:38 +02:00
Georgi Gerganov
3bc10cb485
server : fix temperature + disable some tests ( #7409 )
...
* server : fix temperature
* server : disable tests relying on parallel determinism
* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
Georgi Gerganov
1cc0155d04
server : tuning tests ( #7388 )
...
* server : don't pass temperature as string
* server : increase timeout
* tests : fix the fix 0.8f -> 0.8
ggml-ci
* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input ( #7389 )
2024-05-20 08:56:05 +03:00
Georgi Gerganov
2789baf480
tests : fix --keep_split -> --keep-split ( #7374 )
2024-05-20 08:55:09 +03:00
Fred Douglas
1ea2a0036e
quantize : fix --keep-split check ( #7374 )
2024-05-19 19:37:04 +03:00
Johannes Gäßler
1b01f06db0
server: add test for token probs ( #7347 )
2024-05-19 16:26:02 +02:00
Johannes Gäßler
41858392e1
server: fix seed being reported back ( #7382 )
2024-05-19 17:06:33 +03:00
Georgi Gerganov
854d365aba
cmake : update android comments ( #7341 )
2024-05-19 11:01:01 +03:00
Georgi Gerganov
511182eabb
android : use "ci-android" branch for CI ( #7341 )
...
* android : use "ci-android" branch for CI
* ggml : disable SIMD exp and silu for 32-bit ARM
ggml-ci
* android : do not fetch, use add_subdirectory instead
* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler
cb42c29427
server: correct --threads documentation [no ci] ( #7362 )
2024-05-18 11:10:47 +02:00
strawberrymelonpanda
ca57e0f35e
perplexity : ndot progress and show stats with < 100 tasks ( #7348 )
...
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18 10:57:08 +03:00
Radoslav Gerganov
f4bd8b3d26
rpc : set SO_REUSEADDR for the server socket ( #7320 )
...
ref: #7293
2024-05-17 17:25:44 +03:00
Radoslav Gerganov
ee94172d33
server : add support for the RPC backend ( #7305 )
...
ref: #7292
2024-05-17 10:00:17 +03:00
Leon Knauer
9c4fdcbec8
[Server] Added --verbose option to README [no ci] ( #7335 )
2024-05-17 10:11:03 +10:00
Pierrick Hymbert
24ecb58168
Revert "server bench: fix bench not waiting for model load ( #7284 )" ( #7334 )
...
This reverts commit 583fd6b000 .
2024-05-16 20:43:45 +02:00
Radoslav Gerganov
9afdffe70e
rpc : get available mem for the CPU backend
...
This can be overridden with the -m command line option
ref: #7293
2024-05-16 12:04:08 +03:00
Radoslav Gerganov
3b3963c55c
rpc : add command line arg for specifying backend memory
...
ref: #7293
2024-05-16 09:58:29 +03:00
Vaibhav Srivastav
ad52d5c259
doc: add references to hugging face GGUF-my-repo quantisation web tool. ( #7288 )
...
* chore: add references to the quantisation space.
* fix grammer lol.
* Update README.md
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Update README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 15:38:43 +10:00
slaren
344f9126cc
ggml : tag ggml_tensor::backend as deprecated ( #7290 )
2024-05-15 15:08:48 +02:00
dm4
ea3b0590ee
embedding : free the batch after execution ( #7297 )
2024-05-15 15:01:12 +03:00
Johannes Gäßler
583fd6b000
server bench: fix bench not waiting for model load ( #7284 )
2024-05-15 08:44:16 +02:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit ( #7264 )
...
* server: free sampling contexts on exit
This cleans up last leak found by the address sanitizer.
* fix whitespace
* fix whitespace
2024-05-14 16:11:24 +02:00
Brian
1265c670fd
Revert "move ndk code to a new library ( #6951 )" ( #7282 )
...
This reverts commit efc8f767c8 .
2024-05-14 16:10:39 +03:00
Radoslav Gerganov
5e31828d3e
ggml : add RPC backend ( #6829 )
...
* ggml : add RPC backend
The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).
* set TCP_NODELAY
* add CI workflows
* Address review comments
* fix warning
* implement llama_max_devices() for RPC
* Address review comments
* Address review comments
* wrap sockfd into a struct
* implement get_alignment and get_max_size
* add get_device_memory
* fix warning
* win32 support
* add README
* readme : trim trailing whitespace
* Address review comments
* win32 fix
* Address review comments
* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
Elton Kola
efc8f767c8
move ndk code to a new library ( #6951 )
2024-05-14 17:30:30 +10:00
Ryuei
27f65d6267
docs: Fix typo and update description for --embeddings flag ( #7026 )
...
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
2024-05-14 15:20:47 +10:00
k.h.lai
30e70334f7
llava-cli: fix base64 prompt ( #7248 )
2024-05-14 00:02:36 +10:00
Johannes Gäßler
1c570d8bee
perplexity: add BF16 vs. FP16 results ( #7150 )
2024-05-13 13:03:27 +02:00
Benjamin Findley
e586ee4259
change default temperature of OAI compat API from 0 to 1 ( #7226 )
...
* change default temperature of OAI compat API from 0 to 1
* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
Xuan Son Nguyen
72c177c1f6
fix system prompt handling ( #7153 )
2024-05-11 17:28:10 +02:00
Steve Grubb
988631335a
server : free llama_batch on exit ( #7212 )
...
* [server] Cleanup a memory leak on exit
There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.
* make tab into spaces
2024-05-11 11:13:02 +03:00
Johannes Gäßler
5ae3426b0b
server: fix reported top tokens for temperature 0 ( #7203 )
2024-05-11 10:11:28 +02:00