* init version
* fix auto scroll
* bring back copy btn
* bring back thought process
* add lint and format check on CI
* remove lang from html tag
* allow multiple generations at the same time
* lint and format combined
* fix unused var
* improve MarkdownDisplay
* fix more latex
* fix code block cannot be selected while generating
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.
The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.
Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
* Fix Shift+Enter handling
`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway
* build index.html.gz
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* An empty tool_call_id is better than none!
* sync: minja (tool call name optional https://github.com/google/minja/pull/36)
* Force-disable parallel_tool_calls if template doesn't support it
* More debug logs
* Llama 3.x tools: accept / trigger on more varied spaced outputs
* Fix empty content for functionary v3.2 tool call
* Add proper tool call docs to server README
* readme: function calling *is* supported now
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.
Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.
The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.
Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md
This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.
The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.
* add /apply-template endpoint to server
* remove unnecessary line
* add /apply-template documentation
* return only "prompt" field in /apply-template
* use suggested idea instead of my overly verbose way
* server : update auto gen files comments
This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.
The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269b ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").
* squash! server : update auto gen files comments [no ci]
Move comments about file generation to README.md.
* squash! server : update auto gen files comments [no ci]
Remove the comments in server.cpp that mention that information
can be found in the README.md file.
The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True
* webui : put DeepSeek R1 CoT in a collapsible <details> element
* webui: refactor split
* webui: don't use regex to split cot and response
* webui: format+qol
* webui: no loading icon if the model isn't generating
* ui fix, add configs
* add jsdoc types
* only filter </think> for assistant msg
* build
* update build
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : add tooltips to settings and themes btn
This commit adds tooltips to the settings and themes buttons in the
webui. The tooltip will be displayed below the actual buttons when
hovered over.
The motivation for this change is to clarify the purpose of the themes
button.
* squash! server : add tooltips to settings and themes btn
This commit adds a tooltip to the '...' button when a chat has been
started. The tooltip is "Chat options" which think could be a good
description as the dropdown contains options to delete or download the
current chat.
* rm tooltip for 3 dots button
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
* server: avoid overwriting Authorization header
If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.
Fixes https://github.com/ggerganov/llama.cpp/issues/10854
* rebuild
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Update server JSON response.
* Add unit test to check `has_new_line` JSON response
* Remove `has_new_line` unit test changes.
* Address code review comment: type check for `has_new_line` in unit test
* get rid of completion.js
* extract chat bubble to a component
* add tok/s info
* sync
* fix BASE_URL
* only extract timings when it's enabled
* fix auto scroll
* bug-fix: snprintf prints NULL in place of the last character
We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string.
* add comment about extra null-term byte requirement
* server : (refactor) no more json in server_task input
* add test for slots endpoint
* add tests for /props and /slots
* remove task inf_type
* fix CI by adding safe_json_to_str
* add "model_path" to /props
* update readme
* server : various fixes
ggml-ci
* server : show curent seed in slot_params
ggml-ci
* fix /slots endpoint
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : reflect endpoint response changes in the readme
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* hide buttons in dropdown menu
* use npm as deps manager and vite as bundler
* fix build
* fix build (2)
* fix responsive on mobile
* fix more problems on mobile
* sync build
* (test) add CI step for verifying build
* fix ci
* force rebuild .hpp files
* cmake: clean up generated files pre build
* readme : document --no-display-prompt
* readme : update default prompt context size
* readme : remove unnecessary indentation
Indenting a line with four spaces makes Markdown treat that section as
plain text.
* readme : indent commands under bullets
* readme : indent commands in lettered list
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
* server : replace behave with pytest
* fix test on windows
* misc
* add more tests
* more tests
* styling
* log less, fix embd test
* added all sequential tests
* fix coding style
* fix save slot test
* add parallel completion test
* fix parallel test
* remove feature files
* update test docs
* no cache_prompt for some tests
* add test_cache_vs_nocache_prompt
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
* Add download chat feature to server chat
Add a download feature next to the delete chat feature in the server vue chat interface.
* code style
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Samplers sequence: simplified and input field.
* Removed unused function
* Modify and use `settings-modal-short-input`
* rename "name" --> "label"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : (web ui) add copy btn for code blocks
* fix problem with api key
* use settings-modal-short-input component
* always show copy btn for code snippet