* server: split HTTP into its own interface
* move server-http and httplib to its own file
* add the remaining endpoints
* fix exception/error handling
* renaming
* missing header
* fix missing windows header
* fix error responses from http layer
* fix slot save/restore handler
* fix case where only one stream chunk is returned
* add NOMINMAX
* do not call sink.write on empty data
* use safe_json_to_str for SSE
* clean up
* add some comments
* improve usage of next()
* bring back the "server is listening on" message
* more generic handler
* add req.headers
* move the chat template print to init()
* add req.path
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI
- Purely visual and diagnostic change, no effect on model context, prompt
construction, or inference behavior
- Captured assistant tool call payloads during streaming and non-streaming
completions, and persisted them in chat state and storage for downstream use
- Exposed parsed tool call labels beneath the assistant's model info line
with graceful fallback when parsing fails
- Added tool call badges beneath assistant responses that expose JSON tooltips
and copy their payloads when clicked, matching the existing model badge styling
- Added a user-facing setting to toggle tool call visibility to the Developer
settings section directly under the model selector option
* webui: remove scroll listener causing unnecessary layout updates (model selector)
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: npm run format & update webui build output
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.
* feat(memory): Only fail partial erasure of recurrent tail
The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.
There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(main): Check the output of seq_rm for prefix matching
This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(memory): Fix condition for partial erasure failure if p0 > pos
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
* style: Fix extra parens
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>