When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints. Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation. closes: #20684 |
||
|---|---|---|
| .. | ||
| cpp-httplib | ||
| miniaudio | ||
| nlohmann | ||
| sheredom | ||
| stb | ||