When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints. Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation. closes: #20684 |
||
|---|---|---|
| .. | ||
| CMakeLists.txt | ||
| LICENSE | ||
| httplib.cpp | ||
| httplib.h | ||