llama.cpp

History

Radoslav Gerganov 68ee98ae18 server : return HTTP 400 if prompt exceeds context length (#16486 ) In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.		2025-10-10 16:11:07 +02:00
..
test_basic.py	server : host-memory prompt caching (#16391 )	2025-10-09 18:54:51 +03:00
test_chat_completion.py	server : return HTTP 400 if prompt exceeds context length (#16486 )	2025-10-10 16:11:07 +02:00
test_completion.py	server : host-memory prompt caching (#16391 )	2025-10-09 18:54:51 +03:00
test_ctx_shift.py	server : host-memory prompt caching (#16391 )	2025-10-09 18:54:51 +03:00
test_embedding.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_infill.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_lora.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_rerank.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_security.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_slot_save.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_speculative.py	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
test_template.py	server : speed up tests (#15836 )	2025-09-06 14:45:24 +02:00
test_tokenize.py	server : disable context shift by default (#15416 )	2025-08-19 16:46:37 +03:00
test_tool_call.py	server : speed up tests (#15836 )	2025-09-06 14:45:24 +02:00
test_vision_api.py	server : speed up tests (#15836 )	2025-09-06 14:45:24 +02:00