Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.
before:
Error: {
code = 400,
message = "model not found",
type = "invalid_request_error"
}
After:
Error: {
code = 400,
message = "model 'toto' not found",
type = "invalid_request_error"
}
* add option --tensor-type-file to llama-quantize, but it raises an error.
* add error message when file not found
* quantize: update help menu, fix CI
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs
* Move `task_result_state::update_chat_msg` to match with header
* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header
---------
Co-authored-by: openingnow <>
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>