* Move `task_result_state::update_chat_msg` to match with header
* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header
---------
Co-authored-by: openingnow <>
* Add Ministral3ForCausalLM architeture
This adds support for newer architectres like Devstral-2
* removed blank line found after function decorator
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu
* convert : use n_groups instead of hardcoded values in reshape
This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.
* use self.n_group instead of self.hparams["n_groups"]
Refactors MCP server enabling logic to remove the dependency on global settings.
This simplifies the logic by directly checking the per-chat override status, and removes the need to pass the global enabled state as a parameter.
Additionally:
- Only shows MCP servers that are enabled in settings in the selector.
- Sorts the servers by whether they are enabled for the current chat.
This commit adds a BUILD_DIR variable to the scripts used for running
converted models.
The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.
Remove all frontend validation logic that prevented users from selecting
models based on multimodal capabilities. This refactoring removes
restrictive UI code while maintaining full functionality
- Vision models can describe images as text
- That text remains useful for non-vision models
- Chaining vision -> non-vision is a valid workflow
- Users know their use case better than the UI
- Users can return to vision models when needed
* support negative array index and default value
* attribute support (int and str) for join, map and sort
* add tests
* update CODEOWNERS
* improve fixme sorting comment