* add option --tensor-type-file to llama-quantize, but it raises an error.
* add error message when file not found
* quantize: update help menu, fix CI
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs
* Move `task_result_state::update_chat_msg` to match with header
* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header
---------
Co-authored-by: openingnow <>
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
* initial commit for branch
* simplify constants
* add params to `struct common_params_sampling`, add reference to PR
* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`
* add args, rename `queue_size` -> `window_size`
* improved comments
* minor
* remove old unused code from algorithm
* minor
* add power law case to `common_sampler_init`, add sampler name mappings
* clarify behaviour when `window_size = 0`
* add missing enums
* remove `target_range` param, make `target == 1` no-op, cleanup code
* oops, straggler
* add missing parameters in `server-task.cpp`
* copy from author
ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069
* remove old debug log, style nit
* fix compiler warning, add commented-out logging per token
* re-write + change parameters + simplify
* oops forgot args.cpp
* fix leftover `window_size`
* add missing values to `common_params_sampling::print()`
* with logging
* does this fix it?
* no, but does this?
* update default decay
* optimize
* fix bad merge
my git skills are lacking
* silence `missing initializer for member`
* update default decay to 0.9
* fix logging
* format (double)
* add power law to the new `samplers` vector
* log sampler init values
* improve logging messages in llama_sampler_power_law
* remove extraneous logging
* simplify target computation
last commit with debug logging!
* remove debug logging, explicitly clamp params at init
* add `use_power_law` flag + logic, minor cleanup
* update `power-law` -> `adaptive-p`
* fix cold start EMA
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`
this fixes a "cold start" problem with the moving average
* update `SHARPNESS` constant to `10.0f`
* minor style fixes
no functional changes
* minor style fixes cont.
* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)
* separate into `apply` + `accept` functions
* `pending_token_idx`: switch from `llama_token` to `int32`
functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now
* don't transform logits <= -1e9f
* fix masking in backend top-p, min-p
* address review comments
* typo in comments `RND` -> `RNG`
* add docs
* add recommended values in completion docs
* address PR feedback
* remove trailing whitespace (for CI `editorconfig`)
* add to adaptive-p to `common_sampler_types_from_chars`
* server : make sure children tasks are scheduled to launch with parent
* fix
* add comment pointing to this PR
* fix
* clean up
* more debug messages
* add pop_deferred_task with specific ID version
* improve the logic
* simple approach
* no double move
* correct return type of launch_slots_with_parent_task
This change enables the repack stage to utilize the user-specified
thread count, ensuring that both the logical thread IDs and the total
number of threads remain consistent between the repack and inference
stages.
In a NUMA architecture where the `--numa distribute` parameter is used,
logical threads are pinned to specific physical NUMA nodes. By aligning
the thread configuration across these two stages, we can fully leverage
the operating system's "first-touch" memory allocation policy:
1. Repack Stage: Logical thread i (bound to NUMA node j) is responsible
for repacking and writing the weight data. Since the "first touch"
occurs within this thread, the corresponding physical memory is
allocated on node j.
2. Inference Stage: The same logical thread i (still bound to node j)
reads these weights. Since the data already resides on the local
node, low-latency local memory access is achieved.
Without ensuring consistency in the number of threads, data may be
randomly allocated to mismatched nodes, resulting in significant
cross-node access overhead during inference.
Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
* server : add arg for disabling prompt caching
Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.
* address review comments
* address review comments
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
* Add mobilenetv5 impl
* Fix comments, remove unused vars
* Fix permute and remove transpose of projection weights
* Fix comments, remove debugging prints from hf_to_gguf
* 1. Hard-code image_mean = 0 and image_std = 1
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
* 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp
2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
* Remove obsolete comments
* - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
* - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology
* Fix stem conv bias name
* Remove explicit handling of bias term for stem conv
* - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
* clean up conversion script
* fix code style
* also preserve audio tensors
* trailing space
* split arch A and V
* rm unused gemma3 func
* fix alignment
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Change is decoupled from https://github.com/ggml-org/llama.cpp/pull/18641.
[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B)
needs streaming istft for generating output audio.
* add streaming ISTFT class (`mtmd_audio_streaming_istft`) with overlap-add for audio reconstruction
* replace global audio cache with per-instance cache, the model requires
two independent caches, for preprocessing (audio input) and for istft
(audio output).
* unified templated FFT/IFFT implementation supporting both forward and inverse transforms
* server : add thinking content blocks to Anthropic Messages API
Add support for returning reasoning/thinking content in Anthropic API
responses when using models with --reasoning-format deepseek and the
thinking parameter enabled.
- Non-streaming: adds thinking block before text in content array
- Streaming: emits thinking_delta events with correct block indices
- Partial streaming: tracks reasoning state across chunks via
anthropic_has_reasoning member variable
Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model.
* server : fix Anthropic API streaming for thinking content blocks
Add signature field and fix duplicate content_block_start events in
Anthropic Messages API streaming responses for reasoning models.
* server: refactor Anthropic streaming state to avoid raw pointer
Replace raw pointer to task_result_state with direct field copies:
- Copy state fields in update() before processing chunk
- Use local copies in to_json_anthropic() instead of dereferencing
- Pre-compute state updates for next chunk in update()
This makes the data flow clearer and avoids unsafe pointer patterns.