llama.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Bevenius	62cef26ac5	model-conversion : add qat-q4 quantization targets (#15588 ) This commit adds two targets to the Makefile for quantizing of Quantization Aware Trained (QAT) models to Q4_0 format. The motivation for this is that this sets the token embedding and the output tensors data types to Q8_0 instead of the default Q6_K. This is someting that we wish to enforce for QAT Q4_0 models that are to be uploaded to ggml-org on Huggingface to guarantee the best quality.	2025-08-26 16:12:29 +02:00
Daniel Bevenius	dfd9b5f6c7	model-conversion : set pooling type to none in logits.cpp (#15564 ) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings.	2025-08-25 15:00:43 +02:00
Daniel Bevenius	5a6bc6b1a6	model-conversion : add model card template for embeddings [no ci] (#15557 ) * model-conversion: add model card template for embeddings [no ci] This commit adds a separate model card template (model repository README.md template) for embedding models. The motivation for this is that there server command for the embedding model is a little different and some addition information can be useful in the model card for embedding models which might not be directly relevant for causal models. * squash! model-conversion: add model card template for embeddings [no ci] Fix pyright lint error. * remove --pooling override and clarify embd_normalize usage	2025-08-25 14:25:25 +02:00
Georgi Gerganov	9ebebef62f	llama : remove KV cache defragmentation logic (#15473 ) ggml-ci	2025-08-22 12:22:13 +03:00
Jie Fu (傅杰)	9ad5e60dba	examples : fix some typos in examples/model-conversion/README.md (#15477 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-08-21 16:53:13 +02:00
Jie Fu (傅杰)	ad294df03f	examples : install torch-cpu for model conversion tool/example (#15475 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-08-21 15:42:34 +02:00
Daniel Bevenius	2758fa10da	examples : add model conversion tool/example (#15455 ) * examples : add model conversion tool/example This commit adds an "example/tool" that is intended to help in the process of converting models to GGUF. Currently it supports normal causal models and embedding models. The readme contains instructions and command to guide through the process. The motivation for this to have a structured and repeatable process for model conversions and hopefully with time improve upon it to make the process easier and more reliable. We have started to use this for new model conversions internally and will continue doing so and improve it as we go along. Perhaps with time this should be placed in a different directory than the examples directory, but for now it seems like a good place to keep it while we are still developing it. * squash! examples : add model conversion tool/example Remove dependency on scikit-learn in model conversion example. * squash! examples : add model conversion tool/example Update transformer dep to use non-dev version. And also import `AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility with the latest version. * squash! examples : add model conversion tool/example Remove the logits requirements file from the all requirements file.	2025-08-21 12:16:54 +02:00
Daniel Bevenius	29f538ac63	examples : remove references to `make` in examples [no ci] (#15457 ) This commit removes references to `make` in the examples, as the build system has been updated to use CMake directly and using `make` will now generate an error since Commit `37f10f955f` ("make : remove make in favor of CMake (#15449)").	2025-08-21 06:12:28 +02:00
Georgi Gerganov	2f37014073	lookahead : add sample command to readme (#15447 ) * lookahead : add sample command to readme * cont : build-agnostic command	2025-08-20 13:30:46 +03:00
Georgi Gerganov	df36bce667	eval-callback : stop on first NaN (#15320 ) * eval-callback : stop on first NaN * cont : log error	2025-08-14 22:10:51 +03:00
Jonathan Graehl	5cdb27e091	finetune: SGD optimizer, more CLI args (#13873 ) * examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy eventually drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wdalpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alphawd minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-14 12:03:57 +02:00
Copilot	d8914fc47e	common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191 ) * Checkpoint from VS Code for coding agent session * Initial plan * Fix typo in --override-tensor-draft flag implementation * Add null termination for speculative tensor buffer overrides * Apply suggestions from code review * Apply suggestions from code review * Extract tensor override parsing logic to common function (addresses @slaren's feedback) * Apply suggestions from code review * Apply suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-08-13 12:44:40 +02:00
g2mt	94933c8c2e	server : implement universal assisted decoding (#12635 ) * llama-server : implement universal assisted decoding * Erase prompt tail for kv-cache * set vocab_dft_compatible in common_speculative * rename ctx_main to ctx_tgt * move vocab_dft_compatible to spec struct * clear mem_dft, remove mem * detokenize id_last for incompatible models * update comment * add --spec-replace flag * accept special tokens when translating between draft/main models * Escape spec-replace * clamp draft result to size to params.n_draft * fix comment * clean up code * restore old example * log common_speculative_are_compatible in speculative example * fix * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-31 14:25:23 +02:00
Aman Gupta	8a4a856277	Add LLaDA 8b Diffusion model (#14771 ) * Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py	2025-07-31 19:49:09 +08:00
Georgi Gerganov	00131d6eaf	tests : update for LLAMA_SET_ROWS=1 (#14961 ) * test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci	2025-07-30 15:12:02 +03:00
Georgi Gerganov	2adf8d83ac	parallel : add option for different RNG seeds (#14757 ) ggml-ci	2025-07-18 17:33:41 +03:00
Georgi Gerganov	225e7a1438	llama : add high-throughput mode (#14363 ) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-16 16:35:42 +03:00
Aman Gupta	ab14019821	Support diffusion models: Add Dream 7B (#14644 ) * Support diffusion models: Add Dream 7B * Move diffusion to examples * Move stuff to examples. Add patch to not use kv-cache * Address review comments * Make sampling fast * llama: remove diffusion functions * Add basic timings + cleanup * More cleanup * Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length * fixup! * Review: move everything to diffusion-cli for now	2025-07-16 20:03:51 +08:00
Georgi Gerganov	bac8bed248	eval-callback : check for empty input (#14539 )	2025-07-05 07:18:09 +03:00
Georgi Gerganov	d7f5f4e578	simple-chat : fix context-exceeded condition (#14494 ) * simple-chat : fix context-exceeded condition ggml-ci * cont : fix n_ctx_used computation ggml-ci	2025-07-02 14:12:07 +03:00
Vedran Miletić	e9b6350e61	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
Radoslav Gerganov	8d94219a4a	ggml : add ggml_set_rows (#14274 ) * ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366 * use I64 for indices * ggml : add repeat impl for i64 * ggml : add ggml_is_contiguous_rows * ggml : ggml_set_rows support broadcast * ggml : ggml_set_rows support quantized dst ggml-ci * ggml : support GGML_TYPE_F32 ".from_float" trait * ggml : ggml_set_rows update comment + better index name * tests : add ggml_set_rows * metal : add ggml_set_rows implementation ggml-ci * ggml : simplify forward_dup_f32 * ggml : fix supports_op * tests : add comment to set_rows * ggml : leave the repeat_i64 for a separate PR ggml-ci * ggml : set_rows use std::min instead of MIN * ggml : better error message for set_rows unsupported type * metal : perform op->type check only once * tests : more consistent implementation + more tests ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-27 16:41:40 +03:00
Georgi Gerganov	f1f5e82df6	examples : fix is_first logic for tokenization (#14329 ) ggml-ci	2025-06-22 20:10:07 +03:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
Georgi Gerganov	d3e64b9f49	llama : rework embeddings logic (#14208 ) * llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci	2025-06-16 14:14:00 +03:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Sigbjørn Skjæret	d17a809ef0	llama : support multiple classifier outputs and labels (#13940 )	2025-06-06 09:03:25 +02:00
Georgi Gerganov	c04621711a	parallel : fix n_junk == 0 (#13952 )	2025-06-01 11:42:16 +03:00
Georgi Gerganov	803f8baf4f	llama : deprecate explicit kv_self defrag/update calls (#13921 ) ggml-ci	2025-05-31 15:58:33 +03:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching	2025-05-31 12:55:57 +03:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Georgi Gerganov	dd665cc9d4	parallel : increase the variability of the prompt lengths (#13927 ) ggml-ci	2025-05-30 19:38:07 +03:00
standby24x7	88c125f2ac	examples/training: Fix file name in README (#13803 ) This patch fixes binary file names in README.md. Signed-off-by: Masanari Iida <standby24x7@gmail.com>	2025-05-26 16:55:24 +02:00
Georgi Gerganov	79c137f776	examples : allow extracting embeddings from decoder contexts (#13797 ) ggml-ci	2025-05-26 14:03:54 +03:00
Sigbjørn Skjæret	2aa777d86d	examples : switch retrieval to llama_encode (#13685 ) * switch retrieval to llama_encode * enable --no-warmup for retrieval	2025-05-21 16:57:38 +02:00
Georgi Gerganov	797f2ac062	kv-cache : simplify the interface (#13660 ) * kv-cache : simplify the interface ggml-ci * context : revert llama_batch_allocr position change ggml-ci	2025-05-21 15:11:13 +03:00
Georgi Gerganov	a4090d1174	llama : remove llama_kv_cache_view API + remove deprecated (#13653 ) ggml-ci	2025-05-20 16:13:16 +03:00
Alberto Cabrera Pérez	725f23f1f3	sycl : backend documentation review (#13544 ) * sycl: reviewing and updating docs * Updates Runtime error codes * Improves OOM troubleshooting entry * Added a llama 3 sample * Updated supported models * Updated releases table	2025-05-19 14:38:20 +01:00
Nick	9c55e5c5c2	fix: check model pointer validity before use (#13631 )	2025-05-19 13:25:41 +03:00
Georgi Gerganov	518329b2d4	parallel : add option for non-shared and larger prompts (#13598 ) * parallel : add option for non-shared and larger prompts * parallel : update readme [no ci] * cont : add note about base models [no ci] * parallel : better var name ggml-ci	2025-05-17 12:58:55 +03:00
Johannes Gäßler	10d2af0eaa	llama/ggml: add LLM training support (#10544 ) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-12 14:44:49 +02:00
Georgi Gerganov	6562e5a4d6	context : allow cache-less context for embeddings (#13108 ) * context : allow cache-less context for embeddings ggml-ci * context : enable reranking with encode() ggml-ci * context : encode() clears embd_seq ggml-ci * examples : use llama_encode() when appropriate ggml-ci * models : nomic bert moe does not require KV cache * llama : update comments for llama_decode/llama_encode ggml-ci * context : update warning log [no ci]	2025-05-08 14:28:33 +03:00
Georgi Gerganov	4773d7a02f	examples : remove infill (#13283 ) ggml-ci	2025-05-07 10:28:02 +03:00
Diego Devesa	1d36b3670b	llama : move end-user examples to tools directory (#13249 ) * llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-02 20:27:13 +02:00
Xuan-Son Nguyen	074e42ab31	convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf (#13209 ) * wip * qwen2.5vl ok * vision: fix models missing "text_config" * add test * fix test repo name * fix 32B model * Revert "fix 32B model" This reverts commit `651752f1ae`. * clarify about 32B * rm qwen surgery script * update llava/readme * move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel	2025-05-02 17:17:15 +02:00
Shakil Ahmed	e84773ab60	mtmd-cli : fix out_of_range when input image path is empty (#13244 ) * fix out_of_range error to keep the chat loop running * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mtmd-cli : load image right away * add a new line for readability * rm printf * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/llava/mtmd-cli.cpp --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-05-02 10:20:27 +02:00
Georgi Gerganov	fab647e884	server : add cache reuse card link to help (#13230 ) * server : add cache reuse card link to help * args : use short url	2025-05-02 09:48:31 +03:00
Loïc Carrère	b6e4ff69b8	clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP image size (#13237 )	2025-05-01 21:32:21 +02:00
Xuan-Son Nguyen	8936784f7a	mtmd : add vision support for Mistral Small 3.1 (#13231 ) * convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test	2025-05-01 17:05:42 +02:00
Tatsuya Tanaka	ceda28ef8e	llava : remove duplicate include (#13207 )	2025-04-30 15:25:20 +02:00
xiaofei	a0f7016d17	rpc : fix cache directory initialization (#13188 ) Signed-off-by: xiaofei <hbuxiaofei@gmail.com>	2025-04-30 09:29:22 +03:00
matteo	e2e1ddb93a	server : Prefilling assistant message in openai compatible API (#13174 ) * Prefilling assistant message in openai compatible API * fixed indentation * fixed code convention * simplify method usage * no more than one assistant message at end of messages * merge checks into prefill code * Update examples/server/utils.hpp --------- Co-authored-by: matteo <matteo@naspc.lan> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-04-29 20:33:10 +02:00
Alberto Cabrera Pérez	5a63980117	llama-bench: fixed size of fields to correctly map to values (#13183 )	2025-04-29 17:24:36 +02:00
Xuan-Son Nguyen	00e3e5a194	mtmd : add qwen2vl and qwen2.5vl (#13141 ) * llava : add clip_n_output_tokens, deprecate clip_n_patches * mtmd : add qwen2vl and qwen2.5vl * decode_embd_batch::set_position_... * working version * deprecate llama-qwen2vl-cli * correct order W, H of clip_embd_nbytes_by_img * edit existing line in hot topics	2025-04-29 11:47:04 +02:00
Xuan-Son Nguyen	eaea325324	clip : fix model size display (#13153 )	2025-04-28 21:23:19 +02:00
Vishal Agarwal	1831f538f7	llama-bench: add `-d` depth arg (#13096 ) * add depth param * update llama-bench README and add depth param * llama-bench: default params for depth arg for faster execution * Update examples/llama-bench/README.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix buffer print ub * use user provided args * remove extra whitespaces --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-04-28 16:50:39 +02:00
Xuan-Son Nguyen	4e87962e34	mtmd : fix glm-edge redundant token count (#13139 ) * mtmd : fix glm-edge redundant token count * fix chat template * temporary disable GLMEdge test chat tmpl	2025-04-28 16:12:56 +02:00
Xuan-Son Nguyen	d2b2031e5f	llama : (mrope) allow using normal 1D position for text token (#13138 ) * llama : (mrope) use normal position for text token * rm n_pos_per_embd from llm_graph_input_attn_temp	2025-04-28 14:20:56 +02:00
Xuan-Son Nguyen	5fa9e63be8	clip : refactor set input for cgraph + fix qwen2.5vl input (#13136 ) * clip : refactor set input for cgraph * more strict assert * minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere * split qwen2 and qwen2.5 code blocks * minor style fix	2025-04-28 12:18:59 +02:00
4onen	c0a97b762e	llama-bench : Add `--override-tensors` arg (#12922 ) * Add --override-tensors option to llama-bench * Correct llama-bench --override-tensors to --override-tensor * llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix. * Make new llama-bench util functions static to fix Ubuntu CI * llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)	2025-04-27 23:48:26 +02:00
LostRuins Concedo	59e991c23c	Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133 )	2025-04-27 12:43:37 +02:00
HimariO	ca2bb89eac	clip : Add Qwen2.5VL support (#12402 ) * implment vision model architecture, gguf convertor * handle window attention inputs * add debug utils * fix few incorrect tensor memory layout * move position id remap out of ggml to avoid int32 cuda operations * cleaning up * ignore transformers Qwen2_5_xxx type check * remove not so often use `qwen2vl-cli` debug functions * remove commented-out code blocks * fix attn weight scaling after rebase * add `PROJECTOR_TYPE_QWEN2_5_VL` * remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM` * replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN` * remove `attn_window_size` from gguf * fix model conversion * clean up * fix merging problem * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-04-27 10:10:34 +02:00
Xuan-Son Nguyen	4753791e70	clip : improve projector naming (#13118 ) * clip : improve projector naming * no more kv has_llava_projector * rm unused kv * rm more unused	2025-04-26 22:39:47 +02:00
frob	d5fe4e81bd	grammar : handle maxItems == 0 in JSON schema (#13117 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-04-26 10:10:20 +02:00
Xuan-Son Nguyen	edb18b6e8f	clip : fix pixtral on some GPU backends (#13097 ) * clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO	2025-04-25 14:31:42 +02:00
Xuan-Son Nguyen	13be08daf9	clip : remove boi/eoi embeddings for GLM-edge model (#13081 )	2025-04-24 22:17:04 +02:00
Georgi Gerganov	226251ed56	embeddings : fix batch sizes (#13076 ) ggml-ci	2025-04-24 22:29:22 +03:00
Georgi Gerganov	13b4548877	cmake : do not include ./src as public for libllama (#13062 ) * cmake : do not include ./src as public for libllama ggml-ci * cmake : rework tests ggml-ci * llguidance : remove unicode include ggml-ci * cmake : make c++17 private ggml-ci	2025-04-24 16:00:10 +03:00
Xuan-Son Nguyen	7c727fbe39	arg : add --no-mmproj-offload (#13093 ) * arg : add --no-mmproj-offload * Update common/arg.cpp	2025-04-24 14:04:14 +02:00
Xuan-Son Nguyen	80982e815e	arg : clean up handling --mmproj with -hf (#13082 ) * arg : clean up handling --mmproj with -hf * rm change about no_mmproj * Revert "rm change about no_mmproj" This reverts commit `2cac8e0efb`. * handle no_mmproj explicitly * skip download mmproj on examples not using it	2025-04-24 12:14:13 +02:00
pl752	5630406959	llama-mtmd-cli: Sigint rework in mtmd vision example (#13080 ) * Sigint rework in mtmd vision example * Applied suggestions on mtmd-cli PR * Forgot to invert one of the conditions * Update examples/llava/mtmd-cli.cpp * Removed redundant exit check --------- Co-authored-by: pl752 <maximpl752@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-04-23 23:32:35 +02:00
Xuan-Son Nguyen	ecda2ec4b3	mtmd : Support Pixtral 12B (#13065 ) * add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script	2025-04-23 20:21:59 +02:00
Radoslav Gerganov	2cca6c01e4	rpc : add command line option for number of threads for the CPU backend (#13060 ) closes #13051	2025-04-23 10:32:49 +03:00
Xuan-Son Nguyen	dc39a5e7a8	mtmd : support SmolVLM (version 1 and 2) (#13050 ) * mtmd : support SmolVLM (version 1 and 2) * correct chat template * fix n_patches * scale_factor is an int * add more models to test	2025-04-22 16:24:54 +02:00
Xuan-Son Nguyen	243453533e	llava : update documentations (#13055 ) * llava : update documentations * fix typo	2025-04-22 10:37:00 +02:00
Xuan-Son Nguyen	84a9bf2fc2	mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012 ) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-04-21 15:32:58 +02:00
Xuan-Son Nguyen	2016f07bd1	convert : experimental support for `--mmproj` flag (#13023 ) * convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * fix Mistral3Model * fix typo Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2025-04-20 23:29:36 +02:00
Jeffrey Morgan	6602304814	llava: fix errors in clip.h on certain compilers (#13030 )	2025-04-20 12:15:41 +02:00
Xuan-Son Nguyen	37b9f0d29d	clip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011 ) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include	2025-04-19 09:15:45 +02:00
Daniel Tang	6408210082	main : Fix Ctrl+D/newline handling (#12951 ) This restores the behavior from #491. This does not affect Ctrl+D's ability to terminate --multiline-input lines (#1040). This also actually implements #587: "If the user wants the text to end in a newline, this should be accomplished by explicitly adding a newline by using \ followed by return, then returning control by pressing return again." Fixes #12949	2025-04-18 22:02:55 +02:00
Xuan-Son Nguyen	35370ba945	server : use std::move whenever possible (#12936 ) * server : use std::move whenever possible * use r-value ref * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make task creation scoped * restore std::move * fix task_id not set correctly * apply changes from suggestion Co-authored-by: ggerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-04-18 19:58:12 +02:00
Xuan-Son Nguyen	b9154ecff9	mtmd : add methods to access `mtmd_image_tokens` (#12906 ) * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member	2025-04-18 10:04:51 +02:00
Radoslav Gerganov	2db9ba1464	rpc : add RPC_CMD_HELLO (#12955 ) Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like #12465	2025-04-18 10:13:42 +03:00
Russyyds	d6d2c2ab8c	Add performance print for gemma3 in example (#12929 )	2025-04-14 19:18:20 +02:00
Neo Zhang Jianyu	81c7e64fc2	dsiable curl lib check, this action is missed by commit `bd3f59f812` (#12761 ) (#12937 )	2025-04-14 18:19:07 +08:00
Ed Addario	71e90e8813	quantize: Handle user-defined quantization levels for additional tensors (#12511 ) * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Minor refactoring as per the contributors' coding guidelines * Update descriptions to match existing style * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Minor refactoring as per the contributors' guidelines * Implement general --tensor-type instead of tensor-specific command option * Fix implied type bug * Restore missing #includes * Add regex capability for tensor selection * Refactor function name and update ALLOWED_TENSOR_TYPE * Add missing #include * Handle edge case when tensor name is cls.output * Minor logging improvement	2025-04-13 21:29:28 +03:00
Prajwal B Mehendarkar	bc091a4dc5	common : Define cache directory on AIX (#12915 )	2025-04-12 17:33:39 +02:00
Matt Clayton	e59ea539b8	llava: Fix cpu-only clip image encoding sefault (#12907 ) * llava: Fix cpu-only clip image encoding * clip : no smart ptr for ggml_backend_t * Fix for backend_ptr push_back --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-04-12 07:29:03 +02:00
Georgi Gerganov	c94085df28	server : add VSCode's Github Copilot Chat support (#12896 ) * server : add VSCode's Github Copilot Chat support * cont : update handler name	2025-04-11 23:37:41 +03:00
yuri@FreeBSD	e8a62631b3	rpc : Set cache directory in rpc-server.cpp on FreeBSD (#12903 )	2025-04-11 22:04:14 +02:00
tastelikefeet	b2034c2b55	contrib: support modelscope community (#12664 ) * support download from modelscope * support login * remove comments * add arguments * fix code * fix win32 * test passed * fix readme * revert readme * change to MODEL_ENDPOINT * revert tail line * fix readme * refactor model endpoint * remove blank line * fix header * fix as comments * update comment * update readme --------- Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>	2025-04-11 14:01:56 +02:00
Xuan-Son Nguyen	0c50923944	clip : use smart pointer (⚠️ breaking change) (#12869 ) * clip : use smart pointers * fix warmup * add forward declaration * misisng include * fix include (2) * composite * simplify batch ptr * fix conflict	2025-04-11 12:09:39 +02:00
Xuan-Son Nguyen	8b9cc7cdd8	llava : introduce libmtmd (#12849 ) * wip llava2 * migrated gemma3 to llava2 * add timings * correct pre/postfix * fix missing include * fix compilation unused var warn * update llava2_tokenize * change name llava2 --> mtmd * improve api * refine helpers * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-04-10 22:57:16 +02:00
Plamen Minev	381603a775	ci: detach common from the library (#12827 ) * fix: detach common from the library * fix: building chat test template	2025-04-09 10:11:11 +02:00
Xuan-Son Nguyen	65a69e6e1b	clip : do not print ftype (#12832 )	2025-04-09 10:09:53 +02:00
Matt Clayton	b32efad2bc	llava: improve clip_ctx destructor to not memleak load_image_size (#12834 )	2025-04-08 22:01:58 +02:00
Georgi Gerganov	a19b5cef16	llama : fix FA when KV cache is not used (i.e. embeddings) (#12825 ) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci	2025-04-08 19:54:51 +03:00
Xuan-Son Nguyen	78a1ba0a4f	server : fix thread.join() on exit (#12831 )	2025-04-08 18:37:06 +02:00
dm4	2dabf759e7	llava: add more helper functions to check projector types in clip context (#12824 ) Signed-off-by: dm4 <sunrisedm4@gmail.com>	2025-04-08 15:49:13 +02:00
characharm	8ca6e1c3a4	server : webui : Improve Chat Input with Auto-Sizing Textarea (#12785 ) * Update ChatScreen.tsx * useAutosizeTextarea.ts useAutosizeTextarea to encapsulate the logic. * Implement responsive auto-sizing chat textarea Replaces the manual textarea resizing with an automatic height adjustment based on content. - `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization - Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up). - Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability. - Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize. * -update compressed index.html.gz after npm run build -refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook * chore: normalize line endings to LF refactor: AutosizeTextareaApi -> chatTextareaApi * refactor: Rename interface to PascalCase --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-04-08 11:14:59 +02:00
stduhpf	4ccea213bc	hellaswag: display estimated score confidence interval (#12797 )	2025-04-07 18:47:08 +03:00
Xuan-Son Nguyen	bd3f59f812	cmake : enable curl by default (#12761 ) * cmake : enable curl by default * no curl if no examples * fix build * fix build-linux-cross * add windows-setup-curl * fix * shell * fix path * fix windows-latest-cmake* * run: include_directories * LLAMA_RUN_EXTRA_LIBS * sycl: no llama_curl * no test-arg-parser on windows * clarification * try riscv64 / arm64 * windows: include libcurl inside release binary * add msg * fix mac / ios / android build * will this fix xcode? * try clearing the cache * add bunch of licenses * revert clear cache * fix xcode * fix xcode (2) * fix typo	2025-04-07 13:35:19 +02:00
Sergey Fedorov	f1e3eb4249	common : fix includes in arg.cpp and gemma3-cli.cpp (#12766 ) * arg.cpp: add a missing include * gemma3-cli.cpp: fix cinttypes include	2025-04-05 17:46:00 +02:00
Xuan-Son Nguyen	0364178ca2	clip : refactor clip_init, add tests (#12757 ) * refactor clip_init * fix loading file * fix style * test ok * better test with report * add missing headers * clarify * add KEY_MM_PATCH_MERGE_TYPE * remove bool has_* pattern * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use ggml_soft_max_ext * refactor logging system * add minicpm-v-o 2.6 for testing * use nullptr everywhere * fix Yi-VL model --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-04-05 17:17:40 +02:00
Nauful Shaikh	b772394297	server : webui : Upgrade daisyui, tailwindcss. (#12735 ) * Upgrade daisyui, tailwindcss. * Switch to all themes. * Revert a change. * Update formatting. * Install packages before npm build. * Revert "Install packages before npm build." This reverts commit `336c5147e6`. * Add index.html.gz * run build --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-04-04 16:09:52 +02:00
nick huang	23106f94ea	gguf-split : --merge now respects --dry-run option (#12681 ) * gguf-split now respects dry-run option * removing trailing space	2025-04-04 16:09:12 +02:00
Georgi Gerganov	a10b36c91a	llama : refactor kv cache guard (#12695 ) * llama : refactor kv cache guard ggml-ci * cont : fix comment [no ci] * llama : fix kv_cache restore logic ggml-ci * context : simplify kv cache updates ggml-ci * cont : better name [no ci] * llama : fix llama_decode return code when could not find KV slot ggml-ci * context : change log err -> warn [no ci] * kv-cache : add comment + warning	2025-04-02 14:32:59 +03:00
Xuan-Son Nguyen	42eb248f46	common : remove json.hpp from common.cpp (#12697 ) * common : remove json.hpp from common.cpp * fix comment	2025-04-02 09:58:34 +02:00
Xuan-Son Nguyen	267c1399f1	common : refactor downloading system, handle mmproj with -hf option (#12694 ) * (wip) refactor downloading system [no ci] * fix all examples * fix mmproj with -hf * gemma3: update readme * only handle mmproj in llava example * fix multi-shard download * windows: fix problem with std::min and std::max * fix 2	2025-04-01 23:44:05 +02:00
Sigbjørn Skjæret	1a85949067	llava : proper description fix (#12668 )	2025-03-31 11:28:30 +02:00
Sigbjørn Skjæret	f52d59d771	llava : fix clip loading GGUFs with missing description (#12660 )	2025-03-31 11:07:07 +02:00
marcoStocchi	52de2e5949	tts : remove printfs (#12640 ) * tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output.	2025-03-31 11:20:30 +03:00
Benson Wong	5d01670266	server : include speculative decoding stats when timings_per_token is enabled (#12603 ) * Include speculative decoding stats when timings_per_token is true New fields added to the `timings` object: - draft_n : number of draft tokens generated - draft_accepted_n : number of draft tokens accepted - draft_accept_ratio: ratio of accepted/generated * Remove redundant draft_accept_ratio var * add draft acceptance rate to server console output	2025-03-28 10:05:44 +02:00
Radoslav Gerganov	ef03229ff4	rpc : update README for cache usage (#12620 )	2025-03-28 09:44:13 +02:00
Radoslav Gerganov	ab6ab8f809	rpc : send hash when tensor data is above some fixed threshold (#12496 ) * rpc : send hash when tensor data is above some fixed threshold ref #10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency	2025-03-28 08:18:04 +02:00
Piotr	2099a9d5db	server : Support listening on a unix socket (#12613 ) * server : Bump cpp-httplib to include AF_UNIX windows support Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com> * server : Allow running the server example on a unix socket Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com> --------- Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-03-27 23:41:04 +01:00
Ivy233	02082f1519	clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566 ) * [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize. After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA. * [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.	2025-03-26 15:06:04 +01:00
Eric Curtin	ef19c71769	run: de-duplicate fmt and format functions and optimize (#11596 )	2025-03-25 18:46:11 +01:00
Marius Gerdes	77f9c6bbe5	server : Add verbose output to OAI compatible chat endpoint. (#12246 ) Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.	2025-03-23 19:30:26 +01:00
marcoStocchi	ea1518e839	llama-tts : avoid crashes related to bad model file paths (#12482 )	2025-03-21 11:12:45 +02:00
Woof Dog	e04643063b	webui : Prevent rerendering on textarea input (#12299 ) * webui: Make textarea uncontrolled to eliminate devastating lag * Update index.html.gz * use signal-style implementation * rm console log * no duplicated savedInitValue set --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-03-20 15:57:43 +01:00
Georgi Gerganov	c6af2161b2	speculative : fix seg fault in certain cases (#12454 )	2025-03-18 19:35:11 +02:00
Georgi Gerganov	810e0af3f5	server : fix warmup draft cache type (#12446 ) ggml-ci	2025-03-18 12:05:42 +02:00
Sigbjørn Skjæret	60c902926c	docs : bring llama-cli conversation/template docs up-to-date (#12426 )	2025-03-17 21:14:32 +01:00
marcoStocchi	f4c3dd5daa	llama-tts : add '-o' option (#12398 ) * added -o option to specify an output file name * llama-tts returns ENOENT in case of file write error note : PR #12042 is closed as superseded with this one.	2025-03-15 17:23:11 +01:00
Eric Curtin	9f2250ba72	Add CLI arg to llama-run to adjust the number of threads used (#12370 ) We default to 4, sometimes we want to manually adjust this Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-03-14 16:41:20 +00:00
Victor	add2a3aa5a	server: fix "--grammar-file" parameter (#12285 )	2025-03-14 11:21:17 +01:00
Georgi Gerganov	e0dbec0bc6	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 ) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci	2025-03-13 12:35:44 +02:00
Ishaan Gandhi	2048b5913d	server : fix crash when using verbose output with input tokens that are not in printable range (#12178 ) (#12338 ) * Fix DOS index bug * Remove new APIs * remove extra line * Remove from API * Add extra newline * Update examples/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-03-13 11:10:05 +01:00
Daniel Bevenius	80a02aa858	llama.swiftui : fix xcframework dir in README [no ci] (#12353 ) This commit fixes the path to the xcframework in the README file which I had forgotten to change after renaming the build directory.	2025-03-12 13:45:32 +01:00
Xuan-Son Nguyen	7841fc723e	llama : Add Gemma 3 support (+ experimental vision capability) (#12343 ) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64	2025-03-12 09:30:24 +01:00
Xuan-Son Nguyen	96e1280839	clip : bring back GPU support (#12322 ) * clip : bring back GPU support * use n_gpu_layers param * fix double free * ggml_backend_init_by_type * clean up	2025-03-11 09:20:16 +01:00
marcoStocchi	6ef79a67ca	common : refactor '-o' option (#12278 ) As discussed in PR 'llama-tts : add -o option' (#12042): * common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option. * cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.	2025-03-10 13:34:13 +02:00
Olivier Chafik	be421fc429	`tool-call`: ensure there's always a non-empty tool call id (#12292 )	2025-03-10 09:45:29 +00:00
Olivier Chafik	2b3a25c212	`sampler`: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291 ) * Fix typo in lazy grammar handling (fixes trigger tokens) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-10 09:44:42 +00:00
tc-mb	8352cdc87b	llava : fix bug in minicpm-v code (#11513 ) * fix bug in minicpm-v code * update readme of minicpm-v	2025-03-10 10:33:24 +02:00
Georgi Gerganov	7ab364390f	server : infill gen ends on new line (#12254 )	2025-03-07 20:54:30 +02:00
Sigbjørn Skjæret	8fad3c7a7c	server : Log original chat template parsing error (#12233 )	2025-03-07 11:15:33 +01:00
Aaron Teo	e9b2f84f14	llava: add big-endian conversion for image encoder (#12218 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-03-06 09:33:21 +01:00
Han Yin	57b6abf85a	android : fix KV cache log message condition (#12212 )	2025-03-06 08:22:49 +02:00
Olivier Chafik	669912d9a5	`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 ) * sampler: turn lazy grammar trigger words to regexes * add scripts/tool_bench.sh & .py * constrain llama json output regardless of function name if matches at beginning * update relaxed newline space rule in grammar tests * support add_generation_prompt query parameter (useful for /apply_template) * Update src/llama-grammar.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-05 13:05:13 +00:00
Clauszy	06a92a193a	server : fix cache reuse logic (#12161 ) The first kv shift offsets the positions of all tokens after head_c. When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.	2025-03-05 09:25:45 +02:00
Daniel Bevenius	a057897ad4	llama : add xcframework build script (#11996 ) * llama : add xcframework build script This commit adds a script to build an XCFramework for Apple ios, macos, visionos, and tvos platforms. The generated XCFramework can then be added to a project and used in the same way as a regular framework. The llama.swiftui example project has been updated to use the XCFramework and can be started using the following command: ```console $ open examples/llama.swiftui/llama.swiftui.xcodeproj/ ``` Refs: https://github.com/ggml-org/llama.cpp/issues/10747 * examples : remove llama.cpp (source dir ref) from project.pbxproj This commit removes the reference to llama.cpp from the project.pbxproj file since Package.swift has been removed. * ci : updated build.yml to use build-xcframework.sh * ci : add xcframework build to github releases This commit adds the ability to create a GitHub release with the xcframework build artifact. * scripts : add apple app validation scripts This commit adds scripts that can validate the iOS, macOS, tvOS, and VisionOS applications. The scripts create a simple test app project, copy the llama.xcframework to the test project, build and archive the app, create an IPA from the archive, and validate the IPA using altool. The motivation for this is to provide some basic validation and hopefully avoid having to manually validate apps in Xcode. * llama : remove Package.swift This commit removes the Package.swift file, as we are now building an XCFramework for the project. * llama : remove Sources and spm-headers directories * llama : use TargetConditionals.h for visionOS/tvOS	2025-03-05 06:30:31 +01:00
mgroeber9110	5bbe6a9fe9	ggml : portability fixes for VS 2017 (#12150 ) * Add include files for std::min/max and std::toupper/tolower * win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined * Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode * win32: only use __restrict in MSVC if C11/C17 support is not enabled --------- Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>	2025-03-04 18:53:26 +02:00
Sigbjørn Skjæret	56d7a9f812	main: allow preloading conversation with -p and add -st / --single-turn (#12145 ) * Add chat template formatting to -no-cnv * only enable prompt formatting if explicitly enabled * add -st / --single-turn * add --single-turn and -p in conversation mode * fix -sys + -p * reword warning * small readability change and fix (long) outdated example usage * only activate single turn in conversation mode	2025-03-04 12:19:39 -04:00
Olivier Chafik	1a24c4621f	`server`: fix deadly typo in response_format.json_schema.schema handling (#12168 )	2025-03-04 08:24:07 +02:00
dm4	c43af9276b	tts: add speaker file support (#12048 ) * tts: add speaker file support Signed-off-by: dm4 <sunrisedm4@gmail.com> * tts: handle outetts-0.3 * tts : add new line in error message --------- Signed-off-by: dm4 <sunrisedm4@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-03 15:09:29 +02:00
Eric Curtin	c950a1f692	Adding UTF-8 support to llama.cpp (#12111 ) For emojis, non-alpha characters, etc. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-03-03 12:44:56 +00:00
Xuan-Son Nguyen	7b69003af7	webui : add ?m=... and ?q=... params (#12148 ) * webui : add ?m=... and ?q=... params * also clear prefilledMessage variable * better approach * fix comment * test: bump timeout on GITHUB_ACTION	2025-03-03 11:42:45 +01:00
Sigbjørn Skjæret	14dec0c2f2	main: use jinja chat template system prompt by default (#12118 ) * Use jinja chat template system prompt by default * faster conditional order * remove nested ternary --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-03-02 14:53:48 +01:00
Sigbjørn Skjæret	1782cdfed6	main: update outdated system prompt message (followup to #12131 ) (#12132 ) * Update outdated message * wording Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-03-01 15:22:27 +01:00
Sigbjørn Skjæret	45a8e76745	common : add --system-prompt parameter, replace behavior of -p in conversation mode (#12131 ) * Add --system-prompt parameter * use user defined system prompt * clarify Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * add warning * clarify Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-03-01 13:56:45 +01:00
Vivian	2cc4a5e44a	webui : minor typo fixes (#12116 ) * fix typos and improve menu text clarity * rename variable trimedValue to trimmedValue * add updated index.html.gz * rebuild --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-03-01 11:15:09 +01:00
Alex Brooks	84d5f4bc19	Update granite vision docs for 3.2 model (#12105 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-28 11:31:47 +00:00
Ting Lou	a800ae46da	llava : add struct for FFI bindgen (#12079 ) * add struct for FFI bindgen * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-26 15:26:52 +01:00
Olivier Chafik	d7cfe1ffe0	docs: add docs/function-calling.md to lighten server/README.md's plight (#12069 )	2025-02-25 18:52:56 +00:00
rhjdvsgsgks	401af80b54	server: handle echo=false on /v1/completions (#12060 )	2025-02-25 12:52:52 +01:00
Olivier Chafik	0b52745649	server: support add_generation_prompt query param (#12062 )	2025-02-25 10:40:22 +00:00
Alex Brooks	4d1051a40f	Add Doc for Converting Granite Vision -> GGUF (#12006 ) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-25 10:46:05 +01:00
Alex Brooks	7a2c913e66	llava : Add Granite Vision Support (#11794 ) * Add super wip scripts for multimodal granite gguf Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Add example for converting mmgranite to gguf Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * remove hardcoded path Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Add vision feature layer to gguf params Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Clean up llava surgery and remove name substitution hacks Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Add transformers llava next tensor name mapping Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Make siglip / openclip mutuall exclusive Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix projector linear substitution Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix linear 2 substitution index Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Increase max flattened gridpoints to 64 Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix hardcoded concat for multiple feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Pull vision feature layers out of gguf keys Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * fix num gridpoints and use all layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Avoid dropping last image encoder layer in llava models Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Use 10 for max number of patches Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Standardize vision feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Cleanup logs Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Update comment for vision feature layer init Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Update notes for alternative to legacy llm conversion script Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix notes rendering Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Add v prefix to vision feature layer log Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Use current defaults for feature layer Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Use constant for max gridpoints / feat layers, style fixes Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * clarify non-negative feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Remove CLIP_API from func signature Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * USE MAX_IMAGE_FEATURE_LAYERS const in layer calc Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Clarify feature layers are non negative ints and not uint Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix condition for reading feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * pop last llava layer when feature layers are unset Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix unset vision layer 0 Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Update examples/llava/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Reenable assertion for out of bounds get_rows Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Use std vector for gridpoints and feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Caculate max feature layer at load time Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Include base patch for granite vision allocation Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Fix trailing whitespace Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Add max num patches = 10 back for minicpmv Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Use unordered set to store feature layers Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Use max feature layer for postnorm Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> * Apply suggestions from code review --------- Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-24 17:09:51 +01:00
Neo Zhang Jianyu	08d5986290	[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035 ) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2025-02-24 22:33:23 +08:00
Florent BENOIT	7ad0779f5d	run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (#12041 ) Signed-off-by: Florent Benoit <fbenoit@redhat.com>	2025-02-23 17:15:51 +00:00
Eric Curtin	f777a73e18	Some llama-run cleanups (#11973 ) Use consolidated open function call from File class. Change read_all to to_string(). Remove exclusive locking, the intent for that lock is to avoid multiple processes writing to the same file, it's not an issue for readers, although we may want to consider adding a shared lock. Remove passing nullptr as reference, references are never supposed to be null. clang-format the code for consistent styling. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-23 13:14:32 +00:00
Ting Lou	36c258ee92	llava: build clip image from pixels (#11999 ) * llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-22 15:28:28 +01:00
Georgi Gerganov	cf756d6e0a	server : disable Nagle's algorithm (#12020 )	2025-02-22 11:46:31 +01:00
Daniel Bevenius	de8b5a3624	llama.swiftui : add "Done" dismiss button to help view (#11998 ) The commit updates the help view in the llama.swiftui example to use a NavigationView and a Done button to dismiss the help view. The motivation for this is that without this change there is now way to dimiss the help view.	2025-02-22 06:33:29 +01:00
Alex Brooks	ee02ad02c5	clip : fix visual encoders with no CLS (#11982 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-21 08:11:03 +02:00
momonga	c392e5094d	server (webui): Fix Premature Submission During IME Conversion (#11971 ) * fix skip ime composing * fix npm rebuild * fix warn --------- Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-02-20 19:43:22 +01:00
Michael Engel	0d559580a0	run : add --chat-template-file (#11961 ) Relates to: https://github.com/ggml-org/llama.cpp/issues/11178 Added --chat-template-file CLI option to llama-run. If specified, the file will be read and the content passed for overwriting the chat template of the model to common_chat_templates_from_model. Signed-off-by: Michael Engel <mengel@redhat.com>	2025-02-20 10:35:11 +02:00
Georgi Gerganov	abd4d0bc4f	speculative : update default params (#11954 ) * speculative : update default params * speculative : do not discard the last drafted token	2025-02-19 13:29:42 +02:00
igardev	b58934c183	server : (webui) Enable communication with parent html (if webui is in iframe) (#11940 ) * Webui: Enable communication with parent html (if webui is in iframe): - Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server - On pressing na Escape button sends command "escapePressed" to the parent Example handling from the parent html side: - Send command "setText" from parent html to webui in iframe: const iframe = document.getElementById('askAiIframe'); if (iframe) { iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, ''); } - Listen for Escape key from webui on parent html: // Listen for escape key event in the iframe window.addEventListener('keydown', (event) => { if (event.key === 'Escape') { // Process case when Escape is pressed inside webui } }); Move the extraContext from storage to app.context. * Fix formatting. * add Message.extra * format + build * MessageExtraContext * build * fix display * rm console.log --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-02-18 23:01:44 +01:00
Olivier Chafik	63e489c025	tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900 ) * tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type * addressed clang-tidy lints in [test-]chat.* * rm minja deps from util & common & move it to common/minja/ * add name & tool_call_id to common_chat_msg * add common_chat_tool * added json <-> tools, msgs conversions to chat.h * fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens) * fix deepseek r1 slow test (no longer <think> opening w/ new template) * allow empty tools w/ auto + grammar * fix & test server grammar & json_schema params w/ & w/o --jinja	2025-02-18 18:03:23 +00:00
Xuan-Son Nguyen	63ac128563	server : add TEI API format for /rerank endpoint (#11942 ) * server : add TEI API format for /rerank endpoint * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix * also gitignore examples/server/*.gz.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-18 14:21:41 +01:00
xiaobing318	09aaf4f1f5	docs : Fix duplicated file extension in test command (#11935 ) This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was: ./tests.sh unit/test_chat_completion.py.py -v -x It has been corrected to: ./tests.sh unit/test_chat_completion.py -v -x This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.	2025-02-18 10:12:49 +01:00
Antoine Viallon	c4d29baf32	server : fix divide-by-zero in metrics reporting (#11915 )	2025-02-17 11:25:12 +01:00
Xuan-Son Nguyen	0f2bbe6564	server : bump httplib to 0.19.0 (#11908 )	2025-02-16 17:11:22 +00:00
708-145	fc10c38ded	examples: fix typo in imatrix/README.md (#11884 ) * simple typo fixed * Update examples/imatrix/README.md --------- Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-15 21:03:30 +02:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
theraininsky	a7b8ce2260	llama-bench : fix unexpected global variable initialize sequence issue (#11832 ) * llama-bench : fix unexpected global variable initialize sequence issue * Update examples/llama-bench/llama-bench.cpp --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-14 02:13:43 +01:00
Reza Rahemtola	c1f958c038	server : (docs) Update wrong tool calling example (#11809 ) Call updated to match the tool used in the output just below, following the example in https://github.com/ggerganov/llama.cpp/pull/9639	2025-02-13 17:22:44 +01:00
Olivier Chafik	c7f460ab88	`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (#11607 ) * extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B * tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template * tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out * server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability * tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-13 10:05:16 +00:00
Vinesh Janarthanan	27e8a23300	sampling: add Top-nσ sampler (#11223 ) * initial sampling changes: * completed top nsigma sampler implementation * apply parameter to only llama-cli * updated readme * added tests and fixed nsigma impl * cleaned up pr * format * format * format * removed commented tests * cleanup pr and remove explicit floats * added top-k sampler to improve performance * changed sigma to float * fixed string format to float * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * added llama_sampler_init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-13 08:45:57 +02:00
Oleksandr Kuvshynov	e4376270d9	llama.cpp: fix warning message (#11839 ) There was a typo-like error, which would print the same number twice if request is received with n_predict > server-side config. Before the fix: ``` slot launch_slot_: id 0 \| task 0 \| n_predict = 4096 exceeds server configuration, setting to 4096 ``` After the fix: ``` slot launch_slot_: id 0 \| task 0 \| n_predict = 8192 exceeds server configuration, setting to 4096 ```	2025-02-13 08:25:34 +02:00
Woof Dog	31afcbee0e	server : (webui) Give copy button back to all message bubbles (#11814 ) * All messages get the copy button * Update index.html.gz	2025-02-12 23:47:11 +01:00
JC	bfd11a2344	Fix: Compile failure due to Microsoft STL breaking change (#11836 )	2025-02-12 21:36:11 +01:00
Daniel Bevenius	a18f481f99	server : use common_token_to_piece instead of common_detokenize (#11740 ) * server : use common_token_to_piece instead of common_detokenize This commit replaces the call to common_detokenize with common_token_to_piece in the populate_token_probs. The motivation for this change is to avoid an issue where common_detokenize would remove the word boundary character for tokens, which caused a regression in the server generated token probabilities. Resolves: https://github.com/ggerganov/llama.cpp/issues/11728 * squash! server : use common_token_to_piece instead of common_detokenize Use common_token_to_piece for post_sampling_probs as well.	2025-02-11 14:06:45 +01:00
Xuan-Son Nguyen	507f9174fe	server : (webui) introduce conversation branching + idb storage (#11792 ) * server : (webui) introduce conversation branching + idb storage * mark old conv as "migrated" instead deleting them * improve migration * add more comments * more clarification	2025-02-10 21:23:17 +01:00
Xuan-Son Nguyen	0893e0114e	server : correct signal handler (#11795 )	2025-02-10 18:03:28 +01:00
pascal-lc	9ac3457b39	Update README.md [no ci] (#11781 ) typo: `\` -> `/` Change the UNIX path separator to` \`.	2025-02-10 09:05:57 +01:00
Eric Curtin	19d3c8293b	There's a better way of clearing lines (#11756 ) Use the ANSI escape code for clearing a line. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-09 10:34:49 +00:00
Xuan-Son Nguyen	55ac8c7791	server : (webui) revamp Settings dialog, add Pyodide interpreter (#11759 ) * redo Settings modal UI * add python code interpreter * fix auto scroll * build * fix overflow for long output lines * bring back sticky copy button * adapt layout on mobile view * fix multiple lines output and color scheme * handle python exception * better state management * add webworker * add headers * format code * speed up by loading pyodide on page load * (small tweak) add small animation to make it feels like claude	2025-02-08 21:54:50 +01:00
Woof Dog	e6e6583199	server : (webui) increase edit textarea size (#11763 )	2025-02-08 20:09:55 +01:00
Georgi Gerganov	aaa5505307	server : minor log updates (#11760 ) ggml-ci	2025-02-08 18:08:43 +02:00
Nikolaos Pothitos	3ab410f55f	readme : update front-end framework (#11753 ) After the migration to React with #11688	2025-02-08 10:43:04 +01:00
Xuan-Son Nguyen	0cf867160c	server : (webui) fix numeric settings being saved as string (#11739 ) * server : (webui) fix numeric settings being saved as string * add some more comments	2025-02-08 10:42:34 +01:00
Eric Curtin	d2fe216fb2	Make logging more verbose (#11714 ) Debugged an issue with a user who was on a read-only filesystem. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-07 14:42:46 +00:00
Xuan-Son Nguyen	2fb3c32a16	server : (webui) migrate project to ReactJS with typescript (#11688 ) * init version * fix auto scroll * bring back copy btn * bring back thought process * add lint and format check on CI * remove lang from html tag * allow multiple generations at the same time * lint and format combined * fix unused var * improve MarkdownDisplay * fix more latex * fix code block cannot be selected while generating	2025-02-06 17:32:29 +01:00
SAMI	1ec208083c	llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644 ) * Added quantization for visual projector * Added README * Fixed the clip quantize implementation in the file * Fixed the gcc warning regarding minor linting * Removed trailing whitespace	2025-02-05 10:45:40 +03:00
Olivier Chafik	9f4cc8f8d3	`sync`: minja (#11641 ) * `sync`: minja `182de30cda` https://github.com/google/minja/pull/46 https://github.com/google/minja/pull/45	2025-02-05 01:00:12 +00:00
Xuan-Son Nguyen	3962fc1a79	server : add try..catch to places not covered by set_exception_handler (#11620 ) * server : add try..catch to places not covered by set_exception_handler * log_server_request: rm try catch, add reminder	2025-02-04 18:25:42 +01:00
Olivier Chafik	db288b60cb	`tool-call`: command r7b fix for normal responses (#11608 ) * fix command r7b normal response regex + add to server test * test multiline non-tool-call responses in test-chat	2025-02-04 15:48:53 +00:00
Jhen-Jie Hong	f117d84b48	swift : fix llama-vocab api usage (#11645 ) * swiftui : fix vocab api usage * batched.swift : fix vocab api usage	2025-02-04 13:15:24 +02:00
Olivier Chafik	cde3833239	`tool-call`: allow `--chat-template chatml` w/ `--jinja`, default to chatml upon parsing issue, avoid double bos (#11616 ) * tool-call: allow `--jinja --chat-template chatml` * fix double bos issue (drop bos/eos tokens from jinja template) * add missing try catch around jinja parsing to default to chatml * Simplify default chatml logic	2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen	b3451785ac	server : (webui) revert hacky solution from #11626 (#11634 )	2025-02-04 00:10:52 +01:00
Woof Dog	1d1e6a90bc	server : (webui) allow typing and submitting during llm response (#11626 )	2025-02-03 23:16:27 +01:00
Daniel Bevenius	5598f475be	server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622 ) This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server code. The motivation for this is that when using a debug build the server would crash when an exception was throws and terminate the server process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set cpp_httplib will not call the exception handler, which would normally return a 500 error to the client. This caused tests to fail when using a debug build. Fixes: https://github.com/ggerganov/llama.cpp/issues/11613	2025-02-03 16:45:38 +01:00
mashdragon	d92cb67e37	server : (webui) Fix Shift+Enter handling (#11609 ) * Fix Shift+Enter handling `exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway * build index.html.gz --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-02-03 10:42:55 +01:00
Eric Curtin	84ec8a58f7	Name colors (#11573 ) It's more descriptive, use #define's so we can use compile-time concatenations. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-02 15:14:48 +00:00
Olivier Chafik	bfcce4d693	`tool-call`: support Command R7B (+ return tool_plan "thoughts" in API) (#11585 ) * `tool-call`: support Command R7B (w/ tool_plan return) * `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override * `tool-call`: test cleanup / handle lazy grammar triggers	2025-02-02 09:25:38 +00:00
piDack	0cec062a63	llama : add support for GLM-Edge and GLM-Edge-V series models (#10573 ) * add glm edge chat model * use config partial_rotary_factor as rope ratio * support for glm edge model * vision model support * remove debug info * fix format * llava.cpp trailing whitespace * remove unused AutoTokenizer * Update src/llama.cpp for not contain <\|end\|> or </s> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add edge template * fix chat template * fix confict * fix confict * fix ci err * fix format err * fix template err * 9b hf chat support * format * format clip.cpp * fix format * Apply suggestions from code review * Apply suggestions from code review * Update examples/llava/clip.cpp * fix format * minor : style --------- Co-authored-by: liyuhang <yuhang.li@zhipuai.cn> Co-authored-by: piDack <pcdack@hotmail.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: liyuhang <yuhang.li@aminer.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-02 09:48:46 +02:00
Eric Curtin	ecef206ccb	Implement s3:// protocol (#11511 ) For those that want to pull from s3 Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-01 10:30:54 +00:00
Olivier Chafik	a83f528688	`tool-call`: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539 ) * An empty tool_call_id is better than none! * sync: minja (tool call name optional https://github.com/google/minja/pull/36) * Force-disable parallel_tool_calls if template doesn't support it * More debug logs * Llama 3.x tools: accept / trigger on more varied spaced outputs * Fix empty content for functionary v3.2 tool call * Add proper tool call docs to server README * readme: function calling is supported now * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-31 14:15:25 +00:00
Olivier Chafik	b1bcd309fc	fix stop regression (#11543 )	2025-01-31 13:48:31 +00:00
Olivier Chafik	5783575c9d	Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533 )	2025-01-31 08:24:29 +00:00
Olivier Chafik	4a2b196d03	server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531 )	2025-01-31 10:12:40 +02:00
Daniel Bevenius	a2df2787b3	server : update help metrics processing/deferred (#11512 ) This commit updates the help text for the metrics `requests_processing` and `requests_deferred` to be more grammatically correct. Currently the returned metrics look like this: ```console \# HELP llamacpp:requests_processing Number of request processing. \# TYPE llamacpp:requests_processing gauge llamacpp:requests_processing 0 \# HELP llamacpp:requests_deferred Number of request deferred. \# TYPE llamacpp:requests_deferred gauge llamacpp:requests_deferred 0 ``` With this commit, the metrics will look like this: ```console \# HELP llamacpp:requests_processing Number of requests processing. \# TYPE llamacpp:requests_processing gauge llamacpp:requests_processing 0 \# HELP llamacpp:requests_deferred Number of requests deferred. \# TYPE llamacpp:requests_deferred gauge llamacpp:requests_deferred 0 ``` This is also consistent with the description of the metrics in the server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).	2025-01-31 06:04:53 +01:00
Olivier Chafik	8b576b6c55	Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639 ) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-30 19:13:58 +00:00
Daniel Bevenius	4314e56c4f	server : use lambda instead of std::bind (#11507 ) This commit replaces the two usages of `std::bind` in favor of lambdas for the callback functions for `callback_new_task` and `callback_update_slots`. The motivation for this changes is consistency with the rest of the code in server.cpp (lambdas are used for all other callbacks/handlers). Also lambdas are more readable (perhaps this is subjective) but also they are recommended over `std::bind` in modern C++. Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md	2025-01-30 11:05:00 +01:00
Isaac McFadyen	496e5bf46b	server : (docs) added response format for /apply-template [no ci] (#11503 )	2025-01-30 10:11:53 +01:00
Daniel Bevenius	e0449763a4	server : update json snippets in README.md [no ci] (#11492 ) This commit updates some of JSON snippets in README.md file and removes the `json` language tag from the code blocks. The motivation for this changes is that if there is invalid json in a code snippet these are highlighted in red which can make it somewhat difficult to read and can be a little distracting.	2025-01-30 05:48:14 +01:00
Nigel Bosch	eb7cf15a80	server : add /apply-template endpoint for additional use cases of Minja functionality (#11489 ) * add /apply-template endpoint to server * remove unnecessary line * add /apply-template documentation * return only "prompt" field in /apply-template * use suggested idea instead of my overly verbose way	2025-01-29 19:45:44 +01:00
Daniel Bevenius	e51c47b401	server : update auto gen files comments [no ci] (#11484 ) * server : update auto gen files comments This commit updates the 'auto generated files' comments in server.cpp and removes `deps.sh` from the comment. The motivation for this change is that `deps.sh` was removed in Commit `91c36c269b` ("server : (web ui) Various improvements, now use vite as bundler (#10599)"). * squash! server : update auto gen files comments [no ci] Move comments about file generation to README.md. * squash! server : update auto gen files comments [no ci] Remove the comments in server.cpp that mention that information can be found in the README.md file.	2025-01-29 16:34:18 +01:00
Eric Curtin	f0d4b29edf	Parse https://ollama.com/library/ syntax (#11480 ) People search for ollama models using the web ui, this change allows one to copy the url from the browser and for it to be compatible with llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-29 11:23:10 +00:00
peidaqi	cf8cc856d7	server : Fixed wrong function name in llamacpp server unit test (#11473 ) The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True	2025-01-29 00:03:42 +01:00
Eric Curtin	7fee2889e6	Add github protocol pulling and http:// (#11465 ) As pulling protocols to llama-run Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-28 14:45:41 +00:00
Michael Engel	2b8525d5c8	Handle missing model in CLI parameters for llama-run (#11399 ) The HTTP client in llama-run only prints an error in case the download of a resource failed. If the model name in the CLI parameter list is missing, this causes the application to crash. In order to prevent this, a check for the required model parameter has been added and errors for resource downloads get propagated to the caller. Signed-off-by: Michael Engel <mengel@redhat.com>	2025-01-28 08:32:40 +00:00
Eric Curtin	a4417ddda9	Add new hf protocol for ollama (#11449 ) https://huggingface.co/docs/hub/en/ollama Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-27 19:36:10 +01:00
bandoti	19f65187cb	cmake: add ggml find package (#11369 ) * Add initial ggml cmake package * Add build numbers to ggml find-package * Expand variables with GGML_ prefix * Guard against adding to cache variable twice * Add git to msys2 workflow * Handle ggml-cpu-* variants * Link ggml/ggml-base libraries to their targets * Replace main-cmake-pkg with simple-cmake-pkg * Interface features require c_std_90 * Fix typo * Removed unnecessary bracket from status message * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-26 12:07:48 -04:00
Xuan Son Nguyen	49b0e3cec4	server : fix cleaning up stream task (#11418 ) * server : fix cleaning up stream task * one more spot	2025-01-25 16:36:44 +01:00
Eric Curtin	01f37edf1a	Update llama-run README.md (#11386 ) For consistency Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-24 09:39:24 +00:00
stduhpf	c07e87f38b	server : (webui) put DeepSeek R1 CoT in a collapsible <details> element (#11364 ) * webui : put DeepSeek R1 CoT in a collapsible <details> element * webui: refactor split * webui: don't use regex to split cot and response * webui: format+qol * webui: no loading icon if the model isn't generating * ui fix, add configs * add jsdoc types * only filter </think> for assistant msg * build * update build --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-24 09:02:38 +01:00
Eric Curtin	05f63cc9ee	Update documentation (#11373 ) To show -n, -ngl, --ngl is acceptable. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-23 20:04:31 +00:00
Eric Curtin	f7fb43cd0b	Add -ngl (#11372 ) Most other llama.cpp cli tools accept -ngl with a single dash. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-23 16:16:18 +00:00
Xuan Son Nguyen	5845661640	server : add more clean up when cancel_tasks is called (#11340 ) * server : add more clean up when cancel_tasks is called * fix recv_with_timeout * std::remove_if * fix std::remove_if	2025-01-23 13:56:05 +01:00
Eric Curtin	f211d1dc10	Treat hf.co/ prefix the same as hf:// (#11350 ) ollama uses hf.co/ to specify huggingface prefix, like RamaLama uses hf:// Treat them similarly. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-23 10:38:20 +00:00
Diego Devesa	6152129d05	main : update README documentation for batch size (#11353 ) * main : update README documentation for batch size * fix formatting * minor	2025-01-22 19:22:20 +01:00
Diego Devesa	12c2bdf2de	server : fix draft context not being released (#11354 )	2025-01-22 17:44:40 +01:00
Jiří Podivín	96f4053934	Adding logprobs to /v1/completions (#11344 ) Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2025-01-22 12:51:32 +01:00
tc-mb	3e3357fd77	llava : support Minicpm-omni (#11289 ) * init * add readme * update readme * no use make * update readme * update fix code * fix editorconfig-checker * no change convert py * use clip_image_u8_free	2025-01-22 09:35:48 +02:00
Olivier Chafik	6171c9d258	Add Jinja template support (#11016 ) * Copy minja from `58f0ca6dd7` * Add --jinja and --chat-template-file flags * Add missing <optional> include * Avoid print in get_hf_chat_template.py * No designated initializers yet * Try and work around msvc++ non-macro max resolution quirk * Update test_chat_completion.py * Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template * Refactor test-chat-template * Test templates w/ minja * Fix deprecation * Add --jinja to llama-run * Update common_chat_format_example to use minja template wrapper * Test chat_template in e2e test * Update utils.py * Update test_chat_completion.py * Update run.cpp * Update arg.cpp * Refactor common_chat_* functions to accept minja template + use_jinja option * Attempt to fix linkage of LLAMA_CHATML_TEMPLATE * Revert LLAMA_CHATML_TEMPLATE refactor * Normalize newlines in test-chat-templates for windows tests * Forward decl minja::chat_template to avoid eager json dep * Flush stdout in chat template before potential crash * Fix copy elision warning * Rm unused optional include * Add missing optional include to server.cpp * Disable jinja test that has a cryptic windows failure * minja: fix vigogne (https://github.com/google/minja/pull/22) * Apply suggestions from code review Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Finish suggested renamings * Move chat_templates inside server_context + remove mutex * Update --chat-template-file w/ recent change to --chat-template * Refactor chat template validation * Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr) * Warn against missing eos / bos tokens when jinja template references them * rename: common_chat_template[s] * reinstate assert on chat_templates.template_default * Update minja to `b8437df626` * Update minja to https://github.com/google/minja/pull/25 * Update minja from https://github.com/google/minja/pull/27 * rm unused optional header --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-21 13:18:51 +00:00
Xuan Son Nguyen	e28245f35f	export-lora : fix tok_embd tensor (#11330 )	2025-01-21 14:07:12 +01:00
Eric Curtin	2e2f8f093c	linenoise.cpp refactoring (#11301 ) More RAII mainly Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-21 09:32:35 +00:00
Georgi Gerganov	80d0d6b4b7	common : add -hfd option for the draft model (#11318 ) * common : add -hfd option for the draft model * cont : fix env var * cont : more fixes	2025-01-20 22:29:43 +02:00
Georgi Gerganov	9f7add1cde	examples : fix add_special conditions (#11311 )	2025-01-20 16:36:08 +02:00
Georgi Gerganov	92bc493917	tests : increase timeout when sanitizers are enabled (#11300 ) * tests : increase timeout when sanitizers are enabled * tests : add DEFAULT_HTTP_TIMEOUT	2025-01-19 20:22:30 +02:00
Georgi Gerganov	b9daaffe02	simple-chat : fix BOS being added to each message (#11278 )	2025-01-19 18:12:09 +02:00
Eric Curtin	a1649cc13f	Adding linenoise.cpp to llama-run (#11252 ) This is a fork of linenoise that is C++17 compatible. I intend on adding it to llama-run so we can do things like traverse prompt history via the up and down arrows: https://github.com/ericcurtin/linenoise.cpp Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-18 14:42:31 +00:00
Xuan Son Nguyen	f30f099228	server : implement cancellable request (#11285 ) * server : implement cancellable request * fix typo * httplib 0.18.5 * fix i underflow	2025-01-18 14:12:05 +01:00
LostRuins Concedo	6390a998bf	tts : add guide tokens support (#11186 ) * Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start	2025-01-18 12:20:57 +02:00
codezjx	3edfa7d375	llama.android: add field formatChat to control whether to parse special tokens when send message (#11270 )	2025-01-17 14:57:56 +02:00

... 3 4 5 6 7 ...

1718 Commits