llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jonathan Graehl	5cdb27e091	finetune: SGD optimizer, more CLI args (#13873 ) * examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy eventually drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wdalpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alphawd minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-14 12:03:57 +02:00
Sigbjørn Skjæret	b3e16665e1	server : enable -td and -tbd parameters (#15172 )	2025-08-13 15:43:00 +02:00
Copilot	d8914fc47e	common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191 ) * Checkpoint from VS Code for coding agent session * Initial plan * Fix typo in --override-tensor-draft flag implementation * Add null termination for speculative tensor buffer overrides * Apply suggestions from code review * Apply suggestions from code review * Extract tensor override parsing logic to common function (addresses @slaren's feedback) * Apply suggestions from code review * Apply suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-08-13 12:44:40 +02:00
Xuan-Son Nguyen	fba5c0d680	chat : hotfix gpt-oss jinja raising an exception (#15243 ) * chat : hotfix gpt-oss jinja raising an exception * fix	2025-08-11 15:31:35 +02:00
Xuan-Son Nguyen	53d0a12658	server : allow specifying reasoning_format in HTTP request (#15238 )	2025-08-11 14:48:41 +02:00
Sachin Desai	3db4da56a5	chat : support Granite model reasoning and tool call (#14864 )	2025-08-06 20:27:30 +02:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret	f324a3b715	chat : only remove double bos/eos if added (#15086 ) * only remove double bos/eos if added * fix tests	2025-08-05 20:43:36 +02:00
Diego Devesa	ec428b02c3	llama : add --n-cpu-moe option (#15077 ) * llama : add --n-cpu-moe option Keeps the MoE weights of the first N layers in the CPU	2025-08-05 01:05:36 +02:00
compilade	19f68fa5a4	imatrix : warn when GGUF imatrix is saved without .gguf suffix (#15076 ) * imatrix : add warning when suffix is not .gguf for GGUF imatrix * imatrix : only warn about suffix when output format is unspecified	2025-08-04 23:26:52 +02:00
compilade	d31192b4ee	imatrix : use GGUF by default (#14842 ) * imatrix : use GGUF by default * imatrix : use GGUF regardless of the output filename The legacy format can only be produced with --output-format dat	2025-08-03 22:00:05 +02:00
Jhen-Jie Hong	f738989dcb	chat : fix multiple tool_calls on hermes-2-pro (#14962 )	2025-08-02 18:04:48 +08:00
Diego Devesa	a06ed5feae	llama : add simple option to enable CPU for MoE weights (--cpu-moe) (#14992 )	2025-07-31 20:15:41 +02:00
Aman Gupta	784524053d	Fix params bug in diffusion example (#14993 )	2025-08-01 01:22:58 +08:00
Diego Devesa	d6818d06a6	llama : allow other bufts when overriding to CPU, add --no-repack option (#14990 )	2025-07-31 18:11:34 +02:00
g2mt	94933c8c2e	server : implement universal assisted decoding (#12635 ) * llama-server : implement universal assisted decoding * Erase prompt tail for kv-cache * set vocab_dft_compatible in common_speculative * rename ctx_main to ctx_tgt * move vocab_dft_compatible to spec struct * clear mem_dft, remove mem * detokenize id_last for incompatible models * update comment * add --spec-replace flag * accept special tokens when translating between draft/main models * Escape spec-replace * clamp draft result to size to params.n_draft * fix comment * clean up code * restore old example * log common_speculative_are_compatible in speculative example * fix * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-31 14:25:23 +02:00
Aman Gupta	8a4a856277	Add LLaDA 8b Diffusion model (#14771 ) * Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py	2025-07-31 19:49:09 +08:00
kallewoof	1a67fcc306	common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937 ) * bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences	2025-07-29 17:05:38 +02:00
Ed Addario	d1aa0cc5d1	imatrix: add option to display importance score statistics for a given imatrix file (#12718 ) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of #9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>	2025-07-22 14:33:37 +02:00
Molly Sophia	adef81781a	server : allow setting `--reverse-prompt` arg (#14799 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-07-22 09:24:22 +08:00
compilade	90083283ec	imatrix : use GGUF to store importance matrices (#9400 ) * imatrix : allow processing multiple chunks per batch * perplexity : simplify filling the batch * imatrix : fix segfault when using a single chunk per batch * imatrix : use GGUF to store imatrix data * imatrix : fix conversion problems * imatrix : use FMA and sort tensor names * py : add requirements for legacy imatrix convert script * perplexity : revert changes * py : include imatrix converter requirements in toplevel requirements * imatrix : avoid using designated initializers in C++ * imatrix : remove unused n_entries * imatrix : allow loading mis-ordered tensors Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * quantize : use unused imatrix chunk_size with LLAMA_TRACE * common : use GGUF for imatrix output by default * imatrix : two-way conversion between old format and GGUF * convert : remove imatrix to gguf python script * imatrix : use the function name in more error messages * imatrix : don't use FMA explicitly This should make comparisons between the formats easier because this matches the behavior of the previous version. * imatrix : avoid returning from void function save_imatrix * imatrix : support 3d tensors with MUL_MAT * quantize : fix dataset name loading from gguf imatrix * common : move string_remove_suffix from quantize and imatrix Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * imatrix : add warning when legacy format is written * imatrix : warn when writing partial data, to help guess dataset coverage Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used. * imatrix : avoid loading model to convert or combine imatrix * imatrix : avoid using imatrix.dat in README --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-19 12:51:22 -04:00
Georgi Gerganov	225e7a1438	llama : add high-throughput mode (#14363 ) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-16 16:35:42 +03:00
Aman Gupta	ab14019821	Support diffusion models: Add Dream 7B (#14644 ) * Support diffusion models: Add Dream 7B * Move diffusion to examples * Move stuff to examples. Add patch to not use kv-cache * Address review comments * Make sampling fast * llama: remove diffusion functions * Add basic timings + cleanup * More cleanup * Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length * fixup! * Review: move everything to diffusion-cli for now	2025-07-16 20:03:51 +08:00
Georgi Gerganov	6ffd4e9c44	server : pre-calculate EOG logit biases (#14721 ) ggml-ci	2025-07-16 14:04:12 +03:00
Eric Zhang	a457551332	cmake : do not search for curl libraries by ourselves (#14613 ) * cmake : do not search for curl libraries by ourselves * run : do not search for curl libraries by ourselves	2025-07-10 15:29:05 +03:00
Eric Zhang	f9a867f592	cmake : bump llguidance version to v1.0.1 (#14609 )	2025-07-10 08:19:37 +03:00
Eric Zhang	ac44eb6c80	cmake : llguidance build parser library only (#14608 )	2025-07-10 08:19:13 +03:00
Alawode Oluwandabira	17a1f0d2d4	server: Add ability to mount server at prefix (#14544 ) * Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix	2025-07-08 11:47:33 +03:00
matteo	caf5681fcb	server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196 ) * initial commit for handling extra template kwargs * enable_thinking and assistant prefill cannot be enabled at the same time * can set chat_template_kwargs in command line * added doc * fixed formatting * add support for extra context in generic template init * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix merge conflict * chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context) * normalize environment variable name * simplify code * prefill cannot be used with thinking models * compatibility with the new reasoning-budget parameter * fix prefill for non thinking models --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>	2025-06-29 20:02:53 +02:00
Sigbjørn Skjæret	40bfa04c95	common : use std::string_view now that we target c++17 (#14319 )	2025-06-22 08:37:43 +03:00
Ruikai Peng	dd6e6d0b6a	vocab : prevent tokenizer overflow (#14301 ) * vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow	2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
aa956	d67341dc18	server : add server parameters for draft model cache type (#13782 ) Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>	2025-06-19 16:01:03 +03:00
fanyang	456af35eb7	build : suppress gcc15 compile warnings (#14261 ) * Change _contains_any() substrs to std::string_view and fix the find comparison logic.	2025-06-19 14:49:48 +02:00
Sigbjørn Skjæret	e434e69183	common : suggest --jinja when autodetection fails (#14222 )	2025-06-16 21:58:42 +02:00
Diego Devesa	6adc3c3ebc	llama : add thread safety test (#14035 ) * llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-16 08:11:43 -07:00
Georgi Gerganov	d3e64b9f49	llama : rework embeddings logic (#14208 ) * llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci	2025-06-16 14:14:00 +03:00
Piotr	3cb203c89f	llama-chat : Do not throw when tool parsing fails (#14012 ) Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-06-14 17:25:15 +01:00
Christian Kastner	cc8d081879	cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167 ) * cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*	2025-06-13 10:38:52 +02:00
Christian Kastner	09cf2c7c65	cmake : Improve build-info.cpp generation (#14156 ) * cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir	2025-06-13 09:51:34 +03:00
bandoti	2e89f76b7a	common: fix issue with regex_escape routine on windows (#14133 )	2025-06-11 17:19:44 -03:00
Sigbjørn Skjæret	d4e0d95cf5	chore : clean up relative source dir paths (#14128 )	2025-06-11 19:04:23 +02:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Olivier Chafik	c9bbc77931	`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 ) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts	2025-06-02 10:15:44 -07:00
Max Krasnyansky	053b1539c0	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995 ) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-05-31 15:39:19 -07:00
Olivier Chafik	e15898d1c7	server: allow unclosed thinking tags (#13931 )	2025-05-31 08:26:10 -07:00
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci	2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen	10961339b2	mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866 ) * mtmd : move helpers to dedicated library * fix server build * rm leftover cmakelist code	2025-05-28 22:35:22 +02:00
Đinh Trọng Huy	e0e3aa231d	llama : add support for BertForSequenceClassification reranker (#13858 ) * convert: add support for BertForSequenceClassification * add support for reranking using BertForSequenceClassification * merge checks of eos and sep * fix lint --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-28 19:01:58 +02:00
Olivier Chafik	cdf94a1802	server: --offline mode (#13804 ) * server: --offline mode (env: LLAMA_OFFLINE) --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-05-26 22:34:27 +01:00
Olivier Chafik	03f582ae8f	server: fix streaming crashes (#13786 ) * add preludes to content on partial regex match * allow all parsers to parse non-tool-call content. * tweak order of <\|python_tag\|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash	2025-05-26 16:03:57 +01:00
Olivier Chafik	d74e94c1b3	`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800 ) * fix deltas of tool_call.function.name * fix tool_call.id (was in tool_call.function.id!) + add function type * add tool_call.type * populate empty tool_call.function.arguments on first delta	2025-05-26 14:56:49 +01:00
Olivier Chafik	f13847cfb5	server: fix regression on streamed non-chat completion w/ stops (#13785 ) * more forgiving message diffs: partial stop words aren't erased, full stops are * Add (slow) server test for completion + stream + stop	2025-05-26 14:16:37 +01:00
Olivier Chafik	e121edc432	`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 ) --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-05-26 00:30:51 +01:00
Percy Piper	c508256db2	rpc : Fix build on OpenBSD (#13541 )	2025-05-25 15:35:53 +03:00
Olivier Chafik	f5cd27b71d	`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379 ) * add common_json w/ support for truncated json healing * add common_chat_msg_diff * partial common_chat_parse * refactor parser w/ optionals * server: wire chat diffs in stream mode * fix trigger of thinking models (must happen after thoughts are closed) * fix functionary v3.2 raw python! * rename: common_chat_syntax (now contains format) * rm common_regex.at_start * don't return empty <think></think> * accommodate yet another deepseek r1 distill fantasy syntax (`<｜tool▁calls｜>`) * fix QwQ 32B tool call parsing after thoughts (hermes2) * better logs for grammar triggers * consume spaces after parse_json_tool_calls * fix required tool calls w/ thinking models that have pre-opened thinking tags * fix thinking model's initial trigger + test qwq's template * run most test_tool_call tests in stream + non-stream modes * make functionary v3.2 parsing more strict (differentiate first match from others) * send final diff from server, to close off raw python arguments * support partial content streaming in Generic mode * tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5) * Update function-calling.md * Update tool_bench.py * chat-parser: remove input from exception (llm output may contain PII) --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>	2025-05-25 01:48:08 +01:00
Xuan-Son Nguyen	797990c4bc	mtmd : add ultravox audio input (#13623 ) * convert ok, load ok * warmup ok * test * still does not work? * fix padding * temporary give up * fix merge conflict * build_ultravox() * rm test * fix merge conflict * add necessary mtmd APIs * first working version (only 4s of audio) * will this monster compile? * fix compile * please compile * fPIC * fix windows * various fixes * clean up audio_helpers * fix conversion * add some debug stuff * long audio input ok * adapt the api * add --audio arg * final touch UX * add miniaudio to readme * fix typo * refactor kv metadata * mtmd_default_marker()	2025-05-22 20:42:48 +02:00
Sigbjørn Skjæret	2aa777d86d	examples : switch retrieval to llama_encode (#13685 ) * switch retrieval to llama_encode * enable --no-warmup for retrieval	2025-05-21 16:57:38 +02:00
Georgi Gerganov	a4090d1174	llama : remove llama_kv_cache_view API + remove deprecated (#13653 ) ggml-ci	2025-05-20 16:13:16 +03:00
Georgi Gerganov	e298d2fbd0	kv-cache : add SWA support (#13194 ) * kv-cache : prepare for SWA ggml-ci * kv-cache : initial iSWA implementation ggml-ci * kv-cache : rework error recovery logic ggml-ci * models : fix Phi-3 SWA parameters ggml-ci * model : adjust Granite to rope factor changes ggml-ci * server : check if context can do shifts ggml-ci * iswa : for now, always enable shifts (experiment) ggml-ci * kv-cache : simplify SWA logic ggml-ci * kv-cache : apply defrag when we fail to find slots for the batch ggml-ci * llama : update docs about llama_decode ggml-ci * kv-cache : update warning logs when no space for the batch is available ggml-ci * llama : add llama_kv_self_seq_pos_min() * kv-cache : keep track of partial SWA computes and print warnings * server : disallow use cases involving partial SWA context ggml-ci * llama : add param to control SWA cache size ggml-ci * minor : clean-up ggml-ci	2025-05-20 08:05:46 +03:00
psocolovsky	1dfbf2cf3a	common : add load_progress_callback (#13617 )	2025-05-19 21:17:36 +02:00
Isaac McFadyen	6a2bc8bfb7	server : added --no-prefill-assistant flag (#13608 ) * added no-prefill-assistant flag * reworded documentation comment * updated server README.md	2025-05-17 23:59:48 +02:00
Georgi Gerganov	518329b2d4	parallel : add option for non-shared and larger prompts (#13598 ) * parallel : add option for non-shared and larger prompts * parallel : update readme [no ci] * cont : add note about base models [no ci] * parallel : better var name ggml-ci	2025-05-17 12:58:55 +03:00
Z	3e0be1cace	llguidance : official v0.7.20 release (no actual changes) [noci] (#13594 )	2025-05-16 22:56:28 +02:00
Olivier Chafik	bc098c3cf0	minja: sync (qwen3) (#13573 ) * minja: sync `f06140fa52` - https://github.com/google/minja/pull/67 (@grf53) - https://github.com/google/minja/pull/66 (@taha-yassine) - https://github.com/google/minja/pull/63 (@grf53) - https://github.com/google/minja/pull/58 --------- Co-authored-by: ochafik <ochafik@google.com>	2025-05-15 23:29:10 +01:00
Olivier Chafik	aa48e373f2	`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 ) * Inject date_string in llama 3.x + fix for functionary v2 https://github.com/ggml-org/llama.cpp/issues/12729 * move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-05-15 02:39:51 +01:00
Olivier Chafik	3198405e98	`common`: add partial regex support (#12808 ) * move string_find_partial_stop & string_ends_with to common * add common_regex (supports partial matches) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/regex-partial.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/regex-partial.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/regex-partial.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * partial regex: add missing iterator end checks * string utils: use string_views * direct throw to avoid ggml.h include * regex-partial: replace missed ggml_asserts --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-05-14 19:50:57 +01:00
Johannes Gäßler	10d2af0eaa	llama/ggml: add LLM training support (#10544 ) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-12 14:44:49 +02:00
David Huang	7f323a589f	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386 )	2025-05-11 14:18:39 +02:00
Sigbjørn Skjæret	43dfd741a5	llguidance : set tokenizer slices to default (#13424 )	2025-05-10 17:19:52 +02:00
Xuan-Son Nguyen	7fef11766c	arg : add env var to control mmproj (#13416 ) * arg : add env var to control mmproj * small note about -hf --mmproj	2025-05-10 08:16:29 +02:00
Helton Reis	7c28a74e07	chore(llguidance): use tagged version that does not break the build (#13413 )	2025-05-09 23:15:39 +03:00
Xuan-Son Nguyen	33eff40240	server : vision support via libmtmd (#12898 ) * server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-05-09 19:29:37 +02:00
Bartowski	efb8b47eda	imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389 ) * Add --parse-special for enabling parsing of special tokens in imatrix calculation * whitespace	2025-05-09 11:53:58 +02:00
Diego Devesa	15e03282bb	ci : limit write permission to only the release step + fixes (#13392 ) * ci : limit write permission to only the release step * fix win cuda file name * fix license file copy on multi-config generators	2025-05-08 23:45:22 +02:00
Xuan-Son Nguyen	8c83449cb7	server : (webui) revamp the input area, plus many small UI improvements (#13365 ) * rework the input area * process selected file * change all icons to heroicons * fix thought process collapse * move conversation more menu to sidebar * sun icon --> moon icon * rm default system message * stricter upload file check, only allow image if server has mtmd * build it * add renaming * better autoscroll * build * add conversation group * fix scroll * extra context first, then user input in the end * fix <hr> tag * clean up a bit * build * add mb-3 for <pre> * throttle adjustTextareaHeight to make it less laggy * (nits) missing padding in sidebar * rm stray console log	2025-05-08 15:37:29 +02:00
Georgi Gerganov	51fb96b1ff	context : remove logits_all flag (#13284 ) * context : remove logits_all flag ggml-ci * llama : remove logits_all flag + reorder llama_context_params ggml-ci	2025-05-08 14:26:50 +03:00
Ycros	39e73ae0d6	common : Add a warning when we can't match samplers from a string or char. (#13330 )	2025-05-07 11:23:28 +03:00
Georgi Gerganov	4773d7a02f	examples : remove infill (#13283 ) ggml-ci	2025-05-07 10:28:02 +03:00
oobabooga	233461f812	sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264 ) * sampling: add Top-nσ sampler to `llama-server` and sampler ordering * revert: sampler ordering * revert: VS' crappy auto-formatting * revert: VS' crappy auto-formatting pt.2 * revert: my crappy eye sight... * sampling: add XTC to Top-nσ sampler chain * sampling: add Dyna. Temp. to Top-nσ sampler chain * sampling: actually remove Top-nσ from sampler(oops) * Integrate top_n_sigma into main sampler chain * Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA * Formatting * Lint * Exit early in the sampler if nsigma < 0 --------- Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>	2025-05-05 22:12:19 +02:00
Xuan-Son Nguyen	9b61acf060	mtmd : rename llava directory to mtmd (#13311 ) * mv llava to mtmd * change ref everywhere	2025-05-05 16:02:55 +02:00
Diego Devesa	1d36b3670b	llama : move end-user examples to tools directory (#13249 ) * llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-02 20:27:13 +02:00
Georgi Gerganov	fab647e884	server : add cache reuse card link to help (#13230 ) * server : add cache reuse card link to help * args : use short url	2025-05-02 09:48:31 +03:00
Diego Devesa	d7a14c42a1	build : fix build info on windows (#13239 ) * build : fix build info on windows * fix cuda host compiler msg	2025-05-01 21:48:08 +02:00
Xuan-Son Nguyen	13c9a3319b	arg : remove CURLINFO_EFFECTIVE_METHOD (#13228 )	2025-05-01 10:23:25 +02:00
Xuan-Son Nguyen	6f67cf1f48	arg : -hf do not fail if url mismatch (#13219 ) * arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json	2025-04-30 21:29:15 +01:00
Olivier Chafik	3b127c7385	common : add -jf / --json-schema-file flag (#12011 )	2025-04-30 14:52:35 +02:00
Xuan-Son Nguyen	5933e6fdc9	arg : allow using -hf offline (#13202 ) * arg : allow using -hf offline * add more comments in code [no ci]	2025-04-30 10:46:32 +02:00
Georgi Gerganov	43f2b07193	common : fix noreturn compile warning (#13151 ) ggml-ci	2025-04-28 11:57:19 +03:00
Xuan-Son Nguyen	85f36e5e71	arg : fix unused variable (#13142 )	2025-04-28 08:16:59 +03:00
Xuan-Son Nguyen	2d451c8059	common : add common_remote_get_content (#13123 ) * common : add common_remote_get_content * support max size and timeout * add tests	2025-04-26 22:58:12 +02:00
frob	d5fe4e81bd	grammar : handle maxItems == 0 in JSON schema (#13117 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-04-26 10:10:20 +02:00
Georgi Gerganov	13b4548877	cmake : do not include ./src as public for libllama (#13062 ) * cmake : do not include ./src as public for libllama ggml-ci * cmake : rework tests ggml-ci * llguidance : remove unicode include ggml-ci * cmake : make c++17 private ggml-ci	2025-04-24 16:00:10 +03:00
Xuan-Son Nguyen	7c727fbe39	arg : add --no-mmproj-offload (#13093 ) * arg : add --no-mmproj-offload * Update common/arg.cpp	2025-04-24 14:04:14 +02:00
Xuan-Son Nguyen	80982e815e	arg : clean up handling --mmproj with -hf (#13082 ) * arg : clean up handling --mmproj with -hf * rm change about no_mmproj * Revert "rm change about no_mmproj" This reverts commit `2cac8e0efb`. * handle no_mmproj explicitly * skip download mmproj on examples not using it	2025-04-24 12:14:13 +02:00
Xuan-Son Nguyen	243453533e	llava : update documentations (#13055 ) * llava : update documentations * fix typo	2025-04-22 10:37:00 +02:00
Xuan-Son Nguyen	84a9bf2fc2	mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012 ) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-04-21 15:32:58 +02:00
Prajwal B Mehendarkar	bc091a4dc5	common : Define cache directory on AIX (#12915 )	2025-04-12 17:33:39 +02:00
Olivier Chafik	b6930ebc42	`tool-call`: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900 ) * `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema) * test all chat formats w/o tools	2025-04-11 21:47:52 +02:00
yuri@FreeBSD	68b08f36d0	common : Define cache directory on FreeBSD (#12892 )	2025-04-11 21:45:44 +02:00
tastelikefeet	b2034c2b55	contrib: support modelscope community (#12664 ) * support download from modelscope * support login * remove comments * add arguments * fix code * fix win32 * test passed * fix readme * revert readme * change to MODEL_ENDPOINT * revert tail line * fix readme * refactor model endpoint * remove blank line * fix header * fix as comments * update comment * update readme --------- Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>	2025-04-11 14:01:56 +02:00
Prajwal B Mehendarkar	1d343b4069	arg : Including limits file on AIX (#12822 )	2025-04-08 14:30:59 +02:00
Xuan-Son Nguyen	bd3f59f812	cmake : enable curl by default (#12761 ) * cmake : enable curl by default * no curl if no examples * fix build * fix build-linux-cross * add windows-setup-curl * fix * shell * fix path * fix windows-latest-cmake* * run: include_directories * LLAMA_RUN_EXTRA_LIBS * sycl: no llama_curl * no test-arg-parser on windows * clarification * try riscv64 / arm64 * windows: include libcurl inside release binary * add msg * fix mac / ios / android build * will this fix xcode? * try clearing the cache * add bunch of licenses * revert clear cache * fix xcode * fix xcode (2) * fix typo	2025-04-07 13:35:19 +02:00
Sergey Fedorov	f1e3eb4249	common : fix includes in arg.cpp and gemma3-cli.cpp (#12766 ) * arg.cpp: add a missing include * gemma3-cli.cpp: fix cinttypes include	2025-04-05 17:46:00 +02:00
エシュナヴァリシア	c6ff5d2a8d	common: custom hf endpoint support (#12769 ) * common: custom hf endpoint support Add support for custom huggingface endpoints via HF_ENDPOINT environment variable You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration. Example usage: HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is" The trailing slash in the URL is optional: HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is" * Update common/arg.cpp readability Improvement Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Apply suggestions from code review --------- Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-04-05 15:31:42 +02:00
Olivier Chafik	7a84777f42	sync: minja (#12739 ) * sync: minja https://github.com/google/minja/pull/57 * fix json include	2025-04-04 21:16:39 +01:00
R0CKSTAR	5f696e88e0	sync : minja (inclusionAI/Ling) and update tests (#12699 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-03 13:51:35 +02:00
Diego Devesa	e0e912f49b	llama : add option to override model tensor buffers (#11397 ) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-02 14:52:01 +02:00
Xuan-Son Nguyen	42eb248f46	common : remove json.hpp from common.cpp (#12697 ) * common : remove json.hpp from common.cpp * fix comment	2025-04-02 09:58:34 +02:00
Xuan-Son Nguyen	267c1399f1	common : refactor downloading system, handle mmproj with -hf option (#12694 ) * (wip) refactor downloading system [no ci] * fix all examples * fix mmproj with -hf * gemma3: update readme * only handle mmproj in llava example * fix multi-shard download * windows: fix problem with std::min and std::max * fix 2	2025-04-01 23:44:05 +02:00
R0CKSTAR	a6f32f0b34	Fix clang warning in gguf_check_reserved_keys (#12686 ) * Fix clang warning in gguf_check_reserved_keys Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix typo Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-01 13:12:53 +02:00
Johannes Gäßler	dd373dd3bf	llama: fix error on bad grammar (#12628 )	2025-03-28 18:08:52 +01:00
Piotr	2099a9d5db	server : Support listening on a unix socket (#12613 ) * server : Bump cpp-httplib to include AF_UNIX windows support Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com> * server : Allow running the server example on a unix socket Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com> --------- Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-03-27 23:41:04 +01:00
Michał Moskal	2447ad8a98	upgrade to llguidance 0.7.10 (#12576 )	2025-03-26 11:06:09 -07:00
marcoStocchi	f4c3dd5daa	llama-tts : add '-o' option (#12398 ) * added -o option to specify an output file name * llama-tts returns ENOENT in case of file write error note : PR #12042 is closed as superseded with this one.	2025-03-15 17:23:11 +01:00
Sigbjørn Skjæret	774973b8f3	main : add -sysf / --system-prompt-file (#12249 ) (#12250 ) * add system_prompt_file * add -sysf / --system-prompt-file * remove system_prompt_file	2025-03-14 16:57:05 +01:00
fairydreaming	8fcb563613	Load all MoE experts during warmup (#11571 ) * llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-03-14 13:47:05 +01:00
Xuan-Son Nguyen	be7c303410	arg : no n_predict = -2 for examples except for main and infill (#12364 )	2025-03-13 12:34:54 +01:00
Georgi Gerganov	e0dbec0bc6	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 ) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci	2025-03-13 12:35:44 +02:00
marcoStocchi	6ef79a67ca	common : refactor '-o' option (#12278 ) As discussed in PR 'llama-tts : add -o option' (#12042): * common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option. * cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.	2025-03-10 13:34:13 +02:00
Olivier Chafik	4e39a3c332	`server`: extract <think> tags from qwq outputs (#12297 ) * extract <think> tags from qwq outputs * const for all static regexes in chat.cpp	2025-03-10 10:59:03 +00:00
Olivier Chafik	87c2630546	allow missing content in message if tool_calls provided (#12293 )	2025-03-10 09:45:07 +00:00
Georgi Gerganov	1e2f78a004	server : add speculative decoding presets for FIM (#12287 )	2025-03-09 19:08:20 +02:00
Olivier Chafik	7cf64f6bee	sync: minja - support QwQ-32B (#12235 ) `8a76f7815e`	2025-03-07 09:33:37 +00:00
Olivier Chafik	669912d9a5	`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 ) * sampler: turn lazy grammar trigger words to regexes * add scripts/tool_bench.sh & .py * constrain llama json output regardless of function name if matches at beginning * update relaxed newline space rule in grammar tests * support add_generation_prompt query parameter (useful for /apply_template) * Update src/llama-grammar.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-05 13:05:13 +00:00
mgroeber9110	5bbe6a9fe9	ggml : portability fixes for VS 2017 (#12150 ) * Add include files for std::min/max and std::toupper/tolower * win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined * Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode * win32: only use __restrict in MSVC if C11/C17 support is not enabled --------- Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>	2025-03-04 18:53:26 +02:00
Sigbjørn Skjæret	56d7a9f812	main: allow preloading conversation with -p and add -st / --single-turn (#12145 ) * Add chat template formatting to -no-cnv * only enable prompt formatting if explicitly enabled * add -st / --single-turn * add --single-turn and -p in conversation mode * fix -sys + -p * reword warning * small readability change and fix (long) outdated example usage * only activate single turn in conversation mode	2025-03-04 12:19:39 -04:00
dm4	c43af9276b	tts: add speaker file support (#12048 ) * tts: add speaker file support Signed-off-by: dm4 <sunrisedm4@gmail.com> * tts: handle outetts-0.3 * tts : add new line in error message --------- Signed-off-by: dm4 <sunrisedm4@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-03-03 15:09:29 +02:00
Sigbjørn Skjæret	45a8e76745	common : add --system-prompt parameter, replace behavior of -p in conversation mode (#12131 ) * Add --system-prompt parameter * use user defined system prompt * clarify Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * add warning * clarify Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-03-01 13:56:45 +01:00
Daniel Bevenius	d07c621393	common : add llama.vim preset for Qwen2.5 Coder (#11945 ) This commit adds a preset for llama.vim to use the default Qwen 2.5 Coder models. The motivation for this change is to make it easier to start a server suitable to be used with the llama.vim plugin. For example, the server can be started with a command like the following: ```console $ llama.vim --fim-qwen-1.5b-default ``` Refs: https://github.com/ggml-org/llama.cpp/issues/10932	2025-02-19 12:29:52 +01:00
Georgi Gerganov	abd4d0bc4f	speculative : update default params (#11954 ) * speculative : update default params * speculative : do not discard the last drafted token	2025-02-19 13:29:42 +02:00
Olivier Chafik	63e489c025	tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900 ) * tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type * addressed clang-tidy lints in [test-]chat.* * rm minja deps from util & common & move it to common/minja/ * add name & tool_call_id to common_chat_msg * add common_chat_tool * added json <-> tools, msgs conversions to chat.h * fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens) * fix deepseek r1 slow test (no longer <think> opening w/ new template) * allow empty tools w/ auto + grammar * fix & test server grammar & json_schema params w/ & w/o --jinja	2025-02-18 18:03:23 +00:00
standby24x7	fe163d5bf3	common : Fix a typo in help (#11899 ) This patch fixes a typo in command help. prefx -> prefix Signed-off-by: Masanari Iida <standby24x7@gmail.com>	2025-02-16 10:51:13 +01:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
Olivier Chafik	f355229692	server: fix type promotion typo causing crashes w/ --jinja w/o tools (#11880 )	2025-02-15 10:11:36 +00:00
Michał Moskal	89daa2564f	llguidance build fixes for Windows (#11664 ) * setup windows linking for llguidance; thanks @phil-scott-78 * add build instructions for windows and update script link * change VS Community link from DE to EN * whitespace fix	2025-02-14 12:46:08 -08:00
Daniel Bevenius	3d68f034da	llama : add completion for --chat-template-file (#11860 ) This commit adds completion for `--chat-template-file`, enabling only `.jinja` files to be displayed as completions. Example usage: ```console $ ./build/bin/llama-cli --chat-template-file models/templates/<TAB> models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja models/templates/CohereForAI-c4ai-command-r-plus-tool_use.jinja models/templates/deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja models/templates/fireworks-ai-llama-3-firefunction-v2.jinja models/templates/google-gemma-2-2b-it.jinja models/templates/llama-cpp-deepseek-r1.jinja models/templates/meetkai-functionary-medium-v3.1.jinja models/templates/meetkai-functionary-medium-v3.2.jinja models/templates/meta-llama-Llama-3.1-8B-Instruct.jinja models/templates/meta-llama-Llama-3.2-3B-Instruct.jinja models/templates/meta-llama-Llama-3.3-70B-Instruct.jinja models/templates/microsoft-Phi-3.5-mini-instruct.jinja models/templates/mistralai-Mistral-Nemo-Instruct-2407.jinja models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja models/templates/Qwen-Qwen2.5-7B-Instruct.jinja ``` This is not limited to the models/templates directory, it can be used anywhere in the filesystem, the above is just an example.	2025-02-14 11:16:56 +01:00
Daniel Bevenius	c48f630d1c	llama : add --completion-bash option (#11846 ) This commit adds a new option `--completion-bash` to the llama.cpp which outputs a source-able bash completion script. The motivation for this change is to provide a more user-friendly experience for users who use the command-line interface of llama.cpp. This is currently only basic and all options are displayed for all llama executables but this can be improved in the future if needed. Example usage: ```console $ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash $ source ~/.llama-completion.bash $ ./build/bin/llama-server --m<TAB> --main-gpu --mirostat --mirostat-lr --model --multiline-input --min-p --mirostat-ent --mlock --model-url ```	2025-02-13 14:46:59 +01:00
Olivier Chafik	c7f460ab88	`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (#11607 ) * extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B * tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template * tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out * server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability * tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-13 10:05:16 +00:00
Vinesh Janarthanan	27e8a23300	sampling: add Top-nσ sampler (#11223 ) * initial sampling changes: * completed top nsigma sampler implementation * apply parameter to only llama-cli * updated readme * added tests and fixed nsigma impl * cleaned up pr * format * format * format * removed commented tests * cleanup pr and remove explicit floats * added top-k sampler to improve performance * changed sigma to float * fixed string format to float * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * added llama_sampler_init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-13 08:45:57 +02:00
JC	bfd11a2344	Fix: Compile failure due to Microsoft STL breaking change (#11836 )	2025-02-12 21:36:11 +01:00
bandoti	fef0cbeadf	cleanup: fix compile warnings associated with gnu_printf (#11811 )	2025-02-12 10:06:53 -04:00
Maxim Evtush	7b891bdc86	fix: typos in documentation files (#11791 ) * Update ggml.c * Update arg.cpp * Update speculative.h	2025-02-10 23:21:31 +01:00
Olivier Chafik	d7b31a9d84	sync: minja (`a72057e519`) (#11774 )	2025-02-10 09:34:09 +00:00
Eric Curtin	19d3c8293b	There's a better way of clearing lines (#11756 ) Use the ANSI escape code for clearing a line. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-09 10:34:49 +00:00
Christian Fillion	7ee953a64a	llama : add llama_sampler_init for safe usage of llama_sampler_free (#11727 ) The C API in llama.h claims users can implement `llama_sampler_i` to create custom `llama_sampler`. The sampler chain takes ownership and calls `llama_sampler_free` on them. However, `llama_sampler_free` is hard-coded to use `delete`. This is undefined behavior if the object wasn't also allocated via `new` from libllama's C++ runtime. Callers in C and C-compatible languages do not use C++'s `new` operator. C++ callers may not be sharing the same heap as libllama.	2025-02-07 11:33:27 +02:00
Daniel Bevenius	b7552cfcbc	common : add default embeddings presets (#11677 ) * common : add default embeddings presets This commit adds default embeddings presets for the following models: - bge-small-en-v1.5 - e5-small-v2 - gte-small These can be used with llama-embedding and llama-server. For example, with llama-embedding: ```console ./build/bin/llama-embedding --embd-gte-small-default -p "Hello, how are you?" ``` And with llama-server: ```console ./build/bin/llama-server --embd-gte-small-default ``` And the embeddings endpoint can then be called with a POST request: ```console curl --request POST \ --url http://localhost:8080/embeddings \ --header "Content-Type: application/json" \ --data '{"input": "Hello, how are you?"}' ``` I'm not sure if these are the most common embedding models but hopefully this can be a good starting point for discussion and further improvements. Refs: https://github.com/ggerganov/llama.cpp/issues/10932	2025-02-07 09:15:22 +01:00
Olivier Chafik	9f4cc8f8d3	`sync`: minja (#11641 ) * `sync`: minja `182de30cda` https://github.com/google/minja/pull/46 https://github.com/google/minja/pull/45	2025-02-05 01:00:12 +00:00
Radoslav Gerganov	1bef571f6a	arg : list RPC devices first when using --list-devices (#11655 ) List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref #11435	2025-02-04 18:16:20 +02:00
Olivier Chafik	db288b60cb	`tool-call`: command r7b fix for normal responses (#11608 ) * fix command r7b normal response regex + add to server test * test multiline non-tool-call responses in test-chat	2025-02-04 15:48:53 +00:00

1 2 3 4 5 ...

628 Commits