Commit Graph

571 Commits

Author SHA1 Message Date
Sigbjørn Skjæret f324a3b715
chat : only remove double bos/eos if added (#15086)
* only remove double bos/eos if added

* fix tests
2025-08-05 20:43:36 +02:00
Diego Devesa ec428b02c3
llama : add --n-cpu-moe option (#15077)
* llama : add --n-cpu-moe option

Keeps the MoE weights of the first N layers in the CPU
2025-08-05 01:05:36 +02:00
compilade 19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix (#15076)
* imatrix : add warning when suffix is not .gguf for GGUF imatrix

* imatrix : only warn about suffix when output format is unspecified
2025-08-04 23:26:52 +02:00
compilade d31192b4ee
imatrix : use GGUF by default (#14842)
* imatrix : use GGUF by default

* imatrix : use GGUF regardless of the output filename

The legacy format can only be produced with --output-format dat
2025-08-03 22:00:05 +02:00
Jhen-Jie Hong f738989dcb
chat : fix multiple tool_calls on hermes-2-pro (#14962) 2025-08-02 18:04:48 +08:00
Diego Devesa a06ed5feae
llama : add simple option to enable CPU for MoE weights (--cpu-moe) (#14992) 2025-07-31 20:15:41 +02:00
Aman Gupta 784524053d
Fix params bug in diffusion example (#14993) 2025-08-01 01:22:58 +08:00
Diego Devesa d6818d06a6
llama : allow other bufts when overriding to CPU, add --no-repack option (#14990) 2025-07-31 18:11:34 +02:00
g2mt 94933c8c2e
server : implement universal assisted decoding (#12635)
* llama-server : implement universal assisted decoding

* Erase prompt tail for kv-cache

* set vocab_dft_compatible in common_speculative

* rename ctx_main to ctx_tgt

* move vocab_dft_compatible to spec struct

* clear mem_dft, remove mem

* detokenize id_last for incompatible models

* update comment

* add --spec-replace flag

* accept special tokens when translating between draft/main models

* Escape spec-replace

* clamp draft result to size to params.n_draft

* fix comment

* clean up code

* restore old example

* log common_speculative_are_compatible in speculative example

* fix

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-31 14:25:23 +02:00
Aman Gupta 8a4a856277
Add LLaDA 8b Diffusion model (#14771)
* Add support for Llada-8b: diffusion model

* Add README

* Fix README and convert_hf_to_gguf

* convert_hf_to_gguf.py: address review comments

* Make everything in a single example

* Remove model-specific sampling

* Remove unused argmax

* Remove braced initializers, improve README.md a bit

* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps

* Remove adding the mask token

* Move add_add_bos_token to set_vocab

* use add_bool in gguf_writer.py
2025-07-31 19:49:09 +08:00
kallewoof 1a67fcc306
common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937)
* bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences
2025-07-29 17:05:38 +02:00
Ed Addario d1aa0cc5d1
imatrix: add option to display importance score statistics for a given imatrix file (#12718)
* Add --show-statistics option

* Add --show-statistics logic

* Add tensor name parsing

* Tidy output format

* Fix typo in title

* Improve tensor influence ranking

* Add better statistics

* Change statistics' sort order

* Add Cosine Similarity

* Add header search path

* Change header search path to private

* Add weighted statistics per layer

* Update report title

* Refactor compute_statistics out of main

* Refactor compute_cossim out of load_imatrix

* Refactor compute_statistics out of load_imatrix

* Move imatrix statistics calculation into its own functions

* Add checks and validations

* Remove unnecessary include directory

* Rename labels

* Add m_stats getter and refactor compute_statistics out of load_imatrix

* Refactor variable names

* Minor cosmetic change

* Retrigger checks (empty commit)

* Rerun checks (empty commit)

* Fix unnecessary type promotion

Co-authored-by: compilade <git@compilade.net>

* Reverting change to improve code readability

* Rerun checks (empty commit)

* Rerun checks (empty commit)

* Rerun checks - third time's the Charm 🤞 (empty commit)

* Minor cosmetic change

* Update README

* Fix typo

* Update README

* Rerun checks (empty commit)

* Re-implement changes on top of #9400

* Update README.md

* Update README

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

* Remove duplicate option in print_usage()

* Update README.md

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Remove input check

* Remove commented out code

---------

Co-authored-by: compilade <git@compilade.net>
2025-07-22 14:33:37 +02:00
Molly Sophia adef81781a
server : allow setting `--reverse-prompt` arg (#14799)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 09:24:22 +08:00
compilade 90083283ec
imatrix : use GGUF to store importance matrices (#9400)
* imatrix : allow processing multiple chunks per batch

* perplexity : simplify filling the batch

* imatrix : fix segfault when using a single chunk per batch

* imatrix : use GGUF to store imatrix data

* imatrix : fix conversion problems

* imatrix : use FMA and sort tensor names

* py : add requirements for legacy imatrix convert script

* perplexity : revert changes

* py : include imatrix converter requirements in toplevel requirements

* imatrix : avoid using designated initializers in C++

* imatrix : remove unused n_entries

* imatrix : allow loading mis-ordered tensors

Sums and counts tensors no longer need to be consecutive.

* imatrix : more sanity checks when loading multiple imatrix files

* imatrix : use ggml_format_name instead of std::string concatenation

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* quantize : use unused imatrix chunk_size with LLAMA_TRACE

* common : use GGUF for imatrix output by default

* imatrix : two-way conversion between old format and GGUF

* convert : remove imatrix to gguf python script

* imatrix : use the function name in more error messages

* imatrix : don't use FMA explicitly

This should make comparisons between the formats easier
because this matches the behavior of the previous version.

* imatrix : avoid returning from void function save_imatrix

* imatrix : support 3d tensors with MUL_MAT

* quantize : fix dataset name loading from gguf imatrix

* common : move string_remove_suffix from quantize and imatrix

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* imatrix : add warning when legacy format is written

* imatrix : warn when writing partial data, to help guess dataset coverage

Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.

* imatrix : avoid loading model to convert or combine imatrix

* imatrix : avoid using imatrix.dat in README

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-19 12:51:22 -04:00
Georgi Gerganov 225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta ab14019821
Support diffusion models: Add Dream 7B (#14644)
* Support diffusion models: Add Dream 7B

* Move diffusion to examples

* Move stuff to examples. Add patch to not use kv-cache

* Address review comments

* Make sampling fast

* llama: remove diffusion functions

* Add basic timings + cleanup

* More cleanup

* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length

* fixup!

* Review: move everything to diffusion-cli for now
2025-07-16 20:03:51 +08:00
Georgi Gerganov 6ffd4e9c44
server : pre-calculate EOG logit biases (#14721)
ggml-ci
2025-07-16 14:04:12 +03:00
Eric Zhang a457551332
cmake : do not search for curl libraries by ourselves (#14613)
* cmake : do not search for curl libraries by ourselves

* run : do not search for curl libraries by ourselves
2025-07-10 15:29:05 +03:00
Eric Zhang f9a867f592
cmake : bump llguidance version to v1.0.1 (#14609) 2025-07-10 08:19:37 +03:00
Eric Zhang ac44eb6c80
cmake : llguidance build parser library only (#14608) 2025-07-10 08:19:13 +03:00
Alawode Oluwandabira 17a1f0d2d4
server: Add ability to mount server at prefix (#14544)
* Add server_prefix

* Correct server path env

* Rename cli flag to --api-prefix

* Change all to api_prefix
2025-07-08 11:47:33 +03:00
matteo caf5681fcb
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
* initial commit for handling extra template kwargs

* enable_thinking and assistant prefill cannot be enabled at the same time

* can set chat_template_kwargs in command line

* added doc

* fixed formatting

* add support for extra context in generic template init

* coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* coding standard:  common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix merge conflict

* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)

* normalize environment variable name

* simplify code

* prefill cannot be used with thinking models

* compatibility with the new reasoning-budget parameter

* fix prefill for non thinking models

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
2025-06-29 20:02:53 +02:00
Sigbjørn Skjæret 40bfa04c95
common : use std::string_view now that we target c++17 (#14319) 2025-06-22 08:37:43 +03:00
Ruikai Peng dd6e6d0b6a
vocab : prevent tokenizer overflow (#14301)
* vocab : prevent stack overflow in tokenize

* vocab : return error instead of aborting on oversized token count

* vocab : INT32_MIN from llama_tokenize on overflow
2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret 88fc854b4b
llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
aa956 d67341dc18
server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>
2025-06-19 16:01:03 +03:00
fanyang 456af35eb7
build : suppress gcc15 compile warnings (#14261)
* Change _contains_any() substrs to std::string_view and fix the find comparison logic.
2025-06-19 14:49:48 +02:00
Sigbjørn Skjæret e434e69183
common : suggest --jinja when autodetection fails (#14222) 2025-06-16 21:58:42 +02:00
Diego Devesa 6adc3c3ebc
llama : add thread safety test (#14035)
* llama : add thread safety test

* llamafile : remove global state

* llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-16 08:11:43 -07:00
Georgi Gerganov d3e64b9f49
llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
Piotr 3cb203c89f
llama-chat : Do not throw when tool parsing fails (#14012)
Currently when a model generates output which looks like a tool call,
but is invalid an exception is thrown and not handled, causing the cli
or llama-server to bail. Instead, handle the chat parser exception and
simply return the generated text in such cases.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-14 17:25:15 +01:00
Christian Kastner cc8d081879
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT

* cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
2025-06-13 10:38:52 +02:00
Christian Kastner 09cf2c7c65
cmake : Improve build-info.cpp generation (#14156)
* cmake: Simplify build-info.cpp generation

The rebuild of build-info.cpp still gets triggered when .git/index gets
changes.

* cmake: generate build-info.cpp in build dir
2025-06-13 09:51:34 +03:00
bandoti 2e89f76b7a
common: fix issue with regex_escape routine on windows (#14133) 2025-06-11 17:19:44 -03:00
Sigbjørn Skjæret d4e0d95cf5
chore : clean up relative source dir paths (#14128) 2025-06-11 19:04:23 +02:00
Georgi Gerganov 745aa5319b
llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
2025-06-06 14:11:15 +03:00
Olivier Chafik c9bbc77931
`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Max Krasnyansky 053b1539c0
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 15:39:19 -07:00
Olivier Chafik e15898d1c7
server: allow unclosed thinking tags (#13931) 2025-05-31 08:26:10 -07:00
Georgi Gerganov 53f925074d
sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen 10961339b2
mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
Đinh Trọng Huy e0e3aa231d
llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 19:01:58 +02:00
Olivier Chafik cdf94a1802
server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 22:34:27 +01:00
Olivier Chafik 03f582ae8f
server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik d74e94c1b3
`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik f13847cfb5
server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Olivier Chafik e121edc432
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Percy Piper c508256db2
rpc : Fix build on OpenBSD (#13541) 2025-05-25 15:35:53 +03:00
Olivier Chafik f5cd27b71d
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Xuan-Son Nguyen 797990c4bc
mtmd : add ultravox audio input (#13623)
* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
Sigbjørn Skjæret 2aa777d86d
examples : switch retrieval to llama_encode (#13685)
* switch retrieval to llama_encode

* enable --no-warmup for retrieval
2025-05-21 16:57:38 +02:00
Georgi Gerganov a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated (#13653)
ggml-ci
2025-05-20 16:13:16 +03:00
Georgi Gerganov e298d2fbd0
kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
2025-05-20 08:05:46 +03:00
psocolovsky 1dfbf2cf3a
common : add load_progress_callback (#13617) 2025-05-19 21:17:36 +02:00
Isaac McFadyen 6a2bc8bfb7
server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
2025-05-17 23:59:48 +02:00
Georgi Gerganov 518329b2d4
parallel : add option for non-shared and larger prompts (#13598)
* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci
2025-05-17 12:58:55 +03:00
Z 3e0be1cace
llguidance : official v0.7.20 release (no actual changes) [noci] (#13594) 2025-05-16 22:56:28 +02:00
Olivier Chafik bc098c3cf0
minja: sync (qwen3) (#13573)
* minja: sync f06140fa52

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <ochafik@google.com>
2025-05-15 23:29:10 +01:00
Olivier Chafik aa48e373f2
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Olivier Chafik 3198405e98
`common`: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Johannes Gäßler 10d2af0eaa
llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang 7f323a589f
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
Sigbjørn Skjæret 43dfd741a5
llguidance : set tokenizer slices to default (#13424) 2025-05-10 17:19:52 +02:00
Xuan-Son Nguyen 7fef11766c
arg : add env var to control mmproj (#13416)
* arg : add env var to control mmproj

* small note about -hf --mmproj
2025-05-10 08:16:29 +02:00
Helton Reis 7c28a74e07
chore(llguidance): use tagged version that does not break the build (#13413) 2025-05-09 23:15:39 +03:00
Xuan-Son Nguyen 33eff40240
server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-09 19:29:37 +02:00
Bartowski efb8b47eda
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)
* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace
2025-05-09 11:53:58 +02:00
Diego Devesa 15e03282bb
ci : limit write permission to only the release step + fixes (#13392)
* ci : limit write permission to only the release step

* fix win cuda file name

* fix license file copy on multi-config generators
2025-05-08 23:45:22 +02:00
Xuan-Son Nguyen 8c83449cb7
server : (webui) revamp the input area, plus many small UI improvements (#13365)
* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log
2025-05-08 15:37:29 +02:00
Georgi Gerganov 51fb96b1ff
context : remove logits_all flag (#13284)
* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci
2025-05-08 14:26:50 +03:00
Ycros 39e73ae0d6
common : Add a warning when we can't match samplers from a string or char. (#13330) 2025-05-07 11:23:28 +03:00
Georgi Gerganov 4773d7a02f
examples : remove infill (#13283)
ggml-ci
2025-05-07 10:28:02 +03:00
oobabooga 233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264)
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering

* revert: sampler ordering

* revert: VS' crappy auto-formatting

* revert: VS' crappy auto-formatting pt.2

* revert: my crappy eye sight...

* sampling: add XTC to Top-nσ sampler chain

* sampling: add Dyna. Temp. to Top-nσ sampler chain

* sampling: actually remove Top-nσ from sampler(oops)

* Integrate top_n_sigma into main sampler chain

* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA

* Formatting

* Lint

* Exit early in the sampler if nsigma < 0

---------

Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>
2025-05-05 22:12:19 +02:00
Xuan-Son Nguyen 9b61acf060
mtmd : rename llava directory to mtmd (#13311)
* mv llava to mtmd

* change ref everywhere
2025-05-05 16:02:55 +02:00
Diego Devesa 1d36b3670b
llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00
Georgi Gerganov fab647e884
server : add cache reuse card link to help (#13230)
* server : add cache reuse card link to help

* args : use short url
2025-05-02 09:48:31 +03:00
Diego Devesa d7a14c42a1
build : fix build info on windows (#13239)
* build : fix build info on windows

* fix cuda host compiler msg
2025-05-01 21:48:08 +02:00
Xuan-Son Nguyen 13c9a3319b
arg : remove CURLINFO_EFFECTIVE_METHOD (#13228) 2025-05-01 10:23:25 +02:00
Xuan-Son Nguyen 6f67cf1f48
arg : -hf do not fail if url mismatch (#13219)
* arg : -hf do not fail if url mismatch

* do not return if cannot parse metadata json
2025-04-30 21:29:15 +01:00
Olivier Chafik 3b127c7385
common : add -jf / --json-schema-file flag (#12011) 2025-04-30 14:52:35 +02:00
Xuan-Son Nguyen 5933e6fdc9
arg : allow using -hf offline (#13202)
* arg : allow using -hf offline

* add more comments in code [no ci]
2025-04-30 10:46:32 +02:00
Georgi Gerganov 43f2b07193
common : fix noreturn compile warning (#13151)
ggml-ci
2025-04-28 11:57:19 +03:00
Xuan-Son Nguyen 85f36e5e71
arg : fix unused variable (#13142) 2025-04-28 08:16:59 +03:00
Xuan-Son Nguyen 2d451c8059
common : add common_remote_get_content (#13123)
* common : add common_remote_get_content

* support max size and timeout

* add tests
2025-04-26 22:58:12 +02:00
frob d5fe4e81bd
grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-04-26 10:10:20 +02:00
Georgi Gerganov 13b4548877
cmake : do not include ./src as public for libllama (#13062)
* cmake : do not include ./src as public for libllama

ggml-ci

* cmake : rework tests

ggml-ci

* llguidance : remove unicode include

ggml-ci

* cmake : make c++17 private

ggml-ci
2025-04-24 16:00:10 +03:00
Xuan-Son Nguyen 7c727fbe39
arg : add --no-mmproj-offload (#13093)
* arg : add --no-mmproj-offload

* Update common/arg.cpp
2025-04-24 14:04:14 +02:00
Xuan-Son Nguyen 80982e815e
arg : clean up handling --mmproj with -hf (#13082)
* arg : clean up handling --mmproj with -hf

* rm change about no_mmproj

* Revert "rm change about no_mmproj"

This reverts commit 2cac8e0efb.

* handle no_mmproj explicitly

* skip download mmproj on examples not using it
2025-04-24 12:14:13 +02:00
Xuan-Son Nguyen 243453533e
llava : update documentations (#13055)
* llava : update documentations

* fix typo
2025-04-22 10:37:00 +02:00
Xuan-Son Nguyen 84a9bf2fc2
mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-21 15:32:58 +02:00
Prajwal B Mehendarkar bc091a4dc5
common : Define cache directory on AIX (#12915) 2025-04-12 17:33:39 +02:00
Olivier Chafik b6930ebc42
`tool-call`: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900)
* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)

* test all chat formats w/o tools
2025-04-11 21:47:52 +02:00
yuri@FreeBSD 68b08f36d0
common : Define cache directory on FreeBSD (#12892) 2025-04-11 21:45:44 +02:00
tastelikefeet b2034c2b55
contrib: support modelscope community (#12664)
* support download from modelscope

* support login

* remove comments

* add arguments

* fix code

* fix win32

* test passed

* fix readme

* revert readme

* change to MODEL_ENDPOINT

* revert tail line

* fix readme

* refactor model endpoint

* remove blank line

* fix header

* fix as comments

* update comment

* update readme

---------

Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>
2025-04-11 14:01:56 +02:00
Prajwal B Mehendarkar 1d343b4069
arg : Including limits file on AIX (#12822) 2025-04-08 14:30:59 +02:00
Xuan-Son Nguyen bd3f59f812
cmake : enable curl by default (#12761)
* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo
2025-04-07 13:35:19 +02:00
Sergey Fedorov f1e3eb4249
common : fix includes in arg.cpp and gemma3-cli.cpp (#12766)
* arg.cpp: add a missing include

* gemma3-cli.cpp: fix cinttypes include
2025-04-05 17:46:00 +02:00
エシュナヴァリシア c6ff5d2a8d
common: custom hf endpoint support (#12769)
* common: custom hf endpoint support

Add support for custom huggingface endpoints via HF_ENDPOINT environment variable

You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration.

Example usage:
HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

The trailing slash in the URL is optional:
HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

* Update common/arg.cpp

readability Improvement

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-04-05 15:31:42 +02:00
Olivier Chafik 7a84777f42
sync: minja (#12739)
* sync: minja

https://github.com/google/minja/pull/57

* fix json include
2025-04-04 21:16:39 +01:00
R0CKSTAR 5f696e88e0
sync : minja (inclusionAI/Ling) and update tests (#12699)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-03 13:51:35 +02:00