llama.cpp

Commit Graph

Author	SHA1	Message	Date
Eric Curtin	a1649cc13f	Adding linenoise.cpp to llama-run (#11252 ) This is a fork of linenoise that is C++17 compatible. I intend on adding it to llama-run so we can do things like traverse prompt history via the up and down arrows: https://github.com/ericcurtin/linenoise.cpp Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-18 14:42:31 +00:00
Xuan Son Nguyen	f30f099228	server : implement cancellable request (#11285 ) * server : implement cancellable request * fix typo * httplib 0.18.5 * fix i underflow	2025-01-18 14:12:05 +01:00
LostRuins Concedo	6390a998bf	tts : add guide tokens support (#11186 ) * Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start	2025-01-18 12:20:57 +02:00
codezjx	3edfa7d375	llama.android: add field formatChat to control whether to parse special tokens when send message (#11270 )	2025-01-17 14:57:56 +02:00
Radoslav Gerganov	667d72846c	rpc : early register backend devices (#11262 ) Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609	2025-01-17 10:57:09 +02:00
Georgi Gerganov	f11cfdfd7f	ci : use -no-cnv in gguf-split tests (#11254 ) * ci : use -no-cnv in gguf-split tests ggml-ci * ci : use -no-cnv in requantize tests ggml-ci * scripts : fix [no ci]	2025-01-15 18:28:35 +02:00
Daniel Bevenius	0ccd7f3eb2	examples : add embd_to_audio to tts-outetts.py [no ci] (#11235 ) This commit contains a suggestion for adding the missing embd_to_audio function from tts.cpp to tts-outetts.py. This introduces a depencency numpy which I was not sure if that is acceptable or not (only PyTorch was mentioned in referened PR). Also the README has been updated with instructions to run the example with llama-server and the python script. Refs: https://github.com/ggerganov/llama.cpp/pull/10784#issuecomment-2548377734	2025-01-15 05:44:38 +01:00
ebraminio	c5bf0d1bd7	server : Improve code snippets direction between RTL text (#11221 )	2025-01-14 11:39:33 +01:00
ebraminio	504af20ee4	server : (UI) Improve messages bubble shape in RTL (#11220 ) I simply have overlooked message bubble's tail placement for RTL text as I use the dark mode and that isn't visible there and this fixes it.	2025-01-13 20:23:31 +01:00
Xuan Son Nguyen	84a44815f7	cli : auto activate conversation mode if chat template is available (#11214 ) * cli : auto activate conversation mode if chat template is detected * add warn on bad template * update readme (writing with the help of chatgpt) * update readme (2) * do not activate -cnv for non-instruct models	2025-01-13 20:18:12 +01:00
ebraminio	437e05f714	server : (UI) Support for RTL text as models input or output (#11208 )	2025-01-13 14:46:39 +01:00
Eric Curtin	924518e2e5	Reset color before we exit (#11205 ) We don't want colors to leak post termination of llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-12 18:23:10 +00:00
Georgi Gerganov	afa8a9ec9b	llama : add `llama_vocab`, functions -> methods, naming (#11110 ) * llama : functions -> methods (#11110) * llama : add struct llama_vocab to the API (#11156) ggml-ci * hparams : move vocab params to llama_vocab (#11159) ggml-ci * vocab : more pimpl (#11165) ggml-ci * vocab : minor tokenization optimizations (#11160) ggml-ci Co-authored-by: Diego Devesa <slarengh@gmail.com> * lora : update API names (#11167) ggml-ci * llama : update API names to use correct prefix (#11174) * llama : update API names to use correct prefix ggml-ci * cont ggml-ci * cont ggml-ci * minor [no ci] * vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174) ggml-ci * vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174) ggml-ci --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-01-12 11:32:42 +02:00
Daniel Bevenius	ba8a1f9c5b	examples : add README.md to tts example [no ci] (#11155 ) * examples : add README.md to tts example [no ci] * squash! examples : add README.md to tts example [no ci] Fix heading to be consistent with other examples, and add a quickstart section to README.md. * squash! examples : add README.md to tts example [no ci] Fix spelling mistake.	2025-01-10 13:16:16 +01:00
Daniel Bevenius	8eceb888d7	server : add tooltips to settings and themes btn (#11154 ) * server : add tooltips to settings and themes btn This commit adds tooltips to the settings and themes buttons in the webui. The tooltip will be displayed below the actual buttons when hovered over. The motivation for this change is to clarify the purpose of the themes button. * squash! server : add tooltips to settings and themes btn This commit adds a tooltip to the '...' button when a chat has been started. The tooltip is "Chat options" which think could be a good description as the dropdown contains options to delete or download the current chat. * rm tooltip for 3 dots button --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-09 11:28:29 +01:00
Eric Curtin	1bf839b1e8	Enhance user input handling for llama-run (#11138 ) The main motivation for this change is it was not handing ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF, "/bye" command, and empty input cases. Introduce `get_user_input` function to manage user input loop and handle different return cases. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-08 18:47:05 +00:00
Georgi Gerganov	a3c1232c3f	arg : option to exclude arguments from specific examples (#11136 ) * arg : option to exclude arguments from specific examples ggml-ci * readme : remove old args [no ci]	2025-01-08 12:55:36 +02:00
Johannes Gäßler	53ff6b9b9f	GGUF: C++ refactor, backend support, misc fixes (#11030 ) * GGUF: C++ refactor, backend support, misc fixes remove ggml_tensor.backend update CODEOWNERS [no ci] remove gguf_get_data from API revise GGUF API data types	2025-01-07 18:01:58 +01:00
Eric Curtin	dc7cef9f37	llama-run : fix context size (#11094 ) Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is a more reasonable 2048. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-06 23:45:28 +01:00
Georgi Gerganov	e6e7c75d94	server : fix extra BOS in infill endpoint (#11106 ) * server : fix extra BOS in infill endpoing ggml-ci * server : update infill tests	2025-01-06 15:36:08 +02:00
Georgi Gerganov	47182dd03f	llama : update llama_model API names (#11063 ) * llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci	2025-01-06 10:55:18 +02:00
Georgi Gerganov	3e6e7a6bc2	tokenize : escape the prompt (#11058 ) * tokenize : escape the prompt * tokenize : update help	2025-01-06 10:54:25 +02:00
Georgi Gerganov	727368c60f	llama : use LLAMA_TOKEN_NULL (#11062 ) ggml-ci	2025-01-06 10:52:15 +02:00
Georgi Gerganov	f66f582927	llama : refactor `src/llama.cpp` (#10902 ) * llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]	2025-01-03 10:18:53 +02:00
Pierrick Hymbert	2f0ee84b9b	server: bench: minor fixes (#10765 ) * server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench	2025-01-02 18:06:12 +01:00
Xuan Son Nguyen	0da5d86026	server : allow using LoRA adapters per-request (#10994 ) * slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-02 15:05:18 +01:00
Xuan Son Nguyen	45095a61bf	server : clean up built-in template detection (#11026 ) * server : clean up built-in template detection * fix compilation * add chat template test * fix condition	2024-12-31 15:22:01 +01:00
Xuan Son Nguyen	5896c65232	server : add OAI compat for /v1/completions (#10974 ) * server : add OAI compat for /v1/completions * add test * add docs * better docs	2024-12-31 12:34:13 +01:00
Peter	6e1531aca5	common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON (#11013 ) In common/common.cpp: * Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature) * Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2) In examples/run/run.cpp: * Add io.h header inclusion (error cannot find function _get_osfhandle) * Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members) * Add initialiser for hFile (warning it may be uninitialised) * Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int) In ggml/src/ggml-opencl/ggml-opencl.cpp: * Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)	2024-12-31 01:46:06 +01:00
ag2s20150909	c250ecb315	android : fix llama_batch free (#11014 )	2024-12-30 14:35:13 +02:00
Isaac McFadyen	f865ea149d	server: added more docs for response_fields field (#10995 )	2024-12-28 16:09:19 +01:00
Alexey Parfenov	16cdce7b68	server : fix token duplication when streaming with stop strings (#10997 )	2024-12-28 16:08:54 +01:00
Peter	d283d02bf2	examples, ggml : fix GCC compiler warnings (#10983 ) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)	2024-12-26 14:59:11 +01:00
Reza Kakhki	9ba399dfa7	server : add support for "encoding_format": "base64" to the /embeddings endpoints (#10967 ) add support for base64 * fix base64 test * improve test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-24 21:33:04 +01:00
Djip007	2cd43f4900	ggml : more perfo with llamafile tinyblas on x86_64 (#10714 ) * more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: https://github.com/ikawrakow/ik_llama.cpp/pull/71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test	2024-12-24 18:54:49 +01:00
NeverLucky	09fe2e7613	server: allow filtering llama server response fields (#10940 ) * llama_server_response_fields * llama_server_response_fields_fix_issues * params fixes * fix * clarify docs * change to "response_fields" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-24 17:39:49 +01:00
Xuan Son Nguyen	14b699ecde	server : fix missing model id in /model endpoint (#10957 ) * server : fix missing model id in /model endpoint * fix ci	2024-12-23 12:52:25 +01:00
Xuan Son Nguyen	485dc01214	server : add system_fingerprint to chat/completion (#10917 ) * server : add system_fingerprint to chat/completion * update README	2024-12-23 12:02:44 +01:00
Radoslav Gerganov	86bf31cfe6	rpc-server : add support for the SYCL backend (#10934 )	2024-12-23 10:39:30 +02:00
Eric Curtin	dab76c92cc	llama-run : include temperature option (#10899 ) This commit updates the `examples/run/README.md` file to include a new option for setting the temperature and updates the `run.cpp` file to parse this option. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-23 01:21:40 +01:00
Xuan Son Nguyen	0ca416c91a	server : (UI) fix copy to clipboard function (#10916 )	2024-12-20 14:12:06 +01:00
Georgi Gerganov	d408bb9268	clip : disable GPU support (#10896 ) ggml-ci	2024-12-19 18:47:15 +02:00
Georgi Gerganov	5cab3e4aaa	llama : minor grammar refactor (#10897 ) ggml-ci	2024-12-19 17:42:13 +02:00
Xuan Son Nguyen	57bb2c40cd	server : fix logprobs, make it OAI-compatible (#10783 ) * server : fix logprobs, make it openai-compatible * update docs * add std::log * return pre-sampling p * sort before apply softmax * add comment * fix test * set p for sampled token * update docs * add --multi-token-probs * update docs * add `post_sampling_probs` option * update docs [no ci] * remove --multi-token-probs * "top_probs" with "post_sampling_probs" * resolve review comments * rename struct token_prob to prob_info * correct comment placement * fix setting prob for sampled token	2024-12-19 15:40:08 +01:00
Eric Curtin	7909e8588d	llama-run : improve progress bar (#10821 ) Set default width to whatever the terminal is. Also fixed a small bug around default n_gpu_layers value. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-19 03:58:00 +01:00
Diego Devesa	9177484f58	ggml : fix arm build (#10890 ) * ggml: GGML_NATIVE uses -mcpu=native on ARM Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml: Show detected features with GGML_NATIVE Signed-off-by: Adrien Gallouët <angt@huggingface.co> * remove msvc support, add GGML_CPU_ARM_ARCH option * disable llamafile in android example * march -> mcpu, skip adding feature macros ggml-ci --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Adrien Gallouët <angt@huggingface.co>	2024-12-18 23:21:42 +01:00
Georgi Gerganov	0bf2d10c55	tts : add OuteTTS support (#10784 ) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : be explicit about the pooling type in the tests ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * llama : add OuteTTS support (wip) * wip * extract features * first conv * group norm * resnet conv * resnet * attn * pos net * layer norm * convnext * head * hann window * fix n_embd + remove llama.cpp hacks * compute hann window * fft * spectrum processing * clean-up * tts : receive input text and generate codes * clip : fix new conv name * tts : minor fix * tts : add header + minor fixes ggml-ci * tts : add matchematical constant ggml-ci * tts : fix sampling + cut initial noise * tts : fixes * tts : update default samplers ggml-ci * tts : text pre-processing * tts : outetts-voc -> wavtokenizer-dec * tts : remove hardcoded constants ggml-ci * tts : fix tensor shapes * llama : refactor wavtokenizer tensors ggml-ci * cont ggml-ci * cont [no ci] * llama : update WavTokenizer to non-causal attn * llama : handle no-vocab detokenization * tts : add Python example for OuteTTS (wip) * tts : extend python example to generate spectrogram ggml-ci * server : fix rebase artifacts * tts : enable "return_tokens" in Python example ggml-ci * tts : minor fixes * common : support HF download for vocoder	2024-12-18 19:27:21 +02:00
Gaetan Bisson	7bbb5acf12	server: avoid overwriting Authorization header (#10878 ) * server: avoid overwriting Authorization header If no API key is set, leave the Authorization header as is. It may be used by another part of the Web stack, such as an authenticating proxy. Fixes https://github.com/ggerganov/llama.cpp/issues/10854 * rebuild --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-18 15:00:07 +01:00
Georgi Gerganov	152610eda9	server : output embeddings for all tokens when pooling = none (#10861 ) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : update readme [no ci] * server : fix spacing [no ci] Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * server : be explicit about the pooling type in the tests ggml-ci * server : update /embeddings and /v1/embeddings endpoints ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * server : update readme ggml-ci * server : fixes * tests : update server tests ggml-ci * server : update readme [no ci] * server : remove rebase artifact --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-12-18 13:01:41 +02:00
Georgi Gerganov	0e70ba686e	server : add "tokens" output (#10853 ) * server : add "tokens" output ggml-ci * server : update readme ggml-ci * server : return tokens ids only if requested ggml-ci * tests : improve "tokens" type check Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * server : remove "tokens" from the OAI endpoint ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-12-18 11:05:29 +02:00
Xuan Son Nguyen	46828872c3	server : (embeddings) using same format for "input" and "content" (#10872 ) * server : (embeddings) using same format for "input" and "content" * fix test case * handle empty input case * fix test	2024-12-18 10:55:09 +02:00
krystiancha	05c3a444b8	server : fill usage info in embeddings and rerank responses (#10852 ) * server : fill usage info in embeddings response * server : fill usage info in reranking response	2024-12-17 18:00:24 +02:00
Xuan Son Nguyen	227d7c5a7f	server : (UI) fix missing async generator on safari (#10857 ) * server : (UI) fix missing async generator on safari * fix	2024-12-17 09:52:09 +01:00
Georgi Gerganov	644fd71b44	sampling : refactor + optimize penalties sampler (#10803 ) * sampling : refactor + optimize penalties sampler ggml-ci * common : apply ignore_eos as logit bias ggml-ci * batched : remove penalties sampler * params : allow penalty_last_n == -1 to be equal to context size ggml-ci * common : by default, move the penalties at the end of the sampling chain ggml-ci * common : ignore all EOG tokens Co-authored-by: Diego Devesa <slarengh@gmail.com> * common : move back the penalties at the front of the sampling chain ggml-ci * readme : restore hint about --ignore-eos flag [no ci] * llama : minor ggml-ci * webui : update --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-16 12:31:14 +02:00
Bartowski	4ddd199f6f	llava : Allow locally downloaded models for QwenVL (#10833 ) * Allow locally downloaded models for QwenVL * Define model_path * rm trailing space --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-12-15 21:43:25 +01:00
Vinesh Janarthanan	5478bbcd17	server: (UI) add syntax highlighting and latex math rendering (#10808 ) * add code highlighting and math formatting * code cleanup * build public/index.html * rebuild public/index.html * fixed coding style * fixed coding style * style fixes * highlight: smaller bundle size, fix light & dark theme * remove katex * add bundle size check * add more languages * add php * reuse some langs * use gzip * Revert "remove katex" This reverts commit `c0e5046acc`. * use better maintained @vscode/markdown-it-katex * fix gzip non deterministic * ability to add a demo conversation for dev * fix latex rendering * add comment * latex codeblock as code --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-15 12:55:54 +01:00
Michelle Tan	89d604f2c8	server: Fix `has_next_line` in JSON response (#10818 ) * Update server JSON response. * Add unit test to check `has_new_line` JSON response * Remove `has_new_line` unit test changes. * Address code review comment: type check for `has_new_line` in unit test	2024-12-14 23:29:45 +01:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
cduk	56eea0781c	Removes spurious \r in output that causes logging in journalctl to treat lines as binary and therefore hidden by default (#10771 ) Signed-off-by: Charles Darke <s.cduk@toodevious.com> Co-authored-by: Charles Darke <s.cduk@toodevious.com>	2024-12-13 23:21:49 +01:00
Eric Curtin	c27ac678dd	Opt class for positional argument handling (#10508 ) Added support for positional arguments `model` and `prompt`. Added functionality to download via strings like: llama-run llama3 llama-run ollama://granite-code llama-run ollama://granite-code:8b llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf llama-run https://example.com/some-file1.gguf llama-run some-file2.gguf llama-run file://some-file3.gguf Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-13 19:34:25 +01:00
Xuan Son Nguyen	adffa6ffd5	common : improve -ctv -ctk CLI arguments (#10806 ) * common : improve ctv ctk cli argument * regenerate docs * even better approach * use std::vector	2024-12-12 22:53:05 +01:00
Diego Devesa	cb13ef85a4	remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797 ) other windows build fixes	2024-12-12 19:02:49 +01:00
CentricStorm	5555c0c1f6	docs: update server streaming mode documentation (#9519 ) Provide more documentation for streaming mode.	2024-12-11 23:40:40 +01:00
Xuan Son Nguyen	235f6e14bf	server : (UI) add tok/s, get rid of completion.js (#10786 ) * get rid of completion.js * extract chat bubble to a component * add tok/s info * sync * fix BASE_URL * only extract timings when it's enabled * fix auto scroll	2024-12-11 20:52:14 +01:00
qingy1337	1a31d0dc00	Update README.md (#10772 )	2024-12-11 16:16:32 +01:00
kallewoof	484d2f31ae	bug-fix: snprintf prints NULL in place of the last character (#10419 ) * bug-fix: snprintf prints NULL in place of the last character We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string. * add comment about extra null-term byte requirement	2024-12-11 14:48:04 +01:00
CentricStorm	4b4d92b098	docs: fix server documentation formatting (#10776 )	2024-12-11 11:47:43 +01:00
Yüg	a86ad841f1	server : add flag to disable the web-ui (#10762 ) (#10751 ) Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>	2024-12-10 18:22:34 +01:00
Xuan Son Nguyen	ce8784bdb1	server : fix format_infill (#10724 ) * server : fix format_infill * fix * rename * update test * use another model * update test * update test * test_invalid_input_extra_req	2024-12-08 23:04:29 +01:00
Xuan Son Nguyen	e52522b869	server : bring back info of final chunk in stream mode (#10722 ) * server : bring back into to final chunk in stream mode * clarify a bit * traling space	2024-12-08 20:38:51 +01:00
Diego Devesa	43ed389a3f	llama : use cmake for swift build (#10525 ) * llama : use cmake for swift build * swift : <> -> "" * ci : remove make * ci : disable ios build * Revert "swift : <> -> """ This reverts commit `d39ffd9556`. * ci : try fix ios build * ci : cont * ci : cont --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-08 13:14:54 +02:00
Xuan Son Nguyen	3573fa8e7b	server : (refactor) no more json in server_task input (#10691 ) * server : (refactor) no more json in server_task input * add test for slots endpoint * add tests for /props and /slots * remove task inf_type * fix CI by adding safe_json_to_str * add "model_path" to /props * update readme	2024-12-07 20:21:09 +01:00
Georgi Gerganov	ce4a7b8493	server : various fixes (#10704 ) * server : various fixes ggml-ci * server : show curent seed in slot_params ggml-ci * fix /slots endpoint * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : reflect endpoint response changes in the readme ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-12-07 18:02:05 +02:00
Djip007	19d8762ab6	ggml : refactor online repacking (#10446 ) * rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-07 14:37:50 +02:00
Georgi Gerganov	c2a16c0bdb	server : fix free of spec context and batch (#10651 ) ggml-ci	2024-12-07 11:52:44 +02:00
Xuan Son Nguyen	6c5bc0625f	server : (refactoring) do not rely on JSON internally (#10643 ) * server : (refactoring) reduce usage of json internally * move all response types to struct * wip [no ci] * many fixes * add virtual function * fix index * minor style fix * add std::move * refactor handle_completions_generic * add virtual functions * remove server.hpp * clarify server_sent_event RFC specs * apply review comments * fix model_alias and completion_probabilities * small clean up * remove virtual for to_json_oai_compat() * naming oai_compat --> oaicompat * fix unwanted recursive call * update docs	2024-12-06 11:14:32 +01:00
Plamen Minev	7736837d62	fix(server) : not show alert when DONE is received (#10674 )	2024-12-05 22:36:41 +01:00
aryantandon01	f112d198cd	Update deprecation-warning.cpp (#10619 ) Fixed Path Separator Handling for Cross-Platform Support (Windows File Systems)	2024-12-04 23:19:20 +01:00
Georgi Gerganov	1da7b76569	server : fix speculative decoding with context shift (#10641 ) * server : fix speculative decoding with context shift ggml-ci * server : take into account speculative limits ggml-ci * server : add tests	2024-12-04 22:38:20 +02:00
piDack	01e6d9bb71	clip : add sycl support (#10574 ) Co-authored-by: piDack <pcdack@hotmail.co>	2024-12-04 01:26:37 +01:00
Xuan Son Nguyen	91c36c269b	server : (web ui) Various improvements, now use vite as bundler (#10599 ) * hide buttons in dropdown menu * use npm as deps manager and vite as bundler * fix build * fix build (2) * fix responsive on mobile * fix more problems on mobile * sync build * (test) add CI step for verifying build * fix ci * force rebuild .hpp files * cmake: clean up generated files pre build	2024-12-03 19:38:44 +01:00
Nikolaos Pothitos	82bca2257b	readme : add option, update default value, fix formatting (#10271 ) * readme : document --no-display-prompt * readme : update default prompt context size * readme : remove unnecessary indentation Indenting a line with four spaces makes Markdown treat that section as plain text. * readme : indent commands under bullets * readme : indent commands in lettered list	2024-12-03 12:50:08 +02:00
Georgi Gerganov	70b98fadbc	server : fix default draft model parameters (#10586 ) * server : force F16 KV cache for the draft model ggml-ci * server : fix draft params ggml-ci * server : various params fixes ggml-ci	2024-12-03 11:20:00 +02:00
Xuan Son Nguyen	642330ac7c	llama : add enum for built-in chat templates (#10623 ) * llama : add enum for supported chat templates * use "built-in" instead of "supported" * arg: print list of built-in templates * fix test * update server README	2024-12-02 22:10:19 +01:00
Georgi Gerganov	8648c52101	make : deprecate (#10514 ) * make : deprecate ggml-ci * ci : disable Makefile builds ggml-ci * docs : remove make references [no ci] * ci : disable swift build ggml-ci * docs : remove obsolete make references, scripts, examples ggml-ci * basic fix for compare-commits.sh * update build.md * more build.md updates * more build.md updates * more build.md updates * Update Makefile Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-12-02 21:22:53 +02:00
haopeng	64ed2091b2	server: Add "tokens per second" information in the backend (#10548 ) * add cmake rvv support * add timings * remove space * update readme * fix * fix code * remove empty line * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-02 14:45:54 +01:00
alek3y	86dc11c5bc	server : bind to any port when specified (#10590 )	2024-12-01 13:33:12 +02:00
Diego Devesa	7cc2d2c889	ggml : move AMX to the CPU backend (#10570 ) * ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-29 21:54:58 +01:00
Xuan Son Nguyen	b782e5c7d4	server : add more test cases (#10569 ) * server : add split model test * add test speculative * add invalid cases	2024-11-29 21:48:56 +01:00
Robert Collins	3a8e9af402	imatrix : support combine-only (#10492 ) * imatrix-combine-only idea * ensured that behavior consistent with log	2024-11-29 19:21:37 +02:00
Ting Lou	678d7994f4	llava: return false instead of exit (#10546 )	2024-11-29 01:09:46 +01:00
Xuan Son Nguyen	6c59567689	server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568 ) * server : (tests) don't use thread for capturing stdout/stderr * test: bump openai to 1.55.2 * bump openai to 1.55.3	2024-11-28 19:17:49 +01:00
Random Fly	7281cf13ad	docs: fix outdated usage of llama-simple (#10565 )	2024-11-28 16:03:11 +01:00
Xuan Son Nguyen	9f912511bc	common : fix duplicated file name with hf_repo and hf_file (#10550 )	2024-11-27 22:30:52 +01:00
Diego Devesa	30ec398321	llama : disable warnings for 3rd party sha1 dependency (#10527 )	2024-11-26 21:01:47 +01:00
Xuan Son Nguyen	45abe0f74e	server : replace behave with pytest (#10416 ) * server : replace behave with pytest * fix test on windows * misc * add more tests * more tests * styling * log less, fix embd test * added all sequential tests * fix coding style * fix save slot test * add parallel completion test * fix parallel test * remove feature files * update test docs * no cache_prompt for some tests * add test_cache_vs_nocache_prompt	2024-11-26 16:20:18 +01:00
Georgi Gerganov	ab96610b1e	cmake : enable warnings in llama (#10474 ) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-11-26 14:18:08 +02:00
Georgi Gerganov	84e1c33cde	server : fix parallel speculative decoding (#10513 ) ggml-ci	2024-11-26 13:36:40 +02:00
Georgi Gerganov	811872a59d	speculative : simplify the implementation (#10504 ) ggml-ci	2024-11-26 12:29:38 +02:00
Eric Curtin	0cc63754b8	Introduce llama-run (#10291 ) It's like simple-chat but it uses smart pointers to avoid manual memory cleanups. Less memory leaks in the code now. Avoid printing multiple dots. Split code into smaller functions. Uses no exception handling. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-11-25 22:56:24 +01:00
Georgi Gerganov	47f931c8f9	server : enable cache_prompt by default (#10501 ) ggml-ci	2024-11-25 21:50:07 +02:00
Diego Devesa	10bce0450f	llama : accept a list of devices to use to offload a model (#10497 ) * llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency	2024-11-25 19:30:06 +01:00
brucepro	a9a678a6b2	Add download chat feature to server chat (#10481 ) * Add download chat feature to server chat Add a download feature next to the delete chat feature in the server vue chat interface. * code style --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-25 17:11:55 +01:00
Georgi Gerganov	9ca2e67762	server : add speculative decoding support (#10455 ) * server : add speculative decoding support ggml-ci * server : add helper function slot.can_speculate() ggml-ci	2024-11-25 16:31:38 +02:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Georgi Gerganov	d9d54e498d	speculative : refactor and add a simpler example (#10362 ) * speculative : refactor and add a simpler example ggml-ci * speculative : clean-up and add comments and TODOs [no ci] * speculative : manage context in common_speculative ggml-ci * speculative : simplify ggml-ci * speculative : simplify (cont) ggml-ci * speculative : add --draft-min CLI arg * speculative : minor fixup * make : build fixes * speculative : do not redraft previous drafts ggml-ci * speculative : fix the draft sampling ggml-ci * speculative : fix compile warning * common : refactor args ggml-ci * common : change defaults [no ci] * common : final touches ggml-ci	2024-11-25 09:58:41 +02:00
Diego Devesa	fab5d30ff6	llama : add .clang-format file (#10415 )	2024-11-20 12:57:53 +01:00
Johannes Gäßler	4e54be0ec6	llama/ex: remove --logdir argument (#10339 )	2024-11-16 23:00:41 +01:00
MaggotHATE	bcdb7a2386	server: (web UI) Add samplers sequence customization (#10255 ) * Samplers sequence: simplified and input field. * Removed unused function * Modify and use `settings-modal-short-input` * rename "name" --> "label" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-16 14:26:54 +01:00
Xuan Son Nguyen	9901068ac7	server : (web UI) add copy button for code block, fix api key (#10242 ) * server : (web ui) add copy btn for code blocks * fix problem with api key * use settings-modal-short-input component * always show copy btn for code snippet	2024-11-15 10:48:49 +01:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
Georgi Gerganov	2a82891a85	speculative : fix out-of-bounds access (#10289 )	2024-11-14 11:44:15 +02:00
Alexey Parfenov	ff7fb670d0	server : add missing docs (#10269 )	2024-11-13 13:16:30 +02:00
Jhen-Jie Hong	0e712a5acb	server : fix incorrect res in validate_model_chat_template (#10272 ) * server : fix validate_model_chat_template * server : fix chat res	2024-11-13 13:15:23 +02:00
Brian	a0ec17b32e	metadata: Detailed Dataset Authorship Metadata (#8875 ) Converter script can now read these two fields as a detailed base model and dataset source. This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed. - base_model_sources (List[dict], optional) - dataset_sources (List[dict], optional) Dataset now represented as: - general.dataset.count - general.dataset.{id}.name - general.dataset.{id}.author - general.dataset.{id}.version - general.dataset.{id}.organization - general.dataset.{id}.description - general.dataset.{id}.url - general.dataset.{id}.doi - general.dataset.{id}.uuid - general.dataset.{id}.repo_url This also adds to base model these metadata: - general.base_model.{id}.description	2024-11-13 21:10:38 +11:00
Georgi Gerganov	b141e5f6ef	server : enable KV cache defrag by default (#10233 ) ggml-ci	2024-11-11 08:38:43 +02:00
MaggotHATE	505f33274d	server : (web UI) Add back sampler settings (#10239 ) * Add back samplers to server * Added tooltips with basic information * Fixed stretching of input fields. * use component for settings input, move help msg to tooltips --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-10 15:42:25 -04:00
haopeng	8fc393f246	scripts : fix pattern and get n_tokens in one go (#10221 )	2024-11-09 09:06:54 +02:00
Georgi Gerganov	841f27abdb	metal : optimize FA kernels (#10171 ) * ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]	2024-11-08 13:47:22 +02:00
Xuan Son Nguyen	76c6e7f105	server : minor UI fix (#10207 )	2024-11-07 18:44:38 -04:00
Xuan Son Nguyen	a71d81cf8c	server : revamp chat UI with vuejs and daisyui (#10175 ) * server : simple chat UI with vuejs and daisyui * move old files to legacy folder * embed deps into binary * basic markdown support * add conversation history, save to localStorage * fix bg-base classes * save theme preferences * fix tests * regenerate, edit, copy buttons * small fixes * docs: how to use legacy ui * better error handling * make CORS preflight more explicit * add GET method for CORS * fix tests * clean up a bit * better auto scroll * small fixes * use collapse-arrow * fix closeAndSaveConfigDialog * small fix * remove console.log * fix style for <pre> element * lighter bubble color (less distract when reading)	2024-11-07 17:31:10 -04:00
Georgi Gerganov	b11f9ba9b8	server : remove hack for extra parallel slot (#10187 ) ggml-ci	2024-11-06 13:29:01 +02:00
Xuan Son Nguyen	9e0ecfb697	server : clarify /slots endpoint, add is_processing (#10162 ) * server : clarify /slots endpoint, add is_processing * fix tests	2024-11-04 16:33:29 +01:00
Diego Devesa	9f40989351	ggml : move CPU backend to a separate file (#10144 )	2024-11-03 19:34:08 +01:00
sasha0552	42cadc74bd	server : fix slot selection by lru (#10126 ) * server : fix slot selection by lru, migrate lcs to `size_t` * minor debug log fix	2024-11-02 18:34:56 +02:00
Georgi Gerganov	45950415ed	server : fix endpoint checks (#10135 ) ggml-ci	2024-11-02 18:34:00 +02:00
Diego Devesa	b634f8a26f	simple-chat : only add bos on first prompt (#10129 )	2024-11-02 13:08:53 +01:00
Diego Devesa	a6744e43e8	llama : add simple-chat example (#10124 ) * llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-11-01 23:50:59 +01:00
sasha0552	d865d1478c	server : fix smart selection of available slot (#10120 ) * Fix smart selection of available slot * minor fix * replace vectors of tokens with shorthands	2024-11-01 14:33:14 +01:00
Kevin Gibbons	0a683e8088	server : include scheme when printing URL (#10106 )	2024-10-31 14:02:35 +01:00
Rich Dougherty	6763f713bb	readme : more lora detail in main example readme (#10064 )	2024-10-30 13:22:39 +01:00
Diego Devesa	c5b0f4b5d9	llama : refactor model loader with backend registry (#10026 )	2024-10-30 02:01:23 +01:00
Georgi Gerganov	8d8ff71536	llama : remove Tail-Free sampling (#10071 ) ggml-ci	2024-10-29 10:42:05 +02:00
Georgi Gerganov	8125e6cbfc	server : don't overfill the batch during infill (#10018 ) ggml-ci	2024-10-28 08:49:32 +02:00
wwoodsTM	ff252ea48e	llama : add DRY sampler (#9702 ) * sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>	2024-10-25 19:07:34 +03:00
Michael Podvitskiy	d80fb71f8b	llama: string_split fix (#10022 ) * llama: Refactor string_split to use template specialization, fixes parsing strings with spaces * llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string	2024-10-25 17:57:54 +02:00
Georgi Gerganov	bc5ba007b2	server : check that the prompt fits in the slot's context (#10030 ) ggml-ci	2024-10-25 10:13:46 +03:00
Xuan Son Nguyen	958367bf53	server : refactor slot input data, move tokenizer to HTTP thread (#10023 ) * server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere	2024-10-24 21:51:22 +02:00
wwoodsTM	0a1c750c80	server : samplers accept the prompt correctly (#10019 )	2024-10-23 22:27:51 +03:00
Georgi Gerganov	2d3aba9ee8	llama.vim : bump generation time limit to 3s [no ci]	2024-10-23 17:16:56 +03:00
Michael Coppola	ac113a0fee	llama.vim : add classic vim support (#9995 ) * added classic vim support * fixed ring update, removed blank line * minor * minor * minor doc update * removed uneeded var * minor * minor * fixed job_start creating new scratch buffers * fixed job_start creating new scratch buffers * fixed ghost text indenting when expandtab is on * removed unused code * minor * unified fim_on_exit * minor * vim ghost text rendering now uses pos_x and pos_y parameters * renamed _hlgroup to hlgroup_ * renamed _ghost_text to ghost_text_, moved nvim/vim detection to llama#init() * minor --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-10-23 14:09:26 +03:00
Georgi Gerganov	e94a138d64	llama.vim : fix info text display [no ci] (#9787 )	2024-10-22 00:37:55 +03:00
Georgi Gerganov	e01c67affe	llama.vim : move info to the right of screen [no ci] (#9787 ) 'eol' messes up the rendering with nvim v0.10.2 for some reason	2024-10-21 22:53:18 +03:00
Georgi Gerganov	dbd5f2f573	llama.vim : plugin for Neovim (#9787 )	2024-10-21 20:25:02 +03:00
Georgi Gerganov	55e47786e3	llama : default sampling changes + greedy update (#9897 ) * llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci	2024-10-21 09:46:40 +03:00
Georgi Gerganov	bc21975084	speculative : fix handling of some input params (#9963 ) * speculative : fix batch sizes at initialization ggml-ci * speculative : handle params.n_predict == -1 * speculative : limit batch size to llama_n_batch	2024-10-21 09:37:12 +03:00
Xuan Son Nguyen	cda0e4b648	llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745 ) * refactor llama_batch_get_one * adapt all examples * fix simple.cpp * fix llama_bench * fix * fix context shifting * free batch before return * use common_batch_add, reuse llama_batch in loop * null terminated seq_id list * fix save-load-state example * fix perplexity * correct token pos in llama_batch_allocr	2024-10-18 23:18:01 +02:00
Ouadie EL FAROUKI	87421a23e8	[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705 ) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp	2024-10-18 06:46:16 +01:00
Georgi Gerganov	8901755ba3	server : add n_indent parameter for line indentation requirement (#9929 ) ggml-ci	2024-10-18 07:32:19 +03:00
Georgi Gerganov	17bb928080	readme : remove --memory-f32 references (#9925 )	2024-10-17 23:43:05 +03:00

1 2 3 4 5 ...

1372 Commits