llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	c2cd24fbfd	readme : add notice about new package registry (#11890 ) * readme : add notice about new package registry * cont : fix whitespace	2025-02-15 20:29:56 +02:00
Georgi Gerganov	68ff663a04	repo : update links to new url (#11886 ) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-15 16:40:57 +02:00
Georgi Gerganov	04045bb842	readme : minor	2025-02-14 00:16:56 +02:00
Daniel Bevenius	c48f630d1c	llama : add --completion-bash option (#11846 ) This commit adds a new option `--completion-bash` to the llama.cpp which outputs a source-able bash completion script. The motivation for this change is to provide a more user-friendly experience for users who use the command-line interface of llama.cpp. This is currently only basic and all options are displayed for all llama executables but this can be improved in the future if needed. Example usage: ```console $ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash $ source ~/.llama-completion.bash $ ./build/bin/llama-server --m<TAB> --main-gpu --mirostat --mirostat-lr --model --multiline-input --min-p --mirostat-ent --mlock --model-url ```	2025-02-13 14:46:59 +01:00
lhez	4078c77f98	docs: add OpenCL (#11697 )	2025-02-11 15:04:13 -07:00
Matvey Soloviev	c3db0480bb	readme : add link to Autopen under UIs (#11684 ) Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.	2025-02-06 01:55:25 +01:00
Shelby Jenkins	106045e7bb	readme : add llm_client Rust crate to readme bindings (#11628 ) [This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it. It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible. It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face. So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.	2025-02-04 13:20:55 +02:00
piDack	0cec062a63	llama : add support for GLM-Edge and GLM-Edge-V series models (#10573 ) * add glm edge chat model * use config partial_rotary_factor as rope ratio * support for glm edge model * vision model support * remove debug info * fix format * llava.cpp trailing whitespace * remove unused AutoTokenizer * Update src/llama.cpp for not contain <\|end\|> or </s> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add edge template * fix chat template * fix confict * fix confict * fix ci err * fix format err * fix template err * 9b hf chat support * format * format clip.cpp * fix format * Apply suggestions from code review * Apply suggestions from code review * Update examples/llava/clip.cpp * fix format * minor : style --------- Co-authored-by: liyuhang <yuhang.li@zhipuai.cn> Co-authored-by: piDack <pcdack@hotmail.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: liyuhang <yuhang.li@aminer.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-02 09:48:46 +02:00
Olivier Chafik	8b576b6c55	Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639 ) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-30 19:13:58 +00:00
Guspan Tanadi	7919256c57	readme : reference examples relative links (#11505 )	2025-01-30 06:58:02 +01:00
Georgi Gerganov	2cc9b8c32c	readme : update hot topics	2025-01-26 14:30:15 +02:00
Georgi Gerganov	16d3df7ab0	readme : add plugin links (#11355 )	2025-01-22 19:44:26 +02:00
musoles	7a689c415e	README : added kalavai to infrastructure list (#11216 )	2025-01-17 01:10:49 +01:00
Xuan Son Nguyen	84a44815f7	cli : auto activate conversation mode if chat template is available (#11214 ) * cli : auto activate conversation mode if chat template is detected * add warn on bad template * update readme (writing with the help of chatgpt) * update readme (2) * do not activate -cnv for non-instruct models	2025-01-13 20:18:12 +01:00
Molly Sophia	ee7136c6d1	llama: add support for QRWKV6 model architecture (#11001 ) llama: add support for QRWKV6 model architecture (#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-10 09:58:08 +08:00
Pierrick Hymbert	f8feb4b01a	model: Add support for PhiMoE arch (#11003 ) * model: support phimoe * python linter * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: add phimoe as supported model ggml-ci --------- Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>	2025-01-09 11:21:41 +01:00
Benson Wong	a45433ba20	readme : add llama-swap to infrastructure section (#11032 ) * list llama-swap under tools in README * readme: add llama-swap to Infrastructure	2025-01-02 09:14:54 +02:00
Eric Curtin	7909e8588d	llama-run : improve progress bar (#10821 ) Set default width to whatever the terminal is. Also fixed a small bug around default n_gpu_layers value. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-19 03:58:00 +01:00
redbeard	6b064c92b4	docs: Fix HIP (née hipBLAS) in README (#10880 ) Related to #10524 / `be0e350c` references to hipBLAS have been removed across the repository. This fixes the link from the repositories `README.md`. Signed-off-by: Brian 'redbeard' Harrington <redbeard@dead-city.org>	2024-12-18 10:35:00 +02:00
Ruan	4f51968aca	readme : update typos (#10863 )	2024-12-17 11:47:20 +02:00
Valentin Mamedov	a0974156f3	llama : add Deepseek MoE v1 & GigaChat models (#10827 ) * Add deepseek v1 arch & gigachat template * improve template code * add readme * delete comments * remove comment * fix format * lint llama.cpp * fix order of deepseek and deepseek2, move gigachat temlate to the end of func * fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need * remove comments * move deepseek above deepseek2 * change placement of gigachat chat template	2024-12-15 19:02:46 +02:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
Eric Curtin	c27ac678dd	Opt class for positional argument handling (#10508 ) Added support for positional arguments `model` and `prompt`. Added functionality to download via strings like: llama-run llama3 llama-run ollama://granite-code llama-run ollama://granite-code:8b llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf llama-run https://example.com/some-file1.gguf llama-run some-file2.gguf llama-run file://some-file3.gguf Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-13 19:34:25 +01:00
Georgi Gerganov	6acce39710	readme : update the usage section with examples (#10596 ) * readme : update the usage section with examples * readme : more examples	2024-12-01 11:25:17 +02:00
Georgi Gerganov	3e0ba0e604	readme : remove old badge	2024-11-30 10:09:21 +02:00
Georgi Gerganov	abadba05be	readme : refresh (#10587 ) * readme : refresh * readme : move section [no ci] * readme : clarify [no ci] * readme : fixes [no ci] * readme : more fixes [no ci] * readme : simplify [no ci] * readme : clarify GGUF	2024-11-30 09:47:07 +02:00
Diego Devesa	a3a3048e7a	cleanup UI link list (#10577 ) * cleanup UI link list * sort list alphabetically * add missing licenses	2024-11-29 17:45:08 +01:00
Shane A	de5097351c	Add OLMo 2 model in docs (#10530 ) * Add link to OLMo 2 model in docs * Change link to landing page	2024-11-26 21:55:29 +01:00
Johannes Gäßler	467576b6cc	CMake: default to -arch=native for CUDA build (#10320 )	2024-11-17 09:06:34 +01:00
Small Grass Forest	1ee9eea094	docs : update bindings list (#10261 ) Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>	2024-11-13 13:17:10 +02:00
Georgi Gerganov	ba6f62eb79	readme : update hot topics	2024-11-01 17:31:51 +02:00
Molly Sophia	4ff7fe1fb3	llama : add chat template for RWKV-World + fix EOT (#9968 ) * Add chat template for RWKV-World Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Fix the chat template not being used Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6: Set EOT token to ``\n\n`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * readme: add rwkv into supported model list Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-10-22 13:33:37 +03:00
Asghar Ghorbani	994cfb1acb	readme : update UI list (#9972 ) add PocketPal AI app	2024-10-21 21:20:59 +03:00
Loïc Carrère	45f097645e	readme : update bindings list (#9951 ) Update the binding list by adding LM-Kit.NET (C# & VB.NET)	2024-10-20 19:25:41 +03:00
icppWorld	7cab2083c7	readme : update infra list (#9942 ) llama_cpp_canister allows you to run llama.cpp as a Smart Contract on the Internet Computer. The smart contract runs as WebAssembly in a so-called 'canister'.	2024-10-20 19:01:34 +03:00
Ma Mingfei	60ce97c9d8	add amx kernel for gemm (#8998 ) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend	2024-10-18 13:34:36 +08:00
Tim Wang	3752217ed5	readme : update bindings list (#9918 ) Co-authored-by: Tim Wang <tim.wang@ing.com>	2024-10-17 09:57:14 +03:00
Michał Tuszyński	4c42f93b22	readme : update bindings list (#9889 )	2024-10-15 11:20:34 +03:00
R0CKSTAR	943d20b411	musa : update doc (#9856 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-12 08:09:53 +03:00
Viet-Anh NGUYEN (Andrew)	71967c2a6d	Add Llama Assistant (#9744 )	2024-10-04 20:29:35 +02:00
Paweł Wodnicki	3f1ae2e32c	Update README.md (#9591 ) Add Bielik model.	2024-10-01 19:18:46 +02:00
Georgi Gerganov	589b48d41e	contrib : add Resources section (#9675 )	2024-09-29 14:38:18 +03:00
Aarni Koskela	43bcdd9703	readme : add tool (#9655 )	2024-09-28 15:07:14 +03:00
Georgi Gerganov	b5de3b74a5	readme : update hot topics	2024-09-27 20:57:51 +03:00
Riceball LEE	1d48e98e4f	readme : add programmable prompt engine language CLI (#9599 )	2024-09-23 18:58:17 +03:00
Shane A	0aadac10c7	llama : support OLMoE (#9462 )	2024-09-16 09:47:37 +03:00
OSecret	d6b37c881f	readme : update tools list (#9475 ) * Added link to proprietary wrapper for Unity3d into README.md Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/). * Update README.md Fixes upon review	2024-09-15 10:36:53 +03:00
Faisal Zaghloul	449ccfb6f5	Add Jais to list of supported models (#9439 ) Co-authored-by: fmz <quic_fzaghlou@quic.com>	2024-09-12 02:29:53 +02:00
Georgi Gerganov	38ca6f644b	readme : update hot topics	2024-09-09 15:51:37 +03:00
Antonis Makropoulos	5ed087573e	readme : add LLMUnity to UI projects (#9381 ) * add LLMUnity to UI projects * add newline to examples/rpc/README.md to fix editorconfig-checker unit test	2024-09-09 14:21:38 +03:00
Georgi Gerganov	b69a480af4	readme : refactor API section + remove old hot topics	2024-09-03 10:00:36 +03:00
Younes Belkada	b40eb84895	llama : support for `falcon-mamba` architecture (#9074 ) * feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <git@compilade.net>	2024-08-21 11:06:36 +03:00
wangshuai09	cfac111e2b	cann: add doc for cann backend (#8867 ) Co-authored-by: xuedinge233 <damow890@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2024-08-19 16:46:38 +08:00
Minsoo Cheong	c679e0cb5c	llama : add EXAONE model support (#9025 ) * add exaone model support * add chat template * fix whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add ftype * add exaone pre-tokenizer in `llama-vocab.cpp` Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * fix lint Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * add `EXAONE` to supported models in `README.md` * fix space Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net>	2024-08-16 09:35:18 +03:00
Frank Mai	84eb2f4fad	docs: introduce gpustack and gguf-parser (#8873 ) * readme: introduce gpustack GPUStack is an open-source GPU cluster manager for running large language models, which uses llama.cpp as the backend. Signed-off-by: thxCode <thxcode0824@gmail.com> * readme: introduce gguf-parser GGUF Parser is a tool to review/check the GGUF file and estimate the memory usage without downloading the whole model. Signed-off-by: thxCode <thxcode0824@gmail.com> --------- Signed-off-by: thxCode <thxcode0824@gmail.com>	2024-08-12 14:45:50 +02:00
Eric Curtin	b42978e7e4	readme : add ramalama to the availables UI (#8811 ) ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-08-05 15:45:01 +03:00
BarfingLemurs	400ae6f65f	readme : update model list (#8851 )	2024-08-05 08:54:10 +03:00
R0CKSTAR	e54c35e4fb	feat: Support Moore Threads GPU (#8383 ) * Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-28 01:41:25 +02:00
MorganRO8	68504f0970	readme : update games list (#8673 ) Added link to game I made that depends on llama	2024-07-24 19:48:00 +03:00
Thorsten Sommer	3a7ac5300a	readme : update UI list [no ci] (#8505 )	2024-07-24 15:52:30 +03:00
Georgi Gerganov	be0cfb4175	readme : fix server badge	2024-07-19 14:34:55 +03:00
Andy Salerno	fd560fe680	Update README.md to fix broken link to docs (#8399 ) Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'	2024-07-09 14:58:44 -04:00
b4b4o	c4dd11d1d3	readme : fix web link error [no ci] (#8347 )	2024-07-08 17:19:24 +03:00
toyer	04ce3a8b19	readme : add supported glm models (#8360 )	2024-07-08 08:57:19 +03:00
Andy Tai	f1948f1e10	readme : update bindings list (#8222 ) * adding guile_llama_cpp to binding list * fix formatting * fix formatting	2024-07-07 16:21:37 +03:00
Xuan Son Nguyen	60d83a0149	update main readme (#8333 )	2024-07-06 19:01:23 +02:00
Xuan Son Nguyen	be20e7f49d	Reorganize documentation pages (#8325 ) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections	2024-07-05 18:08:32 +02:00
Georgi Gerganov	6c05752c50	contributing : update guidelines (#8316 )	2024-07-05 09:09:47 +03:00
Georgi Gerganov	e235b267a2	py : switch to snake_case (#8305 ) * py : switch to snake_case ggml-ci * cont ggml-ci * cont ggml-ci * cont : fix link * gguf-py : use snake_case in scripts entrypoint export * py : rename requirements for convert_legacy_llama.py Needed for scripts/check-requirements.sh --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-07-05 07:53:33 +03:00
Mateusz Charytoniuk	dae57a1ebc	readme: add Paddler to the list of projects (#8239 )	2024-07-01 20:13:22 +03:00
Xuan Son Nguyen	49122a873f	gemma2: add sliding window mask (#8227 ) * gemma2: add sliding window mask * fix data_swa uninitialized * better naming * add co-author Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com> * replace list with single tensor * update * llama : minor styling * convert : add sanity check for query_pre_attn_scalar * fix small typo in README --------- Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-01 18:48:34 +02:00
Roni	0ddeff1023	readme : update tool list (#8209 ) * Added gppm to Tool list in README * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-01 15:48:16 +03:00
iacore	694c59cb42	Document BERT support. (#8205 ) * Update README.md document BERT support * Update README.md	2024-07-01 13:40:58 +02:00
Georgi Gerganov	a95631ee97	readme : update API notes	2024-06-26 19:26:13 +03:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00
Isaac McFadyen	8854044561	Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115 ) * Add message about int8 support * Add suggestions from review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-06-26 08:29:28 +02:00
Johannes Gäßler	a818f3028d	CUDA: use MMQ instead of cuBLAS by default (#8075 )	2024-06-24 17:43:42 +02:00
Abheek Gulati	1193778105	readme : update UI list (#7943 )	2024-06-18 09:57:41 +03:00
Bryan Honof	b473e95084	Add Nix and Flox install instructions (#7899 )	2024-06-17 09:37:55 -06:00
hopkins385	6fe1c62741	readme : update UI list [no ci] (#7958 )	2024-06-16 14:51:18 +03:00
Galunid	a55eb1bf0f	readme : Remove outdated instructions from README.md (#7914 ) [no ci]	2024-06-13 09:42:41 +02:00
Olivier Chafik	1c641e6aac	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 ) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama\|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit `e474ef1df4`. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>	2024-06-13 00:41:52 +01:00
Patrice Ferlet	f2b5764beb	Fix a typo and add Fedora 40 pacakge to install for Vulkan (#7794 ) [no ci] Fix "appropiate" to "appropriate" and add Fedora 40 packages to install to compile with Vulkan support	2024-06-12 11:18:16 +10:00
Georgi Gerganov	c28a83902c	examples : remove --instruct remnants (#7846 )	2024-06-10 15:00:15 +03:00
Mattheus Chediak	a143c04375	README minor fixes (#7798 ) [no ci] derievatives --> derivatives	2024-06-06 22:17:54 +10:00
Georgi Gerganov	554c247caf	ggml : remove OpenCL (#7735 ) ggml-ci	2024-06-04 21:23:20 +03:00
Georgi Gerganov	5ca0944a15	readme : remove obsolete Zig instructions (#7471 )	2024-06-04 19:43:01 +03:00
HanishKVC	2ac95c9d56	SimpleChat: Simple histogram/repeatMatching driven garbageTrimming, Settings UI, Streaming mode, OpenAi Compat (Model, Authorization Bearer), Save/Restore session, Auto Settings UI (#7548 ) * SimpleChat:DU:BringIn local helper js modules using importmap Use it to bring in a simple trim garbage at end logic, which is used to trim received response. Also given that importmap assumes esm / standard js modules, so also global variables arent implicitly available outside the modules. So add it has a member of document for now * SimpleChat:DU: Add trim garbage at end in loop helper * SimpleChat:DU:TrimGarbage if unable try skip char and retry * SimpleChat:DU: Try trim using histogram based info TODO: May have to add max number of uniq chars in histogram at end of learning phase. * SimpleChat:DU: Switch trim garbage hist based to maxUniq simple Instead of blindly building histogram for specified substring length, and then checking if any new char within specified min garbage length limit, NOW exit learn state when specified maxUniq chars are found. Inturn there should be no new chars with in the specified min garbage length required limit. TODO: Need to track char classes like alphabets, numerals and special/other chars. * SimpleChat:DU: Bring in maxType to the mix along with maxUniq Allow for more uniq chars, but then ensure that a given type of char ie numerals or alphabets or other types dont cross the specified maxType limit. This allows intermixed text garbage to be identified and trimmed. * SimpleChat:DU: Cleanup debug log messages * SimpleChat:UI: Move html ui base helpers into its own module * SimpleChat:DU:Avoid setting frequence/Presence penalty Some models like llama3 found to try to be over intelligent by repeating garbage still, but by tweaking the garbage a bit so that it is not exactly same. So avoid setting these penalties and let the model's default behaviour work out, as is. Also the simple minded histogram based garbage trimming from end, works to an extent, when the garbage is more predictable and repeatative. * SimpleChat:UI: Add and use a para-create-append helper Also update the config params dump to indicate that now one needs to use document to get hold of gMe global object, this is bcas of moving to module type js. Also add ui.mjs to importmap * SimpleChat:UI: Helper to create bool button and use it wrt settings * SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt * SimpleChat:UI:Select: dict-name-value, value wrt default, change Take a dict/object of name-value pairs instead of just names. Inturn specify the actual value wrt default, rather than the string representing that value. Trap the needed change event rather than click wrt select. * SimpleChat:UI: Add Div wrapped label+element helpers Move settings related elements to use the new div wrapped ones. * SimpleChat:UI:Add settings button and bring in settings ui * SimpleChat:UI:Settings make boolean button text show meaning * SimpleChat: Update a bit wrt readme and notes in du * SimpleChat: GarbageTrim enable/disable, show trimmed part ifany * SimpleChat: highlight trim, garbage trimming bitmore aggressive Make it easy for end user to identified the trimmed text. Make garbage trimming logic, consider a longer repeat garbage substring. * SimpleChat: Cleanup a bit wrt Api end point related flow Consolidate many of the Api end point related basic meta data into ApiEP class. Remove the hardcoded ApiEP/Mode settings from html+js, instead use the generic select helper logic, inturn in the settings block. Move helper to generate the appropriate request json string based on ApiEP into SimpleChat class itself. * SimpleChat:Move extracting assistant response to SimpleChat class so also the trimming of garbage. * SimpleChat:DU: Bring in both trim garbage logics to try trim * SimpleChat: Cleanup readme a bit, add one more chathistory length * SimpleChat:Stream:Initial handshake skeleton Parse the got stream responses and try extract the data from it. It allows for a part read to get a single data line or multiple data line. Inturn extract the json body and inturn the delta content/message in it. * SimpleChat: Move handling oneshot mode server response Move handling of the oneshot mode server response into SimpleChat. Also add plumbing for moving multipart server response into same. * SimpleChat: Move multi part server response handling in * SimpleChat: Add MultiPart Response handling, common trimming Add logic to call into multipart/stream server response handling. Move trimming of garbage at the end into the common handle_response helper. Add new global flag to control between oneshot and multipart/stream mode of fetching response. Allow same to be controlled by user. If in multipart/stream mode, send the stream flag to the server. * SimpleChat: show streamed generative text as it becomes available Now that the extracting of streamed generated text is implemented, add logic to show the same on the screen. * SimpleChat:DU: Add NewLines helper class To work with an array of new lines. Allow adding, appending, shifting, ... * SimpleChat:DU: Make NewLines shift more robust and flexible * SimpleChat:HandleResponseMultiPart using NewLines helper Make handle_response_multipart logic better and cleaner. Now it allows for working with the situation, where the delta data line got from server in stream mode, could be split up when recving, but still the logic will handle it appropriately. ALERT: Rather except (for now) for last data line wrt a request's response. * SimpleChat: Disable console debug by default by making it dummy Parallely save a reference to the original func. * SimpleChat:MultiPart/Stream flow cleanup Dont try utf8-decode and newlines-add_append if no data to work on. If there is no more data to get (ie done is set), then let NewLines instance return line without newline at end, So that we dont miss out on any last-data-line without newline kind of scenario. Pass stream flag wrt utf-8 decode, so that if any multi-byte char is only partly present in the passed buffer, it can be accounted for along with subsequent buffer. At sametime, bcas of utf-8's characteristics there shouldnt be any unaccounted bytes at end, for valid block of utf8 data split across chunks, so not bothering calling with stream set to false at end. LATER: Look at TextDecoder's implementation, for any over intelligence, it may be doing.. If needed, one can use done flag to account wrt both cases. * SimpleChat: Move baseUrl to Me and inturn gMe This should allow easy updating of the base url at runtime by the end user. * SimpleChat:UI: Add input element helper * SimpleChat: Add support for changing the base url This ensures that if the user is running the server with a different port or wants to try connect to server on a different machine, then this can be used. * SimpleChat: Move request headers into Me and gMe Inturn allow Authorization to be sent, if not empty. * SimpleChat: Rather need to use append to insert headers * SimpleChat: Allow Authorization header to be set by end user * SimpleChat:UI+: Return div and element wrt creatediv helpers use it to set placeholder wrt Authorization header. Also fix copy-paste oversight. * SimpleChat: readme wrt authorization, maybe minimal openai testing * SimpleChat: model request field for openai/equivalent compat May help testing with openai/equivalent web services, if they require this field. * SimpleChat: readme stream-utf-8 trim-english deps, exception2error * Readme: Add a entry for simplechat in the http server section * SimpleChat:WIP:Collate internally, Stream mode Trap exceptions This can help ensure that data fetched till that point, can be made use of, rather than losing it. On some platforms, the time taken wrt generating a long response, may lead to the network connection being broken when it enters some user-no-interaction related power saving mode. * SimpleChat:theResp-origMsg: Undo a prev change to fix non trim When the response handling was moved into SimpleChat, I had changed a flow bit unnecessarily and carelessly, which resulted in the non trim flow, missing out on retaining the ai assistant response. This has been fixed now. * SimpleChat: Save message internally in handle_response itself This ensures that throwing the caught exception again for higher up logic, doesnt lose the response collated till that time. Go through theResp.assistant in catch block, just to keep simple consistency wrt backtracing just in case. Update the readme file. * SimpleChat:Cleanup: Add spacing wrt shown req-options * SimpleChat:UI: CreateDiv Divs map to GridX2 class This allows the settings ui to be cleaner structured. * SimpleChat: Show Non SettingsUI config field by default * SimpleChat: Allow for multiline system prompt Convert SystemPrompt into a textarea with 2 rows. Reduce user-input-textarea to 2 rows from 3, so that overall vertical space usage remains same. Shorten usage messages a bit, cleanup to sync with settings ui. * SimpleChat: Add basic skeleton for saving and loading chat Inturn when ever a chat message (system/user/model) is added, the chat will be saved into browser's localStorage. * SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key * SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat This is a temporary flow * SimpleChat:ODS:Move restore/load saved chat btn setup to Me This also allows being able to set the common system prompt ui element to loaded chat's system prompt. * SimpleChat:Readme updated wrt save and restore chat session info * SimpleChat:Show chat session restore button, only if saved session * SimpleChat: AutoCreate ChatRequestOptions settings to an extent * SimpleChat: Update main README wrt usage with server	2024-06-02 02:20:18 +10:00
Johannes Gäßler	9b596417af	CUDA: quantized KV support for FA vec (#7527 ) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-01 08:44:14 +02:00
Georgi Gerganov	16926dff92	readme : link homebrew discussion	2024-05-31 15:04:58 +03:00
Galunid	2e32f874e6	Somehow '**' got lost (#7663 )	2024-05-31 18:24:41 +10:00
Galunid	1af511fc22	Add convert.py removal to hot topics (#7662 )	2024-05-31 10:09:20 +02:00
Sertaç Özercan	0541f06296	[no ci] docs: add aikit to readme (#7650 ) Signed-off-by: Sertac Ozercan <sozercan@gmail.com>	2024-05-31 09:57:16 +10:00
Martin Delille	5dcdf94676	Fix conan badge display [no ci] (#7645 )	2024-05-31 01:07:39 +10:00
Manuel	2e2340de17	Add brew installation instruction to README [no ci] (#7616 )	2024-05-31 00:58:15 +10:00
Martin Delille	7846540bd2	readme : add Conan badge (#7638 )	2024-05-30 15:52:50 +03:00
Galunid	9c4c9cc83f	Move convert.py to examples/convert-legacy-llama.py (#7430 ) * Move convert.py to examples/convert-no-torch.py * Fix CI, scripts, readme files * convert-no-torch -> convert-legacy-llama * Move vocab thing to vocab.py * Fix convert-no-torch -> convert-legacy-llama * Fix lost convert.py in ci/run.sh * Fix imports * Fix gguf not imported correctly * Fix flake8 complaints * Fix check-requirements.sh * Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE * Review fixes	2024-05-30 21:40:00 +10:00
Johannes Gäßler	972b555ab9	README: explain parallel build [no ci] (#7618 )	2024-05-30 09:52:39 +02:00
Meng, Hengyu	b864b50ce5	[SYCL] Align GEMM dispatch (#7566 ) * align GEMM dispatch	2024-05-29 07:00:24 +08:00
Aarni Koskela	9146d36fe7	Readme: add akx/ggify to tools (#1484 )	2024-05-26 22:09:42 +10:00

1 2 3 4 5 ...

480 Commits