llama.cpp

Commit Graph

Author	SHA1	Message	Date
HanishKVC	cad50c527e	ChatON: Update the note to match current logic	2024-05-06 11:27:56 +05:30
HanishKVC	a4b3285034	ChatON:Show Log on screen when template is applied	2024-05-06 11:27:56 +05:30
HanishKVC	d61b071b8d	Chaton:Common:Add missing newline wrt cmdline arg usage	2024-05-06 11:27:56 +05:30
HanishKVC	fee887fe31	ChatON:Common:Update the cmdline argument name used Had forgotten to update it before	2024-05-06 11:27:56 +05:30
HanishKVC	58e1ff16bc	ChatON: switch to ordered_json from json library to be in sync with the json namespace in server.	2024-05-06 11:27:56 +05:30
HanishKVC	a630564c48	ChatON:ChatTemplateApplyCAPI remaining base logic As c doesnt have the concept of pass by reference, and inturn the existing c api uses pointers wrt llama chat message structure, so switching to same wrt chat_tmpl_apply logics. Also fix a oversight in previous commit and add the remaining logic.	2024-05-06 11:27:56 +05:30
HanishKVC	308d3bf3ff	ChatON:WIP:Add c api wrapper for chat_template_apply Initial skeletons Update existing logics to help with same. Also the inbetween helper was having a bad signature wrt returning status and data, thats also fixed.	2024-05-06 11:27:56 +05:30
HanishKVC	e62699f923	ChatON: Add alertAssistantAtEnd flag & logic wrt MultiMsgs Apply While sending the current chat session along with new user query to the model, many models expect that a tag be added at the end to indicate that user is expecting the model to respond, this flags allows for the same.	2024-05-06 11:27:56 +05:30
HanishKVC	ea3a0f19cc	ChatON: Rather check for tmpl existance in single_ex	2024-05-06 11:27:56 +05:30
HanishKVC	01c8db70f7	ChatON+Main: Add C_API wrapper for single Add a c api wrapper for a single message tagging scenario. Inturn to match convention followed by existing chat_apply_template code, make it return the size expected of the tagged message string buffer. Update internal single logic to help with same. Explicitly check if tmpl specified is available in the loaded json or not and then return a error if not found.	2024-05-06 11:27:56 +05:30
HanishKVC	13857f29d6	ChatON+Main: Updates wrt detailed meta json Fix a oversight wrt key name. Add a alert in case if passed meta json file contains begin(BoS) wrt assistant role, similar to check for end (EoS) wrt user role. Bcas normally both (ie EoS wrt User and BoS wrt Assistant) shouldnt be needed. Update main wrt begin & prefix and suffix & end addition.	2024-05-06 11:27:56 +05:30
HanishKVC	0cd7c62706	ChatON: Keep compiler happy Move helpers to the begining, so can avoid adding prototype declerations/function signatures to the begining Get the char * wrt string data in the c++ string.	2024-05-06 11:27:56 +05:30
HanishKVC	6a0214c067	ChatON:MetaOK->MetaDump: Alert if user->end is needed or not Because user messages dont normally need a EoS token.	2024-05-06 11:27:56 +05:30
HanishKVC	344857b6cb	ChatOn:ChatOnTemplateApply: suffix,end flag based control Also fix a oversight wrt begin, when flag based begin adding control was introduced. NOTE: Currently system role suffix/end conditional adding always triggered, if 1st system prompt seen or additional system prompt is seen.	2024-05-06 11:27:56 +05:30
HanishKVC	f8ae21cec7	ChatON:ChatTemplateApplySingle: update begin+prefix, suffix+end	2024-05-06 11:27:56 +05:30
HanishKVC	5d76f08d37	ChatON: Need to explicitly specify string to use c_str	2024-05-06 11:27:56 +05:30
HanishKVC	7ba0144e42	ChatOn:chaton_tmpl_role_kv: try except to ignore missing ifany Cas of above reason, switch to directly accessing the keys in dump helper, which is inturn used by meta_ok check	2024-05-06 11:27:56 +05:30
HanishKVC	adab5775bf	ChatON: more detailed/spreadout json fields	2024-05-06 11:27:56 +05:30
HanishKVC	3f09eb5dea	ChatOn: ChatTemplateApply[Ex] return tagged msgs parts detail Now there is a simple and extended version of returning tagged messages. The extended version returns the tagged string, as well as the details of the parts that make up that tagged message interms of the type of parts and the lengths of the parts.	2024-05-06 11:27:56 +05:30
HanishKVC	825a78abaa	ChatOn: ChatTemplateApplySingle[Ex] return parts detail Now there is a simple and extended version of returning tagged message wrt a single role and its content. The extended version returns the tagged string, as well as the details of the parts that make up that tagged message interms of the type of parts and the lengths of the parts.	2024-05-06 11:27:56 +05:30
HanishKVC	92e780fb1a	ChatON:ChatParts: Allow flexibility for more refined tokenization	2024-05-06 11:27:56 +05:30
HanishKVC	d1899728aa	ChatON: Test ChatParts in chat-template-apply	2024-05-06 11:27:56 +05:30
HanishKVC	9de1d6017f	ChatON:ChatParts class initial go Helps keep user prompt and chat-hs-template tag parts seperate, but in sequence	2024-05-06 11:27:56 +05:30
HanishKVC	3064a36e74	ChatON+:Update tmpl_role_kv to retrieve wrt multiple keys Use the same for user role's begin and prefix entries.	2024-05-06 11:27:56 +05:30
HanishKVC	f1f39c5256	ChatON:Add Monarch model template, which uses Begin + Prefix Inturn Begin/BoS is added only for non 1st user messages in a system+user prompts chain.	2024-05-06 11:27:56 +05:30
HanishKVC	724ff38345	ChatOn: Wrap getting begin in try-catch, so that even if a role doesnt contain begin, the logic will work fine.	2024-05-06 11:27:56 +05:30
HanishKVC	d70fca7a45	ChatOn: Add begin to the mix along with prefix Dump shows user->begin. chat-template-apply[-single] updated to work with begin and prefix TODO: need to wrap begin in a try-catch, so that irrespective of role, begin+prefix will work, irrespoective of whether that role has a begin entry or not.	2024-05-06 11:27:56 +05:30
HanishKVC	bdd279c0c9	ChatOn:User Begin+Prefix note update, keep things simple consistent	2024-05-06 11:27:56 +05:30
HanishKVC	84367b9fd1	ChatON: Add template for DeepSeek Was looking at the tokenized vector, and noticed that the EOS mentioned by existing chat_apply_template of llama.cpp, is different from what I noticed in tokenizer_config.json of deepseek llm, so I have added two entries * "deepseek-alt" which matches llama.cpp's chat_apply_template and * "deepseek" which matches that in tokenizer_config.json. This impacts the assistant suffix and reverse prompt entries. CasOfThis: Need to look into other entries which I added previously at a later time. However as the default logic should be picking the EOS from model file, so I assume reverse-prompt being outofsync, may not matter beyond a limit, potentially.	2024-05-06 11:27:56 +05:30
HanishKVC	57bd772bfd	ChatON: Cleanup logging Avoid showing on screen the debug messages. meta-dump can either show on screen or not, based on how LOGXLN is defined.	2024-05-06 11:27:56 +05:30
HanishKVC	217544e5ff	ChatON: Keep compiler happy Order the functions so that no need for seperate prototypes Also use kv_bool wrt boolean entries. Convert string to c char *	2024-05-06 11:27:56 +05:30
HanishKVC	3f9dfc240c	ChatON: Check for the boolean entries in meta-json	2024-05-06 11:27:56 +05:30
HanishKVC	42f6b45547	ChatON: Use the constants defined for the keys	2024-05-06 11:27:56 +05:30
HanishKVC	efb758ba7d	ChatON: Rename helpers to kv suffix, updated wrt metaok rename because they return value of specified key. [main] update metaok to take template-id, so that one can cross check that all needed entries are there wrt that template-id in the chaton-meta-json file	2024-05-06 11:27:56 +05:30
HanishKVC	e8c24c0767	ChatOn:MetaOk: Allows template-id based cross check For a given template-id, cross check, all needed entries are there in the json.	2024-05-06 11:27:56 +05:30
HanishKVC	b1055641e9	ChatON: Update the notes a bit	2024-05-06 11:27:56 +05:30
HanishKVC	11b47fbcfc	ChatON:MetaJson: Add key constants, check metaJson loaded ifNeeded	2024-05-06 11:27:56 +05:30
HanishKVC	221ccd6462	ChatOn: Add SystemUser-1st-User-Has-Prefix flag support Llama2 seems to need it, so chaton-meta-json sample file updated to use same.	2024-05-06 11:27:56 +05:30
HanishKVC	f03dd2439f	ChatOn:No global-begin/end in ChatApplyTmplSingle, ChatApplyTmpl Avoid adding global begin/end markers wrt ChatApplyTmplSingle. Add ChatApplyTmpl which goes through a vector of messages.	2024-05-06 11:27:56 +05:30
HanishKVC	c4cf0e9075	ChatON:Cleanup: BeginEnd, Debug log Update the note Rename global-prefix\|suffix to global-begin\|end. Rename chat-apply-template to chat-apply-template-single, cas it handles only a single message. Add some debug log messages to the helper functions	2024-05-06 11:27:56 +05:30
HanishKVC	050d329e7e	ChatOn+Main: Initial go at chaton in main interactive flow	2024-05-06 11:27:55 +05:30
HanishKVC	dc56be951d	ChatOn:Main: Load and dump any specified chaton meta file	2024-05-06 11:27:55 +05:30
HanishKVC	35f25196a0	ChatOn:Common: Add the needed cmdline arg params and its parsing	2024-05-06 11:27:55 +05:30
HanishKVC	2146a253e8	ChatOn: Capture the idea	2024-05-06 11:27:55 +05:30
viric	fcd84a0f5a	Fix Linux /sys cpu path to guess number of cores (#7064 )	2024-05-04 15:26:53 +02:00
Andrew Downing	b0d943de17	Update LOG_IMPL and LOG_TEE_IMPL (#7029 ) ROCm clang defines _MSC_VER which results in the wrong implementation of LOG_IMPL and LOG_TEE_IMPL being compiled. This fixes https://github.com/ggerganov/llama.cpp/issues/6972	2024-05-01 23:31:30 +02:00
Johannes Gäßler	a8f9b07631	perplexity: more statistics, added documentation (#6936 ) * perplexity: more statistics, added documentation * add LLaMA 3 8b scoreboard	2024-04-30 23:36:27 +02:00
Georgi Gerganov	9c67c2773d	ggml : add Flash Attention (#5021 ) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (#6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (#6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-04-30 12:16:08 +03:00
Olivier Chafik	8843a98c2b	Improve usability of --model-url & related flags (#6930 ) * args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf) * args: main & server now call gpt_params_handle_model_default * args: define DEFAULT_MODEL_PATH + update cli docs * curl: check url of previous download (.json metadata w/ url, etag & lastModified) * args: fix update to quantize-stats.cpp * curl: support legacy .etag / .lastModified companion files * curl: rm legacy .etag file support * curl: reuse regex across headers callback calls * curl: unique_ptr to manage lifecycle of curl & outfile * curl: nit: no need for multiline regex flag * curl: update failed test (model file collision) + gitignore *.gguf.json	2024-04-30 00:52:50 +01:00
cpumaxx	ffe666572f	llava-cli : multiple images (#6969 ) Co-authored-by: root <root@nenya.lothlorien.ca>	2024-04-29 17:34:24 +03:00

1 2 3 4 5

236 Commits