llama.cpp

Commit Graph

Author	SHA1	Message	Date
Johannes Gäßler	7254cdf7e8	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-01 16:07:38 +03:00
slaren	1b2f992cd2	test-backend-ops : use flops for some performance tests (#9657 ) * test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second	2024-09-28 14:32:46 +02:00
Zhenwei Jin	6102037bbb	vocab : refactor tokenizer to reduce init overhead (#9449 ) * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-28 15:10:58 +03:00
Georgi Gerganov	b0f27361f3	sampling : avoid expensive softmax during greedy sampling (#9605 ) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <slarengh@gmail.com> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 09:03:17 +03:00
Johannes Gäßler	a5b57b08ce	CUDA: enable Gemma FA for HIP/Pascal (#9581 )	2024-09-22 09:34:52 +02:00
Molly Sophia	2a63caaa69	RWKV v6: RWKV_WKV op CUDA implementation (#9454 ) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-22 04:29:12 +02:00
Johannes Gäßler	424c5d00a9	ggml/examples: add backend support for numerical optimization (ggml/949) * CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-20 21:15:05 +03:00
Max Krasnyansky	0226613853	threadpool : skip polling for unused threads (#9461 ) * threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.	2024-09-17 11:19:46 +03:00
Georgi Gerganov	6262d13e0b	common : reimplement logging (#9418 ) https://github.com/ggerganov/llama.cpp/pull/9418	2024-09-15 20:46:12 +03:00
Georgi Gerganov	d6a04f872d	ggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408 ) * ggml : hide ggml_object, ggml_cgraph, ggml_hash_set ggml-ci * ggml : add ggml-impl.h to backends * ggml : fix compiler warnings ggml-ci * ggml : add assert upon adding nodes	2024-09-12 14:23:49 +03:00
Xuan Son Nguyen	bfe76d4a17	common : move arg parser code to `arg.cpp` (#9388 ) * common : move arg parser to arg.cpp * better categorize args * add cmake * missing climits * missing cstdarg * common : more explicit includes * fix build * refactor gpt_params_parse * update server readme * fix test --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-09 23:36:09 +02:00
slaren	5fb5e24811	llama : minor sampling refactor (2) (#9386 )	2024-09-09 17:10:46 +02:00
Xuan Son Nguyen	3f7ccfd649	common : bring back missing args, add env var duplication check (#9375 ) * common : bring back missing args * move duplication check to test-arg-parser * add check for duplicated env var * correct default values	2024-09-08 18:08:55 +02:00
slaren	19f4a7b296	llama : refactor samplers internal implementation (#9370 )	2024-09-08 15:52:07 +02:00
Georgi Gerganov	a876861455	metal : update support condition for im2col + fix warning (#0 )	2024-09-08 11:05:55 +03:00
Johannes Gäßler	202084d31d	tests: add gradient tests for all backends (ggml/932) * tests: add gradient checking to test-backend-ops * remove old comment * reorder includes * adjust SIN/COS parameters * add documentation, use supports_op if possible	2024-09-08 11:05:55 +03:00
Salvatore Mesoraca	efe6a83e30	ggml : fix cont with transposed tensors when one dimension is 1 (ggml/934) * ggml_cont: fix issue with transposed tensors when one dimension is 1 when using multiple threads, it is not enough to check for the tensors to be contiguous for ggml_compute_forward_dup_same_cont to work correctly. The tensors strides also need to match. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Add ggml_cont tests Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Remove dead code it isn't possible to reach this code because all these functions are invoked by ggml_compute_forward_dup if and only if src0->type != dst->type Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Make ggml_compute_forward_dup_same_cont work with contiguous tensors Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> --------- Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-08 11:05:55 +03:00
Xuan Son Nguyen	1b9ae5189c	common : refactor arg parser (#9308 ) * (wip) argparser v3 * migrated * add test * handle env * fix linux build * add export-docs example * fix build (2) * skip build test-arg-parser on windows * update server docs * bring back missing --alias * bring back --n-predict * clarify test-arg-parser * small correction * add comments * fix args with 2 values * refine example-specific args * no more lamba capture Co-authored-by: slaren@users.noreply.github.com * params.sparams * optimize more * export-docs --> gen-docs	2024-09-07 20:43:51 +02:00
Georgi Gerganov	df270ef745	llama : refactor sampling v2 (#9294 ) - Add `struct llama_sampler` and `struct llama_sampler_i` - Add `llama_sampler_` API - Add `llama_sampler_chain_` API for chaining multiple samplers - Remove `LLAMA_API_INTERNAL` - Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`	2024-09-07 15:16:19 +03:00
compilade	9bc6db28d0	ggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151 ) * ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b * ggml-quants : faster 1.625 bpw AVX2 vec_dot Not using a lookup table anymore makes it match q4_0 speed. * gguf-py : fix formatting * llama : remove spaces on empty line * ggml-quants : subtract 1 when back in epi8 This makes the 1.625 bpw type go faster than q4_0. Still not the fastest. * ggml-quants : Q2_2 now faster than Q4_K on with AVX2 * ggml-quants : cleanup Q1_3 code formatting * ggml-quants : ARM NEON vec_dot for q2_2 and q1_3 * ggml-quants : use ceiling division when quantizing q1_3 * convert-hf : simplify BitNet pre-quantization This still results in the exact same tensor weights and scales, but it reveals some weirdness in the current algorithm. * convert-hf : allow converting the weird BitNet 1.3B Its FFN size is 5460 which is not convenient. The offending tensors are kept in F16, which makes the final model 5.01 bpw. * bitnet : replace 1.58b with b1.58, as in the paper * ggml-quants : fix build failure on Windows * ggml-quants : attempt to fix Arm 32-bit support * ggml : add some informative comments in q1_3 vec_dot * ggml : add TQ1_0 and TQ2_0 ternary quantization types * ggml : even faster TQ2_0 * ggml : also faster TQ1_0 Same optimization as for TQ2_0 by offsetting the sum instead of the weights. This makes TQ1_0 almost as fast as Q8_0 on AVX2. * ggml : fix build issues in certain environments * ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0 * ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat The compiler seems smart enough to use the same instruction even when using vget_high_s8 instead. * ggml : remove q1_3 and q2_2 No more 1.625 bpw and 2.000 bpw, now instead using 1.6875 bpw and 2.0625 bpw with TQ1_0 and TQ2_0, respectively. * llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales. * ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency * ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot Not yet tested on hardware which supports it, might not work or might not even compile. But also it might. It should make the performance better on recent ARM CPUs. * ggml-quants : remove comment about possible format change of TQ2_0 Making it slightly more convenient for AVX512 but less convenient for everything else is not worth the trouble. * gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0 * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway. * convert : allow direct conversion to TQ1_0 and TQ2_0 The token embeddings and output tensors are kept in F16 to allow quantizing them to Q4_K and Q6_K with llama-quantize. * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0 Q4_0 is not completely symmetric (so not lossless for ternary models), but it should be good enough. * ggml-quants : allow using ARM dot product instructions for TQ1_0 * ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support * ggml : remove unused ggml_mul special case It would otherwise conflict with the more general optimization coming with Mamba-2. * ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators * test-backend-ops : add TQ1_0 and TQ2_0 comments for later Not yet adding uncommented, because some backends like SYCL and Metal do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT. (and Metal also doesn't handle it with GGML_OP_GET_ROWS) Support for TQ1_0 and TQ2_0 for other backends than CPU will be added in follow-up pull requests.	2024-09-05 21:48:47 -04:00
Faisal Zaghloul	42c76d1358	Threadpool: take 2 (#8672 ) * Introduce ggml_compute_threadpool - OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems * Minor fixes * fixed use after release bug * fixed a harmless race condition * Fix Android bulid issue * fix more race conditions * fix deadlock for cases where cgraph.n_nodes == 1 and fix --poll case * threadpool: use cpu_get_num_math to set the default number of threadpool threads This way we avoid using E-Cores and Hyperthreaded siblings. * bench: create fresh threadpool for each test For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc). * atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior. * threadpool: make polling the default to match openmp behavior All command line args now allow for setting poll to 0 (false). * threadpool: do not wakeup threads in already paused threadpool * fix potential race condition in check_for_work * threadpool: do not create two threadpools if their params are identical * threadpool: reduce pause/resume/wakeup overhead in common cases We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead. * threadpool: add support for hybrid polling poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve. * threadpool: reduce the number of barrier required New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs. * threadpool: remove special-casing for disposable threadpools With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking. * threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers. * threadpool: use relaxed order for chunk sync Full memory barrier is an overkill for this since each thread works on different chunk * threadpool: remove abort_callback from threadpool state * threadpool: better naming for thread/cpumask releated functions * threadpool: consistent use of int type for n_threads params * threadpool: add support for ggml_threadpool_params_default/init Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask. * threadpool: move typedef into ggml.h * threadpool: fix apply_priority() function name * threadpool: fix swift wrapper errors due to n_threads int type cleanup * threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled * threadpool: replace checks for compute_thread ret code with proper status check * threadpool: simplify threadpool init logic and fix main thread affinity application Most of the init code is now exactly the same between threadpool and openmp. * threadpool: update threadpool resume/pause function names * threadpool: enable openmp by default for now * threadpool: don't forget to free workers state when omp is enabled * threadpool: avoid updating process priority on the platforms that do not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings. * threadpool: update calling thread prio and affinity only at start/resume This avoids extra syscalls for each graph_compute() * llama-bench: turn threadpool params into vectors, add output headers, etc * llama-bench: add support for cool off between tests --delay This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test. * threadpool: move process priority setting into the apps (bench and cli) This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy. * threadpool: move all pause/resume logic into ggml * threadpool: futher api cleanup and prep for future refactoring All threadpool related functions and structs use ggml_threadpool prefix. * threadpool: minor indent fixes * threadpool: improve setprioty error message * Update examples/llama-bench/llama-bench.cpp Co-authored-by: slaren <slarengh@gmail.com> * threadpool: fix indent in set_threadpool call * use int32_t for n_thread type in public llama.cpp API * threadpool: use _new and _free instead of _create and _release * fix two more public APIs to use int32_t for n_threads * build: set _GNU_SOURCE for Adroid --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com> Co-authored-by: fmz <quic_fzaghlou@quic.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-08-30 01:20:53 +02:00
Georgi Gerganov	231cff5f6f	sync : ggml	2024-08-27 22:41:27 +03:00
Georgi Gerganov	fc18425b6a	ggml : add SSM Metal kernels (#8546 ) * ggml : add ggml_ssm_conv metal impl * ggml : add ssm_scan metal impl ggml-ci	2024-08-26 17:55:36 +03:00
Georgi Gerganov	879275ac98	tests : fix compile warnings for unreachable code (#9185 ) ggml-ci	2024-08-26 16:30:25 +03:00
slaren	0c41e03ceb	metal : gemma2 flash attention support (#9159 )	2024-08-26 11:08:59 +02:00
Johannes Gäßler	e11bd856d5	CPU/CUDA: Gemma 2 FlashAttention support (#8542 ) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-24 21:34:59 +02:00
Xuan Son Nguyen	3ba780e2a8	lora : fix llama conversion script with ROPE_FREQS (#9117 )	2024-08-23 12:58:53 +02:00
zhentaoyu	4f8d19ff17	[SYCL] Fix SYCL `im2col` and `convert` Overflow with Large Dims (#9052 ) * sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-08-20 23:06:51 +08:00
fairydreaming	90db8146d5	tests : add missing comma in grammar integration tests (#9099 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-20 12:09:55 +03:00
ltoniazzi	2339a0be1c	tests : add integration test for lora adapters (#8957 ) * Add printing to check weights match torch version * minor code style changes --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-08-18 11:58:04 +02:00
Molly Sophia	2d5dd7bb3f	ggml : add epsilon as a parameter for group_norm (#8818 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-06 10:26:46 +03:00
0cc4m	064cdc265f	vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855 ) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered	2024-08-05 08:52:55 +03:00
Mengqing Cao	e09a800f9a	cann: Fix ggml_cann_im2col for 1D im2col (#8819 ) * fix ggml_cann_im2col for 1D im2col * fix build warning	2024-08-02 16:50:53 +08:00
slaren	7a11eb3a26	cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X (#8800 ) cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X update asserts * only use dmmv for supported types * add test	2024-08-01 15:26:22 +02:00
slaren	2b1f616b20	ggml : reduce hash table reset cost (#8698 ) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-07-27 04:41:55 +02:00
Georgi Gerganov	88954f7fbd	tests : fix printfs (#8068 )	2024-07-25 18:58:04 +03:00
Xuan Son Nguyen	96952e7181	llama : fix `llama_chat_format_single` for mistral (#8657 ) * fix `llama_chat_format_single` for mistral * fix typo * use printf	2024-07-24 13:48:46 +02:00
Georgi Gerganov	938943cdbf	llama : move vocab, grammar and sampling into separate files (#8508 ) * llama : move sampling code into llama-sampling ggml-ci * llama : move grammar code into llama-grammar ggml-ci * cont ggml-ci * cont : pre-fetch rules * cont ggml-ci * llama : deprecate llama_sample_grammar * llama : move tokenizers into llama-vocab ggml-ci * make : update llama.cpp deps [no ci] * llama : redirect external API to internal APIs ggml-ci * llama : suffix the internal APIs with "_impl" ggml-ci * llama : clean-up	2024-07-23 13:10:17 +03:00
Georgi Gerganov	e093dd2382	tests : re-enable tokenizer tests (#8611 ) * models : remove duplicated gpt-2 vocab * models : remove old stablelm vocab * tests : re-enable MPT tokenizer tests * tests : re-enable DeepSeek tokenizer tests * cmake : sort ggml-ci	2024-07-22 13:32:49 +03:00
slaren	87e397d00b	ggml : fix quant dot product with odd number of blocks (#8549 ) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix odd blocks for ARM_NEON (#8556) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix q4_1 * ggml : fix q5_0 * ggml : fix q5_1 * ggml : fix iq4_nl metal ggml-ci * ggml : fix q4_0 * ggml : fix q8_0 ggml-ci * ggml : remove special Q4_0 code for first 2 blocks * ggml : fix sumf redefinition --------- Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-19 17:17:27 +02:00
hipudding	1bdd8ae19f	[CANN] Add Ascend NPU backend (#6035 ) * [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <391746016@qq.com> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <391746016@qq.com>	2024-07-17 14:23:50 +03:00
compilade	fa79495bb4	llama : fix pre-tokenization of non-special added tokens (#8228 ) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon	2024-07-13 23:35:10 -04:00
Georgi Gerganov	370b1f7e7a	ggml : minor naming changes (#8433 ) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-07-12 10:46:02 +03:00
Georgi Gerganov	6847d54c4f	tests : fix whitespace (#0 )	2024-07-08 12:23:00 +03:00
John Balis	fde13b3bb9	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) * conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-08 12:23:00 +03:00
compilade	3fd62a6b1c	py : type-check all Python scripts with Pyright (#8341 ) * py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors.	2024-07-07 15:04:39 -04:00
toyer	905942abdb	llama : support glm3 and glm4 (#8031 ) * add chatglm3-6b model support huggingface model: https://hf-mirror.com/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * fix lint error Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * optimize convert-hf-to-gguf.py for chatglm model Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * support glm-4-9b-chat Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * fix eos tokens to glm4 * remove unused log * add preprocess to chatglm3 and chatglm4 * add eos_id_list to llama.cpp * fix code style * fix code style * fix conflicts * fix conflicts * Revert "add eos_id_list to llama.cpp" This reverts commit `3a4d5790bf`. * set <\|endoftext\|> as eos and <\|user\|> as eot * fix chat template bug * add comment to glm prefix and suffix * fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration * fix chat template bug * fix codestyle * fix conflicts * modified the general name of glm model * fix conflicts * remove prefix and suffix * use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3 * fix: resolve Flake8 errors in `convert-hf-to-gguf.py` - Fix E302 by adding two blank lines before top-level function definitions - Replace print statements to fix NP100 - Fix E303 by ensuring only one blank line between lines of code * fix rope ratio to solve incorrect answers * fix by comments --------- Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> Co-authored-by: XingXing Qiao <qiaoxx@dingdao.com> Co-authored-by: Umpire2018 <138990495+Umpire2018@users.noreply.github.com>	2024-07-07 15:52:10 +03:00
jaime-m-p	213701b51a	Detokenizer fixes (#8039 ) * Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces	2024-07-05 19:01:35 +02:00
Clint Herron	07a3fc0608	Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258 )	2024-07-02 12:18:10 -04:00
slaren	0e0590adab	cuda : update supports_op for matrix multiplication (#8245 )	2024-07-02 09:39:38 +03:00
Xuan Son Nguyen	9ef0780062	Fix new line issue with chat template, disable template when in-prefix/suffix is set (#8203 ) * preserve new line llama_chat_format_single * disable chat template if in-prefix/suffix is set * remove redundant change	2024-06-30 20:27:13 +02:00
Olivier Chafik	8748d8ac6f	json: attempt to skip slow tests when running under emulator (#8189 )	2024-06-28 18:02:05 +01:00
Xuan Son Nguyen	26a39bbd6b	Add MiniCPM, Deepseek V2 chat template + clean up `llama_chat_apply_template_internal` (#8172 ) * tmp_contains * minicpm chat template * add DeepSeek Lite template * change deepseek-lite to deepseek2 * correct code comment * correct code from master branch	2024-06-28 15:11:44 +02:00
Olivier Chafik	139cc621e9	`json`: restore default additionalProperties to false, fix some pattern escapes (#8180 ) * json: expand ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS charset * json: revert default of additionalProperties to false * Update README.md	2024-06-28 09:26:45 +01:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00
Olivier Chafik	9b2f16f805	`json`: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863 ) * json: better suport for "type" arrays (e.g. `{"type": ["array", "null"], "items": {"type": "string"}}`) * json: add test for type: [array, null] fix * update tests	2024-06-26 01:46:35 +01:00
Olivier Chafik	6777c544bd	`json`: fix additionalProperties, allow space after enum/const (#7840 ) * json: default additionalProperty to true * json: don't force additional props after normal properties! * json: allow space after enum/const * json: update pydantic example to set additionalProperties: false * json: prevent additional props to redefine a typed prop * port not_strings to python, add trailing space * fix not_strings & port to js+py * Update json-schema-to-grammar.cpp * fix _not_strings for substring overlaps * json: fix additionalProperties default, uncomment tests * json: add integ. test case for additionalProperties * json: nit: simplify condition * reformat grammar integ tests w/ R"""()""" strings where there's escapes * update # tokens in server test: consts can now have trailing space	2024-06-26 01:45:58 +01:00
Daniel Bevenius	e6bf007744	llama : return nullptr from llama_grammar_init (#8093 ) * llama : return nullptr from llama_grammar_init This commit updates llama_grammar_init to return nullptr instead of throwing an exception. The motivation for this is that this function is declared inside an extern "C" block and is intended/may be used from C code which will not be able to handle exceptions thrown, and results in undefined behavior. On Windows and using MSVC the following warning is currently generated: ```console C:\llama.cpp\llama.cpp(13998,1): warning C4297: 'llama_grammar_init': function assumed not to throw an exception but does C:\llama.cpp\llama.cpp(13998,1): message : __declspec(nothrow), throw(), noexcept(true), or noexcept was specified on the function ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! llama : return nullptr from llama_grammar_init Add checks for nullptr when calling llama_grammar_init. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Clint Herron <hanclinto@gmail.com>	2024-06-25 15:07:28 -04:00
Olivier Chafik	84631fe150	`json`: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797 ) * json: support minimum for positive integer values * json: fix min 0 * json: min + max integer constraints * json: handle negative min / max integer bounds * json: fix missing paren min/max bug * json: proper paren fix * json: integration test for schemas * json: fix bounds tests * Update json-schema-to-grammar.cpp * json: fix negative max * json: fix negative min (w/ more than 1 digit) * Update test-grammar-integration.cpp * json: nit: move string rules together * json: port min/max integer support to Python & JS * nit: move + rename _build_min_max_int * fix min in [1, 9] * Update test-grammar-integration.cpp * add C++11-compatible replacement for std::string_view * add min/max constrained int field to pydantic json schema example * fix merge * json: add integration tests for min/max bounds * reshuffle/merge min/max integ test cases * nits / cleanups * defensive code against string out of bounds (apparently different behaviour of libstdc++ vs. clang's libc++, can't read final NULL char w/ former)	2024-06-25 20:06:20 +01:00
Xuan Son Nguyen	48e6b92cc3	Add chat template support for llama-cli (#8068 ) * add chat template support for llama-cli * add help message * server: simplify format_chat * more consistent naming * improve * add llama_chat_format_example * fix server * code style * code style * Update examples/main/main.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-25 21:56:49 +10:00
slaren	b6b9a8e606	fix CI failures (#8066 ) * test-backend-ops : increase cpy max nmse * server ci : disable thread sanitizer	2024-06-23 13:14:45 +02:00
Clint Herron	c5a8d4b749	JSON Schema to GBNF integration tests (#7790 ) * Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars. * Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program. * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs. * Merging improved schema test methods added by @ochafik in #7797 * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework. * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings. * Fixing grammar indentation to be consistent throughout file.	2024-06-21 23:18:36 -04:00
jaime-m-p	37bef89433	tokenizer : BPE fixes (#7530 ) * Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t	2024-06-18 18:40:52 +02:00
Calvin Laurenson	43b35e38ba	Add support for sqrt on CUDA (#7953 ) * cuda sqrt support * enable cuda in pca * fix comments in pca * add test * add sqrt to ggml_backend_cuda_supports_op * fix test * new line * Use F32 sqrtf instead of F64 sqrt Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-06-17 00:23:04 +02:00
Georgi Gerganov	a9cae48003	tests : add non-cont unary tests (#7857 ) * tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci	2024-06-12 16:00:22 +03:00
Georgi Gerganov	4bfe50f741	tests : check the Python version (#7872 ) ggml-ci	2024-06-11 10:10:20 +03:00
Olivier Chafik	b61eb9644d	json: refine constraint for whitespace to avoid runaways yet allow pretty print (#7866 )	2024-06-11 02:22:57 +01:00
Olivier Chafik	396b18dfec	`json`: document schema conversion in GBNF readme, align manual grammar examples & converters (#7841 ) * json: fix char pattern in grammar converters * json: prevent number precision & whitespace runaways in example grammars * json: add doc to grammar readme	2024-06-11 01:00:30 +01:00
Clint Herron	ad675e1c67	Added support for . (any character) token in grammar engine. (#6467 ) * Added support for . (any characer) token in grammar engine. * Add integration tests for any-character symbol.	2024-06-06 06:08:52 -07:00
Olivier Chafik	55b2d0849d	grammars: x{min,max} repetition operator (#6640 ) * grammars: x{min,max} repetition operator + tweak +//? to avoid duplication of original over alternates grammars: handle `x{n}` and fix `x{n,n}` * grammars: document new repetition operators * grammars: uniform use of int for min & max * grammars: refactor parser test * grammar: parsing tests w/ natural pretty print of updated expectations * grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all) * grammars: improve test pretty print again * grammars: pretty print rules and chars * grammars: fix copy rule skipping * grammars: disallow `a{,}` (not allowed in regexps) * Update common/grammar-parser.cpp Co-authored-by: Clint Herron <hanclinto@gmail.com> * grammars: fix copy rule skipping (again) & display of expectations * grammars: more test cases * grammars: update reps parsing to bring ? / * / + closer to before * json: use new GBNF repetitions{m,n} syntax * grammars: update performance gotchas w/ repetition advice * Update examples/json_schema_to_grammar.py Co-authored-by: Clint Herron <hanclinto@gmail.com> * Update examples/server/public/json-schema-to-grammar.mjs Co-authored-by: Clint Herron <hanclinto@gmail.com> * grammars: comment on rule repetitions * grammars: ensure unambiguous number alternatives * grammar: nit typo switched error msgs * grammar: nit numbering in comment * json: update numeric rule to be unambiguous * Apply suggestions from code review Co-authored-by: Clint Herron <hanclinto@gmail.com> * Update examples/server/public/json-schema-to-grammar.mjs Co-authored-by: Clint Herron <hanclinto@gmail.com> * json: fix integral-part * grammar: add repetition tests --------- Co-authored-by: Clint Herron <hanclinto@gmail.com>	2024-06-06 10:07:06 +01:00
Georgi Gerganov	2b3389677a	ggml : refactor rope norm/neox (#7634 ) * ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci	2024-06-05 11:29:20 +03:00
jaime-m-p	3b38d48609	Per token attributes (#7685 ) * Add per token attributes enum * Using phi-3 for testing 'rstrip' * Using jina-v2 for testing 'lstrip' * Brute force test for 'lstrip' and 'rstrip' * Implement 'rstrip' and 'lstrip' * Update phi-3 GGUF file (obsolete since `917dc8c`) * Replace llama_token_type with llama_token_attribs	2024-06-04 09:17:17 +02:00
Johannes Gäßler	e141ce624a	Fix FlashAttention debug test, FP32 assert (#7684 )	2024-06-01 23:26:10 +02:00
Johannes Gäßler	9b596417af	CUDA: quantized KV support for FA vec (#7527 ) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-01 08:44:14 +02:00
Georgi Gerganov	0c27e6f62e	ggml : fix loongson compile warnings (#7537 ) * ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>	2024-05-31 14:17:10 +03:00
Georgi Gerganov	fb76ec31a9	ggml : fix YARN + add tests + add asserts (#7617 ) * tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci	2024-05-29 20:17:31 +03:00
Georgi Gerganov	cce3dcffc5	cuda : non-cont concat support (#7610 ) * tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci	2024-05-29 15:38:26 +03:00
jaime-m-p	02c1ecad07	Tokenizer WPM fixes (#7500 ) * Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.	2024-05-28 21:46:34 +02:00
Georgi Gerganov	edc29433fa	tests : fix test-tokenizer-0.sh	2024-05-28 15:04:09 +03:00
Georgi Gerganov	0548a4187f	ggml : generalize GGML_OP_CONCAT (#7563 ) * ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci	2024-05-28 11:04:19 +03:00
Tristan Druyen	007489e895	Fix phi3 chat template confusion with zephyr (#7449 ) * Fix phi3 template matching vs zephyr * Add regression test for new phi3 chat template * Implement review suggestions * Fix phi3 jinja test templates & match by <\|end\|> * Apply suggestion Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Add all phi3 template variants in tests * Remove unneeded message trimming Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Fix tests to not expect trimmed messages --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-05-23 16:15:15 +02:00
Georgi Gerganov	d48c88cbd5	ggml : remove ggml_flash_attn and ggml_flash_ff (#7463 ) ggml-ci	2024-05-23 10:00:44 +03:00
Georgi Gerganov	3e5faa8503	cuda : fix rope + add tests (#7452 ) * cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci	2024-05-22 11:01:35 +03:00
liuwei-git	201cc11afa	llama : add phi3 128K model support (#7225 ) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-21 23:28:32 +03:00
Georgi Gerganov	c3f8d58356	tests : test-tokenizer-0.sh print more info (#7402 )	2024-05-21 19:53:48 +03:00
jaime-m-p	d7e852c1bc	Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425 ) * Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"	2024-05-21 14:39:48 +02:00
jaime-m-p	917dc8cfa6	Tokenizer SPM fixes for phi-3 and llama-spm (#7375 ) * Update brute force test: special tokens * Fix added tokens - Try to read 'added_tokens.json'. - Try to read 'tokenizer_config.json'. - Try to read 'tokenizer.json'. * Fix special tokens rtrim Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix test regexes	2024-05-20 20:15:57 +02:00
slaren	05834841dc	ggml : fix quants nans when all the group weights are very close to zero (#7313 )	2024-05-18 02:39:54 +02:00
jaime-m-p	b43272afa2	Unicode codepoint flags for custom regexs (#7245 ) * Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM	2024-05-18 01:09:13 +02:00
John Balis	48aa8fd1f2	ggml : add `ggml_upscale_ext` (ggml/814) * initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-15 13:23:33 +03:00
Georgi Gerganov	e8a7fd4fb0	metal : support FA without mask + add asserts (#7278 ) * ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci	2024-05-14 19:09:30 +03:00
Haggai Nuchi	e0f556186b	Add left recursion check: quit early instead of going into an infinite loop (#7083 ) * Add left recursion check: quit early instead of going into an infinite loop * Remove custom enum, rename left recursion check and move to "grammar internal" section, add handling for edge case where a leftmost nonterminal may be empty * Remove unnecessary declaration	2024-05-14 15:25:56 +10:00
Johannes Gäßler	dc685be466	CUDA: add FP32 FlashAttention vector kernel (#7188 ) * CUDA: add FP32 FlashAttention vector kernel * fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel	2024-05-12 19:40:45 +02:00
Haoxiang Fei	f99e1e456e	llama : lookup word in vocab before doing BPE merges (#7193 ) * fix: llama-3 ignore_merges * test: add test for llama-3 bpe ignore_merges * fix: set ignore_merges only for llama-3 * fix: test-tokenizer-1-bpe --ingore-merges detection * fix: copy to fix fallthrough * fix: change ignore_merges to bool * fix: add ignore merges tests to cmake * llama : alternative merge ignore logic --------- Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-11 11:12:06 +03:00
Georgi Gerganov	9cb317f77e	ggml : full ALiBi support (#7192 ) * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * ggml : fix assert message * vulkan : add dev notes * ggml : require mask when using ALiBi ggml-ci * convert : fix convert for refact models	2024-05-11 10:32:41 +03:00
jaime-m-p	43248e5594	llama3 custom regex split (#6965 ) * merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * llama3 custom regex split * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * Using char32_t for codepoints * lint : fix * already exists unicode_tolower() * Typing * Restore BOM * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * Fix merge * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci * Move unused variable value * GPT2 custom regex split * Add alternative regex for custom aplit llama3 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Style * Add bruteforce random tests for token encoding * wip: fixing unicode codepoint ranges * Fix merge * Unicode tables: separator, lowercase, uppercase and whitespace * llama3 custom regex split: fix \s * Restore BOM * Style * wip: generate NDF table * Ignore special tokens for testing * Clean gen-unicode-data.py * Refactor random tokenizer test * lint : fix * tests : add fail test for llama-bpe --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: jaime-m-p <>	2024-05-09 23:30:44 +10:00
Johannes Gäßler	a743d76a01	CUDA: generalize FP16 fattn vec kernel (#7061 ) * CUDA: generalize FP16 fattn vec kernel * disable unsupported head sizes for AMD in test * try AMD fix * fix batch size 2-8 * partially revert changes	2024-05-09 14:32:02 +02:00
Johannes Gäßler	c12452c7ae	JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143 )	2024-05-08 21:53:08 +02:00
Ren Xuancheng	229ffff872	llama : add BPE pre-tokenization for Qwen2 (#7114 ) * Add BPE pre-tokenization for Qwen2. * minor : fixes --------- Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-08 15:06:43 +03:00
Justine Tunney	3855416027	ggml : introduce bfloat16 support (#6412 ) * Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function	2024-05-08 09:30:09 +03:00
DAN™	889bdd7686	command-r : add BPE pre-tokenization (#7063 ) * Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-05 08:19:30 +03:00
Brian	6fbd432211	py : logging and flake8 suppression refactoring (#7081 ) Set one as executable and add basicConfig() to another. Also added noqa tag to test scripts.	2024-05-05 08:07:48 +03:00
Georgi Gerganov	92139b90af	tests : add test-tokenizer-0.sh + fix some tokenizers (#7036 ) * tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update	2024-05-04 08:32:32 +03:00
Brian	a2ac89d6ef	convert.py : add python logging instead of print() (#6511 ) * convert.py: add python logging instead of print() * convert.py: verbose flag takes priority over dump flag log suppression * convert.py: named instance logging * convert.py: use explicit logger id string * convert.py: convert extra print() to named logger * convert.py: sys.stderr.write --> logger.error * .py: Convert all python scripts to use logging module requirements.txt: remove extra line * flake8: update flake8 ignore and exclude to match ci settings * gh-actions: add flake8-no-print to flake8 lint step * pre-commit: add flake8-no-print to flake8 and also update pre-commit version * convert-hf-to-gguf.py: print() to logger conversion * .py: logging basiconfig refactor to use conditional expression .py: removed commented out logging fixup! .py: logging basiconfig refactor to use conditional expression constant.py: logger.error then exit should be a raise exception instead * .py: Convert logger error and sys.exit() into a raise exception (for atypical error) gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar * verify-checksum-model.py: This is the result of the program, it should be printed to stdout. * compare-llama-bench.py: add blank line for readability during missing repo response * reader.py: read_gguf_file() use print() over logging * convert.py: warning goes to stderr and won't hurt the dump output * gguf-dump.py: dump_metadata() should print to stdout * convert-hf-to-gguf.py: print --> logger.debug or ValueError() * verify-checksum-models.py: use print() for printing table * .py: refactor logging.basicConfig() gguf-py/gguf/.py: use __name__ as logger name Since they will be imported and not run directly. python-lint.yml: use .flake8 file instead * constants.py: logger no longer required * convert-hf-to-gguf.py: add additional logging * convert-hf-to-gguf.py: print() --> logger * .py: fix flake8 warnings revert changes to convert-hf-to-gguf.py for get_name() * convert-hf-to-gguf-update.py: use triple quoted f-string instead * .py: accidentally corrected the wrong line *.py: add compilade warning suggestions and style fixes	2024-05-03 22:36:41 +03:00
Georgi Gerganov	9c67c2773d	ggml : add Flash Attention (#5021 ) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (#6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (#6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-04-30 12:16:08 +03:00
Clint Herron	b8c1476e44	Extending grammar integration tests (#6644 ) * Cleaning up integration tests to share code between tests and make it simpler to add new tests. * Add tests around quantifiers to ensure both matching and non-matching compliance. * Add slightly more complex grammar with quantifiers to test references with quantifiers. * Fixing build when C++17 is not present. * Separating test calls to give more helpful stack traces on failure. Adding verbose messages to give visibility for what is being tested. * Adding quotes around strings to explicitly show whitespace * Removing trailing whitespace. * Implementing suggestions from @ochafik -- grammars and test strings now print and flush before tests to aid in debugging segfaults and whatnot. * Cleaning up forgotten symbols. Modifying simple test to use test harness. Added comments for more verbose descriptions of what each test is accomplishing. * Unicode symbol modifications to hopefully make log easier to parse visually.	2024-04-29 14:40:14 -04:00
Georgi Gerganov	f4ab2a4147	llama : fix BPE pre-tokenization (#6920 ) * merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * lint : fix * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>	2024-04-29 16:58:41 +03:00
Tristan Druyen	abd3314064	llama : add phi 3 chat template (#6857 ) * Add phi 3 chat template & tests * test : fix chat template result --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-24 11:52:37 +03:00
Wouter	7dbdba5690	llama : add llama-3 chat template (#6751 ) * Added llama-3 chat template * Update llama.cpp Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> * Update llama.cpp Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> * Update tests/test-chat-template.cpp Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> * Added EOS stop sequence according to https://github.com/ggerganov/llama.cpp/pull/6751#issuecomment-2065602862 * Removed adding of BOS token before first message * Removed bos token from expected output from llama-3 * Update tests/test-chat-template.cpp Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com> * Update tests/test-chat-template.cpp Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com> * Added <\|end_of_text\|> as another stop token * Reverted last change of adding the end_of_text stop word for llama 3 --------- Co-authored-by: Wouter Tichelaar <tichelaarw@spar.net> Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-21 16:03:39 +03:00
slaren	0d56246f4b	ggml : group all experts in a single ggml_mul_mat_id (#6505 ) * ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-18 15:18:48 +02:00
Shijie	f4dea7da18	llama : add qwen2moe (#6074 ) * support qwen2moe * fix-review * metal : support unary ops for nelements % 4 != 0 * metal : require contiguousness for float4 unary kernels * metal : require contiguousness for float4 unary kernels (cont) * fix-review * names : for brevity "SHARED_EXP" -> "SHEXP" * llama : reuse build_moe_ffn() * llama : add model type name --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-16 18:40:48 +03:00
Olivier Chafik	7593639ce3	`main`: add --json-schema / -j flag (#6659 ) * main: add --json-schema / -j * json: move json-schema-to-grammar to common lib * json: fix zig build	2024-04-15 18:35:21 +01:00
Chao Jiang	04fbc5f23e	Add Command R chat template (#6650 ) * Add chat template for command-r model series * Fix indentation * Add chat template test for command-r models and update the implementation to trim whitespaces * Remove debug print	2024-04-14 18:16:34 +02:00
Olivier Chafik	ab9a3240a9	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 ) * json: rename python schema converter to make import easier * server: skip null json_schema / grammar fields * json: deps management for primitive rules (+ allow null values) * json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?` * grammars: add troubleshooting section to readme * json: cap length of numbers to 15 digits before/after decimal point (avoids infinite gen, e.g. "one third" -> `0.333333333333...`) * json: unify all repetition code (w/ or w/o sep) * json: support string minLength/maxLength * server+json: update server/README w/ result_format * nits * json: fix type error w/ python 3.8 * json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions) * json: simplify DOT `{"type": "string", "pattern": "^.$"}` * json: remove recursion in opt_repetitions (avoids Python stack overflow) * json: rm dead code * json: rm useless assert & ggml.h import	2024-04-12 19:43:38 +01:00
slaren	fbbc030ba9	metal : unify mul_mv_id kernels (#6556 )	2024-04-12 18:13:20 +02:00
Olivier Chafik	cbaadc9294	grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609 ) * grammars: reserve rejects & next candidates * grammars: reuse new_stacks * grammars: fix missing sig change in llama.h * grammars: fix test (api changed) * grammars: update gbnf-validator.cpp * grammars: simpler syntax (no swap)	2024-04-11 19:47:34 +01:00
Clint Herron	57dd02c44b	Tests: Added integration tests for GBNF parser (#6472 ) * Added integration tests for GBNF parser to validate correctness of parsing, as well as correctness of string matching. Intended for use to pin behavior while working on performance improvements. * Fixing whitespace errors and cleaning error message alert to be clearer. * Removing hacky include to llama.cpp from grammar integration test now that needed functions are available via internal API. * Comment cleanup. * Reorganizing tests for readability. * Cleaning up debug message to make a bit more sense.	2024-04-06 10:31:33 -04:00
kaizau	1ff4d9f3d6	Add OpenChat, Alpaca, Vicuna chat templates (#6397 ) * Add openchat chat template * Add chat template test for openchat * Add chat template for vicuna * Add chat template for orca-vicuna * Add EOS for vicuna templates * Combine vicuna chat templates * Add tests for openchat and vicuna chat templates * Add chat template for alpaca * Add separate template name for vicuna-orca * Remove alpaca, match deepseek with jinja output * Regenerate chat template test with add_generation_prompt * Separate deepseek bos from system message * Match openchat template with jinja output * Remove BOS token from templates, unprefix openchat	2024-04-03 17:24:31 +02:00
slaren	08a0c02060	ggml : mul_mat_id use the same tensor for all the experts (#6387 ) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-03 16:07:05 +03:00
Kawrakow	55c1b2a3bb	IQ1_M: 1.75 bpw quantization (#6302 ) * iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-26 15:21:27 +01:00
Kawrakow	1f2fd4e727	tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-25 19:33:15 +02:00
Olivier Chafik	f77a8ffd3b	tests : conditional python & node json schema tests (#6207 ) * json: only attempt python & node schema conversion tests if their bins are present Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978 disabled in https://github.com/ggerganov/llama.cpp/pull/6198 * json: orange warnings when tests skipped * json: ensure py/js schema conv tested on ubuntu-focal-make * json: print env vars in test	2024-03-22 15:09:07 +02:00
Olivier Chafik	72114edf06	json-schema-to-grammar : fix order of props + non-str const/enum (#6232 ) * json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums	2024-03-22 15:07:44 +02:00
Georgi Gerganov	95d576b48e	metal : pad n_ctx by 32 (#6177 ) * metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci	2024-03-22 09:36:03 +02:00
Georgi Gerganov	924ce1dce7	tests : disable system() calls (#6198 ) ggml-ci	2024-03-21 16:20:05 +02:00
Olivier Chafik	5b7b0ac8df	json-schema-to-grammar improvements (+ added to server) (#5978 ) * json: fix arrays (disallow `[,1]`) * json: support tuple types (`[number, string]`) * json: support additionalProperties (`{[k: string]: [string,number][]}`) * json: support required / optional properties * json: add support for pattern * json: resolve $ref (and support https schema urls) * json: fix $ref resolution * join: support union types (mostly for nullable types I think) * json: support allOf + nested anyOf * json: support any (`{}` or `{type: object}`) * json: fix merge * json: temp fix for escapes * json: spaces in output and unrestricted output spaces * json: add typings * json:fix typo * Create ts-type-to-grammar.sh * json: fix _format_literal (json.dumps already escapes quotes) * json: merge lit sequences and handle negatives {"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"} * json: handle pattern repetitions * Update json-schema-to-grammar.mjs * Create regex-to-grammar.py * json: extract repeated regexp patterns to subrule * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * json: handle schema from pydantic Optional fields * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update ts-type-to-grammar.sh * Update ts-type-to-grammar.sh * json: simplify nullable fields handling * json: accept duplicate identical rules * json: revert space to 1 at most * json: reuse regexp pattern subrules * json: handle uuid string format * json: fix literal escapes * json: add --allow-fetch * json: simplify range escapes * json: support negative ranges in patterns * Delete commit.txt * json: custom regex parser, adds dot support & JS-portable * json: rm trailing spaces * Update json-schema-to-grammar.mjs * json: updated server & chat `( cd examples/server && ./deps.sh )` * json: port fixes from mjs to python * Update ts-type-to-grammar.sh * json: support prefixItems alongside array items * json: add date format + fix uuid * json: add date, time, date-time formats * json: preserve order of props from TS defs * json: port schema converter to C++, wire in ./server * json: nits * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * json: fix mjs implementation + align outputs * Update json-schema-to-grammar.mjs.hpp * json: test C++, JS & Python versions * json: nits + regen deps * json: cleanup test * json: revert from c++17 to 11 * json: nit fixes * json: dirty include for test * json: fix zig build * json: pass static command to std::system in tests (fixed temp files) * json: fix top-level $refs * json: don't use c++20 designated initializers * nit * json: basic support for reserved names `{number:{number:{root:number}}}` * Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test) * json: re-ran server deps.sh * json: simplify test * json: support mix of additional props & required/optional * json: add tests for some expected failures * json: fix type=const in c++, add failure expectations for non-str const&enum * json: test (& simplify output of) empty schema * json: check parsing in test + fix value & string refs * json: add server tests for OAI JSON response_format * json: test/fix top-level anyOf * json: improve grammar parsing failures * json: test/fix additional props corner cases * json: fix string patterns (was missing quotes) * json: ws nit * json: fix json handling in server when there's no response_format * json: catch schema conversion errors in server * json: don't complain about unknown format type in server if unset * json: cleaner build of test * json: create examples/json-schema-pydantic-example.py * json: fix date pattern * json: move json.hpp & json-schema-to-grammar.{cpp,h} to common * json: indent 4 spaces * json: fix naming of top-level c++ function (+ drop unused one) * json: avoid using namespace std * json: fix zig build * Update server.feature * json: iostream -> fprintf * json: space before & refs for consistency * json: nits	2024-03-21 11:50:43 +00:00
Xuan Son Nguyen	aab606a11f	llama : add Orion chat template (#6066 )	2024-03-15 10:44:57 +02:00
slaren	d8fd0ccf6a	test-backend-ops : skip CPU backend by default (#6028 )	2024-03-13 15:58:30 +02:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00
Georgi Gerganov	5b09797321	ggml : remove old quantization functions (#5942 ) * ggml : remove old quantization functions ggml-ci * ggml : simplify ggml_quantize_chunk ggml-ci * ggml : restrict correctness ggml-ci * ggml : remove hist data from the quantization API ggml-ci * tests : remove hist usage in test-backend-ops ggml-ci * vulkan : remove hist and fix typo	2024-03-09 15:53:59 +02:00
Georgi Gerganov	2c4f566c88	tests : gitignore ggml-common.h	2024-03-09 14:17:11 +02:00
leejet	7d43c585dc	add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-03-04 10:39:10 +02:00
Kawrakow	0becb22ac0	IQ4_XS: a 4.25 bpw quantization (#5747 ) * Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-27 16:34:24 +02:00
Kawrakow	a33e6a0d2a	Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721 ) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-26 18:28:38 +02:00
Georgi Gerganov	ab336a9d5e	code : normalize enum names (#5697 ) * coda : normalize enum names ggml-ci * code : cont * code : cont	2024-02-25 12:09:09 +02:00
Kawrakow	4c4cb30736	IQ3_S: a much better alternative to Q3_K (#5676 ) * iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-24 16:23:52 +02:00
Xuan Son Nguyen	373ee3fbba	Add Gemma chat template (#5665 ) * add gemma chat template * gemma: only apply system_prompt on non-model message	2024-02-22 19:10:21 +01:00
Xuan Son Nguyen	a46f50747b	server : fallback to chatml, add AlphaMonarch chat template (#5628 ) * server: fallback to chatml * add new chat template * server: add AlphaMonarch to test chat template * server: only check model template if there is no custom tmpl * remove TODO	2024-02-22 10:33:24 +02:00
Kawrakow	a14679cc30	IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590 ) * iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-21 11:39:52 +02:00
Xuan Son Nguyen	11b12de39b	llama : add llama_chat_apply_template() (#5538 ) * llama: add llama_chat_apply_template * test-chat-template: remove dedundant vector * chat_template: do not use std::string for buffer * add clarification for llama_chat_apply_template * llama_chat_apply_template: add zephyr template * llama_chat_apply_template: correct docs * llama_chat_apply_template: use term "chat" everywhere * llama_chat_apply_template: change variable name to "tmpl"	2024-02-19 10:23:37 +02:00
Herman Semenov	5d3de51f97	ggml, common, examples, tests : fixed type arguments in printf (#5528 )	2024-02-18 18:20:12 +02:00
Kawrakow	bd2d4e393b	1.5 bit quantization (#5453 ) * iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-18 18:16:55 +02:00
Georgi Gerganov	8f1be0d42f	ggml : add ALiBi support for ggml_soft_max_ext (#5488 ) * ggml : avoid recomputing alibi slopes (CPU) * llama : reuse hparams.f_max_alibi_bias in all cases ggml-ci * ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal) ggml-ci * ggml : handle all SRCs (do not break on first null) ggml-ci * tests : do not use slope for large soft_max accumulates too much error ggml-ci * ggml : alternative ALiBi without extra tensor We compute the slopes in the kernel ggml-ci * cuda : add ALiBi support in ggml_soft_max_ext ggml-ci * ggml : deprecate ggml_alibi * ggml : support multi-sequence ALiBi (Metal) ggml-ci * cuda : add multi-seq ALiBi + remote F16 soft_max ggml-ci * ggml : update deprecation message * ggml : fix pos ptr when no ALiBi ggml-ci * cuda : fix performance (pow -> powf) * cuda : precompute ALiBi constants * metal : pre-compute ALiBi slopes ggml-ci * llama : init kq_pos only if needed ggml-ci * test-backend-ops : add null pos test to soft_max test-backend-ops : replace soft_max tests ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-17 23:04:16 +02:00
bmwl	f486f6e1e5	ggml : add numa options (#5377 ) * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-16 11:31:07 +02:00
Georgi Gerganov	cf45252a7c	tests : multi-thread the tokenizer tests (#5474 ) * tests : multi-thread the tokenizer tests ggml-ci * unicode : fix data race for unidentified codepoints ggml-ci * unicode : minor style fixes ggml-ci	2024-02-13 15:14:22 +02:00
Georgi Gerganov	99b8b43d7b	tests : disable moe test (#5473 )	2024-02-13 11:20:24 +02:00
snadampal	a07d0fee1f	ggml : add mmla kernels for quantized GEMM (#4966 ) * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info	2024-02-11 15:22:33 +02:00
Johannes Gäßler	26d4efd11e	sampling: fix top_k <= 0 (#5388 ) * sampling: fix top_k <= 0 * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-08 09:46:30 +01:00
Georgi Gerganov	8504d2d0da	tests : .gitignore obj files	2024-02-08 09:46:47 +02:00
Michael Klimenko	52bb63c708	refactor : switch to emplace_back to avoid extra object (#5291 )	2024-02-03 13:23:37 +02:00
JidongZhang-THU	15606309a0	llava : add MobileVLM support (#5132 ) * New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-31 15:10:15 +02:00
John Balis	625a699b54	`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686) * added cuda float16->float32 upcasting to ggml_cuda_cpy * added ability to copy 4d tensors with the cuda backend * added tests for float16_>float32 upcast and 4d tensor cuda copys * added 4d copy test for float32->float16 copy * applied patch suggested by @iamlemec * simplify cpy tests --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-30 16:20:25 +02:00
Kawrakow	f4d7e54974	SOTA 3-bit quants (#5196 ) * iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 15:14:12 +02:00
Jared Van Bortel	fbf1ddec69	Nomic Vulkan backend (#4456 ) Signed-off-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: niansa <anton-sa@web.de> Co-authored-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Aaron Miller <apage43@ninjawhale.com> Co-authored-by: ToKiNoBug <tokinobug@163.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-01-29 15:50:50 -05:00
Abhilash Majumder	0f648573dd	ggml : add unified SYCL backend for Intel GPUs (#2690 ) * first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 17:56:23 +02:00
Johannes Gäßler	b2b2bf988c	Tests for min_p, sampling queue (#5147 )	2024-01-28 09:35:14 +01:00
Michael Klimenko	35a2ee9143	Remove unused data and add fixes (#5154 ) * Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation	2024-01-27 15:25:55 +01:00
Georgi Gerganov	3b7c914de2	tests : gitignore test-c.o	2024-01-26 14:48:15 +02:00
crasm	413e7b0559	ci : add model tests + script wrapper (#4586 ) * scripts : add lib.sh and lib_test.sh * scripts : stub out new ci-run.sh script * scripts : switch to PascalCase for functions This looks a little odd at first, but I find it very useful as a convention to know if a command is part of our code vs a builtin. * scripts : add some fancy conversion from snake_case to PascalCase * Add venv to ci/run.sh * Revert scripts work * scripts : add wrapper script for local use of ci/run.sh * Simplify .gitignore for tests, clang-tidy fixes * Label all ctest tests * ci : ctest uses -L main * Attempt at writing ctest_with_model * Update test-model-load-cancel * ci : add ctest_with_model for debug and release ggml-ci * Fix gg_get_model function ggml-ci * got stuck on CMake * Add get_model.cpp to tests/CMakeLists.txt ggml-ci * Fix README.md output for ctest_with_model ggml-ci * workflows : use `-L main` for all ctest ggml-ci * Fixes * GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE * Always show warning rather than failing if model file variable is not set * scripts : update usage text for ci-run.sh	2024-01-26 14:18:00 +02:00
Georgi Gerganov	38566680cd	ggml : add IQ2 to test-backend-ops + refactoring (#4990 ) * ggml : add IQ2 to test-backend-ops + refactoring ggml-ci * cuda : update supports_op for IQ2 ggml-ci * ci : enable LLAMA_CUBLAS=1 for CUDA nodes ggml-ci * cuda : fix out-of-bounds-access in `mul_mat_vec_q` ggml-ci * tests : avoid creating RNGs for each Q tensor ggml-ci * tests : avoid creating RNGs for each tensor ggml-ci	2024-01-17 18:54:56 +02:00
Georgi Gerganov	c918fe8dca	metal : create autorelease pool during library build (#4970 ) * metal : create autorelease pool during library build ggml-ci * test : simplify ggml-ci	2024-01-17 18:38:39 +02:00
Kawrakow	147b17ac94	2-bit quantizations (#4897 ) * imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 09:45:56 +02:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Kawrakow	49662cbed3	ggml : SOTA 2-bit quants (add IQ2_XS) (#4856 ) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:39:39 +02:00
Johannes Gäßler	8f900abfc0	CUDA: faster softmax via shared memory + fp16 math (#4742 )	2024-01-09 08:58:55 +01:00
Kawrakow	dd5ae06405	SOTA 2-bit quants (#4773 ) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-08 16:02:32 +01:00
Johannes Gäßler	a91928014f	Print backend name on test-backend-ops failure (#4751 )	2024-01-04 09:43:23 +01:00
Guillaume Wenzek	5f66ebca9c	ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639) * add more int ops * ggml_compute_forward_dup_bytes * add tests * PR comments * tests : minor indentations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-03 14:38:38 +02:00
Georgi Gerganov	58ba655af0	metal : enable shader debugging (cmake option) (#4705 ) * ggml : disable fast-math for Metal (cmake build only) ggml-ci * metal : fix Metal API debug warnings * cmake : add -fno-inline for Metal build (#4545) * metal : fix API debug warnings * metal : fix compile warnings * metal : use uint64_t for strides * cmake : rename option to LLAMA_METAL_SHADER_DEBUG * metal : fix mat-vec Q8_0 kernel for BS > 1 * metal : normalize mat-vec kernel signatures * cmake : respect LLAMA_QKK_64 option * metal : fix mat-vec Q4_K kernel for QK_K == 64 ggml-ci	2024-01-02 10:57:44 +02:00
Cuong Trinh Manh	97bbca6e85	cmake : fix ld warning duplicate libraries libllama.a (#4671 ) * fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.	2023-12-29 16:39:15 +02:00
bssrdf	afc8c19291	ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669) * fixed mul-mat error for old GPUs * style fixes * add mul mat src1 f16 test cases, fix more cases ggml-ci --------- Co-authored-by: bssrdf <bssrdf@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-29 14:54:19 +02:00
manikbhandari	ea5497df5d	gpt2 : Add gpt2 architecture integration (#4555 )	2023-12-28 15:03:57 +01:00
slaren	5bf3953d7e	cuda : improve cuda pool efficiency using virtual memory (#4606 ) * cuda : improve cuda pool efficiency using virtual memory * fix mixtral * fix cmake build * check for vmm support, disable for hip ggml-ci * fix hip build * clarify granularity * move all caps to g_device_caps * refactor error checking * add cuda_pool_alloc, refactor most pool allocations ggml-ci * fix hip build * CUBLAS_TF32_TENSOR_OP_MATH is not a macro * more hip crap * llama : fix msvc warnings * ggml : fix msvc warnings * minor * minor * cuda : fallback to CPU on host buffer alloc fail * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ensure allocations are always aligned * act_size -> actual_size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-12-24 14:34:22 +01:00
Georgi Gerganov	afefa319f1	ggml : change ggml_scale to take a float instead of tensor (#4573 ) * ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci	2023-12-21 23:20:49 +02:00
Ebey Abraham	b9e74f9bca	llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490 ) * phi2 implementation * fix breaking change * phi-2 : various fixes * phi-2 : use layer norm eps * py : whitespaces * llama : fix meta KV override bug * convert : phi don't add BOS token * convert : revert "added_tokens_decoder" change * phi-2 : scale Q instead of KQ for better precision * ggml : fix NeoX rope to rotate just first n_dims * cuda : less diff in the rope_neox kernel * ggml : add ggml_mul_mat_set_prec ggml-ci * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * cuda : ggml_cuda_op_mul_mat_cublas support F32 precision * cuda : remove oboslete comment --------- Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-18 19:27:47 +02:00
slaren	6744dbe924	ggml : use ggml_row_size where possible (#4472 ) * ggml : use ggml_row_size where possible ggml-ci * ggml : move ggml_nbytes_split to ggml-cuda.cu	2023-12-14 20:05:21 +01:00
Georgi Gerganov	4d98d9a656	sync : ggml (SD ops, tests, kernels) (#4444 ) * sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-13 21:54:54 +02:00
slaren	799a1cb13b	llama : add Mixtral support (#4406 ) * convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: `1914017863` * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>	2023-12-13 14:04:25 +02:00
Richard Kiss	9494d7c477	english : use `typos` to fix comments and logs (#4354 )	2023-12-12 11:53:36 +02:00
Georgi Gerganov	fe680e3d10	sync : ggml (new ops, tests, backend, etc.) (#4359 ) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit `09e35d04b1`. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 22:26:54 +02:00
Galunid	f23c0359a3	ci : add flake8 to github actions (python linting) (#4129 ) Disabled rules: * E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned * E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned * E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned * E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard * E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned * E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned * E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard * E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard * E266 Too many leading '#' for block comment - sometimes used as "section" separator * E501 Line too long - disabled because it's broken so often it seems like a standard * E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead) * E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead)	2023-11-20 11:35:47 +01:00
Jiří Podivín	f7d5e97542	py : remove superfluous import statements (#4076 ) Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Jiri Podivin <jpodivin@redhat.com>	2023-11-17 17:20:53 +02:00
Galunid	36eed0c42c	stablelm : StableLM support (#3586 ) * Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers	2023-11-14 11:17:12 +01:00
Georgi Gerganov	4760e7cc0b	sync : ggml (backend v2) (#3912 ) * sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp	2023-11-13 14:16:23 +02:00
Georgi Gerganov	207b51900e	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861 ) * ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file	2023-10-30 19:19:15 +02:00
Galunid	daab3d7f45	Add more tokenizer tests (#3742 ) * Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test	2023-10-24 09:17:17 +02:00
goerch	9e70cc0322	Add test for MPT tokenization (#3728 ) * Add test for MPT tokenization * Revert code motion * Remove unnecessary restriction in test case * Clarify logic in conversion	2023-10-22 21:21:42 +02:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Qin Yue Chen	8cf19d60dc	gguf : support big endian platform (#3552 ) * check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-20 14:19:40 +03:00
goerch	233fc1c69f	Minor improvements in GPT2 tokenizer (#3567 ) * Fixing minor bugs in bpe_gpt2_preprocess * Don't add bos token in test	2023-10-10 18:59:52 +02:00
Georgi Gerganov	f93af02488	sync : ggml (conv 1d + 2d updates, UB fixes) (#3468 ) * sync : ggml (conv 1d + 2d updates) ggml-ci * ggml : fix UB in q5_0 and q5_1 quantize code ggml.c:1033:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ggml.c:1081:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ggml-ci * tests : fix UB in test-quantize-perf	2023-10-04 15:29:58 +03:00
goerch	ff5a3f0c09	Work on the BPE tokenizer (#3252 ) * Work on the BPE tokenizer Tokenizer tests work for Falcon-7B * Try to fix build problem * Fix debug assertion failure * Fix MSVC Unicode BOM problem * Cleanup and an improvement * Fix compiler warning * Cleanup * Test doesn't work over the full range of Unicodes * Update .gitignore and Makefile * Another Makefile rule * Testing Aquila * Moving byte decoding back to `token_to_piece` ... ... because everyone is using it. * Guarding some unusable code pathes * Streamlining code and adding some more assertions Important change: I'm classifying added tokens as control tokens now for BPE. * Adding a comment * Adding another assertion * Fixed vocabulary guarding assertions * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fix PR for recent change * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fixes for more compiler warnings * Remove unused code * Fix initialization of static maps * Add scores and token types back, adapt gptneox * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Ported Starcoder and added some assertions * Fix coding style * Apply @jploski 's fix for missing tokens --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-03 09:16:26 +02:00
Cebtenzzre	bc39553c90	build : enable more non-default compiler warnings (#3200 )	2023-09-28 17:41:44 -04:00
slaren	16bc66d947	llama.cpp : split llama_context_params into model and context params (#3301 ) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used	2023-09-28 22:42:38 +03:00
xaedes	0e76a8992c	train : finetune LORA (#2632 ) * fix track_max_mem in forward_batch_wo_cache_flash_attn_train * remove unnecessary Adam(W) optimizer tensors. reduces optimizer memory overhead from 7modelsize to 2modelsize. additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t. bumps training checkpoint file version, but old checkpoints can still be read. new version with less tensors is saved. * add gradient clipping to AdamW * Fix reset of unused g->nodes and g->grads to NULL * implement gradient checkpointing for training reduces memory overhead from O(n_layer) to O(sqrt(n_layer)) as explained in readme of https://github.com/cybertronai/gradient-checkpointing * remove unused compute buffer 3 * add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep); * change AdamW decay parameter to work like the torch AdamW decay parameter It is now relative to Adam learning rate `alphasched`. Before that it was relative to `sched` only. `alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1] change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT * change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW btw: the default weight decay parameter for torch.optim.AdamW is 0.01 * bug fixes for cross entropy loss ggml_cross_entropy_loss: sums where not correctly added in workload of each thread ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16 cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup. so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance. * fix test-grad0 for cross_entropy_loss the second argument to cross_entropy_loss must sum up to 1 for each row * fix test-grad0 for soft_max dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work instead use sum(log(soft_max()(1-eps)+eps)); use eps to avoid log(0) improve finite differences of test-grad0 by using double instead of float * change cross_entropy_loss to output average over all rows this helps keeping the loss and gradients in a sane range * improve gradient checkpointing sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal. since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different: ``` given: n, u, v objective: minimize(au+bv) where ab=n, a>0, b>0 b=n/a minimize(au+vn/a) diff(au+vn/a, a) = u - (vn/a)/a diff(au+vn/a, a) == 0 u - (vn/a)/a == 0 u == vn/(aa) uaa = vn aa = vn/u a = sqrt(nv/u) ``` this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage. disable gradient checkpointing debug output * llama : fix rope usage in train-text-from-scratch after ChatGLM change * add more training parameters: --enable-restart N Only for Adam optimizer. Enable restarts of cos-decay --disable-restart N Only for Adam optimizer. Disable restarts of cos-decay --opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero. --opt-delta N Maximum delta for delta convergence test. Disabled when <= zero. --opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero. --adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero. --adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha * replace memcpy with reshape operation so that the graph is not cut at the input this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it * remove unused function argument from get_example_targets_batch * measure and print total training time * add optimization callback to ggml_opt_resume_g this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)). can be used for dynamic learning schedule and setting input data for batches before each iteration * use optimization callback in training allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration * add minimum number of tensor dimensions to apply weight decay (default 2) this allows to not apply weight decay to bias parameters * rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup * fix increase of model.train_samples and model.train_tokens now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations * change sampling parameters for prediction after training to defaults of common.h and clarify what is context for prediction and what are generated tokens * tighten abs error bounds for cross_entropy_loss in test-grad0 * add conditional compilation of using F16 exp in flash attention uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention * tighten abs error bounds for flash_attn in test-grad0 * tighten abs error bounds for sqrt in test-grad0 * remove out-commented vectorized code of opt_adam the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead * ggml : update ggml_rms_norm_back with configurable eps * llama training : fix ggml_rms_norm_back calls to pass configurable eps * remove trailing whitespace * add train function using automatic gradient checkpointing backward pass and allocator * in train function replace add_inplace by regular add because using add_inplace seems to result in different gradients * don't use allocate hash_map on context because the context has no_alloc=True when using memory allocator resulting in NULL data pointers * correctly clone reshape and permute operations by also cloning tensor->nb values * fix variable name and add missing type cast * terminate recursive tensor cloning when reaching tensor without src tensors * correctly clone view tensors by setting data pointers without this the checkpointing would only work when being used together with memory allocator * fix variable names * swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn` * add input tensors as checkpoints so that recursive tensor cloning of gradient checkpointing terminates on input tensors * fix variable name and add missing boolean negation * make sure some tensors are not reallocated by inserting new temporary nodes depending on them: output and parameter gradient tensors need to be available at the end of the graph execution parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration checkpoint tensors are allocated all together to reduce memory allocator fragmentation afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs * fix ASSERT to work with zero layers * add training options whether to use allocator and/or unified training function * integrate unified training function which may use memory allocator the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing * format name of cloned tensors with " (clone)" suffix * set names for tensors in unified train function for easier debugging * allocate graph on context using ggml_new_graph * remove handwritten training functions * remove unused training parameters "use_scratch" and "use_unified" * remove trailing whitespace * remove unused train params: mem_compute1_gb & mem_compute2_gb mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented) * remove unused forward_batch function * add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly * only use ggml_allocr_alloc when tensor has NULL data and is no view * fix test when to create temporary backward graph temporary backward graph is only necessary when using checkpointing * fix memory "leak" in optimizers each iteration a new cplan with new memory for work data was allocated. now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data. * reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory. the computation results are the same * add API functions to access llama model tensors * add stub example for finetuning, based on train-text-from-scratch * move and remove code * add API functions to access remaining model parameters: mult, head and rot * first draft for LORA finetune training * remove const model and layer arguments in API functions for accessing model tensors * bug fixes to make finetune compile automatic allocator does not work yet * add debug prints for training memory improvements * fix names of lora tensors * avoid stack overflow resulting from big ggml_cgraph replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand * replace llama API functions to get model tensors by one function to get model tensor by name LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name); * remove unused call to not existing llama_get_layer_from_model * implement ggml_compute_forward_out_prod_q_f32 * remove trailing whitespace * add lora finetune support on quantized base model tensors * add ggml_add_cast API function this function works like ggml_add, but accepts a data type for the resulting tensor. only supported for quantized src0 input. * use ggml_add_cast in finetuning lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models * bug fix: actually use result type passed to ggml_add_cast * make sure base model tensors data cannot be used in viewable operations memory allocator would try to make lora application inplace on base model tensors. since those are memory mapped this will result in memory access violations * fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors * avoid keeping in memory ALL of the gradients The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients. During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset. To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero. * remove trailing whitespace * remove debug prints and function to compute tensor data hash * improve optimization iteration prints * adjust maximal values to support finetuning 3B models * change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4 * bug fix: make sure finetune input gradient is allocated at begin and kept until end * remove unnecessary src tensor from ggml_get_rows_back we don't need data of src[2] for computation, only to setup the correct output shape. remove dependency on src[2], so that allocator can work more freely. the computational graph is still completely determined, because the output shape is naturally included. this is similar to how ggml_reshape does it. * remove unnecessary src tensor from ggml_repeat & ggml_repeat_back we don't need data of src[1] for computation, only to setup the correct output shape. remove dependency on src[1], so that allocator can work more freely. the computational graph is still completely determined, because the output shape is naturally included * resolve todo allocator will only make it inplace when they are of the same type * mixing multiple LORA adapters is now possible pass more than one '--lora FNAME' argument to apply more than one LORA. use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter. * add option to save finetune output every N iterations * also save latest finetune output with ITERATION="LATEST" and print where files are saved saving with LATEST makes it easier to resume training from the latest checkpoint the string "LATEST" can be configured with command line option "--fn-latest STR" * update checkpoint train stats before saving via "--save-every" * add command line option `--rank-wo N` for rank of wo tensor * update finetune README * fix dump_non_result_info_yaml to output multiple lora adapters * bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t) * replace llama_n_mult by llama_n_ff * finetune bug fixes to compile with merged in code from master * remove prediction related code to reduce duplicated code with main use main instead * reduce large memory overhead in train-text-from-scratch all gradients had to be pinned so that graph_reset works correctly. this is no longer necessary with the changes to ggml_compute_backward introduced in this PR. * add comment explaining why finetune checkpoints are allocated in one block * make default value of float member a float literal * handle rms_norm and rope parameters the same as in train-text-from-scratch * remove unused code * remove vocab related code as it is unnecessary * add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints so that they can be differentiated from lora finetune checkpoints * add gguf constants and load/save functions from train-text-from-scratch * add load & save lora finetune checkpoints via gguf * add python script to convert old finetune checkpoint files to gguf * remove old checkpoint save & load code * remove code to print data checksums which was used to verify correctness of new gguf code * omit tokenization when training is disabled, only save llama lora adapter training can be disabled by passing '-n 0' to finetune * remove trailing whitespace * update README.md * implement ggml_compute_forward_repeat_f16 * avoid stack overflow of large cgraphs in test-grad0 * add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32 ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors. in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent. this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore * increase test-grad0 context mem size to accommodate for bigger cgraph * add sanity check to ggml_compute_backward, asserting the correct shape of gradients * fix ggml_acc_or_set to return tensor of correct shape * remove unused 'inplace' argument from ggml_compute_backward function inplace operations to add gradients are no longer created by ggml_compute_backward use allocator to automatically make inplace operations * add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations * fix error message in ggml_allocr_alloc to display actual max_avail * fix check_gradient ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing * use tensor->view_src instead of ggml_is_view and get_view_source * move gradient checkpointing code into ggml, new API function: // build gradient checkpointing backward graph gb for gf using provided checkpoints // gb_tmp will contain original backward graph with rewritten backward process nodes, // but without the second forward pass nodes. GGML_API void ggml_build_backward_gradient_checkpointing( struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, struct ggml_cgraph * gb_tmp, struct ggml_tensor * * checkpoints, int n_checkpoints); * replace custom data getters and setters by ggml functions * train-text-from-scratch can train (full finetune) gguf models just pass the gguf model via `--checkpoint-in FN`. after this, to continue training, pass the generated checkpoint instead of the original gguf model. tested with smaller models, bigger models may exceed available memory. use (LORA) finetune for those. * remove trailing whitespace * add option to save train-text-from-scratch output every N iterations * update README.md * fix warnings * fix warnings * remove finetune option to disable allocator the allocator should always be used. by making sure that it is always used it gets easier to implement automatic memory requirements computation * add tensor checkpoints only when gradient checkpointing is enabled * initialize opt ggml context if none was provided * add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc); * finetune: automatically allocate all memory and changes to command line options remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop. add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter. remove memory buffer related command line options. improve iteration console output. * add finetune to Makefile * update README.md * print time per iteration and estimate remaining time * increase measured alloc size by tensor_alignment ggml_allocr_reset will reduce the given size by up to tensor_alignment-1 * fix README.md * add some more allocator debug prints * bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue * revert last commit "bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue" "alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size." This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue. * remove unnecessary "0x" before "%p" output * move measurement memory segment to upper region of the address space * update README.md * fix printf format warnings * add missing gguf_free in load_checkpoint_lora_file * load default rms_norm and rope parameters from base model * add gradient accumulation specify number accumulation steps with '--grad-acc N'. this will simulate a bigger batch size of grad_accbatch. fix tracking of train_samples and train_tokens * build : fix compile warnings * ggml : fix L-BFGS linesearch loop * improve finetune time measurement fix printf warnings on system where int64_t is (long int). change time datatypes to double because values get big with long training times. exclude file saving from time measurement. converge faster to actual time per iteration by removing very small first duration before first iteration was performed. fix bug in output of total training time, the reported value was 1000 times to small. * specify default lora rank with '--lora-r N' '--lora-r N' will specify default rank for all tensors '--rank-wq N', etc. will override this default rank for specific tensor types. * fix gradient accumulation bug where the same batch was used for each microstep * fix gradient accumulation bug where the same batch was used for each microstep * support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back k and v can now be repeated in q along ne[2] in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2. in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3]. so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads. in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irepnek2. since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor. additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned. we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions. this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous. since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous. change test-grad0 to also test for repeated k/v in q. this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable. added a note to explain this. add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'. * fix finetune to support grouped-query-attention (using flash-attention) note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention. * support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b) * test broadcasting mul_mat backward pass * decouple random number generator of each operation test when changing one test the rng of others tests is not influenced anymore * add comment briefly describing what ggml_repeat_back does * simplify broadcasting mul_mat backward using ggml_repeat_back * add cgraph evaluation order member and corresponding enum type this controls in which order ggml_build_forward visits source nodes. by default the nodes are visited left to right, i.e. src[0] first. in some cases it is beneficial for ggml-alloc to visit in a different order. two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last). * measure max compute size for each cgraph eval order and use best order this can bring huge memory savings: e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB * remove unused command line options * add sample start patterns and options to force new or by default resume last shuffling * update shuffle rng state on reshuffle * exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32 * remove probably unnecessary exception type flags from stringstream * pass correct max number of tokens to llama_tokenize * account for possible leading whitespace that will be added by tokenizer e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12] * use unrolled vec_mad in out_prod y is vec_mad result vec. x is vec_mad input vec. v is vec_mad input scalar. ggml_vec_mad_f32_unroll will internally loop over x and v with same y. GGML_VEC_MAD_UNROLL is by default defined to 32. This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod. Full measurements of out-prod runtime in ms: unroll_xv unroll_yv 1 67014.643 87826.469 2 77117.552 89077.656 4 72091.311 109121.657 8 61077.543 88678.334 16 56914.67 79514.947 24 59024.595 84350.254 28 55952.446 83368.73 32 51476.658 85177.745 36 55973.792 84659.92 40 55139.616 93844.738 48 60736.392 93330.267 64 99856.878 116994.99 Second column is when unrollying yv instead of xv * set lora_alpha to value of lora_r if it is not set via command line otherwise only changing lora_r will change scaling of lora adapter used in prediction * reshuffle original sample order instead of the previous shuffled order otherwise resumed reshuffle will not result in same sample order * block tiling for out-prod inspired by mul-mat block sizes are empirically optimized roughly doubles the flops of out-prod * exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32 * add static keywords * remove outcommented old code * update train-text-from-scratch with tokenization, sample selection and shuffling from finetune * remove lbfgs related train parameters * move common train functions into common/train.[h\|cpp] * move train state into struct train_state * move train data saving code into callback to unify code of opt_callback train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h\|cpp * move common train params into common/train * move common opt_callback into common/train * fix consume_common_train_arg * save and load head_count_kv in lora checkpoints * increase train_samples by used_samples instead of number of batches on batch can contain more than one sample when option "fill_with_next_samples" is used * fix usage of llama_tokenize * remove static from process_escape since we need it exposed in header * fix code formating of long function declarations * fix condition in load_train_state_gguf * use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg") * fix saving and loading of training type * remove terminating '\0' from tokenization (llama_tokenize is now passed the string length instead of relying on terminating '\0') * fix compile warnings * fix compile warnings * use new/delete for train_state instead of malloc/free using malloc may result in seg faults when trying to assign string fields * assert that sample_count > 0, avoiding division by zero * fix frand to return value in interval [0,1) * add train option "--sample-random-offsets" Use samples beginning at random offsets. The offset is only applied to the first sample in each batch context window. Together with "--fill-with-next-samples" this may help for training endless text generation. For example given a dataset containing samples "abcd", "ABCD", "0123". With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos", the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc. With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc. * deduplicate code into function * remove n_rot hparam, as it must always be hparam.n_embd_head() * align code * assert correct base model tensor shapes * move some params from lora hparams into model hparams and load model params from gguf this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters * remove now unnecessary llama API functions to get model params that where added by this PR * train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N' * train-text-from-scratch: automatically allocate opt context * train-text-from-scratch: automatically allocate input tensors * train-text-from-scratch: automatically allocate compute memory * remove unused options and equalize train-text-from-scratch with finetune * initialize opt->loss_after with zero * add export-lora program * remove trailing whitespace * add export-lora build in Makefile * remove unused struct tensor_info from export-lora * add export-lora build dependency to llama because it depends on common, which depends on llama * update finetune README.md * cancel optimization when specified number of epochs is completed * improve handling of export-lora arguments print errors and warnings when files could not be read or created * Fix export-lora.cpp "not enough space in the context's memory pool" (#1) * Fix export-lora.cpp "not enough space in the context's memory pool" Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)". * increase required context size by 5GGML_MEM_ALIGN instead of plain 16 --------- Co-authored-by: xaedes <xaedes@gmail.com> improve handling of not yet supported tensor types --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>	2023-09-28 21:40:11 +03:00
Georgi Gerganov	ec893798b7	llama : custom attention mask + parallel decoding + no context swaps (#3228 ) * tests : verify that RoPE is "additive" * llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask) * ggml : ggml_rope now takes a vector with positions instead of n_past * metal : add rope_f16 kernel + optimize cpy kernels * llama : unified KV cache + batch inference API * llama : add new llama_decode() API that works with llama_batch * llama : add cell_max heuristic for more efficient kv_cache * llama : extend llama_kv_cache API * llama : more robust cell_max heuristic + wip shift * metal : disable concurrency optimization * llama : add llama_kv_cache_shift_seq + no more context swaps * llama : apply K-cache roping for Falcon and Baichuan * speculative : fix KV cache management * parallel : example for serving multiple users in parallel * parallel : disable hot-plug to avoid cache fragmentation * fixes : speculative KV cache + llama worst-case graph * llama : extend batch API to select which logits to output * llama : fix worst case graph build * ggml-cuda : update rope implementation for parallel decoding (#3254) * ggml-cuda : update rope implementation for parallel decoding * better solution for p0 computation * fix rope * simpler rope implementation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make : add parallel to build + fix static functions in llama.cpp * simple : fix token counting * parallel : various improvements * llama : fix cell_max logic + rename functions * parallel : try smaller batches when the KV cache is fragmented * parallel : fix sequence termination criteria * llama : silence errors KV cache errors * parallel : remove new line from prompt * parallel : process system prompt once + configurable paramters + llama API * parallel : remove question with short answers * parallel : count cache misses * parallel : print misses on each request * parallel : minor * llama : fix n_kv to never become 0 * parallel : rename hot-plug to continuous-batching * llama : improve llama_batch API + simplify parallel example * simple : add parallel decoding support * simple : improve comments + free batch * ggml-cuda : add rope f16, restore performance with parallel decoding (#3272) * ggml-cuda : add rope f16, restore performance * offload KQ_mask with all models * fix rope shift --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : disable MPI for now ggml-ci * train : make KQ_pos memory buffer permanent via dummy scale op * ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275) ggml-ci * parallel : fix bug (extra BOS) + smaller token_prev array * parallel : fix cases where the input prompts can overflow the batch * parallel : add disabled experimental batch chunking in powers of two * llama : llama.h formatting + comments * simple : add README.md * llama : fix kv cache heuristic when context is less than 32 * parallel : fix crash when `-n -1` * llama : simplify returns if/else branches * metal : use mm kernels for batch size > 2 * examples : utilize new llama_get_logits_ith() * examples : add example for batched decoding * examples : do not eval prompt 2 times (close #3348) * server : clear the KV cache beyond n_past before llama_decode * server : avoid context swaps by shifting the KV cache --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-09-28 19:04:36 +03:00
goerch	b08e75baea	Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 (#3170 ) * Fix für #2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo. * Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 * Simplify logic * Add missing change... * Fix ugly compiler warning * llama_tokenize should accept strings containing NUL now * Adding huichen's test case	2023-09-16 13:41:33 +02:00
Cebtenzzre	3aefaab9e5	check C++ code with -Wmissing-declarations (#3184 )	2023-09-15 15:38:27 -04:00
goerch	71ca2fad7d	whisper : tokenizer fix + re-enable tokenizer test for LLaMa (#3096 ) * Fix für #2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo.	2023-09-13 16:19:44 +03:00
Cebtenzzre	00d62adb79	fix some warnings from gcc and clang-tidy (#3038 ) Co-authored-by: xaedes <xaedes@gmail.com>	2023-09-07 13:22:29 -04:00

... 2 3 4 5 6 ...

394 Commits