llama.cpp

Commit Graph

Author	SHA1	Message	Date
l-austenfeld	c76b420e4c	vendor : update vendored copy of google/minja (#15011 ) * vendor : update vendored copy of google/minja Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> * Re-remove trailing whitespace Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> * Remove another trailing whitespace Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> --------- Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>	2025-08-01 16:59:06 +02:00
stevenkuang	0f5ccd6fd1	model : add hunyuan dense (#14878 ) * support hunyuan_v1_dense Signed-off-by: stevenkuang <stevenkuang@tencent.com> * update hunyuan_moe to hunyuan_v1_moe Signed-off-by: stevenkuang <stevenkuang@tencent.com> * fix rope alpha assert and bos token Signed-off-by: stevenkuang <stevenkuang@tencent.com> * add blank line Signed-off-by: stevenkuang <stevenkuang@tencent.com> * Revert "update hunyuan_moe to hunyuan_v1_moe" This reverts commit `aa973ca219`. * use hunyuan_dense instead of hunyuan_v1_dense Signed-off-by: stevenkuang <stevenkuang@tencent.com> * fix hunyuan_moe chat template Signed-off-by: stevenkuang <stevenkuang@tencent.com> * remove leftover code Signed-off-by: stevenkuang <stevenkuang@tencent.com> * update hunyuan dense chat template Signed-off-by: stevenkuang <stevenkuang@tencent.com> * fix hunyuan dense vocab and chat template Signed-off-by: stevenkuang <stevenkuang@tencent.com> --------- Signed-off-by: stevenkuang <stevenkuang@tencent.com>	2025-08-01 15:31:12 +02:00
lhez	1c872f71fb	opencl: add f16 for `add`, `sub`, `mul`, `div` (#14984 )	2025-08-01 13:15:44 +02:00
Srihari-mcw	baad94885d	ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373 ) * Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>	2025-08-01 09:20:33 +03:00
Georgi Gerganov	ba42794c9e	graph : fix equal_seq() check (#14986 ) ggml-ci	2025-08-01 06:38:12 +03:00
diannao	2860d479b4	docker : add cann build pipline (#14591 ) * docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-08-01 10:02:34 +08:00
R0CKSTAR	484b2091ce	compare-commits.sh: support both llama-bench and test-backend-ops (#14392 ) * compare-commits.sh: support both llama-bench and test-backend-ops Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Speed up the build by specifying -j 12 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Remove build_number from test-backend-ops db Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Apply suggestion from @JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Refine tool selection logic Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-01 08:47:27 +08:00
Ed Addario	daf2dd7880	quantize : skip tensor override when in fallback mode (#14995 )	2025-07-31 21:32:18 +02:00
Diego Devesa	a06ed5feae	llama : add simple option to enable CPU for MoE weights (--cpu-moe) (#14992 )	2025-07-31 20:15:41 +02:00
Aman Gupta	784524053d	Fix params bug in diffusion example (#14993 )	2025-08-01 01:22:58 +08:00
Diego Devesa	d6818d06a6	llama : allow other bufts when overriding to CPU, add --no-repack option (#14990 )	2025-07-31 18:11:34 +02:00
Ruben Ortlam	e08a98826b	Vulkan: Fix minor debug mode issues (#14899 ) * vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support	2025-07-31 17:46:54 +02:00
tc-mb	952a47f455	mtmd : support MiniCPM-V 4.0 (#14983 ) * support minicpm-v 4 * add md * support MiniCPM-o 4.0 * add default location * temp rm MiniCPM-o 4.0 * fix code * fix "minicpmv_projector" default path	2025-07-31 17:22:17 +02:00
Csaba Kecskemeti	36e5fe7bcd	MODEL_TENSOR.SSM_DT_NORM has defined twice (#14991 ) * MODEL_TENSOR.SSM_DT_NORM has defined twice, and second overwritten the jamba model's layername * correct order	2025-07-31 10:59:49 -04:00
g2mt	94933c8c2e	server : implement universal assisted decoding (#12635 ) * llama-server : implement universal assisted decoding * Erase prompt tail for kv-cache * set vocab_dft_compatible in common_speculative * rename ctx_main to ctx_tgt * move vocab_dft_compatible to spec struct * clear mem_dft, remove mem * detokenize id_last for incompatible models * update comment * add --spec-replace flag * accept special tokens when translating between draft/main models * Escape spec-replace * clamp draft result to size to params.n_draft * fix comment * clean up code * restore old example * log common_speculative_are_compatible in speculative example * fix * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-31 14:25:23 +02:00
Dongliang Wei	c1dacaa99b	llama : merge build_moe_ffn_from_probs function into build_moe_ffn (#14968 )	2025-07-31 14:12:20 +02:00
Lukas Straub	a9f77a8be3	server : add openai-style logit_bias support (#14946 ) Signed-off-by: Lukas Straub <lukasstraub2@web.de>	2025-07-31 14:08:23 +02:00
Aman Gupta	8a4a856277	Add LLaDA 8b Diffusion model (#14771 ) * Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py	2025-07-31 19:49:09 +08:00
hipudding	11490b3672	CANN: Improve loading efficiency after converting weights to NZ format. (#14985 ) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-07-31 19:47:20 +08:00
compilade	66625a59a5	graph : reduce splits for recurrent and hybrid models (#14825 ) * graph : avoid creating redundant s_copy views * graph : comment the s_copy views	2025-07-31 08:02:46 +03:00
lhez	6e6725459a	opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (#14809 )	2025-07-30 14:56:55 -07:00
Ed Addario	e9192bec56	quantize : fix using combined imatrix GGUFs (multiple datasets) (#14973 )	2025-07-30 21:11:56 +02:00
Daniel Bevenius	41e78c567e	server : add support for `embd_normalize` parameter (#14964 ) This commit adds support for the `embd_normalize` parameter in the server code. The motivation for this is that currently if the server is started with a pooling type that is not `none`, then Euclidean/L2 normalization will be the normalization method used for embeddings. However, this is not always the desired behavior, and users may want to use other normalization (or none) and this commit allows that. Example usage: ```console curl --request POST \ --url http://localhost:8080/embedding \ --header "Content-Type: application/json" \ --data '{"input": "Hello world today", "embd_normalize": -1} ```	2025-07-30 18:07:11 +02:00
uvos	ad4a700117	HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (#14949 )	2025-07-30 17:38:06 +02:00
Georgi Gerganov	e32a4ec60e	sync : ggml ggml-ci	2025-07-30 17:33:11 +03:00
Kai Pastor	e228de9449	cmake : Fix BLAS link interface (ggml/1316)	2025-07-30 17:33:11 +03:00
Kai Pastor	73a8e5ca03	vulkan : fix 32-bit builds (ggml/1313) The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.	2025-07-30 17:33:11 +03:00
Johannes Gäßler	92b8810ec7	CUDA: skip masked KV slices for all FA kernels (#14924 )	2025-07-30 15:46:13 +02:00
Georgi Gerganov	00131d6eaf	tests : update for LLAMA_SET_ROWS=1 (#14961 ) * test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci	2025-07-30 15:12:02 +03:00
Georgi Gerganov	1e15bfd42c	graph : fix stack-use-after-return (#14960 ) ggml-ci	2025-07-30 13:52:11 +03:00
Douglas Hanley	a118d80233	embeddings: fix extraction of CLS pooling results (#14927 ) * embeddings: fix extraction of CLS pooling results * merge RANK pooling into CLS case for inputs	2025-07-30 08:25:05 +03:00
Xinpeng Dou	61550f8231	CANN: update ops docs (#14935 ) * CANN:add ops docs * CANN: update ops docs	2025-07-30 08:39:24 +08:00
uvos	aa79524c51	HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (#14945 )	2025-07-29 20:23:04 +02:00
uvos	b77d11179d	HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (#14930 ) This is useful for testing for regressions on GCN with CDNA hardware. With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.	2025-07-29 17:44:30 +02:00
uvos	c7aa1364fd	HIP: Ignore unsupported unroll transformation in fattn-vec (#14931 ) llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.	2025-07-29 17:43:43 +02:00
kallewoof	1a67fcc306	common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937 ) * bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences	2025-07-29 17:05:38 +02:00
hipudding	204f2cf168	CANN: Add ggml_set_rows (#14943 )	2025-07-29 22:36:43 +08:00
Sigbjørn Skjæret	138b288b59	cuda : add softcap fusion (#14907 )	2025-07-29 14:22:03 +02:00
Johannes Gäßler	bbd0f91779	server-bench: make seed choice configurable (#14929 ) * server-bench: make seed choice configurable * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix error formatting * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-29 10:40:50 +02:00
Aman Gupta	0a5036bee9	CUDA: add roll (#14919 ) * CUDA: add roll * Make everything const, use __restrict__	2025-07-29 14:45:18 +08:00
lhez	8ad7b3e65b	opencl : add ops docs (#14910 )	2025-07-28 18:50:17 +02:00
Leonard Mosescu	bda62193b2	test-backend-ops : extend test case filtering (#14865 ) * Extend test case filtering 1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example: `test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"` 2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example: `test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"` These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled) * Updating the usage help text * Update tests/test-backend-ops.cpp	2025-07-28 18:04:27 +02:00
Radoslav Gerganov	c556418b60	llama-bench : use local GPUs along with RPC servers (#14917 ) Currently if RPC servers are specified with '--rpc' and there is a local GPU available (e.g. CUDA), the benchmark will be performed only on the RPC device(s) but the backend result column will say "CUDA,RPC" which is incorrect. This patch is adding all local GPU devices and makes llama-bench consistent with llama-cli.	2025-07-28 18:59:04 +03:00
xctan	db16e2831c	ggml-cpu : deduplicate scalar implementations (#14897 ) * remove redundant code in riscv * remove redundant code in arm * remove redundant code in loongarch * remove redundant code in ppc * remove redundant code in s390 * remove redundant code in wasm * remove redundant code in x86 * remove fallback headers * fix x86 ggml_vec_dot_q8_0_q8_0	2025-07-28 17:40:24 +02:00
Akarshan Biswas	cd1fce6d4f	SYCL: Add set_rows support for quantized types (#14883 ) * SYCL: Add set_rows support for quantized types This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code. * Use get_global_linear_id() instead ggml-ci * Fix formatting ggml-ci * Use const for ne11 and size_t variables in set_rows_sycl_q ggml-ci * Increase block size for q kernel to 256 ggml-ci * Cleanup imports * Add float.h to cpy.hpp	2025-07-28 20:32:15 +05:30
Xuan-Son Nguyen	00fa15fedc	mtmd : add support for Voxtral (#14862 ) * mtmd : add support for Voxtral * clean up * fix python requirements * add [BEGIN_AUDIO] token * also support Devstral conversion * add docs and tests * fix regression for ultravox * minor coding style improvement * correct project activation fn * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-28 15:01:48 +02:00
Johannes Gäßler	946b1f6859	CUDA: fix pointer incrementation in FA (#14916 )	2025-07-28 14:30:22 +02:00
Dongliang Wei	6c6e397aff	model : add support for SmallThinker series (#14898 ) * support smallthinker * support 20b softmax, 4b no sliding window * new build_moe_ffn_from_probs, and can run 4b * fix 4b rope bug * fix python type check * remove is_moe judge * remove set_dense_start_swa_pattern function and modify set_swa_pattern function * trim trailing whitespace * remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * better whitespace Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use GGML_ASSERT for expert count validation Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Improve null pointer check for probs Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use template parameter for SWA attention logic * better whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * move the creation of inp_out_ids before the layer loop * remove redundant judge for probs --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-28 13:47:00 +02:00
Alberto Cabrera Pérez	afc0e89698	sycl: refactor quantization to q8_1 (#14815 ) * sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat	2025-07-28 11:05:53 +01:00
Georgi Gerganov	a5771c9eea	ops : update BLAS (#14914 )	2025-07-28 10:01:03 +02:00

1 2 3 4 5 ...

6059 Commits All Branches Search

6059 Commits

All Branches