llama.cpp

Commit Graph

Author	SHA1	Message	Date
David Friehs	27b93cbd15	cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624 ) * cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization - load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask * cuda: iq2xxs: simplify sum scaling express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8` express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 \| 1)` saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight AFAICT no overflow can occur here as iq2xxs values are far too small * uint -> uint32_t error: identifier "uint" is undefined	2026-02-15 22:38:42 +05:30
Aaron Teo	6e67fd2144	docs: update s390x build docs (#19643 )	2026-02-16 00:33:34 +08:00
Adrien Gallouët	9e118b97c4	build : remove LLAMA_HTTPLIB option (#19623 ) This option was introduced as a workaround because cpp-httplib could not build on visionOS. Since it has been fixed and now compiles on all platforms, we can remove it and simplify many things. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-15 15:38:50 +01:00
Daniel Bevenius	57088276d4	cmake : check if KleidiAI API has been fetched (#19640 ) This commit addresses a build issue with the KleidiAI backend when building multiple cpu backends. Commmit `3a00c98584` ("cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL") introduced a change where FetchContent_Populate is called instead of FetchContent_MakeAvailable, where the latter does handle this case (it is idempotent but FetchContent_Populate is not). I missed this during my review and I should not have commited without verifying the CI failure, sorry about that.	2026-02-15 13:59:38 +01:00
Georgi Gerganov	341bc7d23c	context : fix output reorder with backend sampling (#19638 )	2026-02-15 14:57:40 +02:00
Georgi Gerganov	08e6d914b8	ggml : avoid UB in gemm ukernel (#19642 )	2026-02-15 14:56:35 +02:00
Aaron Teo	184c694f45	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399 )	2026-02-15 18:20:35 +08:00
Aman Gupta	684b36101c	ggml-cpu: FA add GEMM microkernel (#19422 ) * ggml-cpu: FA add GEMM microkernel * add guard for sizeless vector types * fix case where DV % GGML_F32_EPR !=0 * move memset out of the loop * move another memset out of the loop * use RM=4 for arm * simd_gemm: convert everything to int * convert everything to size_t to avoid warnings * fixup * add pragma for ignoring aggressive loop optimizations	2026-02-15 11:09:24 +05:30
SamareshSingh	3a00c98584	cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581 ) * cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used. The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality. * addressed code review comments	2026-02-15 06:22:53 +01:00
Sigbjørn Skjæret	079feab9e3	convert : ensure all models handle new experts count (#19621 ) * ensure all models handle new experts count * revert removal for PhiMoeModel, does not inherit from base	2026-02-14 22:22:32 +01:00
Anav Prasad	01d8eaa28d	mtmd : Add Nemotron Nano 12B v2 VL support (#19547 ) * nemotron nano v2 vlm support added * simplified code; addressed reviews * pre-downsample position embeddings during GGUF conversion for fixed input size	2026-02-14 14:07:00 +01:00
Georgi Gerganov	1725e316c1	models : optimize qwen3next graph (#19375 ) * models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs	2026-02-14 12:57:36 +02:00
Adrien Gallouët	b7742cf321	ggml : fix GGML_DEBUG with OpenMP (#19599 ) last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-14 11:22:57 +01:00
iMil	badba89320	NetBSD build support (#19589 )	2026-02-14 09:47:01 +01:00
Aleksander Grygier	baa12f3831	webui: Architecture and UI improvements (#19596 )	2026-02-14 09:06:41 +01:00
agent-enemy-2	2d8015e8a4	llama : update LoRA API. + fix excessive graph reserves (#19280 ) * Refactoring to use new llama_put_adapter_loras * cont : alternative lora API --------- Co-authored-by: Jake Chavis <jakechavis6@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-14 10:06:27 +02:00
George	eb145c0753	mmap: Fix Windows handle lifetime (#19598 ) * ggml: added cleanups in ggml_quantize_free Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup. * mmap: Fix Windows handle lifetime Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping. The file mapping handle must remain valid until UnmapViewOfFile is called. Fixes cleanup order in destructor. * Update llama-mmap.cpp * Update llama-mmap.cpp Remove trailing whitespace from line 567	2026-02-14 10:05:12 +02:00
Georgi Gerganov	6e473fb384	metal : fix ACC op (#19427 )	2026-02-14 09:54:03 +02:00
Adrien Gallouët	c7db95f106	scripts : use official split.py for cpp-httplib (#19588 ) * scripts : use official split.py for cpp-httplib Using the official script is safer and ensures the generated code aligns with the library's standards. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Catch generic errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Allow print() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Ensure robust cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-14 08:41:16 +01:00
Sigbjørn Skjæret	0d00ef65ed	convert : store ffn_gate_inp_shexp as F32 (#19606 )	2026-02-14 08:17:43 +01:00
Adrien Gallouët	91ea5d67f2	build : fix libtool call in build-xcframework.sh (#19605 ) Run libtool via xcrun like strip and dsymutil, to have proper tool resolution. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-14 06:48:37 +01:00
Jeff Bolz	dbb023336b	vulkan: support L2_NORM with contiguous rows (#19604 )	2026-02-14 06:42:04 +01:00
Jeff Bolz	53aef25a88	vulkan: support GGML_OP_SET (#19584 )	2026-02-14 06:36:38 +01:00
Sophon	2dec548094	vulkan: Add vendor id for Qualcomm drivers (#19569 ) This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.	2026-02-14 06:29:17 +01:00
Max Krasnyansky	0ccbfdef3e	hexagon: further optimizations and refactoring for flash attention (#19583 ) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>	2026-02-13 16:27:30 -08:00
Mengsheng Wu	94a602db66	github : add missing backends to issue templates (#19603 )	2026-02-14 00:56:53 +01:00
Jeff Bolz	05a6f0e894	vulkan: restore -inf check in FA shaders (#19582 )	2026-02-13 13:35:29 -06:00
Adrien Gallouët	b48e80f677	common : update download code (#19573 ) * common : remove legacy .json to .etag migration code Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : simplify common_download_file_single_online This commit also force a redownload if the file exists but has no .etag file. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-13 15:10:46 +01:00
Xuan-Son Nguyen	752584d5f5	model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460 ) * model: support GLM MoE DSA arch * working version * pyright * keep indexer tensors * add indexer gguf params * loaded now * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * update * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * minor fix and cleanup --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-13 14:56:53 +01:00
Alberto Cabrera Pérez	cc2aa81513	Fix wrong memcpy length for block_interleave == 4 (#19575 )	2026-02-13 20:32:14 +08:00
ymcki	0e21991472	fix vulkan ggml_acc only works in 3d but not 4d (#19426 ) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-13 13:31:37 +01:00
Sigbjørn Skjæret	b2ecc0cdb4	support --verbose-prompt (#19576 )	2026-02-13 12:49:10 +01:00
Aman Gupta	5065da554e	CUDA: loop over ne2ne3 in case it overflows (#19538 ) CUDA: loop over ne2ne3 in case it overflows use fastdiv	2026-02-13 17:01:40 +05:30
Aleksander Grygier	5174d7206f	webui: UI and routing fixes (#19586 ) * chore: update webui build output * chore: update webui build output * fix: Scroll issues in DropdownMenuSearchable * webui: fix redirect to root ignoring base path * fix: Word wrapping * fix: remove obsolete modality UI tests causing CI failures - Remove VisionModality/AudioModality test stories - Remove mockServerProps usage and imports - Simplify Default test (remove dropdown interaction checks) - Simplify FileAttachments test (remove mocks) * feat: Improve formatting performance time --------- Co-authored-by: Pascal <admin@serveurperso.com>	2026-02-13 12:31:00 +01:00
Oliver Simons	43919b7f4f	CUDA: Do not mutate cgraph for fused ADDs (#19566 ) * Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-02-13 15:07:55 +05:30
Pavan Shinde	423cf0b26f	docs : fix broken link and typo (#19560 )	2026-02-13 09:38:09 +01:00
ymcki	33a56f90a6	model : Kimi Linear fix conv state update (#19531 ) * fix conv state update for llama-server parallel serving --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-02-13 09:10:18 +01:00
Adrien Gallouët	25224c8021	llama : remove deprecated codecvt (#19565 ) Using the same conversion function ensures a consistent matching between the regex pattern and the text. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-13 06:43:53 +01:00
Adrien Gallouët	2f5d8f8edc	vendor : update BoringSSL to 0.20260211.0 (#19562 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-13 06:43:26 +01:00
Georgi Gerganov	bb96bfd361	memory : fix kv cache size for hybrid models (#19559 )	2026-02-13 07:36:24 +02:00
Georgi Gerganov	0644baefde	metal : improve concurrency (#19555 )	2026-02-13 07:35:57 +02:00
Georgi Gerganov	490eb96b88	metal : support GGML_OP_SET (#19548 )	2026-02-13 07:34:52 +02:00
Shupei Fan	3bb78133ab	hexagon: fix typo in vtcm_needs_release (#19545 )	2026-02-12 15:07:49 -08:00
lhez	79cc0f2daf	opencl: add basic support for q4_1 (#19534 ) * opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm	2026-02-12 14:52:37 -08:00
Georgi Gerganov	338085c69e	args : add -kvu to llama-parallel (#19577 )	2026-02-12 21:52:41 +02:00
Aleksander Grygier	4c61875bf8	webui: Add switcher to Chat Message UI to show raw LLM output (#19571 )	2026-02-12 19:55:51 +01:00
Adrien Gallouët	4b385bfcf8	vendor : update cpp-httplib (#19537 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-12 16:11:22 +01:00
Christian Schmitz	f488429380	llama : update outdated comment in llama.h (#19428 ) * Updated documentation Model is no longer a parameter * llama : fix trailing whitespace in comment --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-02-12 15:52:57 +01:00
Aleksander Grygier	4d688f9ebb	(webui) FEATURE: Enable adding or injecting System Message into chat (#19556 ) * feat: Enable adding System Prompt per-chat * fix: Save draft message in Chat Form when adding System Prompt from new chat view * fix: Proper system message deletion logic * chore: Formatting * chore: update webui build output	2026-02-12 13:56:08 +01:00
Daniel Bevenius	ff599039a9	scripts : add support for forks in pr2wt.sh (#19540 ) This commit adds support for using the pr2wt.sh (pull request to workspace) script with forks of upstream llama.cpp.	2026-02-12 13:14:28 +01:00

1 2 3 4 5 ...

8064 Commits All Branches Search

8064 Commits

All Branches