llama.cpp

Commit Graph

Author	SHA1	Message	Date
Adrien Gallouët	4b385bfcf8	vendor : update cpp-httplib (#19537 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-12 16:11:22 +01:00
Christian Schmitz	f488429380	llama : update outdated comment in llama.h (#19428 ) * Updated documentation Model is no longer a parameter * llama : fix trailing whitespace in comment --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-02-12 15:52:57 +01:00
Aleksander Grygier	4d688f9ebb	(webui) FEATURE: Enable adding or injecting System Message into chat (#19556 ) * feat: Enable adding System Prompt per-chat * fix: Save draft message in Chat Form when adding System Prompt from new chat view * fix: Proper system message deletion logic * chore: Formatting * chore: update webui build output	2026-02-12 13:56:08 +01:00
Daniel Bevenius	ff599039a9	scripts : add support for forks in pr2wt.sh (#19540 ) This commit adds support for using the pr2wt.sh (pull request to workspace) script with forks of upstream llama.cpp.	2026-02-12 13:14:28 +01:00
Aleksander Grygier	f486ce9f30	(webui) REFACTOR: UI primitives and polish (#19551 ) * webui: UI primitives and polish (non-MCP) * chore: update webui build output	2026-02-12 12:21:00 +01:00
Aleksander Grygier	38adc7d469	WebUI Architecture Cleanup (#19541 ) * webui: architecture foundation (non-MCP core refactors) * chore: update webui build output	2026-02-12 11:22:27 +01:00
Georgi Gerganov	3b3a948134	metal : update sum_rows kernel to support float4 (#19524 )	2026-02-12 11:35:28 +02:00
Mario Limonciello	6845f7f87f	Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461 ) There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices). The specialized types aren't resolved properly: ``` /opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>' 2549 \| using ARegsT = typename Impl::ARegsT; ``` Add a workaround to explicitly declare the types and cast when compiling with HIP and ROCWMMA_FATTN [2]. When this is actually fixed upstream some guards can be used to detect and wrap the version that has the fix to only apply when necessary. Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1] Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>	2026-02-12 09:38:35 +01:00
RichardScottOZ	fa16e517a3	server : fix typo in README.md for features list (#19510 ) extra l for full	2026-02-12 08:56:25 +01:00
TriDefender	313493de53	docs : update path in snapdragon README.md (#19533 ) paths changed so original example didn't work	2026-02-12 08:13:51 +01:00
Max Krasnyansky	b1ff83bbb0	hexagon: further optimization and tuning of matmul and dot kernels (#19407 ) * ggml-hexagon: implement 2x2 matmul kernel * hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4 * hexagon: fix editor config failures * hexagon: refactor matmul ops to use context struct and remove wrappers Also implement vec_dot_f16 2x2 * hexagon: refactor dyn quantizers to use mmctx * hexagon: remove mm fastdiv from op_ctx * hexagon: refactor matmul entry point to reduce code duplication --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-02-11 23:04:27 -08:00
Adrien Gallouët	4ae1b7517a	common : replace deprecated codecvt using parse_utf8_codepoint (#19517 ) Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>	2026-02-12 07:27:52 +01:00
lhez	4d3daf80f8	opencl: add general Q6_K mm and Q4_K mv (#19347 ) * opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace	2026-02-11 10:33:13 -08:00
Georgi Gerganov	914dde72ba	ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511 ) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-11 18:58:43 +02:00
Daniel Bevenius	3136a849db	common : remove unused token util functions (#19506 ) This commit removes two unused functions `common_lcp` and `common_lcs`. The last usage of these functions was removed in Commit `33eff40240` ("server : vision support via libmtmd") and are no longer used anywhere in the codebase.	2026-02-11 17:41:35 +01:00
AesSedai	e463bbdf65	model: Add Kimi-K2.5 support (#19170 ) * Move dequant_model to after the text_config merge Add new kimi-k2.5 keys to mtmd convert Update V_MMPROJ tensor mapping for new mm_projector.proj keys Update V_M_IMP_NORM for new mm_projector.pre_norm key * Fix a couple of oversights * Add image support for Kimi-K2.5 * Revert changes to KimiVLForConditionalGeneration * Fix an assert crash * Fix permute swapping w / h on accident * Kimi-K2.5: Use merged QKV for vision * Kimi-K2.5: pre-convert vision QK to use build_rope_2d * Kimi-K2.5: support non-interleaved rope for vision * Kimi-K2.5: fix min / max pixel * Kimi-K2.5: remove v/o permutes, unnecessary * Kimi-K2.5: update permute name to match * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-11 16:47:30 +01:00
Daniel Bevenius	53de59f67d	build : fix case in dSYMs path for build-macos [no ci] (#19515 ) This commit updates an incorrect dSYMs where the the 's' was uppercase by mistake. The motivation for fixing this is that this can cause issues on case sensitive operating systems. Refs: https://github.com/ggml-org/whisper.cpp/pull/3630	2026-02-11 14:02:29 +01:00
Georgi Gerganov	9ab072ebbe	metal : extend l2_norm support for non-cont src0 (#19502 )	2026-02-11 14:53:19 +02:00
Johannes Gäßler	ada90bf2ba	docs: ban AI for issues and discussions [no CI] (#19512 )	2026-02-11 12:49:40 +01:00
Adrien Gallouët	0c1f39a9ae	common : improve download error reporting (#19491 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-11 09:27:55 +01:00
Max Krasnyansky	73cd5e1b97	hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (#19406 ) * hexagon: add ARGSORT op Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> * hexagon: argsort reject tensors with huge rows for now * Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend * hexagon : Add GEGLU op * hexagon: fix editor config check * hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA --------- Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>	2026-02-10 23:21:12 -08:00
thecaptain789	8ee538ce73	llama : correct typos 'occured' and 'occurences' (#19414 ) Co-authored-by: thecaptain789 <thecaptain789@users.noreply.github.com>	2026-02-11 07:05:31 +01:00
Georgi Gerganov	6d95707827	model : fix wavtokenizer embedding notions (#19479 )	2026-02-11 07:52:20 +02:00
Georgi Gerganov	89181c0b6d	ggml : extend bin bcast for permuted src1 (#19484 ) * tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify	2026-02-11 07:52:00 +02:00
Georgi Gerganov	ceaa89b786	metal : consolidate unary ops (#19490 )	2026-02-11 07:51:12 +02:00
Daniel Bevenius	2cce9fddb7	llama : refactor sampling_info to use buffer_view template (#19368 ) * llama : refactor sampling_info to use buffer_view template This commit updates the sampling_info struct in llama-context to use a buffer_view template for the logits, probs, sampled tokens, and candidates buffers. The motivation for this is to simplify the code, improve type safety and readability.	2026-02-11 05:38:13 +01:00
Oliver Simons	612db61886	CUDA : Update CCCL-tag for 3.2 to final release from RC (#19486 ) CCCL 3.2 has been released since it was added to llama.cpp as part of the backend-sampling PR, and it makes sense to update from RC to final released version. https://github.com/NVIDIA/cccl/releases/tag/v3.2.0	2026-02-10 22:31:19 +01:00
Nikhil Jain	57487a64c8	[WebGPU] Plug memory leaks and free resources on shutdown (#19315 ) * Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool * Free pools * Cleanup * More cleanup * Run clang-format * Fix arg-parser and tokenizer test errors that free an unallocated buffer * Fix device lost callback to not print on device teardown * Fix include and run clang-format * remove unused unused * Update binary ops --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-02-10 08:04:00 -08:00
JJJYmmm	fc0fe40049	models : support qwen3.5 series (#19468 ) * support qwen3.5 series * remove deepstack for now, and some code clean * code clean * add FULL_ATTENTION_INTERVAL metadata * code clean * reorder v heads for linear attention to avoid expensive interleaved repeat	2026-02-10 18:00:26 +02:00
Xuan-Son Nguyen	9a96352729	test: fix IMROPE perf test case (#19465 )	2026-02-10 14:37:50 +01:00
Alberto Cabrera Pérez	c03a5a46f0	ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (#19360 ) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention	2026-02-10 10:47:45 +00:00
k4ss4n	6948adc90d	ggml : use noexcept overload for is_regular_file in backend registration (#19452 ) using noexcept std::filesystem::directory_entry::is_regular_file overload prevents abnormal termination upon throwing an error (as caused by symlinks to non-existent folders on linux) Resolves: #18560	2026-02-10 10:57:48 +01:00
Piotr Wilkin (ilintar)	854b09f0d7	convert : move experts permutation from Qwen2MoeModel to Qwen3VLMoeTextModel (#19445 ) * Add special case for Qwen3VLMoe * Fix down path, remove arrows and checkmarks * ws * Moved to Qwen3VL * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-10 09:01:37 +01:00
Daniel Bevenius	66d403c480	tts : fix typos in README.md [no ci] (#19463 )	2026-02-10 07:30:41 +01:00
Raul Torres	f0bfe54f55	CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (#18968 )	2026-02-10 14:19:30 +08:00
hipudding	52e38faf8c	CANN: implement quantized MUL_MAT_ID for MoE models (#19228 ) Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).	2026-02-10 14:18:59 +08:00
Georgi Gerganov	a0d585537c	cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429 ) * cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad	2026-02-10 08:07:16 +02:00
Xuan-Son Nguyen	98e57ca422	chat: fix case where template accepts type content only (#19419 ) * chat: fix case where template accepts type content only * rm stray log * reuse render_message_to_json	2026-02-09 22:14:12 +01:00
Tarek Dakhran	262364e31d	mtmd: Implement tiling for LFM2-VL (#19454 )	2026-02-09 17:30:32 +01:00
손희준	820ebfa6f4	Server: log when converting requests to chat completions format (#19457 ) * Log converting requests * Print as debug instead of info [no ci] --------- Co-authored-by: openingnow <>	2026-02-09 16:22:57 +01:00
Sascha Rogmann	292f6908cd	spec : remove check rate (#19377 ) * spec: remove parameter spec-ngram-check-rate * spec : renamed statistics vars * spec : add n_call_begin, n_call_accept * spec : don't enable key-map-stats	2026-02-09 15:30:50 +02:00
Georgi Gerganov	81ddc60cb3	ci : add metal server workflows (#19293 ) * ci : add metal server workflows * cont : try fix python init * cont : move to a separate workflow that runs only on master * cont : fix num jobs Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-09 15:09:30 +02:00
Georgi Gerganov	972f323e73	revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435 )" (#19453 ) This reverts commit `39bf692af1`.	2026-02-09 14:57:51 +02:00
Kevin Pouget	f5e7734ff2	ggml-virtgpu: add backend documentation (#19354 ) * ggml-virtgpu: add backend documentation Assisted-by-AI: Claude Code * CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget * README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md * docs/ggml-virt: add link to testing + configuration * Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget" This reverts commit `8ece8e72e2`. * drop the ggml- prefix * s/ggerganov/ggml-org * Relocate VirtGPU.md * reorganize the text * turn turn the ascii diagram into a mermaid * README.md: update the link to the main doc	2026-02-09 20:15:42 +08:00
Hugo	1e8924fd65	cmake : add variable to skip installing tests (#19370 ) When packaging downstream, there's usually little point in installing test. The default behaviour remains the same.	2026-02-09 07:12:02 +01:00
Piotr Wilkin (ilintar)	39bf692af1	[Model] Qwen3.5 dense and MoE support (no vision) (#19435 ) * Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-09 00:24:08 +01:00
Oliver Simons	e06088da0f	CUDA: Fix non-contig rope (#19338 ) * Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from https://github.com/ggml-org/llama.cpp/pull/19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-08 15:12:51 +02:00
Adrien Gallouët	5fa1c190d9	rpc : update from common.cpp (#19400 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-08 09:06:45 +01:00
Georgi Gerganov	eb449cdfa4	server : improve context checkpoint logic (#19408 )	2026-02-08 09:40:04 +02:00
ddh0	5999b50eb0	llama-quantize : cleanup `--help` output (#19317 ) * cleanup `llama-quantize --help` output some much needed TLC * remove future argument oops, spoiler * cleanup of cleanup	2026-02-08 09:22:38 +02:00

1 2 3 4 5 ...

8018 Commits All Branches Search

8018 Commits

All Branches