llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jianlin Shi	302089bd2d	Merge branch 'ggml-org:master' into master	2026-03-13 14:17:27 -05:00
ZeroV0LT	f17b3be63f	llama : fix pooling assertion crash in chunked GDN detection path (#20468 ) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (`d28961d`). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>	2026-03-13 20:53:42 +02:00
SoftwareRenderer	d7ba99c485	server: reset counter related to kill-switch on client error (#20513 ) * server: reset kill-switch on client error This avoids triggering a server kill switch. If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated. However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates. * moved counter reset as per recommendation * cont : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-13 19:58:09 +02:00
jianlins	fe33dbe2fb	Merge branch 'master' of https://github.com/jianlins/llama.cpp	2026-03-13 11:57:59 -06:00
jianlins	48c561fcc9	workflow: enhance library packaging by preserving symlinks and adding runtime checks	2026-03-13 11:57:51 -06:00
rehan-10xengineer	fbaa95bc29	ggml-cpu: add RVV vec dot kernels for quantization types (#18859 ) * ggml-cpu: add rvv quantize_row_q8_K kernel Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add rvv vec_dot for iq4_nl, mxfp4, iq2_xxs Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add rvv vec_dot for iq4_xs, refactor * ggml-cpu: remove ifunc for rvv vec dot * ggml-cpu: add vec_dot for iq2_xs, iq3_xxs Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: refactor quants.c --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> Co-authored-by: Rehan Qasim <rehanbhatti0317@gmail.com>	2026-03-13 17:36:04 +02:00
Jianlin Shi	97ec9c04d5	Merge branch 'ggml-org:master' into master	2026-03-13 09:59:02 -05:00
Adrien Gallouët	b5e1212063	ggml : fix typo gmml (#20512 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-13 14:36:13 +01:00
Daniel Bevenius	8f974d2392	mtmd : rename mtmd_get_audio_bitrate to mtmd_get_audio_sample_rate (#20105 ) This commit renames the the function `mtmd_get_audio_bitrate` to `mtmd_get_audio_sample_rate` to better reflect its purpose. The motivation for this is that the function currently returns the audio sample rate, not the bitrate (sample_rate × bit_depth × channels), and that is how it is used in the code as well. This is a breaking change, but I believe mtmd is still in experimental/development phase so it might be alright to simply rename.	2026-03-13 12:30:02 +01:00
Piotr Wilkin (ilintar)	2948e6049a	general: CONTRIBUTING.md - guidelines for quantization schemes (#19762 ) * Guidelines for quantization schemes * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change required precision from Q8 to FP16/BF16 * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md [no ci] * Update CONTRIBUTING.md [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-13 12:21:33 +01:00
Georgi Gerganov	73c9eb8ced	metal : fix l2 norm scale (#20493 )	2026-03-13 11:43:20 +02:00
jianlins	ee86901ed7	Merge branch 'master' of https://github.com/jianlins/llama.cpp	2026-03-12 23:02:48 -06:00
jianlins	cd9d8771c2	workflow: update source checkout step and add tag creation logic in build script	2026-03-12 23:02:38 -06:00
Daniel Bevenius	983df142a9	convert : fix/suppress pyright errors (#20442 ) * convert : fix/suppress pyright errors This commit fixes the pyright errors that are generated by pyright for convert_hf_to_gguf.py. The motivation for this is that running this locally generates errors that CI does not, and it can be difficult to spot new errors. One use case is when working on new models which cannot be run in CI due to privacy. Having the ability to run pyright locally is would be helpful in this cases. In the linked issue there is the mention of switching to `ty` which I don't know anything about but in the meantime I would appreciate if we could suppress these errors for now, and later perhaps revert this commit. With this change there are no errors but there are 4 informations messages if the `mistral_common` package is installed. The `--level error` flag can be used to suppress them. Resolves: https://github.com/ggml-org/llama.cpp/issues/20417	2026-03-13 06:00:52 +01:00
Jianlin Shi	804e13febe	Merge branch 'ggml-org:master' into master	2026-03-12 23:06:57 -05:00
jianlins	25581cbb35	workflow: require tag_name input for manual release trigger and update asset upload process	2026-03-12 22:05:23 -06:00
jianlins	a23f3fb7fd	workflow: add safe.directory configuration for Git in build script	2026-03-12 20:12:53 -06:00
Georgi Gerganov	57819b8d4b	llama : disable graph reuse with pipeline parallelism (#20463 )	2026-03-12 21:04:13 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	557fe2d913	vendor : update cpp-httplib to 0.37.1 (#20390 )	2026-03-12 13:57:06 +01:00
Piotr Wilkin (ilintar)	0e810413bb	tests : use `reasoning` instead of `reasoning_budget` in server tests (#20432 )	2026-03-12 13:41:01 +01:00
Ruben Ortlam	128142fe7d	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 ) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build	2026-03-12 13:26:00 +01:00
Daniel Bevenius	6de1bc631d	common : update completion executables list [no ci] (#19934 ) This commit updates the bash completion executables list, adding missing executables and removing some that non longer exist.	2026-03-12 12:12:01 +01:00
Asbjørn Olling	0a10c34dc1	grammar: Fix grammar root symbol check (#19761 ) * grammar: fix bad check for root symbol, correct error logging * add tests to demonstrate root symbol check failure	2026-03-12 12:04:56 +01:00
ProgenyAlpha	deee23863b	vulkan: add GATED_DELTA_NET op support (#20334 ) * vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from #20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 11:32:04 +01:00
Sigbjørn Skjæret	c3e3f9e533	convert : better mtp check and fix return [no ci] (#20419 )	2026-03-12 10:04:20 +01:00
ProgenyAlpha	40c550d4f6	vulkan: fix SSM_CONV PP scaling with large ubatch sizes (#20379 ) * vulkan: optimize SSM_CONV workgroup dispatch for large ubatch Tile tokens into 2D workgroups (32x16) to reduce workgroup launch overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common d_conv size). Fixes PP performance degradation with ubatch > 512. Ref: ggml-org/llama.cpp#18725 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: remove unused shared memory declaration in SSM_CONV Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 10:03:18 +01:00
Pascal	de190154c8	New conversations now auto-select the first loaded model (#20403 ) * webui: auto-select first loaded model for new conversations in router mode * chore: update webui build output	2026-03-12 09:07:05 +01:00
Masashi Yoshimura	05039967da	ggml-virtgpu: Fix some build commands (#20341 )	2026-03-12 15:47:45 +08:00
Georgi Gerganov	e4cff0956b	metal : avoid divisions in bin kernel (#20426 ) * metal : avoid modulus in bin kernel when not broadcasting * metal : fix capture_started flag	2026-03-12 09:42:40 +02:00
jianlins	eb7a304948	workflow: update build steps to handle gcc-toolset-12 sourcing more safely	2026-03-12 00:08:04 -06:00
jianlins	460a535956	workflow: update build script to use gcc-toolset-12 and add linker flags	2026-03-12 00:01:18 -06:00
Masato Nakasaka	4cc6eb158c	ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154 )	2026-03-12 06:43:22 +01:00
Jeff Bolz	246ffc4b05	vulkan: fix l2_norm epsilon handling (#20350 )	2026-03-12 06:39:41 +01:00
Jeff Bolz	aa429cf507	vulkan: fix OOB check in flash_attn_mask_opt (#20296 )	2026-03-12 06:35:49 +01:00
Masato Nakasaka	5866e3bbc8	vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (#20059 ) * Changed to reuse command buffers to fix crashing on Intel GPU * Removed unused parameter * Fixed compile error and minor mistake * Fix logging * Changing to use usage flag per command buffer * fixed style * added buffer reset * Removed cmd_buffer_idx for reuse consistency * Fixed style	2026-03-12 06:30:16 +01:00
lhez	0516e04bf9	opencl: use larger workgroup size for get_rows (#20316 )	2026-03-11 22:03:27 -07:00
shaofeiqi	3d9ab225e7	opencl: add cumsum op (#18981 ) * OpenCL: add CUMSUM op support * remove unused argument * opencl: refactor cumsum * opencl: refactor * opencl: refactor tmp buffer * opencl: adjust max number of subgroups * opencl: fix whitespace * opencl: fix global size when cumsum the tmp buffer --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-11 22:03:07 -07:00
jianlins	b1e28fa511	workflow: add gpu_arch input for customizable CUDA architecture in build script	2026-03-11 22:28:21 -06:00
jianlins	200905df28	workflow: remove crb repository enablement from Rocky Linux build script	2026-03-11 22:10:13 -06:00
jianlins	65f59acdde	workflow: update package manager commands from yum to dnf in Rocky Linux build script	2026-03-11 22:05:15 -06:00
jianlins	ddb3fcd924	workflow: comment out yum update in Rocky Linux build script	2026-03-11 22:01:02 -06:00
jianlins	38c82a6db7	workflow: install epel-release as a build dependency for Rocky Linux	2026-03-11 21:55:44 -06:00
jianlins	a3cc7bf992	workflow: add manual trigger for release creation in Rocky Linux build	2026-03-11 21:46:17 -06:00
jianlins	29b3c14619	workflow: update build configurations for Linux and Windows CUDA	2026-03-11 21:44:36 -06:00
Jianlin Shi	ab4c9081dc	Merge branch 'ggml-org:master' into master	2026-03-11 22:39:47 -05:00
uvos	d63aa398de	hip: compile debug builds with -O2 on hip to avoid a compiler bug (#20392 )	2026-03-12 10:37:10 +08:00
Jianlin Shi	2a1e77436c	build-linux-cuda	2026-03-11 20:02:20 -05:00
Jianlin Shi	70dde08eba	Merge branch 'ggml-org:master' into master	2026-03-11 19:39:40 -05:00
Mishusha	a8304b4d27	common/parser: add GigaChatV3/3.1 models support (#19931 ) Co-authored-by: Mishusha <pmv26021975@gmail.com>	2026-03-12 01:22:25 +01:00
DAN™	fdb17643d3	model : add support for Phi4ForCausalLMV (#20168 ) * Add support for Phi4ForCausalLMV. * Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd. * Rename contants + fix tokenizer label * Clean-ups. * Fix GGUF export. * Set tokenizer.ggml.pre explicitly. * Default vocab name rather than forcing it. * Clean-ups. * Fix indent. * Fix subscriptable error. * remov overcomplicated code path * Clean-ups. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-12 00:25:54 +01:00

1 2 3 4 5 ...

8368 Commits All Branches Search

8368 Commits

All Branches