llama.cpp

Commit Graph

Author	SHA1	Message	Date
Yee Man Chan	22bc582a82	return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement	2026-01-12 20:32:19 +08:00
Xuan-Son Nguyen	ce3bf9b1a4	server: update docs for sleeping [no ci] (#18777 )	2026-01-12 13:01:24 +01:00
Jeff Bolz	2bbe4c2cf8	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 ) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-12 12:32:13 +01:00
Ruben Ortlam	1051ecd289	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763 ) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size	2026-01-12 07:29:35 +01:00
Yee Man Chan	4faf26c376	fixed flake8 complaints locally	2026-01-12 08:26:47 +08:00
Yee Man Chan	ac85cb1375	removed at least blank line containing white space	2026-01-12 08:14:51 +08:00
Xuan-Son Nguyen	0c3b7a9efe	model: fix qwen3next broken due to #18683 (#18762 )	2026-01-11 21:00:10 +01:00
Ruben Ortlam	0e76501e1d	Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749 ) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l	2026-01-11 17:33:33 +01:00
Xuan-Son Nguyen	4b060bf240	security: make it clear about subtopics in server (#18754 ) * security: make it clear about subtopics in server * exclude DoS	2026-01-11 16:51:03 +01:00
Daniel Bevenius	9789e28459	debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692 ) * debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check This commit updates the pooling check in the debug example to also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just LLAMA_POOLING_TYPE_NONE. * debug : normalize both pooled and token embeddings This commit updates debug.cpp to normalize embeddings for both pooled and non-pooled outputs. For pooled embeddings, normalization is applied to the single vector, and for non-pooled embeddings, normalization is applied to each token embedding vector individually. The motivation for this is to enable non-pooled embeddings to be normalized which was not possible previously.	2026-01-11 16:34:41 +01:00
Georgi Gerganov	84ae04f163	tests : refactor test-backend-sampler (#18753 ) * tests : use "auto", use std::string * tests : refactor test-backend-sampler.cpp * cmake : remove redundant declarations * ci : use smaller model * tests : add struct test_params * tests : reduce logit bias 100.0f -> 10.0f	2026-01-11 17:31:03 +02:00
Yee Man Chan	719d374bf6	remove blank lines to make lint happy	2026-01-11 22:58:44 +08:00
Yee Man Chan	4f6ef2c085	try to make lint happy	2026-01-11 22:33:58 +08:00
Yee Man Chan	58d1ee5227	removed traling whitespaces in empty line + make sure indentation is multiple of 4	2026-01-11 22:19:29 +08:00
Yee Man Chan	59182f5e06	fix trailing whitespace	2026-01-11 22:06:48 +08:00
Yee Man Chan	93afbedc96	moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error	2026-01-11 21:44:54 +08:00
Yee Man Chan	6ae66fc40d	fix trailing spaces	2026-01-11 21:31:35 +08:00
Xuan-Son Nguyen	506bb6e010	model: try to improve Qwen3 Next (#18683 ) * qwen3next: simplify qkvz projection * use ggml_swiglu_split * revert swiglu_split, but remove redundant repeat() * fix missing reshape * rm 2 redundant transposes * move mul_mat(k,q) to outside of chunking * rm redundant cont * improve g_cs_chunk * add comments about no cont * use std::pair instead of ggml_concat * vectorize key_gdiff calculation * rm unused tensor * avoid ggml_concat inside loop * bring back ggml_concat as it may not work on other backend * nits	2026-01-11 12:53:33 +01:00
thom-dev-fr	79456a690a	readme : update UIs (#18751 )	2026-01-11 13:46:50 +02:00
Xuan-Son Nguyen	28068af789	security: narrow down the scope of what we consider a vulnerability (#18752 ) * security: narrow down the scope of what we consider a vulnerability * fix typo	2026-01-11 12:23:36 +01:00
Yee Man Chan	10be797c12	Merge branch 'Kimi-Linear' of github.com:ymcki/llama.cpp into Kimi-Linear merge with latest llama.cpp	2026-01-11 16:04:25 +08:00
Yee Man Chan	5f2b8dd9a5	Merge branch 'master' of github.com:ymcki/llama.cpp into Kimi-Linear sync with latest llama.cpp	2026-01-11 16:00:32 +08:00
Yee Man Chan	b9360c7fe1	MLA KV cache support	2026-01-11 15:58:46 +08:00
ymcki	426a82de3d	Merge branch 'ggml-org:master' into Kimi-Linear	2026-01-11 15:55:45 +08:00
shaofeiqi	707cbafcaa	opencl: add SOFTPLUS op support (#18726 )	2026-01-10 21:57:44 -08:00
Aman Gupta	b137718878	test-backend-ops: fix mxfp4 tests on blackwell (#18736 )	2026-01-11 01:12:57 +08:00
Johannes Gäßler	d2ff4e23ac	HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666 )	2026-01-10 17:19:01 +01:00
Perry Naseck	657a2e644b	cmake : update blas logic (#18205 )	2026-01-10 18:00:54 +02:00
Georgi Gerganov	f307926482	server : adjust unified KV cache tests (#18716 )	2026-01-10 17:51:56 +02:00
Sigbjørn Skjæret	7fdc8c893d	scripts : follow api redirects in pr2wt.sh (#18739 )	2026-01-10 16:04:05 +01:00
Xuan-Son Nguyen	23f82f2420	preset: allow named remote preset (#18728 ) * preset: allow named remote preset * nits: fix docs * cont docs	2026-01-10 15:12:29 +01:00
Yee Man Chan	dce064c0a3	fixed typo and split wkv_b into wk_b and wv_b	2026-01-10 22:08:38 +08:00
Aaron Teo	2656c0d265	docs(ggml): update backend ops (#18734 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-01-10 18:48:17 +08:00
Michael Wand	600a366478	Corrected: changed s13 = src1->nb[3] instead of nb[2] (#18724 )	2026-01-10 10:16:07 +01:00
Adrien Gallouët	ea23c15990	common : add --license to display embedded licenses (#18696 ) This commit introduces a mechanism to embed all licenses directly into the compiled binaries. This eliminates the need to distribute separate LICENSE files alongside the executable, making the binaries self-contained and simplifying deployment.	2026-01-10 09:46:24 +01:00
Yee Man Chan	d26fe50178	Moved Aqk computation out of the loop	2026-01-10 08:45:57 +08:00
Xuan-Son Nguyen	9ac2693a30	server: fix n_cmpl not skipping processing prompt (#18663 ) * server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-10 00:00:41 +01:00
Simranjeet Singh	a61c8bc3bf	mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256 ) * Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py. * Add mobilenetv5 impl * Fix comments, remove unused vars * Fix permute and remove transpose of projection weights * Fix comments, remove debugging prints from hf_to_gguf * 1. Hard-code image_mean = 0 and image_std = 1 2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder * 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size * Remove obsolete comments * - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest - convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf - mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_* - clip.cpp: Remove unused embedding and hard_emb_norm tensor loading * - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology * Fix stem conv bias name * Remove explicit handling of bias term for stem conv * - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer - Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable * clean up conversion script * fix code style * also preserve audio tensors * trailing space * split arch A and V * rm unused gemma3 func * fix alignment --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-01-09 23:42:38 +01:00
shaofeiqi	593da7fa49	opencl: add EXPM1 op (#18704 )	2026-01-09 10:13:13 -08:00
Reese Levine	9e41884dce	Updates to webgpu get_memory (#18707 )	2026-01-09 08:17:18 -08:00
Pascal	ec8fd7876b	Webui/file upload (#18694 ) * webui: fix restrictive file type validation * webui: simplify file processing logic * chore: update webui build output * webui: remove file picker extension whitelist (1/2) * webui: remove file picker extension whitelist (2/2) * chore: update webui build output * refactor: Cleanup * chore: update webui build output * fix: update ChatForm storybook test after removing accept attribute * chore: update webui build output * refactor: more cleanup * chore: update webui build output	2026-01-09 16:45:32 +01:00
Asbjørn Olling	a180ba78c7	cmake: only build cli when server is enabled (#18670 )	2026-01-09 16:43:26 +01:00
Yee Man Chan	6150bb7b17	no clamp version	2026-01-09 20:11:45 +08:00
Georgi Gerganov	53eb9435da	server : fix timing of prompt/generation (#18713 )	2026-01-09 12:59:50 +02:00
Georgi Gerganov	d3435efc8a	scripts : pr2wt.sh reset to remote head (#18695 ) * scripts : pr2wt.sh reset to remote head * cont : cleaner * cont : restore --set-upstream-to	2026-01-09 12:16:40 +02:00
Georgi Gerganov	f5f8812f7c	server : use different seeds for child completions (#18700 ) * server : use different seeds for child completions * cont : handle default seed * cont : note	2026-01-09 09:33:50 +02:00
ymcki	6977ddbe85	Merge branch 'ggml-org:master' into Kimi-Linear	2026-01-09 14:09:56 +08:00
Xuan-Son Nguyen	8ece3836b4	common: support remote preset (#18520 ) * arg: support remote preset * proof reading * allow one HF repo to point to multiple HF repos * docs: mention about multiple GGUF use case * correct clean_file_name * download: also return HTTP status code * fix case with cache file used * fix --offline option	2026-01-08 22:35:40 +01:00
Aaron Teo	046d5fd44e	llama: use host memory if device reports 0 memory (#18587 )	2026-01-09 05:34:56 +08:00
Masashi Yoshimura	480160d472	ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628 ) * Fix GGML_MEM_ALIGN to 8 for emscripten. * Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten	2026-01-08 08:36:42 -08:00

... 2 3 4 5 6 ...

7908 Commits All Branches Search

7908 Commits

All Branches