llama.cpp

Commit Graph

Author	SHA1	Message	Date
Diego Devesa	ec428b02c3	llama : add --n-cpu-moe option (#15077 ) * llama : add --n-cpu-moe option Keeps the MoE weights of the first N layers in the CPU	2025-08-05 01:05:36 +02:00
compilade	19f68fa5a4	imatrix : warn when GGUF imatrix is saved without .gguf suffix (#15076 ) * imatrix : add warning when suffix is not .gguf for GGUF imatrix * imatrix : only warn about suffix when output format is unspecified	2025-08-04 23:26:52 +02:00
Ed Addario	adbff66394	Merge branch 'master' into imatrix	2025-08-04 22:16:10 +01:00
Ed Addario	c39c4e2a33	Refactor variable name	2025-08-04 22:15:50 +01:00
Christian Kastner	41613437ff	cmake: Add GGML_BACKEND_DIR option (#15074 ) * cmake: Add GGML_BACKEND_DIR option This can be used by distributions to specify where to look for backends when ggml is built with GGML_BACKEND_DL=ON. * Fix phrasing	2025-08-04 21:29:14 +02:00
Sigbjørn Skjæret	e5bebe5251	gguf-py : add --chat-template-file to gguf_new_metadata (#15075 )	2025-08-04 21:01:48 +02:00
Sam	ef0144c087	model: support GLM 4.5 family of models (#14939 ) * model: Add GLM 4.5 (#14921) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Merge in PR suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: Add GLM 4.5 family of models (#14921) 1. Updated tensor_mapping.py with NextN tensor mappings - Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py - Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm 2. Added num_nextn_predict_layers configuration - Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp - Added num_nextn_predict_layers field to llama_hparams struct - Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter - Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers - Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method - Updated conversion script to extract and write this parameter from HuggingFace config 3. Added FIM tokens for GLM4_MOE - Added GLM-4.5's FIM tokens to llama-vocab.cpp: - <\|code_prefix\|> for FIM_PRE - <\|code_suffix\|> for FIM_SUF - <\|code_middle\|> for FIM_MID 4. Removed manual NextN tensor handling - Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors - NextN tensors are now handled automatically through the proper tensor mapping system * glm 4.5 update tensors names * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: glm 4.5 apply suggestions from code review * Apply suggestions from code review * patch broken chat template * typings fix * add TENSOR_SKIP flag Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update src/llama-model-loader.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-08-04 20:29:25 +02:00
Sigbjørn Skjæret	2721257e3e	quantize : fix confusing error message if ftype is invalid (#15071 )	2025-08-04 18:11:02 +02:00
Reese Levine	587d0118f5	ggml: WebGPU backend host improvements and style fixing (#14978 ) * Add parameter buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow	2025-08-04 08:52:43 -07:00
Jeff Bolz	5aa1105da2	vulkan: fix build when using glslang that does not support coopmat2 (#15062 )	2025-08-04 07:09:19 +02:00
compilade	d31192b4ee	imatrix : use GGUF by default (#14842 ) * imatrix : use GGUF by default * imatrix : use GGUF regardless of the output filename The legacy format can only be produced with --output-format dat	2025-08-03 22:00:05 +02:00
compilade	0a2f5496be	imatrix : fix 3d activation handling for hybrid and recurrent models (#14994 ) * imatrix : use a single count for dense 3d tensors * imatrix : fix 3d activations when model tensor is 2d * imatrix : fix 3d tensor counts	2025-08-03 21:49:13 +02:00
compilade	11a3811164	memory : handle kv_unified for hybrid models (#15050 )	2025-08-03 21:43:07 +02:00
Csaba Kecskemeti	97366dc6ab	vocab : JetBrains Mellum pre-tokenizer (#15045 )	2025-08-03 21:38:18 +02:00
Ed Addario	f1c2a4ca3f	Fix printing l2 norm when calc_mode = 1	2025-08-03 17:14:46 +01:00
Ed Addario	90cb1be99d	Minor cosmetic changes	2025-08-03 16:57:27 +01:00
Ed Addario	2117c4e54b	Update aggregated statistic report layout	2025-08-03 16:38:02 +01:00
Ed Addario	a6155a8125	Add compute_layer_statistics() function	2025-08-03 16:35:03 +01:00
Gabriel Larson	83bc2f288c	model : add text-only support for Kimi-VL (and find special tokens in text_config) (#15051 ) * basic kimi-vl textmodel conversion * check config["text_config"] for special tokens	2025-08-03 16:56:25 +02:00
Ed Addario	be60469f25	Refactor function names	2025-08-03 15:10:17 +01:00
Jeff Bolz	6c7a441161	vulkan: Use coopmat2 for conv2d (#14982 )	2025-08-03 14:23:57 +02:00
Ed Addario	fce05aac9e	Refactor lambda into compute_tensor_averages() function	2025-08-03 13:03:21 +01:00
Ed Addario	5324558132	Update table layout	2025-08-03 10:28:47 +01:00
Ed Addario	4d1325e1eb	Refactor variables	2025-08-03 10:28:23 +01:00
Ed Addario	a32a2ecbed	Reformat report layout	2025-08-03 00:51:33 +01:00
Ed Addario	4c01f51ae1	Remove inactive	2025-08-03 00:51:12 +01:00
lhez	5c0eb5ef54	opencl: fix adreno compiler detection logic (#15029 )	2025-08-02 19:51:18 +02:00
Ed Addario	fc8f92596f	Update table display	2025-08-02 16:46:27 +01:00
Ed Addario	ee2509f563	Adjust threshold	2025-08-02 16:45:56 +01:00
Ed Addario	9b841eb696	Compute l2 norm	2025-08-02 16:45:09 +01:00
Ed Addario	b7fb362d8e	Compute cosine similarity based on activations	2025-08-02 16:43:49 +01:00
Ed Addario	cce514a392	Compute entropy for activations	2025-08-02 16:40:40 +01:00
Ed Addario	9744a4a1c6	Determine calculation mode	2025-08-02 16:36:12 +01:00
Ed Addario	78ddb475de	Fix problem up when GGUF does not have in_sum	2025-08-02 16:31:21 +01:00
Johannes Gäßler	03d4698218	CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035 )	2025-08-02 16:37:08 +02:00
leejet	3303c19b16	cuda: make im2col a little faster (#15025 )	2025-08-02 17:15:36 +03:00
Daniel Bevenius	4fdea540bd	kv-cache : skip alignment of n_stream in kv-cache log msg [no ci] (#15040 ) This commit removes the right alignment the `n_stream` value in the log message in the `llama_kv_cache_unified` constructor. The motivation for this change is to enhance the readability of log message. Currently the output looks like this: ```console llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB ``` Notice that the `n_stream` value is right aligned, which makes it a little harder to read. With the change in this commit the output will look like ```console llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB ```	2025-08-02 17:14:57 +03:00
Georgi Gerganov	a4569c41fd	llama : enable LLAMA_SET_ROWS=1 by default (#14959 ) ggml-ci	2025-08-02 17:14:21 +03:00
Georgi Gerganov	15e92fd337	cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038 ) * cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch	2025-08-02 17:13:05 +03:00
Sigbjørn Skjæret	2bf3fbf0b5	ci : check that pre-tokenizer hashes are up-to-date (#15032 ) * torch is not required for convert_hf_to_gguf_update * add --check-missing parameter * check that pre-tokenizer hashes are up-to-date	2025-08-02 14:39:01 +02:00
Douglas Hanley	711d5e6fe6	convert : fix Qwen3-Embedding pre-tokenizer hash (#15030 )	2025-08-02 12:51:02 +02:00
Jhen-Jie Hong	f738989dcb	chat : fix multiple tool_calls on hermes-2-pro (#14962 )	2025-08-02 18:04:48 +08:00
Jeff Bolz	4cb208c93c	vulkan: coopmat2 mul_mat optimizations (#14934 ) - Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used	2025-08-02 11:21:37 +02:00
R0CKSTAR	3025b621d1	llama-bench: rename DB table name from test to llama_bench (#15003 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-02 17:20:40 +08:00
Jeff Bolz	ec0b18802c	vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015 )	2025-08-02 10:48:30 +02:00
Douglas Hanley	339bd0268c	model : support Qwen3-Embedding (#15023 )	2025-08-02 10:44:50 +02:00
Johannes Gäßler	f906275537	server: enable token array inputs for OAI API (#15001 )	2025-08-02 10:12:41 +02:00
Jeff Bolz	a9f7541ec2	vulkan: optimizations for direct convolution (#14933 ) * vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-02 09:57:04 +02:00
Johannes Gäßler	9c35706b98	CUDA: fix MMQ nwarps for AMD with warp_size==32 (#15014 )	2025-08-01 20:47:32 +02:00
l-austenfeld	c76b420e4c	vendor : update vendored copy of google/minja (#15011 ) * vendor : update vendored copy of google/minja Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> * Re-remove trailing whitespace Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> * Remove another trailing whitespace Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com> --------- Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>	2025-08-01 16:59:06 +02:00

... 25 26 27 28 29 ...

7410 Commits All Branches Search

7410 Commits

All Branches