llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aleksander Grygier	f22e2be4d0	refactor: Use Popover for Chat Form Prompt Picker	2026-01-27 11:22:30 +01:00
Aleksander Grygier	7eff7a31de	feat: UI improvements	2026-01-27 11:07:20 +01:00
Aleksander Grygier	d4a6815ea9	chore: update webui build output	2026-01-27 10:40:34 +01:00
Aleksander Grygier	b834f165a4	Merge remote-tracking branch 'origin/allozaur/mcp-mvp' into allozaur/mcp-mvp	2026-01-27 10:40:11 +01:00
Aleksander Grygier	e35adedb4f	chore: update webui build output	2026-01-27 10:27:40 +01:00
Aleksander Grygier	1b7f576baf	refactor: Components	2026-01-27 10:26:14 +01:00
Alberto Cabrera Pérez	be8890e721	ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888 ) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-27 11:08:10 +02:00
Aleksander Grygier	b8221e8915	refactor: Utils	2026-01-27 09:04:41 +01:00
Gaurav Garg	a83c73a18a	[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042 ) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation	2026-01-27 08:52:44 +02:00
Daniel Bevenius	fc3cdf32ce	common : clarify HTTPS build options in error message (#19103 ) * common : clarify HTTPS build options in error message This commit updates the https error message to provide clearer instructions for users who encounter the "HTTPS is not supported" error. The motivation for this is that it might not be clear to users that only one of these options are needed to enable HTTPS support. The LLAMA_OPENSSL option is also added to the message to cover all possible build configurations. * clarify that OpenSSL is the default for HTTPS support	2026-01-27 06:16:00 +01:00
shalinib-ibm	7afdfc9b84	ggml-cpu: Enable FP16 MMA kernels on PPC (#19060 )	2026-01-27 11:52:34 +08:00
lhez	94eeb5967c	opencl: add flattened q6_K mv (#19054 ) * opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat` * opencl: clean up * opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat` * opencl: tweak the workgroup size a bit * opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat` * opencl: proper alignment for q6_K * opencl: boundary handling for flattened q6_K mv * opencl: rename q6_K mv kernel file * opencl: put flattened q6_K mv in its own file * opencl: use lower k in file name * opencl: use K in variable names	2026-01-26 19:36:24 -08:00
Johannes Gäßler	b0311c16d2	CUDA: fix padding of GQA to power of 2 in FA (#19115 )	2026-01-26 23:24:58 +01:00
Georgi Gerganov	8f80d1b254	graph : fix nkvo offload with FA (#19105 )	2026-01-26 20:18:34 +02:00
Pascal	5e71525cac	webui: remove unused sessionId, SDK handles it automatically	2026-01-26 16:41:44 +01:00
Pascal	19c32a4c96	webui: remove unused sessionId, SDK handles it automatically	2026-01-26 16:13:07 +01:00
Aleksander Grygier	d444c4a7e5	chore: update webui build output	2026-01-26 15:40:02 +01:00
Aleksander Grygier	1d518cac06	fix: Wait for all MCP Servers Health Checks to load	2026-01-26 15:38:10 +01:00
Aleksander Grygier	82f26ad8e4	refactor: Cleanup	2026-01-26 15:33:27 +01:00
Aleksander Grygier	5bf1c86635	refactor: Cleanup refactor: Cleanup refactor: Cleanup refactor: Cleanup	2026-01-26 15:28:50 +01:00
Sigbjørn Skjæret	142cbe2ac6	ci : use new 1vCPU runner for lightweight jobs (#19107 ) * use new 1vCPU runner for lightweight jobs * pyright is too heavy, look into ty some day use new pip-install input	2026-01-26 15:22:49 +01:00
Aleksander Grygier	7b127db90c	chore: update webui build output	2026-01-26 15:07:47 +01:00
Aleksander Grygier	717a868c23	feat: Mcp Server Selector	2026-01-26 15:03:05 +01:00
Aleksander Grygier	e566d6641e	fix: Scroll issues in DropdownMenuSearchable	2026-01-26 14:41:15 +01:00
Aleksander Grygier	d675f403e3	chore: update webui build output	2026-01-26 14:33:58 +01:00
Aleksander Grygier	ee0f0b277f	feat: Improve Code blocks rendering + add auto scroll + improve global scroll bar behavior	2026-01-26 14:32:40 +01:00
Aleksander Grygier	6586ae71d2	chore: update webui build output	2026-01-26 12:34:21 +01:00
Aleksander Grygier	c631e26a3f	refactor: Components imports/exports structure & documentation	2026-01-26 12:30:53 +01:00
Georgi Gerganov	56f3ebf38e	model : add correct type for GLM 4.7 Flash (#19106 )	2026-01-26 11:24:30 +02:00
Aleksander Grygier	b7d1de68c3	refactor: Cleanup	2026-01-26 09:54:44 +01:00
Aleksander Grygier	0a66568fc9	chore: update webui build output	2026-01-26 09:37:27 +01:00
Aleksander Grygier	fa0cad2e6e	refactor: Componentize Chat Form Prompt Picker	2026-01-26 09:36:13 +01:00
Aleksander Grygier	176abf3175	refactor: Utility function	2026-01-26 09:00:41 +01:00
Aleksander Grygier	5ee232d81c	refactor: Use store methods	2026-01-26 08:52:57 +01:00
Johannes Gäßler	0c21677e43	CUDA: faster FA for GQA > 1 but not power of 2 (#19092 )	2026-01-25 21:19:47 +01:00
ccbinn	0440bfd160	metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088 ) Co-authored-by: chenbin11 <chenbin11@kuaishou.com>	2026-01-25 20:07:19 +02:00
Sigbjørn Skjæret	0bf5636938	convert : yield Gemma3N custom_map tensors directly (#19091 )	2026-01-25 18:03:34 +01:00
Aman Gupta	bcb43163ae	ggml-cpu: Use tiled FA for prompt-processing (#19012 ) * ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier	2026-01-25 23:25:58 +08:00
Georgi Gerganov	d9c6ce46f7	kv-cache : support V-less cache (#19067 ) * kv-cache : support V-less cache * cuda : better check for V_is_K_view * cuda : improve V_is_K_view check * graph : add comments * hparams : refactor	2026-01-25 15:48:56 +02:00
Aleksander Grygier	ff0e927be2	chore: update webui build output	2026-01-25 13:38:25 +01:00
Aleksander Grygier	ee9efae203	refactor: Enums	2026-01-25 13:37:08 +01:00
Aleksander Grygier	7f5284d597	refactor: Cleanup refactor: Cleanup refactor: Cleanup refactor: Cleanup	2026-01-25 13:13:11 +01:00
Sigbjørn Skjæret	70d860824a	convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084 ) * fix Gemma3N and Ernie4.5Moe * fix GraniteMoe	2026-01-25 13:05:05 +01:00
Georgi Gerganov	080b161995	completion : fix prompt cache for recurrent models (#19045 )	2026-01-25 09:12:50 +02:00
Molly Sophia	1243f93a2d	readme: update RWKV7 model links (#19061 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2026-01-25 09:11:19 +02:00
Jakkala Mahesh	24bc238303	llama: fix integer type consistency in split helpers (#18894 ) * llama: fix integer type consistency in split helpers * llama: apply minor style fixes * llama: remove trailing whitespace	2026-01-25 09:10:52 +02:00
Daniel Bevenius	16639ba217	common : use two decimal places for float arg help messages (#19048 ) * common : use two decimal places for float arg help messages This commit updates the help messages for various command-line arguments in arg.cpp to display floating-point default values with two decimal places instead of one. The motivation for this changes is that currently only having one decimal place means that values generated using --help or llama-gen-docs will not display the correct values. For example, currently the value of top-p in tools/server/README.md is `0.9`, but the default value is actually '0.95'. And running llama-gen-docs does not update this value as it uses the output from the help message, which shows only one decimal place, so the values look like they are unchanged. * docs : run llama-gen-docs to update docs	2026-01-25 07:31:42 +01:00
Bartowski	9981c30130	convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064 ) * Add undo_permute = False where needed * Replace super().modify_tensors with ModelBase * Add one more ModelBase.modify_tensors * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-25 02:36:47 +01:00
Aleksander Grygier	97642211a9	chore: update webui build output	2026-01-25 02:10:25 +01:00
Aleksander Grygier	fc377123b7	refactor: Simplify MCP errors	2026-01-25 02:09:12 +01:00

1 2 3 4 5 ...

8130 Commits All Branches Search

8130 Commits

All Branches