llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	0d85c5ca22	tests : add more top_k tests	2026-01-01 20:17:11 +02:00
Georgi Gerganov	435c96709b	llama : assert at most one output token per sequence	2025-12-31 17:51:42 +02:00
Georgi Gerganov	4c3d5422ad	minor : add comments + some cleanup	2025-12-31 16:59:42 +02:00
Georgi Gerganov	791ecb94ff	sampling : zero-initialize input buffers	2025-12-30 20:12:49 +02:00
Georgi Gerganov	c5de75989e	Merge branch 'master' into HEAD	2025-12-30 16:36:58 +02:00
Georgi Gerganov	588299c20c	server : remove printfs	2025-12-30 16:33:01 +02:00
Georgi Gerganov	610e50a17d	sampling : fix reshapes	2025-12-30 16:32:32 +02:00
Georgi Gerganov	5d2156e893	ci : add server workflow with backend sampling	2025-12-30 16:32:14 +02:00
Jay Zenith	c32fa21db8	sampling: reuse token data buffer in llama_sampler_sample (#18365 ) * sampling: reuse token data buffer in llama_sampler_sample * move cur buffer before timing section, after samplers * minor : fix build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-30 16:27:49 +02:00
Jeff Bolz	f14f4e421b	server: fix files built redundantly (#18474 )	2025-12-30 13:11:13 +01:00
Charles Xu	2d6c00a9b8	kleidiai: add and integrate SVE 256-bit vector-length kernel (#18458 ) * kleidiai: add and integrate SVE 256-bit vector-length kernel * updated for review comments	2025-12-30 14:04:53 +02:00
Georgi Gerganov	23e8bb4077	arg : add shorthand for --backend-sampling	2025-12-30 13:56:22 +02:00
Aman Gupta	d77d7c5c06	CUDA: add log line when mxfp4 acceleration is used (#18483 ) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features	2025-12-30 17:40:46 +08:00
Daniel Bevenius	a864fb1c14	model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461 ) This commit updates the causal model verification script to use the CONVERTED_MODEL environment variable instead of using the MODEL_PATH (the original model path) as the basis for the converted model file name. The motivation for this that currently if the converted model file name differs from the original model directory/name the verification script will look for the wrong .bin file that was generating when running the converted model. This similar to the change made for the embeddings models script in Commit `db81d5ec4b` ("model-conversion : use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)"), but we also verify the embeddings of for causal models as well.	2025-12-30 10:13:12 +01:00
Daniel Bevenius	ebfe545cf9	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-30 07:59:02 +01:00
Xuan-Son Nguyen	51a48720b8	webui: fix prompt progress ETA calculation (#18468 ) * webui: fix prompt progress ETA calculation * handle case done === 0	2025-12-29 21:42:11 +01:00
Pascal	c9a3b40d65	Webui/prompt processing progress (#18300 ) * webui: display prompt preprocessing progress * webui: add percentage/ETA and exclude cached tokens from progress Address review feedback from ngxson * webui: add minutes and first chunk (0%) case * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: address review feedback from allozaur * chore: update webui build output * webui: address review feedback from allozaur * nit * chore: update webui build output * feat: Enhance chat processing state * feat: Improve chat processing statistics UI * chore: update webui build output * feat: Add live generation statistics to processing state hook * feat: Persist prompt processing stats in hook for better UX * refactor: Enhance ChatMessageStatistics for live stream display * feat: Implement enhanced live chat statistics into assistant message * chore: update webui build output * fix: Proper tab for each stage of prompt processing/generation * chore: update webui build output * fix: Improved ETA calculation & display logic * chore: update webui build output * feat: Simplify logic & remove ETA from prompt progress * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-12-29 19:32:21 +01:00
Johannes Gäßler	0bd1212a43	CUDA: fix replacment of bad archs in CMake (#18457 )	2025-12-29 17:58:20 +01:00
wbtek	5b1248c9af	server : Cmdline arg -to changes http read timeout from current 600sec default (#18279 ) * Prevent crash if TTFT >300sec, boosted to 90 days * server : allow configurable HTTP timeouts for child models * server : pass needed timeouts from params only --------- Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>	2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen	3595ae5963	contributing: tighten AI usage policy (#18388 ) * contributing: tighten AI usage policy * refactor AGENTS.md * proofreading * update contributing * add claude.md * add trailing newline * add note about dishonest practices * rm point about dishonest * rm requirement watermarking * add .gemini/settings.json * allow initially AI-generated content * revise * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * improve * trailing space * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-29 16:01:32 +01:00
Naco Siren	c1366056f6	android: routine maintenance - Dec 2025 (#18338 ) * Fix `msg` typo * Fix thread safety in destroy() to support generation abortion in lifecycle callbacks. * UI polish: stack new message change from below; fix GGUF margin not in view port * Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation. * Bump dependencies' versions; Deprecated outdated dsl usage.	2025-12-29 15:51:13 +02:00
Georgi Gerganov	2a85f720b8	server : handle closed connection for tasks (#18459 )	2025-12-29 15:34:41 +02:00
Daniel Bevenius	7cbec34a63	model-conversion : add device option to embd run orig model (#18386 ) This commit refactors the original model embedding script to include a device selection option. Users can now specify the device (cpu, cuda, mps, auto) via command-line arguments. It also refactors the code to be more structured.	2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno	0c8986403b	retrieval : use at most n_seq_max chunks (#18400 )	2025-12-29 13:21:13 +02:00
o7si	daa242dfc8	common: fix return value check for setpriority (#18412 ) * common: fix return value check for setpriority * tools: add logging for process priority setting	2025-12-29 11:07:49 +02:00
Johannes Gäßler	e70e640db3	CUDA: Blackwell features for non-native builds (#18436 )	2025-12-29 09:35:42 +01:00
Aman Gupta	5fa66c6e67	cuda: fix race condition in cumsum (#18448 ) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads	2025-12-29 14:07:17 +08:00
Tim Neumann	382808c14b	ci : re-enable rocm build on amd64 (#18439 ) This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913. I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`). A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one. The `runs_on` option was added to match the other entries.	2025-12-29 00:29:23 +01:00
uvos	4ffc47cb20	HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202 )	2025-12-28 20:12:55 +01:00
momonga	9c675c7140	model : Plamo3 support (#17304 ) * plamo3 * fix plamo3 * clean code * clean up the code * fix diff * clean up the code * clean up the code * clean up the code * clean up the code * clean up the code * clean up the code * add chat_template if exist * clean up the code * fix cpu-backend * chore: whitespace trim fix + typo fix * Fix: address review feedback * restore `FREQ_BASE_SWA` constant * Fix: address review feedback2 * Fix:typecheck * Fix: address review feedback3 * final cleanup --------- Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-28 17:28:31 +01:00
Aman Gupta	07a0c4ba92	Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )" (#18426 )	2025-12-28 20:53:36 +08:00
o7si	60f17f56da	rpc: fix segfault on invalid endpoint format (#18387 ) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection	2025-12-28 12:34:41 +02:00
Johannes Gäßler	f8d561eb87	llama-fit-params: fix step size for last device (#18415 )	2025-12-28 10:52:09 +01:00
Johannes Gäßler	e59efe6a78	github: update issue templates [no ci] (#18410 ) * github: update issue templates [no ci] * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen	cffa5c46ea	mtmd: clarify that we no longer accept AI-generated PRs (#18406 )	2025-12-28 09:57:04 +01:00
Boian Berberov	94de74e7b1	cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (#18186 ) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966	2025-12-28 09:33:29 +02:00
Daniel Bevenius	060c0a585e	ggml : include cub/cub.cuh instead of block_scan.cuh This commit updates the include directive in cumsum.cu to use cub/cub.cuh instead of cub/block/block_scan.cuh. The motivation of this change is that without it compilation fails with the following error: ```console /llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name cub::DeviceScan::InclusiveSum(nullptr, ^ /llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name cub::DeviceScan::InclusiveSum((void ) tmp_alloc.get(), tmp_size, src, dst, ne, stream); ^ 2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu". gmake[2]: ** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2 ``` Commit `83b3b1c271` ("cuda: optimize cumsum cub path (#18362)") updated the include directive replacing device_scan.cuh which is causing this issue. This commit uses cub/cub.cuh umbrella header which is consistent with other files in the ggml-cuda directory like mean.cu, sum.cu, etc.	2025-12-28 08:03:04 +01:00
Daniel Bevenius	82c2600585	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-28 07:34:17 +01:00
QDelta	4fd59e8427	ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )	2025-12-28 09:33:14 +08:00
lhez	08566977a7	opencl: allow resizing transpose buffers (#18384 ) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code	2025-12-27 15:51:14 -08:00
Johannes Gäßler	a4bf35889e	llama-fit-params: fix overflow check (#18354 )	2025-12-27 20:20:45 +01:00
Johannes Gäßler	026d2ad472	llama: fix magic number of 999 for GPU layers (#18266 ) * llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode	2025-12-27 20:18:35 +01:00
Aman Gupta	06705fdcb3	ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407 )	2025-12-27 19:56:27 +08:00
Johannes Gäßler	a52dc60ba3	llama_fit_params: return enum for fail vs. error (#18374 )	2025-12-27 09:59:19 +01:00
Johannes Gäßler	9045c9afe5	llama-fit-params: fix Gemma 3 calculation (#18372 )	2025-12-27 09:56:04 +01:00
Jeff Bolz	c9ced4910b	vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352 ) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed.	2025-12-26 16:12:58 -06:00
Jeff Bolz	7ac8902133	vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349 ) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads.	2025-12-26 18:15:50 +01:00
Jeff Bolz	9bf20d8ac3	vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332 )	2025-12-26 18:15:02 +01:00
Eve	cb999704fb	vulkan: small dequantization improvements (#18380 ) * iq4_xs * quants	2025-12-26 18:12:11 +01:00
Jeff Bolz	b96b82fc85	vulkan: Support UPSCALE w/antialias (#18327 )	2025-12-26 17:00:57 +01:00

1 2 3 4 5 ...

7754 Commits All Branches Search

7754 Commits

All Branches