llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	94bfa7803e	fix 4	2025-12-30 14:15:04 +02:00
Georgi Gerganov	3e0a3e865b	fix 3	2025-12-30 14:06:42 +02:00
Georgi Gerganov	bd48a0ac10	fix2	2025-12-30 14:02:58 +02:00
Georgi Gerganov	ab6f1122a4	fix	2025-12-30 14:02:09 +02:00
Georgi Gerganov	faad7d4743	test	2025-12-30 14:00:36 +02:00
Georgi Gerganov	23e8bb4077	arg : add shorthand for --backend-sampling	2025-12-30 13:56:22 +02:00
Daniel Bevenius	ebfe545cf9	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-30 07:59:02 +01:00
Xuan-Son Nguyen	51a48720b8	webui: fix prompt progress ETA calculation (#18468 ) * webui: fix prompt progress ETA calculation * handle case done === 0	2025-12-29 21:42:11 +01:00
Pascal	c9a3b40d65	Webui/prompt processing progress (#18300 ) * webui: display prompt preprocessing progress * webui: add percentage/ETA and exclude cached tokens from progress Address review feedback from ngxson * webui: add minutes and first chunk (0%) case * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: address review feedback from allozaur * chore: update webui build output * webui: address review feedback from allozaur * nit * chore: update webui build output * feat: Enhance chat processing state * feat: Improve chat processing statistics UI * chore: update webui build output * feat: Add live generation statistics to processing state hook * feat: Persist prompt processing stats in hook for better UX * refactor: Enhance ChatMessageStatistics for live stream display * feat: Implement enhanced live chat statistics into assistant message * chore: update webui build output * fix: Proper tab for each stage of prompt processing/generation * chore: update webui build output * fix: Improved ETA calculation & display logic * chore: update webui build output * feat: Simplify logic & remove ETA from prompt progress * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-12-29 19:32:21 +01:00
Johannes Gäßler	0bd1212a43	CUDA: fix replacment of bad archs in CMake (#18457 )	2025-12-29 17:58:20 +01:00
wbtek	5b1248c9af	server : Cmdline arg -to changes http read timeout from current 600sec default (#18279 ) * Prevent crash if TTFT >300sec, boosted to 90 days * server : allow configurable HTTP timeouts for child models * server : pass needed timeouts from params only --------- Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>	2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen	3595ae5963	contributing: tighten AI usage policy (#18388 ) * contributing: tighten AI usage policy * refactor AGENTS.md * proofreading * update contributing * add claude.md * add trailing newline * add note about dishonest practices * rm point about dishonest * rm requirement watermarking * add .gemini/settings.json * allow initially AI-generated content * revise * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * improve * trailing space * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-29 16:01:32 +01:00
Naco Siren	c1366056f6	android: routine maintenance - Dec 2025 (#18338 ) * Fix `msg` typo * Fix thread safety in destroy() to support generation abortion in lifecycle callbacks. * UI polish: stack new message change from below; fix GGUF margin not in view port * Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation. * Bump dependencies' versions; Deprecated outdated dsl usage.	2025-12-29 15:51:13 +02:00
Georgi Gerganov	2a85f720b8	server : handle closed connection for tasks (#18459 )	2025-12-29 15:34:41 +02:00
Daniel Bevenius	7cbec34a63	model-conversion : add device option to embd run orig model (#18386 ) This commit refactors the original model embedding script to include a device selection option. Users can now specify the device (cpu, cuda, mps, auto) via command-line arguments. It also refactors the code to be more structured.	2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno	0c8986403b	retrieval : use at most n_seq_max chunks (#18400 )	2025-12-29 13:21:13 +02:00
o7si	daa242dfc8	common: fix return value check for setpriority (#18412 ) * common: fix return value check for setpriority * tools: add logging for process priority setting	2025-12-29 11:07:49 +02:00
Johannes Gäßler	e70e640db3	CUDA: Blackwell features for non-native builds (#18436 )	2025-12-29 09:35:42 +01:00
Aman Gupta	5fa66c6e67	cuda: fix race condition in cumsum (#18448 ) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads	2025-12-29 14:07:17 +08:00
Tim Neumann	382808c14b	ci : re-enable rocm build on amd64 (#18439 ) This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913. I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`). A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one. The `runs_on` option was added to match the other entries.	2025-12-29 00:29:23 +01:00
uvos	4ffc47cb20	HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202 )	2025-12-28 20:12:55 +01:00
momonga	9c675c7140	model : Plamo3 support (#17304 ) * plamo3 * fix plamo3 * clean code * clean up the code * fix diff * clean up the code * clean up the code * clean up the code * clean up the code * clean up the code * clean up the code * add chat_template if exist * clean up the code * fix cpu-backend * chore: whitespace trim fix + typo fix * Fix: address review feedback * restore `FREQ_BASE_SWA` constant * Fix: address review feedback2 * Fix:typecheck * Fix: address review feedback3 * final cleanup --------- Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-28 17:28:31 +01:00
Aman Gupta	07a0c4ba92	Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )" (#18426 )	2025-12-28 20:53:36 +08:00
o7si	60f17f56da	rpc: fix segfault on invalid endpoint format (#18387 ) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection	2025-12-28 12:34:41 +02:00
Johannes Gäßler	f8d561eb87	llama-fit-params: fix step size for last device (#18415 )	2025-12-28 10:52:09 +01:00
Johannes Gäßler	e59efe6a78	github: update issue templates [no ci] (#18410 ) * github: update issue templates [no ci] * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen	cffa5c46ea	mtmd: clarify that we no longer accept AI-generated PRs (#18406 )	2025-12-28 09:57:04 +01:00
Boian Berberov	94de74e7b1	cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (#18186 ) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966	2025-12-28 09:33:29 +02:00
Daniel Bevenius	060c0a585e	ggml : include cub/cub.cuh instead of block_scan.cuh This commit updates the include directive in cumsum.cu to use cub/cub.cuh instead of cub/block/block_scan.cuh. The motivation of this change is that without it compilation fails with the following error: ```console /llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name cub::DeviceScan::InclusiveSum(nullptr, ^ /llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name cub::DeviceScan::InclusiveSum((void ) tmp_alloc.get(), tmp_size, src, dst, ne, stream); ^ 2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu". gmake[2]: ** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2 ``` Commit `83b3b1c271` ("cuda: optimize cumsum cub path (#18362)") updated the include directive replacing device_scan.cuh which is causing this issue. This commit uses cub/cub.cuh umbrella header which is consistent with other files in the ggml-cuda directory like mean.cu, sum.cu, etc.	2025-12-28 08:03:04 +01:00
Daniel Bevenius	82c2600585	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-28 07:34:17 +01:00
QDelta	4fd59e8427	ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )	2025-12-28 09:33:14 +08:00
lhez	08566977a7	opencl: allow resizing transpose buffers (#18384 ) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code	2025-12-27 15:51:14 -08:00
Johannes Gäßler	a4bf35889e	llama-fit-params: fix overflow check (#18354 )	2025-12-27 20:20:45 +01:00
Johannes Gäßler	026d2ad472	llama: fix magic number of 999 for GPU layers (#18266 ) * llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode	2025-12-27 20:18:35 +01:00
Aman Gupta	06705fdcb3	ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407 )	2025-12-27 19:56:27 +08:00
Johannes Gäßler	a52dc60ba3	llama_fit_params: return enum for fail vs. error (#18374 )	2025-12-27 09:59:19 +01:00
Johannes Gäßler	9045c9afe5	llama-fit-params: fix Gemma 3 calculation (#18372 )	2025-12-27 09:56:04 +01:00
Jeff Bolz	c9ced4910b	vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352 ) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed.	2025-12-26 16:12:58 -06:00
Jeff Bolz	7ac8902133	vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349 ) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads.	2025-12-26 18:15:50 +01:00
Jeff Bolz	9bf20d8ac3	vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332 )	2025-12-26 18:15:02 +01:00
Eve	cb999704fb	vulkan: small dequantization improvements (#18380 ) * iq4_xs * quants	2025-12-26 18:12:11 +01:00
Jeff Bolz	b96b82fc85	vulkan: Support UPSCALE w/antialias (#18327 )	2025-12-26 17:00:57 +01:00
Jeff Bolz	10dc500bdb	vulkan: handle rope with large number of rows (#18306 )	2025-12-26 16:53:46 +01:00
o7si	4893cc07bb	server : fix crash when seq_rm fails for hybrid/recurrent models (#18391 ) * server : fix crash when seq_rm fails for hybrid/recurrent models * server : add allow_processing param to clear_slot	2025-12-26 16:35:29 +01:00
Francisco Herrera	af3be131c0	docs: added note for pre SYCL Intel hardware (#18016 ) Specify that it's for pre sycl hardware	2025-12-26 10:34:30 +08:00
0Marble	b07cda687c	CANN: implement the SSM_CONV operator (#17737 ) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com>	2025-12-26 09:12:04 +08:00
Aman Gupta	85c40c9b02	ggml-cuda: fix regex for arch list (#18371 ) * ggml-cuda: fix regex for arch list * make regex exact	2025-12-26 01:35:14 +08:00
Aman Gupta	83b3b1c271	cuda: optimize cumsum cub path (#18362 ) * cuda: optimize cumsum cub path * remove heavy perf test	2025-12-25 23:55:38 +08:00
Aman Gupta	b0fb0f0aee	ggml-cuda: fix blackwell native builds (#18361 ) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman>	2025-12-25 22:12:11 +08:00
Penglin Cai	e68c19b0fd	CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934 ) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case	2025-12-25 16:46:09 +08:00

1 2 3 4 5 ...

7746 Commits All Branches Search

7746 Commits

All Branches