llama.cpp

Commit Graph

Author	SHA1	Message	Date
Asbjørn Olling	a180ba78c7	cmake: only build cli when server is enabled (#18670 )	2026-01-09 16:43:26 +01:00
Georgi Gerganov	53eb9435da	server : fix timing of prompt/generation (#18713 )	2026-01-09 12:59:50 +02:00
Georgi Gerganov	d3435efc8a	scripts : pr2wt.sh reset to remote head (#18695 ) * scripts : pr2wt.sh reset to remote head * cont : cleaner * cont : restore --set-upstream-to	2026-01-09 12:16:40 +02:00
Georgi Gerganov	f5f8812f7c	server : use different seeds for child completions (#18700 ) * server : use different seeds for child completions * cont : handle default seed * cont : note	2026-01-09 09:33:50 +02:00
Pascal	74b119e81e	webui: prevent mobile dropdown immediate close on synthetic click	2026-01-08 22:48:56 +01:00
Xuan-Son Nguyen	8ece3836b4	common: support remote preset (#18520 ) * arg: support remote preset * proof reading * allow one HF repo to point to multiple HF repos * docs: mention about multiple GGUF use case * correct clean_file_name * download: also return HTTP status code * fix case with cache file used * fix --offline option	2026-01-08 22:35:40 +01:00
Aaron Teo	046d5fd44e	llama: use host memory if device reports 0 memory (#18587 )	2026-01-09 05:34:56 +08:00
Masashi Yoshimura	480160d472	ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628 ) * Fix GGML_MEM_ALIGN to 8 for emscripten. * Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten	2026-01-08 08:36:42 -08:00
Reese Levine	15bff84bf5	ggml webgpu: initial flashattention implementation (#18610 ) * FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness	2026-01-08 08:23:39 -08:00
Jeff Bolz	2524c26164	vulkan: fix push constant size for quantize_q8_1 (#18687 ) I added an assert to catch further mismatches, and it found several. Fix those, too.	2026-01-08 15:40:58 +01:00
Pascal	d000d84201	webui: fix redirect to root ignoring base path	2026-01-08 15:33:23 +01:00
Jeff Bolz	cb14b06995	vulkan: optimize ssm_scan (#18630 ) * vulkan: optimize ssm_scan * fix warp vs subgroup naming	2026-01-08 15:16:54 +01:00
Aleksander Grygier	2c0add6a90	Merge remote-tracking branch 'origin/allozaur/mcp-mvp' into allozaur/mcp-mvp	2026-01-08 15:02:05 +01:00
Aleksander Grygier	e3ca595651	chore: update webui build output	2026-01-08 14:54:45 +01:00
Aleksander Grygier	6f7750489e	refactor: Types	2026-01-08 14:45:47 +01:00
Aleksander Grygier	dfd3031b17	refactor: Componentize McpServerCard	2026-01-08 14:18:30 +01:00
Aleksander Grygier	835c06e0d1	refactor: Cleanup	2026-01-08 14:18:12 +01:00
Aleksander Grygier	ddbb7dc2e5	fix: Remove redundant CSS class	2026-01-08 14:11:52 +01:00
Adrien Gallouët	55abc39355	vendor : update cpp-httplib to 0.30.0 (#18660 ) * vendor : update cpp-httplib to 0.30.0 * common : allow custom headers when downloading	2026-01-08 13:53:54 +01:00
Aleksander Grygier	bf2a793f42	refactor: Cleanup	2026-01-08 13:49:55 +01:00
Aleksander Grygier	089f38230c	feat: Add TruncatedText component	2026-01-08 13:02:46 +01:00
Aleksander Grygier	06febe08b7	fix: Collapsible box trigger	2026-01-08 12:48:15 +01:00
Aleksander Grygier	223c6333e9	refactor: Cleanup	2026-01-08 12:46:10 +01:00
Georgi Gerganov	f2f6c88067	scripts : support chaining commands in pr2wt.sh (#18671 )	2026-01-08 13:40:23 +02:00
Aleksander Grygier	b0ba550928	refactor: Cleanup	2026-01-08 12:03:36 +01:00
도로로도로또	945bf10627	metal : add MoE kernel specialization for ne20=5 (#18667 ) Add template specialization for kernel_mul_mm_id_map0 with ne20=5 to support models using 5 active experts (e.g., VAETKI).	2026-01-08 12:37:45 +02:00
Johannes Gäßler	64848deb18	llama-fit-params: free memory target per device (#18679 )	2026-01-08 10:07:58 +01:00
Doctor Shotgun	9a5724dee2	ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (#18535 ) * ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH * makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32 * ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx * cann: forward declaration of device context struct * cann: move offload op check after device context declaration * cuda: fix whitespace Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-08 11:03:21 +02:00
Daniel Bevenius	9c142e3a2a	model-conversion : add warn about transformers mismatch (#18691 ) This commit adds a check comparing the installed transformers library with the transformers version that the original model supports. This check will be performed upon a model verification failure and prints a warning/hint to the user suggesting to install the correct version of the transformers library. The motivation for this change is that it is possible for the model verification to fail due to differences in the transformers library used and it might not be obvious that this could be the cause of the failure. With this warning the correct version can be checked and hopefully save time troubleshooting the cause of the verification failure.	2026-01-08 09:29:53 +01:00
Daniel Bevenius	df7fb92170	model-conversion : remove -st targets for converted model (#18689 ) This commit removes the '-st` make target for running the converted embedding model. The motivation for this is that the pooling type is now part of the .gguf metdata of the model and this is used by llama-debug when running the model. So there is no need to specify the pooling type separately any more. The commit also adds an option to specify the type of normalization applied to the output embeddings when running the converted model. And the readme documentation has been updated to reflect these changes.	2026-01-08 09:29:15 +01:00
Aleksander Grygier	56b34bf63b	refactor: Collapsible Content Block & small fixes	2026-01-08 09:17:24 +01:00
Julius Tischbein	2038101bd9	llama : add `use_direct_io` flag for model loading (#18166 ) * Adding --direct-io flag for model loading * Fixing read_raw() calls * Fixing Windows read_raw_at * Changing type off_t to size_t for windows and Renaming functions * disable direct io when mmap is explicitly enabled * Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL * Fallback to std::fread in case O_DIRECT fails due to bad address * Windows: remove const keywords and unused functions * Update src/llama-mmap.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: jtischbein <jtischbein@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-08 08:35:30 +02:00
shaofeiqi	568371a726	opencl: add FILL op support (#18682 )	2026-01-07 22:04:50 -08:00
Sigbjørn Skjæret	5b8844ae53	scripts : fix repos cloned with .git extension (#18669 )	2026-01-07 22:35:34 +01:00
Sigbjørn Skjæret	7e16fef085	convert : more variants of rope_theta config entries (#18668 )	2026-01-07 22:34:51 +01:00
Oliver Walsh	f5245b5e4e	cuda : fix build on cuda 12.8 (#18672 ) compute121 requires 12.9 Signed-off-by: Oliver Walsh <owalsh@redhat.com>	2026-01-07 22:32:44 +01:00
R	ae9f8df778	fix(docker): add missing libglvnd libraries to Vulkan image (#18664 ) Add libglvnd0, libgl1, libglx0, libegl1, libgles2 to the Vulkan Dockerfile base image. These libraries are required by mesa-vulkan-drivers to properly initialize the Vulkan ICD and detect GPU devices. Without these libraries, vkEnumeratePhysicalDevices() returns an empty list, resulting in "ggml_vulkan: No devices found." error. Fixes #17761	2026-01-07 16:57:42 +01:00
Adrien Gallouët	56d2fed2b3	tools : remove llama-run (#18661 ) * tools : remove llama-run * Remove licenses/LICENSE-linenoise Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-07 16:18:26 +01:00
Aleksander Grygier	d89ada8cee	chore: update webui build output	2026-01-07 15:46:32 +01:00
Aleksander Grygier	98bce85b1f	refactor: Cleanup	2026-01-07 15:44:23 +01:00
Aleksander Grygier	b9adc00d3f	chore: update webui build output	2026-01-07 14:27:48 +01:00
Georgi Gerganov	56426673cb	scripts : add pr2wt.sh (#18644 ) * scripts : add pr2wt.sh * script : shebang Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-07 15:16:20 +02:00
Aleksander Grygier	10e5ad1396	feat: UI improvements	2026-01-07 14:01:27 +01:00
Aleksander Grygier	bc07e0723d	feat: Always show Mcp Selector	2026-01-07 14:01:27 +01:00
Daniel Bevenius	bb77764c2d	convert : clarify sentence-transformers-dense-modules help [no ci] (#18662 ) * convert : clarify sentence-transformers-dense-modules help [no ci] This commit updates this options help message which currently looks like this: ```console --sentence-transformers-dense-modules Whether to include sentence-transformers dense modules.It can be used for sentence-transformers models, like google/embeddinggemma-300mDefault these modules are not included. ```	2026-01-07 13:18:53 +01:00
Sigbjørn Skjæret	9dfa8ee950	ci : run cann build unconditionally [no ci] (#18659 )	2026-01-07 13:07:08 +01:00
Pascal	4c095df509	fix: remove double scrollbar in model selector by using Bits UI content available height	2026-01-07 12:23:03 +01:00
Jeff Bolz	ca4a8370bc	vulkan: reject ops when a tensor is too large to allocate (#18646 )	2026-01-07 12:03:32 +01:00
virajwad	03023296cf	vulkan: Warptile tuning for Intel Xe2/Xe3 (#18178 ) * modify warptile tuning for xe3 * intel vendor check w/ coopmat support * fix back formatting * fix formatting change 2 * move intel check to chip specific tuning part * Change to support both windows and linux * modify m_warptile to l_warptile for intel * modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile) * Code style changes * Code style changes (2) * Code style changes (3)	2026-01-07 11:59:47 +01:00
Eve	8c77a04cc7	vulkan: more mul mat optimizations (#18533 ) * q4_k * q5_k * q2_k * q4_1 * q5_1 * better buf index	2026-01-07 11:13:17 +01:00

... 5 6 7 8 9 ...

8043 Commits All Branches Search

8043 Commits

All Branches