llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aaron Teo	dded9feb50	ggml-blas: switch from cpu to blas buffer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-20 22:20:08 +08:00
Aaron Teo	7729be2aae	ggml-blas: bring back openmp warnings Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-20 15:25:42 +08:00
Aaron Teo	265183d3cf	ggml-blas: bring back out prod Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-20 15:21:50 +08:00
Aaron Teo	adbfbf9086	ggml-blas: refactor backend Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-20 15:11:45 +08:00
Aaron Teo	2ee4d5fe2f	ggml-blas: fix graph realloc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-20 14:03:24 +08:00
Aaron Teo	623e7135c2	ggml-blas: fix memleak Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-20 13:37:26 +08:00
Aaron Teo	04ed19bbc0	ggml-blas: further cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 23:37:56 +08:00
Aaron Teo	46dea5da74	CODEOWNERS: add @taronaeo to blas backend [no ci] Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 23:23:08 +08:00
Aaron Teo	10ce5e056d	ggml-blas: more code formatting Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 23:20:28 +08:00
Aaron Teo	75e506ff22	ggml-blas: clean up code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 23:19:22 +08:00
Aaron Teo	7998d08b29	ggml-blas: bring back openmp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 23:07:54 +08:00
Aaron Teo	e481be6da6	ggml-blas: move global blas n threads to set_n_threads Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 22:19:12 +08:00
Aaron Teo	6dff031caa	ggml-blas: force dequant routine to use max logical cores Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 21:57:09 +08:00
Aaron Teo	447057973c	ggml-blas: fix ne Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 19:27:42 +08:00
Aaron Teo	717531b1a7	ggml-blas: add note Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 19:22:14 +08:00
Aaron Teo	aae6d1e9b0	ggml-blas: fix invalid data access Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 19:15:34 +08:00
Aaron Teo	9a14a094ac	ggml: rewrite ggml-blas Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 18:06:31 +08:00
Aaron Teo	61ee32dec3	tests: set tensor usage as weight for weight tensors only for mul_mat and mul_mat_id ops Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-14 18:05:51 +08:00
Aaron Teo	1926e07e1a	ggml-blas: code clean up Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-11 21:27:13 +08:00
Aaron Teo	19c8ec9964	ggml-blas: fully working mmid Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-11 21:10:25 +08:00
Aaron Teo	f682374613	ggml-blas: initial mmid impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-12-11 20:51:02 +08:00
Yuichiro Utsumi	e4ae383317	docs: use port 8080 in Docker examples (#17903 )	2025-12-11 17:12:07 +08:00
nullname	34ce48d97a	ggml-hexagon: fix `rope` failure at `test-backend-ops` (#17565 ) * fix test failure * fix: correct scaling calculations in rope_cache_init * fix: optimize element copying in rope_hex_f32 using memcpy * fix: optimize loop boundaries in rope_hex_f32 for better performance * feat: add profiling macros for performance measurement in operations	2025-12-10 14:45:43 -08:00
Sigbjørn Skjæret	45e350e3d3	ci: fix riscv64-native build (#17916 )	2025-12-10 23:24:31 +01:00
Xuan-Son Nguyen	c6b2c9310c	mtmd: some small clean up (#17909 ) * clip: add support for fused qkv in build_vit * use bulid_ffn whenever possible * fix internvl * mtmd-cli: move image to beginning * test script: support custom args	2025-12-10 22:20:06 +01:00
Xuan-Son Nguyen	34a6d86982	cli: enable jinja by default (#17911 ) * cli: enable jinja by default * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-10 22:19:42 +01:00
Pascal	f32ca51bfe	server: add presets (config) when using multiple models (#17859 ) * llama-server: recursive GGUF loading Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model server : router config POC (INI-based per-model settings) * server: address review feedback from @aldehir and @ngxson PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> * server: adopt aldehir's line-oriented PEG parser Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_ - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <aldehir@users.noreply.github.com> * server: fix CLI/env duplication in child processes Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn * add common/preset.cpp * fix compile * cont * allow custom-path models * add falsey check * server: fix router model discovery and child process spawning - Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing * Revert "server: fix router model discovery and child process spawning" This reverts commit e3832b42eeea7fcb108995966c7584479f745857. * clarify about "no-" prefix * correct render_args() to include binary path * also remove arg LLAMA_ARG_MODELS_PRESET for child * add co-author for ini parser code Co-authored-by: aldehir <hello@alde.dev> * also set LLAMA_ARG_HOST * add CHILD_ADDR * Remove dead code --------- Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: aldehir <hello@alde.dev>	2025-12-10 22:18:21 +01:00
Max Krasnyansky	e1f4921980	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748 ) * tests: update barrier test to check for race condition in active threads * cpu: combine n_graph and n_threads into a single atomic update * tests: add multi-graph test for test_barrier	2025-12-10 12:32:23 -08:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Sigbjørn Skjæret	4df6e859e9	cuda : add missing support check for xielu (#17895 )	2025-12-10 16:16:20 +01:00
Xuan-Son Nguyen	6c2131773c	cli: new CLI experience (#17824 ) * wip * wip * fix logging, add display info * handle commands * add args * wip * move old cli to llama-completion * rm deprecation notice * move server to a shared library * move ci to llama-completion * add loading animation * add --show-timings arg * add /read command, improve LOG_ERR * add args for speculative decoding, enable show timings by default * add arg --image and --audio * fix windows build * support reasoning_content * fix llama2c workflow * color default is auto * fix merge conflicts * properly fix color problem Co-authored-by: bandoti <bandoti@users.noreply.github.com> * better loading spinner * make sure to clean color on force-exit * also clear input files on "/clear" * simplify common_log_flush * add warning in mtmd-cli * implement console writter * fix data race * add attribute * fix llama-completion and mtmd-cli * add some notes about console::log * fix compilation --------- Co-authored-by: bandoti <bandoti@users.noreply.github.com>	2025-12-10 15:28:59 +01:00
Eric Zhang	b677721819	model : Qwen3-Next-80B-A3B has 48 layers (#17898 ) * model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name	2025-12-10 15:22:40 +01:00
lhez	2d2e1030e3	docs : update opencl ops (#17904 )	2025-12-10 15:20:00 +01:00
Johannes Gäßler	17f7f4baad	CUDA: fix unpadded strides in MMA FA kernel (#17891 )	2025-12-10 12:39:56 +01:00
Xuan-Son Nguyen	9e79b0116e	convert: allow using quantized Mistral weight (#17889 ) * convert: allow using quantized Mistral weight * data_torch.ndim * update dequant fn Co-authored-by: compilade <compilade@users.noreply.github.com> --------- Co-authored-by: compilade <compilade@users.noreply.github.com>	2025-12-10 10:26:22 +01:00
Neo Zhang Jianyu	2e9eab80c2	fix softmax for iGPU (#17838 )	2025-12-10 16:59:57 +08:00
Aldehir Rojas	2fbe3b7bb7	common : add parser for ministral/mistral large 3/devstral 2 (#17713 )	2025-12-09 17:31:04 -06:00
Sigbjørn Skjæret	63391852b0	docs : update cpu and cuda ops (#17890 ) * update cuda ops * update CPU as well	2025-12-09 23:31:29 +01:00
Gabe Goodhart	086a63e3a5	metal: SSM kernel improvements (#17876 ) * feat: Add a batched version of ssm_conv This was done using Claude Code. It found a number of optimizations around how the threads were organized, resulting in a huge performance boost! Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Optimized SSM_SCAN kernel for metal This used Claude Code and resulted in a modest performance improvement while maintaining correctness. Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Add test-backend-ops perf tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Real representitive tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use function constant for ssm_conv batch size Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: backend op tests for ssm_scan from granite4 1b-h Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: remove commented out templates Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: float4 version of ssm_conv_batched Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing ggml_metal_cv_free Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-09 21:30:02 +02:00
Piotr Wilkin (ilintar)	b63509262a	Add DIAG for CUDA (#17873 ) * Add DIAG for CUDA * Refactor parameters	2025-12-09 20:28:57 +01:00
Johannes Gäßler	48f47565a7	docs: clarify that CPU support should be first (#17886 )	2025-12-09 20:10:36 +01:00
Gabe Goodhart	02e409a5be	ggml : Provide macos-specific backtrace printing to avoid terminal death (#17869 ) * fix: Provide macos-specific backtrace printing to avoid terminal death Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-09 18:29:07 +02:00
Georgi Gerganov	6b82eb7883	metal : print node names for debugging (#17882 )	2025-12-09 15:25:49 +02:00
Sigbjørn Skjæret	86a3f0fad8	ggml : allow fill node alloc inplace (#17870 )	2025-12-09 12:23:47 +01:00
Rhys-T	63908b631a	cmake: fix Mach-O current version number (#17877 ) PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the Mach-O 'current version' field's 'micro' part, which only goes up to 255. This just sets the Mach-O current version to 0 to get it building properly again. Fixes #17258.	2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret	42b12b5608	model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652 ) * nit, DeepSeek V1 MoE is 16B * base type on n_ff_exp instead	2025-12-09 12:15:06 +01:00
Xuan-Son Nguyen	4e842d5120	console: allow using arrow left/right, home/end keys and history mode (#17836 ) * console: allow using arrow left/right to edit the line (with UTF-8 support) * console: fix arrow keys on Windows using private-use Unicode * console: add Home/End key support for Windows and Linux * console: add basic Up/Down history navigation * fix build * console: allow using arrow left/right to edit the line (with UTF-8 support) * console: fix arrow keys on Windows using private-use Unicode * console: add Home/End key support for Windows and Linux * console: add basic Up/Down history navigation * console: remove unreachable wc == 0 check after VK switch * console: add Ctrl+Left/Right word navigation - Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes - Windows: detect CTRL modifier via dwControlKeyState - Linux: parse ANSI sequences with modifier (1;5D/C) - Implement move_word_left/right with space-skipping logic - Refactor escape sequence parsing to accumulate params * console: add Delete key support - Windows: VK_DELETE detection - Linux: ESC[3~ sequence parsing - Forward character deletion with UTF-8 support * console: implement bash-style history editing - Edit any history line during UP/DOWN navigation, edits persist - Pressing Enter appends edited version as new history entry - Original line stay untouched in their positions * clean up * better history impl * fix decode_utf8 --------- Co-authored-by: Pascal <admin@serveurperso.com>	2025-12-09 11:53:59 +01:00
Chenguang Li	ca709e427b	CANN: add support for partial RoPE and Vision mode (#17543 ) * cann: add support for partial RoPE and Vision mode Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation. 1. Support for partial RoPE (rope_dims < ne0): - Split tensor into head (first rope_dims dimensions) and tail portions - Apply rotation only to head portion using RotaryPositionEmbedding operator - Copy unrotated tail portion directly from source to destination - Handle both contiguous and non-contiguous tensor layouts 2. Support for Vision mode (GGML_ROPE_TYPE_VISION): - Set rope_dims = ne0 for Vision mode to rotate entire tensor - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2) - No tail handling needed since entire tensor is rotated Implementation details: - Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0 - Support both F32 and F16 data types with intermediate F32 conversion - Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility - Improve cache invalidation logic to include rope_dims and indep_sects parameters These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation. * cann: fix review comment	2025-12-09 17:53:23 +08:00
Johannes Gäßler	0cdce38a97	CUDA: fix FP16 overflow in tile FA kernel (#17875 )	2025-12-09 09:34:02 +01:00
Aldehir Rojas	e39502e74b	llama : add token matching support to llama-grammar (#17816 ) * llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens	2025-12-09 00:32:57 -06:00

1 2 3 4 5 ...

7378 Commits All Branches Search

7378 Commits

All Branches