Georgi Gerganov
2fbde785bc
kv-cache : optimize KQ mask construction ( #18842 )
...
* kv-cache : optimize KQ mask construction
* cont : add explanation + improve
* cont : fix
2026-01-17 15:42:42 +02:00
Reese Levine
a89002f07b
ggml webgpu: support for backend sampling ( #18880 )
...
* ggml webgpu: add SOFTPLUS unary operator
Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern
* ggml webgpu: add EXPM1 unary operator
Implements EXPM1 (exp(x) - 1) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add FLOOR unary operator
Implements FLOOR (rounds down to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add CEIL unary operator
Implements CEIL (rounds up to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add ROUND unary operator
Implements ROUND (rounds to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add TRUNC unary operator
Implements TRUNC (truncates towards zero) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)
* Updates to webgpu get_memory
* Add argmax
* Add argmax,cumsum,sum,sum_rows
* Add necessary CPY/GET_ROWS operators
* Support for argsort using multi-pass strategy
* Update set_rows for i32 indices, move to pre-wgsl
* Port unary operators to pre-wgsl and support FILL
* Implement PAD
* Add support for top-k
* clean up, scope pipeline init mutex
* fix newline
* Add support for log
* Update LOG for better precision, and ops doc
---------
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>
2026-01-16 16:12:43 -08:00
Mustafa Cavus
aa4bc90030
Syntax correction for workflows build file
2026-01-16 13:06:43 -08:00
Thore Koritzius
388ce82241
ggml : extend ggml_pool_1d + metal ( #16429 )
...
* chore: resolve conflicts
* feat: ggml metal impl
* fix: ggml_metal_kargs_pool_1d struct
* fix: require contiguous input
* chore: test pool_1d
* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts
* chore: add p0 and s0 to testing
* fix: allow padding for cpu and metal
* Update ggml/src/ggml-metal/ggml-metal.metal
* fix: correct single-threaded loop
* ggml : cleanup
* tests : add ne[1] != 1 tests
* fix: ne[1] handling in np
* cont : fixes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 16:59:56 +02:00
hipudding
6ba6a3c76f
docs : update ops.md for CANN backend ( #18654 )
2026-01-16 13:32:17 +01:00
Perry Naseck
0802d4cfb3
ggml-blas: hide warnings from included BLAS headers ( #18818 )
...
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set
* ggml-blas: hide warnings from included BLAS headers
2026-01-16 13:38:25 +02:00
Tarek Dakhran
c945aaaef2
mtmd : Fix ASR for LFM2.5-Audio-1.5B ( #18876 )
2026-01-16 11:23:08 +01:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine ( #18462 )
...
* jinja vm
* lexer
* add vm types
* demo
* clean up
* parser ok
* binary_expression::execute
* shadow naming
* bin ops works!
* fix map object
* add string builtins
* add more builtins
* wip
* use mk_val
* eval with is_user_input
* render gemma tmpl ok
* track input string even after transformations
* support binded functions
* keyword arguments and slicing array
* use shared_ptr for values
* add mk_stmt
* allow print source on exception
* fix negate test
* testing more templates
* mostly works
* add filter_statement
* allow func to access ctx
* add jinja-value.cpp
* impl global_from_json
* a lot of fixes
* more tests
* more fix, more tests
* more fixes
* rm workarounds
* demo: type inferrence
* add placeholder for tojson
* improve function args handling
* rm type inference
* no more std::regex
* trailing spaces
* make testing more flexible
* make output a bit cleaner
* (wip) redirect minja calls
* test: add --output
* fix crash on macro kwargs
* add minimal caps system
* add some workarounds
* rm caps_apply_workarounds
* get rid of preprocessing
* more fixes
* fix test-chat-template
* move test-chat-jinja into test-chat-template
* rm test-chat-jinja from cmake
* test-chat-template: use common
* fix build
* fix build (2)
* rename vm --> interpreter
* improve error reporting
* correct lstrip behavior
* add tojson
* more fixes
* disable tests for COMMON_CHAT_FORMAT_GENERIC
* make sure tojson output correct order
* add object.length
* fully functional selectattr / rejectattr
* improve error reporting
* more builtins added, more fixes
* create jinja rendering tests
* fix testing.h path
* adjust whitespace rules
* more fixes
* temporary disable test for ibm-granite
* r/lstrip behavior matched with hf.js
* minimax, glm4.5 ok
* add append and pop
* kimi-k2 ok
* test-chat passed
* fix lstrip_block
* add more jinja tests
* cast to unsigned char
* allow dict key to be numeric
* nemotron: rm windows newline
* tests ok
* fix test
* rename interpreter --> runtime
* fix build
* add more checks
* bring back generic format support
* fix Apertus
* [json.exception.out_of_range.403] key 'content' not found
* rm generic test
* refactor input marking
* add docs
* fix windows build
* clarify error message
* improved tests
* split/rsplit with maxsplit
* non-inverse maxsplit
forgot to change after simplifying
* implement separators for tojson and fix indent
* i like to move it move it
* rename null -- > none
* token::eof
* some nits + comments
* add exception classes for lexer and parser
* null -> none
* rename global -> env
* rm minja
* update docs
* docs: add input marking caveats
* imlement missing jinja-tests functions
* oops
* support trim filter with args, remove bogus to_json reference
* numerous argument fixes
* updated tests
* implement optional strip chars parameter
* use new chars parameter
* float filter also has default
* always leave at least one decimal in float string
* jinja : static analysis + header cleanup + minor fixes
* add fuzz test
* add string.cpp
* fix chat_template_kwargs
* nits
* fix build
* revert
* unrevert
sorry :)
* add fuzz func_args, refactor to be safer
* fix array.map()
* loosen ensure_vals max count condition, add not impl for map(int)
* hopefully fix windows
* check if empty first
* normalize newlines
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Julius Tischbein
aa1dc3770a
Setting mmap and direct_io to false as default in llama-bench.cpp ( #18841 )
2026-01-16 09:46:51 +01:00
Raul Torres
4ea2eaac01
CANN: Remove unused `ggml_cann_get_device` function ( #18625 )
2026-01-16 16:34:09 +08:00
Chenguang Li
e20fa27a02
CANN: fix an issue where get_env was not fully renamed ( #18796 )
...
* CANN: fix an issue where get_env was not fully renamed
* ci: add cann with acl group
* ci: define use_acl_graph using GitHub Action
* ci: update cann dockerfile with acl graph
2026-01-16 16:24:04 +08:00
hipudding
baa4ba0aec
CANN: support gated linear attn ( #18653 )
...
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: optimize OP gla
Optimize gla for high preformance
* Remove unused comments
---------
Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-16 16:18:49 +08:00
Mustafa Cavus
d7dccf887b
kq_mask naming fix
2026-01-15 14:38:53 -08:00
Yamini Nimmagadda
d3649c11cb
Update OPENVINO.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
e9ed5c4cb6
Update OPENVINO.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
f44c60e995
Update OPENVINO.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
63eed0d9f3
Update build.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
61552e4450
Update OPENVINO.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
9ba324726a
Update OPENVINO.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
25e652569b
Update OPENVINO.md
2026-01-15 11:39:08 -08:00
Yamini Nimmagadda
416556a87d
Create OPENVINO.md in llama.cpp backend docs
2026-01-15 11:39:08 -08:00
Mustafa Cavus
599335c633
Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
2026-01-15 11:39:08 -08:00
Mustafa Cavus
a92eceecd9
Update ggml/src/ggml-openvino/ggml-decoder.cpp
2026-01-15 11:39:08 -08:00
Mustafa Cavus
a81b202f57
requant to f16 for Q6 embed on NPU
2026-01-15 11:39:08 -08:00
Mustafa Cavus
a40a5dfc60
npu perf fix
2026-01-15 11:39:08 -08:00
Mustafa Cavus
981ec6571d
code cleanup
2026-01-15 11:39:08 -08:00
Mustafa Cavus
d2fc15226b
Update ggml/src/ggml-openvino/ggml-decoder.cpp
...
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
2026-01-15 11:39:08 -08:00
Mustafa Cavus
5f30eacdb4
Initial stateful graph support
2026-01-15 11:39:08 -08:00
Yu, Zijun
0d6f253e48
Support -ctk f32
2026-01-15 11:39:08 -08:00
Yu, Zijun
f5c71e3cf4
Update build.md
2026-01-15 11:39:08 -08:00
Yu, Zijun
4e451778d3
Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
2026-01-15 11:39:08 -08:00
Yu, Zijun
67c9720e49
Optimize symmetric quant weight extraction: use single zp
2026-01-15 11:39:08 -08:00
Yu, Zijun
c1142ddb7c
NPU always requant to q4_0_128
2026-01-15 11:39:08 -08:00
Yu, Zijun
52a44012c0
Update build.md to include OpenCL
2026-01-15 11:39:08 -08:00
Yu, Zijun
cfc471353d
FIX: use remote tensor from singleton
2026-01-15 11:39:08 -08:00
Yu, Zijun
a356b44477
only use remote tensor for kvcache for GPU
2026-01-15 11:39:08 -08:00
Yu, Zijun
88d1d17eac
only use remote tensor for kvcache
2026-01-15 11:39:08 -08:00
Yu, Zijun
8273a7c2f4
Use ggml_aligned_malloc
2026-01-15 11:39:08 -08:00
Yu, Zijun
d757849741
Put kvcache on GPU
2026-01-15 11:39:08 -08:00
Yu, Zijun
3fdcb6ab72
Add ov_backend_host_buffer; Use cached remote context
2026-01-15 11:39:08 -08:00
Yu, Zijun
72bba828df
Use shared_buffer for GPU NPU; Refactor
2026-01-15 11:39:08 -08:00
Yu, Zijun
22d9c17a6f
backend buffer: allocate on host
2026-01-15 11:39:08 -08:00
Arshath
ae5336386f
Update build.md for Windows
2026-01-15 11:39:08 -08:00
Yu, Zijun
0ef2e5e4d4
Fix decoder can_reuse for llama-bench
2026-01-15 11:39:08 -08:00
Xuejun Zhai
9e3163e846
Remove unused variable nodes
2026-01-15 11:39:08 -08:00
Yu, Zijun
c9234b44cc
NPU fix q4 perf regression
2026-01-15 11:39:08 -08:00
Yu, Zijun
ae01322dbd
NPU fix wrong model output shape
2026-01-15 11:39:08 -08:00
Yu, Zijun
469325c6da
GPU remove Q6_K requantization
2026-01-15 11:39:08 -08:00
Yu, Zijun
28da9a9adc
Reuse cached decoder
2026-01-15 11:39:08 -08:00
Xuejun Zhai
91a1b20c82
Fix error for decoder cache
2026-01-15 11:39:08 -08:00