Commit Graph

8297 Commits

Author SHA1 Message Date
Georgi Gerganov 2fbde785bc
kv-cache : optimize KQ mask construction (#18842)
* kv-cache : optimize KQ mask construction

* cont : add explanation + improve

* cont : fix
2026-01-17 15:42:42 +02:00
Reese Levine a89002f07b
ggml webgpu: support for backend sampling (#18880)
* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Add argmax

* Add argmax,cumsum,sum,sum_rows

* Add necessary CPY/GET_ROWS operators

* Support for argsort using multi-pass strategy

* Update set_rows for i32 indices, move to pre-wgsl

* Port unary operators to pre-wgsl and support FILL

* Implement PAD

* Add support for top-k

* clean up, scope pipeline init mutex

* fix newline

* Add support for log

* Update LOG for better precision, and ops doc

---------

Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>
2026-01-16 16:12:43 -08:00
Mustafa Cavus aa4bc90030 Syntax correction for workflows build file 2026-01-16 13:06:43 -08:00
Thore Koritzius 388ce82241
ggml : extend ggml_pool_1d + metal (#16429)
* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 16:59:56 +02:00
hipudding 6ba6a3c76f
docs : update ops.md for CANN backend (#18654) 2026-01-16 13:32:17 +01:00
Perry Naseck 0802d4cfb3
ggml-blas: hide warnings from included BLAS headers (#18818)
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers
2026-01-16 13:38:25 +02:00
Tarek Dakhran c945aaaef2
mtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876) 2026-01-16 11:23:08 +01:00
Xuan-Son Nguyen c15395f73c
common : implement new jinja template engine (#18462)
* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Julius Tischbein aa1dc3770a
Setting mmap and direct_io to false as default in llama-bench.cpp (#18841) 2026-01-16 09:46:51 +01:00
Raul Torres 4ea2eaac01
CANN: Remove unused `ggml_cann_get_device` function (#18625) 2026-01-16 16:34:09 +08:00
Chenguang Li e20fa27a02
CANN: fix an issue where get_env was not fully renamed (#18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-16 16:24:04 +08:00
hipudding baa4ba0aec
CANN: support gated linear attn (#18653)
* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-16 16:18:49 +08:00
Mustafa Cavus d7dccf887b kq_mask naming fix 2026-01-15 14:38:53 -08:00
Yamini Nimmagadda d3649c11cb Update OPENVINO.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda e9ed5c4cb6 Update OPENVINO.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda f44c60e995 Update OPENVINO.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda 63eed0d9f3 Update build.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda 61552e4450 Update OPENVINO.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda 9ba324726a Update OPENVINO.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda 25e652569b Update OPENVINO.md 2026-01-15 11:39:08 -08:00
Yamini Nimmagadda 416556a87d Create OPENVINO.md in llama.cpp backend docs 2026-01-15 11:39:08 -08:00
Mustafa Cavus 599335c633 Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp 2026-01-15 11:39:08 -08:00
Mustafa Cavus a92eceecd9 Update ggml/src/ggml-openvino/ggml-decoder.cpp 2026-01-15 11:39:08 -08:00
Mustafa Cavus a81b202f57 requant to f16 for Q6 embed on NPU 2026-01-15 11:39:08 -08:00
Mustafa Cavus a40a5dfc60 npu perf fix 2026-01-15 11:39:08 -08:00
Mustafa Cavus 981ec6571d code cleanup 2026-01-15 11:39:08 -08:00
Mustafa Cavus d2fc15226b Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
2026-01-15 11:39:08 -08:00
Mustafa Cavus 5f30eacdb4 Initial stateful graph support 2026-01-15 11:39:08 -08:00
Yu, Zijun 0d6f253e48 Support -ctk f32 2026-01-15 11:39:08 -08:00
Yu, Zijun f5c71e3cf4 Update build.md 2026-01-15 11:39:08 -08:00
Yu, Zijun 4e451778d3 Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant 2026-01-15 11:39:08 -08:00
Yu, Zijun 67c9720e49 Optimize symmetric quant weight extraction: use single zp 2026-01-15 11:39:08 -08:00
Yu, Zijun c1142ddb7c NPU always requant to q4_0_128 2026-01-15 11:39:08 -08:00
Yu, Zijun 52a44012c0 Update build.md to include OpenCL 2026-01-15 11:39:08 -08:00
Yu, Zijun cfc471353d FIX: use remote tensor from singleton 2026-01-15 11:39:08 -08:00
Yu, Zijun a356b44477 only use remote tensor for kvcache for GPU 2026-01-15 11:39:08 -08:00
Yu, Zijun 88d1d17eac only use remote tensor for kvcache 2026-01-15 11:39:08 -08:00
Yu, Zijun 8273a7c2f4 Use ggml_aligned_malloc 2026-01-15 11:39:08 -08:00
Yu, Zijun d757849741 Put kvcache on GPU 2026-01-15 11:39:08 -08:00
Yu, Zijun 3fdcb6ab72 Add ov_backend_host_buffer; Use cached remote context 2026-01-15 11:39:08 -08:00
Yu, Zijun 72bba828df Use shared_buffer for GPU NPU; Refactor 2026-01-15 11:39:08 -08:00
Yu, Zijun 22d9c17a6f backend buffer: allocate on host 2026-01-15 11:39:08 -08:00
Arshath ae5336386f Update build.md for Windows 2026-01-15 11:39:08 -08:00
Yu, Zijun 0ef2e5e4d4 Fix decoder can_reuse for llama-bench 2026-01-15 11:39:08 -08:00
Xuejun Zhai 9e3163e846 Remove unused variable nodes 2026-01-15 11:39:08 -08:00
Yu, Zijun c9234b44cc NPU fix q4 perf regression 2026-01-15 11:39:08 -08:00
Yu, Zijun ae01322dbd NPU fix wrong model output shape 2026-01-15 11:39:08 -08:00
Yu, Zijun 469325c6da GPU remove Q6_K requantization 2026-01-15 11:39:08 -08:00
Yu, Zijun 28da9a9adc Reuse cached decoder 2026-01-15 11:39:08 -08:00
Xuejun Zhai 91a1b20c82 Fix error for decoder cache 2026-01-15 11:39:08 -08:00