Commit Graph

1947 Commits

Author SHA1 Message Date
Ruben Ortlam f6b533d898
Vulkan Flash Attention Coopmat1 Refactor (#19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-28 18:52:45 +01:00
Patryk Kaminski 0cd7032ca4
ggml-sycl: remove unused syclcompat header (#19140)
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
2026-01-28 23:33:54 +08:00
Oleksandr Kuvshynov 88d23ad515
vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-28 12:35:54 +01:00
Kevin Pouget b7feacf7f3
ggml: new backend for Virglrenderer API Remoting acceleration (v2) (#18718) 2026-01-28 17:49:40 +08:00
Alberto Cabrera Pérez 6ad70c5a77
ggml-cpu: arm64: Q4_K scale unroll and vectorization (#19108) 2026-01-28 09:15:56 +02:00
Georgi Gerganov 631cbfcc7a
cuda : fix "V is K view" check for non-unified KV cache (#19145) 2026-01-28 09:15:27 +02:00
Georgi Gerganov 2eee6c866c
CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (#19142) 2026-01-28 09:15:11 +02:00
Nikhil Jain 06961e2876
ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (#18976)
* Squashed commit of the following:

commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755
Author: Abhijit Ramesh <abhijitramesh2k@gmail.com>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (#11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e49e
Author: neha-ha <137219201+neha-ha@users.noreply.github.com>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (#10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
    Co-authored-by: Reese Levine <reeselevine1@gmail.com>

Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6baea31
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8fece
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c660c
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7dec
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add176
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 362749910b
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb08583337
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e2852a
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa4aa
Merge: 8a6ec843 74b8fc17
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec843a5
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae38278a
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2f88
Author: James Contini <jamescontini@gmail.com>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <jamescontini@gmail.com>
Co-authored-by: Neha Abbas <neabbas@ucsc.edu>
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>

* Remove extra code and format

* Add ops documentation (finally)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context

* Small cleanup

* Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state.

* Cleanups

* More cleanup

* Move staging_buf mutex to global context

* Resolve merge

* Resolve merge

* Resolve merge

* Clean up merge errors, delete forward declaration, and run clang-format

* Rename device_init to backend_init

* Move webgpu_context to backend_context

* Move buffer context members into global context and refactor function calls

* Run clang-format

* Remove commends

* Move parameter buffers to per-thread, add single memset_tensor param buf

* Fix CI compilation issue

* Fix builds for emscripten not supporting subgroups

* cleanup

* cleanup

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-27 20:53:36 -08:00
Vishal Singh f2571df8b7
ggml-zendnn : update ZenDNN git tag to main branch (#19133) 2026-01-28 06:21:36 +08:00
Johannes Gäßler a5bb8ba4c5
CUDA: tune GLM 4.7 Flash FA kernel selection logic (#19097) 2026-01-27 14:28:56 +01:00
Alberto Cabrera Pérez be8890e721
ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888)
* Boilerplate for q6_K repack

* q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q6_K generic gemv and gemm

* wip, gemm_q6_K 8x8

* Still WIP: loading of q8s, q6h and q6l

* first working version of q6_K gemm

* Moved q6 loads outside of sb block, Unrolled inner loop

* Replaced modulo with mask

* First implementation of GEMV

* ggml_vdotq_s32 -> vdotq_s32

* Reduce width of accumulators in q6_K gemv

* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

* Reuse scales in GEMM (same GEMV opt)

* Added todos for bsum and different qh repack

* Arch fallback

* VSLIQ for merging qh adn ql

* Removed TODO, already tested

* Apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Removed unused import

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-27 11:08:10 +02:00
Gaurav Garg a83c73a18a
[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042)
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation
2026-01-27 08:52:44 +02:00
shalinib-ibm 7afdfc9b84
ggml-cpu: Enable FP16 MMA kernels on PPC (#19060) 2026-01-27 11:52:34 +08:00
lhez 94eeb5967c
opencl: add flattened q6_K mv (#19054)
* opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat`

* opencl: clean up

* opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat`

* opencl: tweak the workgroup size a bit

* opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat`

* opencl: proper alignment for q6_K

* opencl: boundary handling for flattened q6_K mv

* opencl: rename q6_K mv kernel file

* opencl: put flattened q6_K mv in its own file

* opencl: use lower k in file name

* opencl: use K in variable names
2026-01-26 19:36:24 -08:00
Johannes Gäßler b0311c16d2
CUDA: fix padding of GQA to power of 2 in FA (#19115) 2026-01-26 23:24:58 +01:00
Johannes Gäßler 0c21677e43
CUDA: faster FA for GQA > 1 but not power of 2 (#19092) 2026-01-25 21:19:47 +01:00
ccbinn 0440bfd160
metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088)
Co-authored-by: chenbin11 <chenbin11@kuaishou.com>
2026-01-25 20:07:19 +02:00
Aman Gupta bcb43163ae
ggml-cpu: Use tiled FA for prompt-processing (#19012)
* ggml-cpu: Use tiled FA for prompt-processing

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

* fix out of bounds for mask

* skip rows where there are all masks

* skip tile if mask is inf

* store mask in worksize

* check inf tile earlier
2026-01-25 23:25:58 +08:00
Georgi Gerganov d9c6ce46f7
kv-cache : support V-less cache (#19067)
* kv-cache : support V-less cache

* cuda : better check for V_is_K_view

* cuda : improve V_is_K_view check

* graph : add comments

* hparams : refactor
2026-01-25 15:48:56 +02:00
Johannes Gäßler 4e5b83b226
GGUF: check that tensor size is representable (#19072) 2026-01-24 21:57:51 +01:00
Johannes Gäßler 8f91ca54ec
CUDA: re-use MLA K data for V in MMA FA (#19057) 2026-01-24 10:09:36 +01:00
Aman Gupta 81ab64f3c8
ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934)
* ggml-cuda: add split-wise cuda graph

* add n-cpu-moe compare_llama_bench.py

* fix hip/musa builds
2026-01-24 14:25:20 +08:00
nullname 8af1f5f430
ggml-hexagon: flash-attn opt (#19025)
* optimize flash attention kernel by improving score computation and online softmax update

* wip

* Refactor online softmax update in flash attention kernel for improved performance

* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation

* wip
2026-01-23 22:02:07 -08:00
Neo Zhang cb6caca191
[SYCL] use malloc to support both iGPU and dGPU in same time (#18992)
* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-23 20:54:10 +08:00
Alberto Cabrera Pérez 091a46cb8d
ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860)
* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
2026-01-23 09:55:08 +02:00
Georgi Gerganov a5eaa1d6a3
mla : make the V tensor a view of K (#18986)
* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-22 22:09:01 +02:00
Johannes Gäßler e2baf02162
CUDA: fix alignment check for FA (#19023) 2026-01-22 20:39:25 +01:00
lhez 9c96465f99
opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970)
* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well
2026-01-22 10:29:25 -08:00
Aman Gupta b70d251076
CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953) 2026-01-22 18:51:53 +08:00
shaofeiqi 5516b9c16a
opencl: add TRI op support (#18979) 2026-01-21 22:05:54 -08:00
Aleksei Nikiforov 94242a62c0
ggml-zdnn : mark zDNN buffers as non-host (#18967)
While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848
2026-01-22 01:16:21 +01:00
Jeff Bolz bd544c94a3
vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-21 18:01:40 +01:00
Jeff Bolz 33f890e579
vulkan: support flash attention GQA/split_k with small batches (#18938) 2026-01-21 17:43:43 +01:00
Masato Nakasaka 067b8d7af3
Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)
This reverts commit 980b7cd17e.
2026-01-21 17:13:43 +01:00
Jeff Bolz 50b7f076a5
vulkan: Use mul_mat_vec_id for small values of n (#18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-21 16:22:02 +01:00
Matthieu Coudron 37c35f0e1c
gguf: display strerrno when cant load a model (#18884)
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'

and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
2026-01-21 08:52:46 +02:00
Oliver Simons 5bd341c9a1
CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request
2026-01-21 02:34:29 +01:00
Oliver Simons d1e3556481
CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)
* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu
2026-01-20 20:11:01 +08:00
Adrien Gallouët 08f3f4a8a3
ggml : cleanup path_str() (#18928)
- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-20 11:42:49 +01:00
Georgi Gerganov 271191906c
metal : enable FA for MLA heads (#18950) 2026-01-20 12:21:28 +02:00
Georgi Gerganov 365a3e8c31
ggml : add ggml_build_forward_select (#18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-19 20:03:19 +02:00
lhez d1b4757ded
opencl: fix q6_K mv for m=1 (#18893) 2026-01-17 13:50:32 -08:00
Reese Levine a89002f07b
ggml webgpu: support for backend sampling (#18880)
* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Add argmax

* Add argmax,cumsum,sum,sum_rows

* Add necessary CPY/GET_ROWS operators

* Support for argsort using multi-pass strategy

* Update set_rows for i32 indices, move to pre-wgsl

* Port unary operators to pre-wgsl and support FILL

* Implement PAD

* Add support for top-k

* clean up, scope pipeline init mutex

* fix newline

* Add support for log

* Update LOG for better precision, and ops doc

---------

Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>
2026-01-16 16:12:43 -08:00
Thore Koritzius 388ce82241
ggml : extend ggml_pool_1d + metal (#16429)
* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 16:59:56 +02:00
Perry Naseck 0802d4cfb3
ggml-blas: hide warnings from included BLAS headers (#18818)
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers
2026-01-16 13:38:25 +02:00
Raul Torres 4ea2eaac01
CANN: Remove unused `ggml_cann_get_device` function (#18625) 2026-01-16 16:34:09 +08:00
Chenguang Li e20fa27a02
CANN: fix an issue where get_env was not fully renamed (#18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-16 16:24:04 +08:00
hipudding baa4ba0aec
CANN: support gated linear attn (#18653)
* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-16 16:18:49 +08:00
shaofeiqi 785a710085
OpenCL: add SOLVE_TRI op support (#18846) 2026-01-15 11:17:17 -08:00
Georgi Gerganov 6e7fc8a146
cuda : print less debug logs when disabling cuda graphs (#18868) 2026-01-15 20:53:01 +02:00