Compare commits

...

68 Commits

Author SHA1 Message Date
Daniel Bevenius 20ca2e12c4
model-conversion : remove -c 0 from model card template [no ci] (#18807)
This commit removes the `-c, --ctx-size N` from the llama-server
command in the model card template for causal models.

The motivation for this is that -c 0 is the default and specifying it
is redundant.
2026-01-13 14:13:10 +01:00
yulo ea4a321f2a
HIP: add fattn-mma-f16 for RDNA4 (#18481)
* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-13 13:52:16 +01:00
Johannes Gäßler c1e79e610f
doc: ban AI-generated PR descriptions [no ci] (#18765) 2026-01-13 13:43:12 +01:00
Xuan-Son Nguyen e047f9ee9d
mtmd: fix use_non_causal being reported incorrectly (#18793)
* mtmd: fix use_non_causal being reported incorrectly

* move clip_is_mrope to mtmd_decode_use_mrope

* fix sloppy code ggml_cpy
2026-01-13 12:19:38 +01:00
Georgi Gerganov 0a57271ab6
CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800) 2026-01-13 12:25:53 +02:00
Gabe Goodhart 076b0faf7d
graph : clean up t5 input builders (#18795)
* fix: Remove unnecessary `h` loops where `h` was only ever 0

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary padding loop that is never hit anymore

The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was
removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the
loop dead.

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2026-01-13 09:43:51 +01:00
Ruben Ortlam db79dc06b1
llama-bench: add direct_io parameter (#18778) 2026-01-13 08:49:10 +01:00
Adrien Gallouët 537d4240d4
ci : remove libcurl in releases (#18775)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 21:43:02 +01:00
Radoslav Gerganov bcf7546160
server : add arg for disabling prompt caching (#18776)
* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments
2026-01-12 19:21:34 +02:00
Adrien Gallouët 36c5913c45
ci : use openssl for openEuler-latest-cmake-cann (#18779)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 17:29:00 +01:00
Adrien Gallouët 8e649571cd
vendor : update cpp-httplib to 0.30.1 (#18771)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 15:58:52 +01:00
Daniel Bevenius 4150da9a95
examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.

The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
2026-01-12 13:47:58 +01:00
Jeff Bolz 8e2da778da
vulkan: change memory_logger to be controlled by an env var (#18769) 2026-01-12 13:32:55 +01:00
Xuan-Son Nguyen ce3bf9b1a4
server: update docs for sleeping [no ci] (#18777) 2026-01-12 13:01:24 +01:00
Jeff Bolz 2bbe4c2cf8
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-12 12:32:13 +01:00
Ruben Ortlam 1051ecd289
vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-12 07:29:35 +01:00
Xuan-Son Nguyen 0c3b7a9efe
model: fix qwen3next broken due to #18683 (#18762) 2026-01-11 21:00:10 +01:00
Ruben Ortlam 0e76501e1d
Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-11 17:33:33 +01:00
Xuan-Son Nguyen 4b060bf240
security: make it clear about subtopics in server (#18754)
* security: make it clear about subtopics in server

* exclude DoS
2026-01-11 16:51:03 +01:00
Daniel Bevenius 9789e28459
debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692)
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check

This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.

* debug : normalize both pooled and token embeddings

This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.

The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
2026-01-11 16:34:41 +01:00
Georgi Gerganov 84ae04f163
tests : refactor test-backend-sampler (#18753)
* tests : use "auto", use std::string

* tests : refactor test-backend-sampler.cpp

* cmake : remove redundant declarations

* ci : use smaller model

* tests : add struct test_params

* tests : reduce logit bias 100.0f -> 10.0f
2026-01-11 17:31:03 +02:00
Xuan-Son Nguyen 506bb6e010
model: try to improve Qwen3 Next (#18683)
* qwen3next: simplify qkvz projection

* use ggml_swiglu_split

* revert swiglu_split, but remove redundant repeat()

* fix missing reshape

* rm 2 redundant transposes

* move mul_mat(k,q) to outside of chunking

* rm redundant cont

* improve g_cs_chunk

* add comments about no cont

* use std::pair instead of ggml_concat

* vectorize key_gdiff calculation

* rm unused tensor

* avoid ggml_concat inside loop

* bring back ggml_concat as it may not work on other backend

* nits
2026-01-11 12:53:33 +01:00
thom-dev-fr 79456a690a
readme : update UIs (#18751) 2026-01-11 13:46:50 +02:00
Xuan-Son Nguyen 28068af789
security: narrow down the scope of what we consider a vulnerability (#18752)
* security: narrow down the scope of what we consider a vulnerability

* fix typo
2026-01-11 12:23:36 +01:00
shaofeiqi 707cbafcaa
opencl: add SOFTPLUS op support (#18726) 2026-01-10 21:57:44 -08:00
Aman Gupta b137718878
test-backend-ops: fix mxfp4 tests on blackwell (#18736) 2026-01-11 01:12:57 +08:00
Johannes Gäßler d2ff4e23ac
HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666) 2026-01-10 17:19:01 +01:00
Perry Naseck 657a2e644b
cmake : update blas logic (#18205) 2026-01-10 18:00:54 +02:00
Georgi Gerganov f307926482
server : adjust unified KV cache tests (#18716) 2026-01-10 17:51:56 +02:00
Sigbjørn Skjæret 7fdc8c893d
scripts : follow api redirects in pr2wt.sh (#18739) 2026-01-10 16:04:05 +01:00
Xuan-Son Nguyen 23f82f2420
preset: allow named remote preset (#18728)
* preset: allow named remote preset

* nits: fix docs

* cont docs
2026-01-10 15:12:29 +01:00
Aaron Teo 2656c0d265
docs(ggml): update backend ops (#18734)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-01-10 18:48:17 +08:00
Michael Wand 600a366478
Corrected: changed s13 = src1->nb[3] instead of nb[2] (#18724) 2026-01-10 10:16:07 +01:00
Adrien Gallouët ea23c15990
common : add --license to display embedded licenses (#18696)
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.

This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
2026-01-10 09:46:24 +01:00
Xuan-Son Nguyen 9ac2693a30
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00
Simranjeet Singh a61c8bc3bf
mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

* Add mobilenetv5 impl

* Fix comments, remove unused vars

* Fix permute and remove transpose of projection weights

* Fix comments, remove debugging prints from hf_to_gguf

* 1. Hard-code image_mean = 0 and image_std = 1
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

* 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp
2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size

* Remove obsolete comments

* - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading

* - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology

* Fix stem conv bias name

* Remove explicit handling of bias term for stem conv

* - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable

* clean up conversion script

* fix code style

* also preserve audio tensors

* trailing space

* split arch A and V

* rm unused gemma3 func

* fix alignment

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-01-09 23:42:38 +01:00
shaofeiqi 593da7fa49
opencl: add EXPM1 op (#18704) 2026-01-09 10:13:13 -08:00
Reese Levine 9e41884dce
Updates to webgpu get_memory (#18707) 2026-01-09 08:17:18 -08:00
Pascal ec8fd7876b
Webui/file upload (#18694)
* webui: fix restrictive file type validation

* webui: simplify file processing logic

* chore: update webui build output

* webui: remove file picker extension whitelist (1/2)

* webui: remove file picker extension whitelist (2/2)

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* fix: update ChatForm storybook test after removing accept attribute

* chore: update webui build output

* refactor: more cleanup

* chore: update webui build output
2026-01-09 16:45:32 +01:00
Asbjørn Olling a180ba78c7
cmake: only build cli when server is enabled (#18670) 2026-01-09 16:43:26 +01:00
Georgi Gerganov 53eb9435da
server : fix timing of prompt/generation (#18713) 2026-01-09 12:59:50 +02:00
Georgi Gerganov d3435efc8a
scripts : pr2wt.sh reset to remote head (#18695)
* scripts : pr2wt.sh reset to remote head

* cont : cleaner

* cont : restore --set-upstream-to
2026-01-09 12:16:40 +02:00
Georgi Gerganov f5f8812f7c
server : use different seeds for child completions (#18700)
* server : use different seeds for child completions

* cont : handle default seed

* cont : note
2026-01-09 09:33:50 +02:00
Xuan-Son Nguyen 8ece3836b4
common: support remote preset (#18520)
* arg: support remote preset

* proof reading

* allow one HF repo to point to multiple HF repos

* docs: mention about multiple GGUF use case

* correct clean_file_name

* download: also return HTTP status code

* fix case with cache file used

* fix --offline option
2026-01-08 22:35:40 +01:00
Aaron Teo 046d5fd44e
llama: use host memory if device reports 0 memory (#18587) 2026-01-09 05:34:56 +08:00
Masashi Yoshimura 480160d472
ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628)
* Fix GGML_MEM_ALIGN to 8 for emscripten.

* Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten
2026-01-08 08:36:42 -08:00
Reese Levine 15bff84bf5
ggml webgpu: initial flashattention implementation (#18610)
* FlashAttention (#13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (#9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (#10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-08 08:23:39 -08:00
Jeff Bolz 2524c26164
vulkan: fix push constant size for quantize_q8_1 (#18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-08 15:40:58 +01:00
Jeff Bolz cb14b06995
vulkan: optimize ssm_scan (#18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-08 15:16:54 +01:00
Adrien Gallouët 55abc39355
vendor : update cpp-httplib to 0.30.0 (#18660)
* vendor : update cpp-httplib to 0.30.0
* common : allow custom headers when downloading
2026-01-08 13:53:54 +01:00
Georgi Gerganov f2f6c88067
scripts : support chaining commands in pr2wt.sh (#18671) 2026-01-08 13:40:23 +02:00
도로로도로또 945bf10627
metal : add MoE kernel specialization for ne20=5 (#18667)
Add template specialization for kernel_mul_mm_id_map0 with ne20=5
to support models using 5 active experts (e.g., VAETKI).
2026-01-08 12:37:45 +02:00
Johannes Gäßler 64848deb18
llama-fit-params: free memory target per device (#18679) 2026-01-08 10:07:58 +01:00
Doctor Shotgun 9a5724dee2
ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (#18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-08 11:03:21 +02:00
Daniel Bevenius 9c142e3a2a
model-conversion : add warn about transformers mismatch (#18691)
This commit adds a check comparing the installed transformers library
with the transformers version that the original model supports. This
check will be performed upon a model verification failure and prints a
warning/hint to the user suggesting to install the correct version of
the transformers library.

The motivation for this change is that it is possible for the model
verification to fail due to differences in the transformers library used
and it might not be obvious that this could be the cause of the failure.
With this warning the correct version can be checked and hopefully save
time troubleshooting the cause of the verification failure.
2026-01-08 09:29:53 +01:00
Daniel Bevenius df7fb92170
model-conversion : remove -st targets for converted model (#18689)
This commit removes the '-st` make target for running the converted
embedding model.

The motivation for this is that the pooling type is now part of the
.gguf metdata of the model and this is used by llama-debug when running
the model. So there is no need to specify the pooling type separately
any more.

The commit also adds an option to specify the type of normalization
applied to the output embeddings when running the converted model.

And the readme documentation has been  updated to reflect these changes.
2026-01-08 09:29:15 +01:00
Julius Tischbein 2038101bd9
llama : add `use_direct_io` flag for model loading (#18166)
* Adding --direct-io flag for model loading

* Fixing read_raw() calls

* Fixing Windows read_raw_at

* Changing type off_t to size_t for windows and Renaming functions

* disable direct io when mmap is explicitly enabled

* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL

* Fallback to std::fread in case O_DIRECT fails due to bad address

* Windows: remove const keywords and unused functions

* Update src/llama-mmap.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-08 08:35:30 +02:00
shaofeiqi 568371a726
opencl: add FILL op support (#18682) 2026-01-07 22:04:50 -08:00
Sigbjørn Skjæret 5b8844ae53
scripts : fix repos cloned with .git extension (#18669) 2026-01-07 22:35:34 +01:00
Sigbjørn Skjæret 7e16fef085
convert : more variants of rope_theta config entries (#18668) 2026-01-07 22:34:51 +01:00
Oliver Walsh f5245b5e4e
cuda : fix build on cuda 12.8 (#18672)
compute121 requires 12.9

Signed-off-by: Oliver Walsh <owalsh@redhat.com>
2026-01-07 22:32:44 +01:00
R ae9f8df778
fix(docker): add missing libglvnd libraries to Vulkan image (#18664)
Add libglvnd0, libgl1, libglx0, libegl1, libgles2 to the Vulkan
Dockerfile base image. These libraries are required by mesa-vulkan-drivers
to properly initialize the Vulkan ICD and detect GPU devices.

Without these libraries, vkEnumeratePhysicalDevices() returns an empty
list, resulting in "ggml_vulkan: No devices found." error.

Fixes #17761
2026-01-07 16:57:42 +01:00
Adrien Gallouët 56d2fed2b3
tools : remove llama-run (#18661)
* tools : remove llama-run
* Remove licenses/LICENSE-linenoise

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-07 16:18:26 +01:00
Georgi Gerganov 56426673cb
scripts : add pr2wt.sh (#18644)
* scripts : add pr2wt.sh

* script : shebang

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-07 15:16:20 +02:00
Daniel Bevenius bb77764c2d
convert : clarify sentence-transformers-dense-modules help [no ci] (#18662)
* convert : clarify sentence-transformers-dense-modules help [no ci]

This commit updates this options help message which currently looks
like this:
```console
  --sentence-transformers-dense-modules
                        Whether to include sentence-transformers dense modules.It can be used for sentence-transformers models, like
                        google/embeddinggemma-300mDefault these modules are not included.
```
2026-01-07 13:18:53 +01:00
Sigbjørn Skjæret 9dfa8ee950
ci : run cann build unconditionally [no ci] (#18659) 2026-01-07 13:07:08 +01:00
Jeff Bolz ca4a8370bc
vulkan: reject ops when a tensor is too large to allocate (#18646) 2026-01-07 12:03:32 +01:00
virajwad 03023296cf
vulkan: Warptile tuning for Intel Xe2/Xe3 (#18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-07 11:59:47 +01:00
147 changed files with 18554 additions and 9627 deletions

View File

@ -33,6 +33,7 @@ FROM ubuntu:$UBUNTU_VERSION AS base
RUN apt-get update \ RUN apt-get update \
&& apt-get install -y libgomp1 curl libvulkan1 mesa-vulkan-drivers \ && apt-get install -y libgomp1 curl libvulkan1 mesa-vulkan-drivers \
libglvnd0 libgl1 libglx0 libegl1 libgles2 \
&& apt autoremove -y \ && apt autoremove -y \
&& apt clean -y \ && apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \ && rm -rf /tmp/* /var/tmp/* \

View File

@ -1,30 +0,0 @@
name: 'Windows - Setup CURL'
description: 'Composite action, to be reused in other workflow'
inputs:
curl_version:
description: 'CURL version'
required: false
default: '8.6.0_6'
architecture:
description: 'Architecture of the libcurl to download'
required: false
default: 'win64'
outputs:
curl_path:
description: "Path to the downloaded libcurl"
value: ${{ steps.get_libcurl.outputs.curl_path }}
runs:
using: "composite"
steps:
- name: libCURL
id: get_libcurl
shell: powershell
env:
CURL_VERSION: ${{ inputs.curl_version }}
ARCHITECTURE: ${{ inputs.architecture }}
run: |
curl.exe -o $env:RUNNER_TEMP/curl.zip -L "https://curl.se/windows/dl-${env:CURL_VERSION}/curl-${env:CURL_VERSION}-${env:ARCHITECTURE}-mingw.zip"
mkdir $env:RUNNER_TEMP/libcurl
tar.exe -xvf $env:RUNNER_TEMP/curl.zip --strip-components=1 -C $env:RUNNER_TEMP/libcurl
echo "curl_path=$env:RUNNER_TEMP/libcurl" >> $env:GITHUB_OUTPUT

View File

@ -152,13 +152,13 @@ jobs:
DAWN_VERSION="v2.0.0" DAWN_VERSION="v2.0.0"
DAWN_OWNER="reeselevine" DAWN_OWNER="reeselevine"
DAWN_REPO="dawn" DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release.zip" DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}" echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
curl -L -o artifact.zip \ curl -L -o artifact.zip \
"https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}" "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
mkdir dawn mkdir dawn
unzip artifact.zip unzip artifact.zip
tar -xvf Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release.tar.gz -C dawn --strip-components=1 tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
- name: Build - name: Build
id: cmake_build id: cmake_build
@ -532,13 +532,13 @@ jobs:
DAWN_VERSION="v2.0.0" DAWN_VERSION="v2.0.0"
DAWN_OWNER="reeselevine" DAWN_OWNER="reeselevine"
DAWN_REPO="dawn" DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-ubuntu-latest-Release.zip" DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-ubuntu-latest-Release"
echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}" echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
curl -L -o artifact.zip \ curl -L -o artifact.zip \
"https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}" "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
mkdir dawn mkdir dawn
unzip artifact.zip unzip artifact.zip
tar -xvf Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-ubuntu-latest-Release.tar.gz -C dawn --strip-components=1 tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
- name: Build - name: Build
id: cmake_build id: cmake_build
@ -1418,7 +1418,6 @@ jobs:
echo "FIXME: test on devices" echo "FIXME: test on devices"
openEuler-latest-cmake-cann: openEuler-latest-cmake-cann:
if: ${{ github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'Ascend NPU') }}
defaults: defaults:
run: run:
shell: bash -el {0} shell: bash -el {0}
@ -1464,12 +1463,14 @@ jobs:
"${{ steps.cann-image.outputs.image }}" \ "${{ steps.cann-image.outputs.image }}" \
bash -lc ' bash -lc '
set -e set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake libcurl-devel yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
yum clean all && rm -rf /var/cache/yum yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace" git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \ cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \ -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DLLAMA_CURL=OFF \
-DLLAMA_OPENSSL=ON \
-DGGML_CANN=on \ -DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE} -DSOC_TYPE=${SOC_TYPE}
cmake --build build -j $(nproc) cmake --build build -j $(nproc)
@ -1705,6 +1706,34 @@ jobs:
run: | run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-webgpu:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: Dawn Dependency
id: dawn-depends
run: |
DAWN_VERSION="v2.0.0"
DAWN_OWNER="reeselevine"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
curl -L -o artifact.zip \
"https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
mkdir dawn
unzip artifact.zip
tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
- name: Test
id: ggml-ci
run: |
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-vulkan: ggml-ci-mac-vulkan:
runs-on: [self-hosted, macOS, ARM64] runs-on: [self-hosted, macOS, ARM64]

View File

@ -37,13 +37,6 @@ jobs:
key: macOS-latest-cmake-arm64 key: macOS-latest-cmake-arm64
evict-old-files: 1d evict-old-files: 1d
- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update
brew install curl
- name: Build - name: Build
id: cmake_build id: cmake_build
run: | run: |
@ -52,6 +45,8 @@ jobs:
-DCMAKE_INSTALL_RPATH='@loader_path' \ -DCMAKE_INSTALL_RPATH='@loader_path' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \ -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_FATAL_WARNINGS=ON \ -DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_CURL=OFF \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL_USE_BF16=ON \ -DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \ -DGGML_METAL_EMBED_LIBRARY=ON \
-DGGML_RPC=ON \ -DGGML_RPC=ON \
@ -90,13 +85,6 @@ jobs:
key: macOS-latest-cmake-x64 key: macOS-latest-cmake-x64
evict-old-files: 1d evict-old-files: 1d
- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update
brew install curl
- name: Build - name: Build
id: cmake_build id: cmake_build
run: | run: |
@ -107,6 +95,8 @@ jobs:
-DCMAKE_INSTALL_RPATH='@loader_path' \ -DCMAKE_INSTALL_RPATH='@loader_path' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \ -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_FATAL_WARNINGS=ON \ -DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_CURL=OFF \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL=OFF \ -DGGML_METAL=OFF \
-DGGML_RPC=ON \ -DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3 -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
@ -159,7 +149,7 @@ jobs:
id: depends id: depends
run: | run: |
sudo apt-get update sudo apt-get update
sudo apt-get install build-essential libcurl4-openssl-dev sudo apt-get install build-essential libssl-dev
- name: Build - name: Build
id: cmake_build id: cmake_build
@ -171,6 +161,8 @@ jobs:
-DGGML_NATIVE=OFF \ -DGGML_NATIVE=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \ -DGGML_CPU_ALL_VARIANTS=ON \
-DLLAMA_FATAL_WARNINGS=ON \ -DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_CURL=OFF \
-DLLAMA_OPENSSL=ON \
${{ env.CMAKE_ARGS }} ${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc) cmake --build build --config Release -j $(nproc)
@ -212,7 +204,7 @@ jobs:
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add - wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add -
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
sudo apt-get update -y sudo apt-get update -y
sudo apt-get install -y build-essential mesa-vulkan-drivers vulkan-sdk libcurl4-openssl-dev sudo apt-get install -y build-essential mesa-vulkan-drivers vulkan-sdk libssl-dev
- name: Build - name: Build
id: cmake_build id: cmake_build
@ -220,6 +212,8 @@ jobs:
cmake -B build \ cmake -B build \
-DCMAKE_INSTALL_RPATH='$ORIGIN' \ -DCMAKE_INSTALL_RPATH='$ORIGIN' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \ -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_CURL=OFF \
-DLLAMA_OPENSSL=ON \
-DGGML_BACKEND_DL=ON \ -DGGML_BACKEND_DL=ON \
-DGGML_NATIVE=OFF \ -DGGML_NATIVE=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \ -DGGML_CPU_ALL_VARIANTS=ON \
@ -269,34 +263,24 @@ jobs:
run: | run: |
choco install ninja choco install ninja
- name: libCURL
id: get_libcurl
uses: ./.github/actions/windows-setup-curl
with:
architecture: ${{ matrix.arch == 'x64' && 'win64' || 'win64a' }}
- name: Build - name: Build
shell: cmd shell: cmd
env:
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
run: | run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" ${{ matrix.arch == 'x64' && 'x64' || 'amd64_arm64' }} call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" ${{ matrix.arch == 'x64' && 'x64' || 'amd64_arm64' }}
cmake -S . -B build -G "Ninja Multi-Config" ^ cmake -S . -B build -G "Ninja Multi-Config" ^
-D CMAKE_TOOLCHAIN_FILE=cmake/${{ matrix.arch }}-windows-llvm.cmake ^ -D CMAKE_TOOLCHAIN_FILE=cmake/${{ matrix.arch }}-windows-llvm.cmake ^
-DLLAMA_CURL=OFF ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_NATIVE=OFF ^ -DGGML_NATIVE=OFF ^
-DGGML_BACKEND_DL=ON ^ -DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=${{ matrix.arch == 'x64' && 'ON' || 'OFF' }} ^ -DGGML_CPU_ALL_VARIANTS=${{ matrix.arch == 'x64' && 'ON' || 'OFF' }} ^
-DGGML_OPENMP=ON ^ -DGGML_OPENMP=ON ^
-DCURL_LIBRARY="%CURL_PATH%/lib/libcurl.dll.a" -DCURL_INCLUDE_DIR="%CURL_PATH%/include" ^
${{ env.CMAKE_ARGS }} ${{ env.CMAKE_ARGS }}
cmake --build build --config Release cmake --build build --config Release
- name: Pack artifacts - name: Pack artifacts
id: pack_artifacts id: pack_artifacts
env:
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
run: | run: |
Copy-Item $env:CURL_PATH\bin\libcurl-${{ matrix.arch }}.dll .\build\bin\Release\
Copy-Item "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112\debug_nonredist\${{ matrix.arch }}\Microsoft.VC143.OpenMP.LLVM\libomp140.${{ matrix.arch == 'x64' && 'x86_64' || 'aarch64' }}.dll" .\build\bin\Release\ Copy-Item "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112\debug_nonredist\${{ matrix.arch }}\Microsoft.VC143.OpenMP.LLVM\libomp140.${{ matrix.arch == 'x64' && 'x86_64' || 'aarch64' }}.dll" .\build\bin\Release\
7z a -snl llama-bin-win-cpu-${{ matrix.arch }}.zip .\build\bin\Release\* 7z a -snl llama-bin-win-cpu-${{ matrix.arch }}.zip .\build\bin\Release\*
@ -744,12 +728,14 @@ jobs:
"${{ steps.cann-image.outputs.image }}" \ "${{ steps.cann-image.outputs.image }}" \
bash -lc ' bash -lc '
set -e set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake libcurl-devel yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
yum clean all && rm -rf /var/cache/yum yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace" git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \ cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \ -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DLLAMA_CURL=OFF \
-DLLAMA_OPENSSL=ON \
-DGGML_CANN=on \ -DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE} -DSOC_TYPE=${SOC_TYPE}
cmake --build build -j $(nproc) cmake --build build -j $(nproc)

1
.gitignore vendored
View File

@ -130,6 +130,7 @@ poetry.toml
# Local scripts # Local scripts
/run-vim.sh /run-vim.sh
/run-chat.sh /run-chat.sh
/run-spec.sh
/.ccache/ /.ccache/
# IDE # IDE

View File

@ -182,6 +182,9 @@ if (NOT MSVC)
endif() endif()
endif() endif()
include("cmake/license.cmake")
license_add_file("llama.cpp" "LICENSE")
# #
# 3rd-party # 3rd-party
# #
@ -235,6 +238,19 @@ if (LLAMA_BUILD_COMMON AND LLAMA_BUILD_TOOLS)
add_subdirectory(tools) add_subdirectory(tools)
endif() endif()
# Automatically add all files from the 'licenses' directory
file(GLOB EXTRA_LICENSES "${CMAKE_SOURCE_DIR}/licenses/LICENSE-*")
foreach(FILE_PATH ${EXTRA_LICENSES})
get_filename_component(FILE_NAME "${FILE_PATH}" NAME)
string(REGEX REPLACE "^LICENSE-" "" NAME "${FILE_NAME}")
license_add_file("${NAME}" "${FILE_PATH}")
endforeach()
if (LLAMA_BUILD_COMMON)
license_generate(common)
endif()
# #
# install # install
# #

View File

@ -20,7 +20,7 @@ If AI is used to generate any portion of the code, contributors must adhere to t
1. Explicitly disclose the manner in which AI was employed. 1. Explicitly disclose the manner in which AI was employed.
2. Perform a comprehensive manual review prior to submitting the pull request. 2. Perform a comprehensive manual review prior to submitting the pull request.
3. Be prepared to explain every line of code they submitted when asked about it by a maintainer. 3. Be prepared to explain every line of code they submitted when asked about it by a maintainer.
4. Using AI to respond to human reviewers is strictly prohibited. 4. Using AI to write pull request descriptions or to respond to human reviewers is strictly prohibited.
For more info, please refer to the [AGENTS.md](AGENTS.md) file. For more info, please refer to the [AGENTS.md](AGENTS.md) file.

View File

@ -200,6 +200,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)* *(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT) - [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
- [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary)
- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT) - [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
- [Dot](https://github.com/alexpinel/Dot) (GPL) - [Dot](https://github.com/alexpinel/Dot) (GPL)
- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT) - [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
@ -482,21 +483,6 @@ To learn more about model quantization, [read this documentation](tools/quantize
</details> </details>
## [`llama-run`](tools/run)
#### A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama [^3].
- <details>
<summary>Run a model with a specific prompt (by default it's pulled from Ollama registry)</summary>
```bash
llama-run granite-code
```
</details>
[^3]: [RamaLama](https://github.com/containers/ramalama)
## [`llama-simple`](examples/simple) ## [`llama-simple`](examples/simple)
#### A minimal example for implementing apps with `llama.cpp`. Useful for developers. #### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
@ -600,7 +586,6 @@ $ echo "source ~/.llama-completion.bash" >> ~/.bashrc
- [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain - [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain
- [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License - [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License
- [minja](https://github.com/google/minja) - Minimal Jinja parser in C++, used by various tools/examples - MIT License - [minja](https://github.com/google/minja) - Minimal Jinja parser in C++, used by various tools/examples - MIT License
- [linenoise.cpp](./tools/run/linenoise.cpp/linenoise.cpp) - C++ library that provides readline-like line editing capabilities, used by `llama-run` - BSD 2-Clause License
- [curl](https://curl.se/) - Client-side URL transfer library, used by various tools/examples - [CURL License](https://curl.se/docs/copyright.html) - [curl](https://curl.se/) - Client-side URL transfer library, used by various tools/examples - [CURL License](https://curl.se/docs/copyright.html)
- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain - [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain - [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain

View File

@ -1,12 +1,52 @@
# Security Policy # Security Policy
- [**Reporting a vulnerability**](#reporting-a-vulnerability)
- [**Requirements**](#requirements)
- [**Covered Topics**](#covered-topics)
- [**Using llama.cpp securely**](#using-llamacpp-securely) - [**Using llama.cpp securely**](#using-llamacpp-securely)
- [Untrusted models](#untrusted-models) - [Untrusted models](#untrusted-models)
- [Untrusted inputs](#untrusted-inputs) - [Untrusted inputs](#untrusted-inputs)
- [Data privacy](#data-privacy) - [Data privacy](#data-privacy)
- [Untrusted environments or networks](#untrusted-environments-or-networks) - [Untrusted environments or networks](#untrusted-environments-or-networks)
- [Multi-Tenant environments](#multi-tenant-environments) - [Multi-Tenant environments](#multi-tenant-environments)
- [**Reporting a vulnerability**](#reporting-a-vulnerability)
## Reporting a vulnerability
If you have discovered a security vulnerability in this project that falls inside the [covered topics](#covered-topics), please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
> [!IMPORTANT]
> For collaborators: if you are interested in helping out with reviewing privting security disclosures, please see: https://github.com/ggml-org/llama.cpp/discussions/18080
## Requirements
Before submitting your report, ensure you meet the following requirements:
- You have read this policy and fully understand it.
- AI is only permitted in an assistive capacity as stated in [AGENTS.md](AGENTS.md). We do not accept reports that are written exclusively by AI.
- Your report must include a working Proof-of-Concept in the form of a script and/or attached files.
Maintainers reserve the right to close the report if these requirements are not fulfilled.
## Covered Topics
Only vulnerabilities that fall within these parts of the project are considered valid. For problems falling outside of this list, please report them as issues.
- `src/**/*`
- `ggml/**/*`
- `gguf-py/**/*`
- `tools/server/*`, **excluding** the following topics:
- Web UI
- Features marked as experimental
- Features not recommended for use in untrusted environments (e.g., router, MCP)
- Bugs that can lead to Denial-of-Service attack
Note that none of the topics under [Using llama.cpp securely](#using-llamacpp-securely) are considered vulnerabilities in LLaMA C++.
For vulnerabilities that fall within the `vendor` directory, please report them directly to the third-party project.
## Using llama.cpp securely ## Using llama.cpp securely
@ -55,19 +95,3 @@ If you intend to run multiple models in parallel with shared memory, it is your
3. Model Sharing: In a multitenant model sharing design, tenants and users must understand the security risks of running code provided by others. Since there are no reliable methods to detect malicious models, sandboxing the model execution is the recommended approach to mitigate the risk. 3. Model Sharing: In a multitenant model sharing design, tenants and users must understand the security risks of running code provided by others. Since there are no reliable methods to detect malicious models, sandboxing the model execution is the recommended approach to mitigate the risk.
4. Hardware Attacks: GPUs or TPUs can also be attacked. [Researches](https://scholar.google.com/scholar?q=gpu+side+channel) has shown that side channel attacks on GPUs are possible, which can make data leak from other models or processes running on the same system at the same time. 4. Hardware Attacks: GPUs or TPUs can also be attacked. [Researches](https://scholar.google.com/scholar?q=gpu+side+channel) has shown that side channel attacks on GPUs are possible, which can make data leak from other models or processes running on the same system at the same time.
## Reporting a vulnerability
Beware that none of the topics under [Using llama.cpp securely](#using-llamacpp-securely) are considered vulnerabilities of LLaMA C++.
<!-- normal version -->
However, If you have discovered a security vulnerability in this project, please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
Please note that using AI to identify vulnerabilities and generate reports is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before submitting the report.
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
> [!IMPORTANT]
> For collaborators: if you are interested in helping out with reviewing privting security disclosures, please see: https://github.com/ggml-org/llama.cpp/discussions/18080

View File

@ -105,7 +105,20 @@ if [ ! -z ${GG_BUILD_VULKAN} ]; then
fi fi
if [ ! -z ${GG_BUILD_WEBGPU} ]; then if [ ! -z ${GG_BUILD_WEBGPU} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1" CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1 -DGGML_METAL=OFF -DGGML_BLAS=OFF"
if [ ! -z "${GG_BUILD_WEBGPU_DAWN_PREFIX}" ]; then
if [ -z "${CMAKE_PREFIX_PATH}" ]; then
export CMAKE_PREFIX_PATH="${GG_BUILD_WEBGPU_DAWN_PREFIX}"
else
export CMAKE_PREFIX_PATH="${GG_BUILD_WEBGPU_DAWN_PREFIX}:${CMAKE_PREFIX_PATH}"
fi
fi
# For some systems, Dawn_DIR needs to be set explicitly, e.g., the lib64 path
if [ ! -z "${GG_BUILD_WEBGPU_DAWN_DIR}" ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DDawn_DIR=${GG_BUILD_WEBGPU_DAWN_DIR}"
fi
fi fi
if [ ! -z ${GG_BUILD_MUSA} ]; then if [ ! -z ${GG_BUILD_MUSA} ]; then
@ -284,7 +297,8 @@ function gg_sum_test_scripts {
} }
function gg_get_model { function gg_get_model {
local gguf_0="$MNT/models/qwen3/0.6B/ggml-model-f16.gguf" #local gguf_0="$MNT/models/qwen3/0.6B/ggml-model-f16.gguf"
local gguf_0="$MNT/models/qwen3/0.6B/ggml-model-q4_0.gguf"
if [[ -s $gguf_0 ]]; then if [[ -s $gguf_0 ]]; then
echo -n "$gguf_0" echo -n "$gguf_0"
else else

40
cmake/license.cmake Normal file
View File

@ -0,0 +1,40 @@
define_property(GLOBAL PROPERTY LICENSE_TEXT
BRIEF_DOCS "Embedded licenses"
FULL_DOCS "Global string containing all aggregated licenses"
)
function(license_add_file NAME FILE)
if(NOT IS_ABSOLUTE "${FILE}")
set(FILE "${CMAKE_CURRENT_SOURCE_DIR}/${FILE}")
endif()
if(EXISTS "${FILE}")
set(TITLE "License for ${NAME}")
string(REGEX REPLACE "." "=" UNDERLINE "${TITLE}")
file(READ "${FILE}" TEXT)
get_property(TMP GLOBAL PROPERTY LICENSE_TEXT)
string(APPEND TMP "R\"=L=(${TITLE}\n${UNDERLINE}\n\n${TEXT})=L=\",\n")
set_property(GLOBAL PROPERTY LICENSE_TEXT "${TMP}")
else()
message(WARNING "License file '${FILE}' not found")
endif()
endfunction()
function(license_generate TARGET_NAME)
message(STATUS "Generating embedded license file for target: ${TARGET_NAME}")
get_property(TEXT GLOBAL PROPERTY LICENSE_TEXT)
set(CPP_CONTENT "// Generated by CMake\n\n")
string(APPEND CPP_CONTENT "const char* LICENSES[] = {\n")
string(APPEND CPP_CONTENT "${TEXT}")
string(APPEND CPP_CONTENT "nullptr\n")
string(APPEND CPP_CONTENT "};\n")
set(CPP_FILE "${CMAKE_BINARY_DIR}/license.cpp")
file(WRITE "${CPP_FILE}" "${CPP_CONTENT}")
if(TARGET ${TARGET_NAME})
target_sources(${TARGET_NAME} PRIVATE "${CPP_FILE}")
else()
message(FATAL_ERROR "Target '${TARGET_NAME}' does not exist")
endif()
endfunction()

View File

@ -155,27 +155,3 @@ if (LLAMA_LLGUIDANCE)
endif () endif ()
target_link_libraries(${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama Threads::Threads) target_link_libraries(${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama Threads::Threads)
#
# copy the license files
#
# Check if running in GitHub Actions
if (DEFINED ENV{GITHUB_ACTIONS} AND "$ENV{GITHUB_ACTIONS}" STREQUAL "true")
message(STATUS "Running inside GitHub Actions - copying license files")
# Copy all files from licenses/ to build/bin/
file(GLOB LICENSE_FILES "${CMAKE_SOURCE_DIR}/licenses/*")
foreach(LICENSE_FILE ${LICENSE_FILES})
get_filename_component(FILENAME ${LICENSE_FILE} NAME)
add_custom_command(
POST_BUILD
TARGET ${TARGET}
COMMAND ${CMAKE_COMMAND} -E copy_if_different
"${LICENSE_FILE}"
"$<TARGET_FILE_DIR:llama>/${FILENAME}"
COMMENT "Copying ${FILENAME} to ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}")
message(STATUS "Copying ${LICENSE_FILE} to ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/${FILENAME}")
endforeach()
endif()

View File

@ -2,10 +2,11 @@
#include "chat.h" #include "chat.h"
#include "common.h" #include "common.h"
#include "download.h"
#include "json-schema-to-grammar.h" #include "json-schema-to-grammar.h"
#include "log.h" #include "log.h"
#include "sampling.h" #include "sampling.h"
#include "download.h" #include "preset.h"
// fix problem with std::min and std::max // fix problem with std::min and std::max
#if defined(_WIN32) #if defined(_WIN32)
@ -47,6 +48,8 @@
#define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083 #define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
extern const char * LICENSES[];
using json = nlohmann::ordered_json; using json = nlohmann::ordered_json;
using namespace common_arg_utils; using namespace common_arg_utils;
@ -268,6 +271,55 @@ static void parse_tensor_buffer_overrides(const std::string & value, std::vector
} }
} }
static std::string clean_file_name(const std::string & fname) {
std::string clean_fname = fname;
string_replace_all(clean_fname, "\\", "_");
string_replace_all(clean_fname, "/", "_");
return clean_fname;
}
static bool common_params_handle_remote_preset(common_params & params, llama_example ex) {
GGML_ASSERT(!params.model.hf_repo.empty());
// the returned hf_repo is without tag
auto [hf_repo, hf_tag] = common_download_split_repo_tag(params.model.hf_repo);
// "latest" tag (default if not specified) is translated to "default" preset
if (hf_tag == "latest") {
hf_tag = "default";
}
const bool offline = params.offline;
std::string model_endpoint = get_model_endpoint();
auto preset_url = model_endpoint + hf_repo + "/resolve/main/preset.ini";
// prepare local path for caching
auto preset_fname = clean_file_name(hf_repo + "_preset.ini");
auto preset_path = fs_get_cache_file(preset_fname);
const int status = common_download_file_single(preset_url, preset_path, params.hf_token, offline);
const bool has_preset = status >= 200 && status < 400;
// remote preset is optional, so we don't error out if not found
if (has_preset) {
LOG_INF("applying remote preset from %s\n", preset_url.c_str());
common_preset_context ctx(ex, /* only_remote_allowed */ true);
common_preset global;
auto remote_presets = ctx.load_from_ini(preset_path, global);
remote_presets = ctx.cascade(global, remote_presets);
if (remote_presets.find(hf_tag) != remote_presets.end()) {
common_preset preset = remote_presets.at(hf_tag);
LOG_INF("\n%s", preset.to_ini().c_str()); // to_ini already added trailing newline
preset.apply_to_params(params);
} else {
throw std::runtime_error("Remote preset.ini does not contain [" + std::string(hf_tag) + "] section");
}
} else {
LOG_INF("%s", "no remote preset found, skipping\n");
}
return has_preset;
}
struct handle_model_result { struct handle_model_result {
bool found_mmproj = false; bool found_mmproj = false;
common_params_model mmproj; common_params_model mmproj;
@ -309,9 +361,7 @@ static handle_model_result common_params_handle_model(
// make sure model path is present (for caching purposes) // make sure model path is present (for caching purposes)
if (model.path.empty()) { if (model.path.empty()) {
// this is to avoid different repo having same file name, or same file name in different subdirs // this is to avoid different repo having same file name, or same file name in different subdirs
std::string filename = model.hf_repo + "_" + model.hf_file; std::string filename = clean_file_name(model.hf_repo + "_" + model.hf_file);
// to make sure we don't have any slashes in the filename
string_replace_all(filename, "/", "_");
model.path = fs_get_cache_file(filename); model.path = fs_get_cache_file(filename);
} }
@ -425,6 +475,7 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
} }
}; };
auto parse_cli_args = [&]() {
std::set<std::string> seen_args; std::set<std::string> seen_args;
for (int i = 1; i < argc; i++) { for (int i = 1; i < argc; i++) {
@ -482,6 +533,31 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
arg.c_str(), e.what(), opt.to_string().c_str())); arg.c_str(), e.what(), opt.to_string().c_str()));
} }
} }
};
// parse the first time to get -hf option (used for remote preset)
parse_cli_args();
// maybe handle remote preset
if (!params.model.hf_repo.empty()) {
std::string cli_hf_repo = params.model.hf_repo;
bool has_preset = common_params_handle_remote_preset(params, ctx_arg.ex);
// special case: if hf_repo explicitly set by preset, we need to preserve it (ignore CLI value)
// this is useful when we have one HF repo pointing to other HF repos (one model - multiple GGUFs)
std::string preset_hf_repo = params.model.hf_repo;
bool preset_has_hf_repo = preset_hf_repo != cli_hf_repo;
if (has_preset) {
// re-parse CLI args to override preset values
parse_cli_args();
}
// preserve hf_repo from preset if needed
if (preset_has_hf_repo) {
params.model.hf_repo = preset_hf_repo;
}
}
postprocess_cpu_params(params.cpuparams, nullptr); postprocess_cpu_params(params.cpuparams, nullptr);
postprocess_cpu_params(params.cpuparams_batch, &params.cpuparams); postprocess_cpu_params(params.cpuparams_batch, &params.cpuparams);
@ -679,7 +755,6 @@ static void common_params_print_completion(common_params_context & ctx_arg) {
"llama-quantize", "llama-quantize",
"llama-qwen2vl-cli", "llama-qwen2vl-cli",
"llama-retrieval", "llama-retrieval",
"llama-run",
"llama-save-load-state", "llama-save-load-state",
"llama-server", "llama-server",
"llama-simple", "llama-simple",
@ -966,6 +1041,16 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
exit(0); exit(0);
} }
)); ));
add_opt(common_arg(
{"--license"},
"show source code license and dependencies",
[](common_params &) {
for (int i = 0; LICENSES[i]; ++i) {
printf("%s\n", LICENSES[i]);
}
exit(0);
}
));
add_opt(common_arg( add_opt(common_arg(
{"-cl", "--cache-list"}, {"-cl", "--cache-list"},
"show list of models in cache", "show list of models in cache",
@ -1210,7 +1295,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) { [](common_params & params) {
params.kv_unified = true; params.kv_unified = true;
} }
).set_env("LLAMA_ARG_KV_UNIFIED").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_PERPLEXITY})); ).set_env("LLAMA_ARG_KV_UNIFIED").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_PERPLEXITY, LLAMA_EXAMPLE_BATCHED}));
add_opt(common_arg( add_opt(common_arg(
{"--context-shift"}, {"--context-shift"},
{"--no-context-shift"}, {"--no-context-shift"},
@ -2089,11 +2174,22 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
add_opt(common_arg( add_opt(common_arg(
{"--mmap"}, {"--mmap"},
{"--no-mmap"}, {"--no-mmap"},
string_format("whether to memory-map model (if disabled, slower load but may reduce pageouts if not using mlock) (default: %s)", params.use_mmap ? "enabled" : "disabled"), string_format("whether to memory-map model. Explicitly enabling mmap disables direct-io. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: %s)", params.use_mmap ? "enabled" : "disabled"),
[](common_params & params, bool value) { [](common_params & params, bool value) {
params.use_mmap = value; params.use_mmap = value;
if (value) {
params.use_direct_io = false; // disable direct io when mmap is explicitly enabled
}
} }
).set_env("LLAMA_ARG_MMAP")); ).set_env("LLAMA_ARG_MMAP"));
add_opt(common_arg(
{"-dio", "--direct-io"},
{"-ndio", "--no-direct-io"},
string_format("use DirectIO if available. Takes precedence over --mmap (default: %s)", params.use_direct_io ? "enabled" : "disabled"),
[](common_params & params, bool value) {
params.use_direct_io = value;
}
).set_env("LLAMA_ARG_DIO"));
add_opt(common_arg( add_opt(common_arg(
{"--numa"}, "TYPE", {"--numa"}, "TYPE",
"attempt optimizations that help on some NUMA systems\n" "attempt optimizations that help on some NUMA systems\n"
@ -2245,7 +2341,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
std::vector<std::string> split_arg{ it, {} }; std::vector<std::string> split_arg{ it, {} };
if (split_arg.size() >= llama_max_devices()) { if (split_arg.size() >= llama_max_devices()) {
throw std::invalid_argument( throw std::invalid_argument(
string_format("got %d input configs, but system only has %d devices", (int)split_arg.size(), (int)llama_max_devices()) string_format("got %zu input configs, but system only has %zu devices", split_arg.size(), llama_max_devices())
); );
} }
for (size_t i = 0; i < llama_max_devices(); ++i) { for (size_t i = 0; i < llama_max_devices(); ++i) {
@ -2285,10 +2381,28 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
} }
).set_env("LLAMA_ARG_FIT")); ).set_env("LLAMA_ARG_FIT"));
add_opt(common_arg( add_opt(common_arg(
{ "-fitt", "--fit-target" }, "MiB", { "-fitt", "--fit-target" }, "MiB0,MiB1,MiB2,...",
string_format("target margin per device for --fit option, default: %zu", params.fit_params_target/(1024*1024)), string_format("target margin per device for --fit, comma-separated list of values, "
[](common_params & params, int value) { "single value is broadcast across all devices, default: %zu", params.fit_params_target[0]/(1024*1024)),
params.fit_params_target = value * size_t(1024*1024); [](common_params & params, const std::string & value) {
std::string arg_next = value;
// split string by , and /
const std::regex regex{ R"([,/]+)" };
std::sregex_token_iterator it{ arg_next.begin(), arg_next.end(), regex, -1 };
std::vector<std::string> split_arg{ it, {} };
if (split_arg.size() >= llama_max_devices()) {
throw std::invalid_argument(
string_format("got %zu input configs, but system only has %zu devices", split_arg.size(), llama_max_devices())
);
}
if (split_arg.size() == 1) {
std::fill(params.fit_params_target.begin(), params.fit_params_target.end(), std::stoul(split_arg[0]) * 1024*1024);
return;
}
for (size_t i = 0; i < split_arg.size(); i++) {
params.fit_params_target[i] = std::stoul(split_arg[i]) * 1024*1024;
}
} }
).set_env("LLAMA_ARG_FIT_TARGET")); ).set_env("LLAMA_ARG_FIT_TARGET"));
add_opt(common_arg( add_opt(common_arg(
@ -2763,10 +2877,18 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.n_threads_http = value; params.n_threads_http = value;
} }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_THREADS_HTTP")); ).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_THREADS_HTTP"));
add_opt(common_arg(
{"--cache-prompt"},
{"--no-cache-prompt"},
string_format("whether to enable prompt caching (default: %s)", params.cache_prompt ? "enabled" : "disabled"),
[](common_params & params, bool value) {
params.cache_prompt = value;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CACHE_PROMPT"));
add_opt(common_arg( add_opt(common_arg(
{"--cache-reuse"}, "N", {"--cache-reuse"}, "N",
string_format( string_format(
"min chunk size to attempt reusing from the cache via KV shifting (default: %d)\n" "min chunk size to attempt reusing from the cache via KV shifting, requires prompt caching to be enabled (default: %d)\n"
"[(card)](https://ggml.ai/f0.png)", params.n_cache_reuse "[(card)](https://ggml.ai/f0.png)", params.n_cache_reuse
), ),
[](common_params & params, int value) { [](common_params & params, int value) {

View File

@ -129,11 +129,3 @@ void common_params_add_preset_options(std::vector<common_arg> & args);
// initialize argument parser context - used by test-arg-parser and preset // initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr); common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
struct common_remote_params {
std::vector<std::string> headers;
long timeout = 0; // CURLOPT_TIMEOUT, in seconds ; 0 means no timeout
long max_size = 0; // max size of the response ; unlimited if 0 ; max is 2GB
};
// get remote file content, returns <http_code, raw_response_body>
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);

View File

@ -1097,7 +1097,7 @@ common_init_result::common_init_result(common_params & params) :
if (params.fit_params) { if (params.fit_params) {
LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n", __func__); LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n", __func__);
llama_params_fit(params.model.path.c_str(), &mparams, &cparams, llama_params_fit(params.model.path.c_str(), &mparams, &cparams,
params.tensor_split, params.tensor_buft_overrides.data(), params.fit_params_target, params.fit_params_min_ctx, params.tensor_split, params.tensor_buft_overrides.data(), params.fit_params_target.data(), params.fit_params_min_ctx,
params.verbosity >= 4 ? GGML_LOG_LEVEL_DEBUG : GGML_LOG_LEVEL_ERROR); params.verbosity >= 4 ? GGML_LOG_LEVEL_DEBUG : GGML_LOG_LEVEL_ERROR);
} }
@ -1366,6 +1366,7 @@ struct llama_model_params common_model_params_to_llama(common_params & params) {
mparams.split_mode = params.split_mode; mparams.split_mode = params.split_mode;
mparams.tensor_split = params.tensor_split; mparams.tensor_split = params.tensor_split;
mparams.use_mmap = params.use_mmap; mparams.use_mmap = params.use_mmap;
mparams.use_direct_io = params.use_direct_io;
mparams.use_mlock = params.use_mlock; mparams.use_mlock = params.use_mlock;
mparams.check_tensors = params.check_tensors; mparams.check_tensors = params.check_tensors;
mparams.use_extra_bufts = !params.no_extra_bufts; mparams.use_extra_bufts = !params.no_extra_bufts;

View File

@ -80,6 +80,7 @@ int32_t cpu_get_num_math();
// //
enum llama_example { enum llama_example {
LLAMA_EXAMPLE_BATCHED,
LLAMA_EXAMPLE_DEBUG, LLAMA_EXAMPLE_DEBUG,
LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_COMMON,
LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SPECULATIVE,
@ -336,9 +337,11 @@ struct common_params {
int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
bool fit_params = true; // whether to fit unset model/context parameters to free device memory bool fit_params = true; // whether to fit unset model/context parameters to free device memory
size_t fit_params_target = 1024 * 1024*1024; // margin per device in bytes for fitting parameters to free memory
int32_t fit_params_min_ctx = 4096; // minimum context size to set when trying to reduce memory use int32_t fit_params_min_ctx = 4096; // minimum context size to set when trying to reduce memory use
// margin per device in bytes for fitting parameters to free memory:
std::vector<size_t> fit_params_target = std::vector<size_t>(llama_max_devices(), 1024 * 1024*1024);
enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
struct cpu_params cpuparams; struct cpu_params cpuparams;
@ -428,7 +431,8 @@ struct common_params {
bool kv_unified = false; // enable unified KV cache bool kv_unified = false; // enable unified KV cache
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix
bool use_mmap = true; // use mmap for faster loads bool use_mmap = true; // enable mmap to use filesystem cache
bool use_direct_io = true; // read from disk without buffering for faster model loading
bool use_mlock = false; // use mlock to keep model in memory bool use_mlock = false; // use mlock to keep model in memory
bool verbose_prompt = false; // print prompt tokens before generation bool verbose_prompt = false; // print prompt tokens before generation
bool display_prompt = true; // print prompt before generation bool display_prompt = true; // print prompt before generation
@ -472,6 +476,7 @@ struct common_params {
int32_t timeout_write = timeout_read; // http write timeout in seconds int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool) int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting
bool cache_prompt = true; // whether to enable prompt caching
int32_t n_ctx_checkpoints = 8; // max number of context checkpoints per slot int32_t n_ctx_checkpoints = 8; // max number of context checkpoints per slot
int32_t cache_ram_mib = 8192; // -1 = no limit, 0 - disable, 1 = 1 MiB, etc. int32_t cache_ram_mib = 8192; // -1 = no limit, 0 - disable, 1 = 1 MiB, etc.

View File

@ -157,6 +157,20 @@ static std::string read_etag(const std::string & path) {
return none; return none;
} }
static bool is_http_status_ok(int status) {
return status >= 200 && status < 400;
}
std::pair<std::string, std::string> common_download_split_repo_tag(const std::string & hf_repo_with_tag) {
auto parts = string_split<std::string>(hf_repo_with_tag, ':');
std::string tag = parts.size() > 1 ? parts.back() : "latest";
std::string hf_repo = parts[0];
if (string_split<std::string>(hf_repo, '/').size() != 2) {
throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n");
}
return {hf_repo, tag};
}
#ifdef LLAMA_USE_CURL #ifdef LLAMA_USE_CURL
// //
@ -306,11 +320,14 @@ static bool common_download_head(CURL * curl,
} }
// download one single file from remote URL to local path // download one single file from remote URL to local path
static bool common_download_file_single_online(const std::string & url, // returns status code or -1 on error
static int common_download_file_single_online(const std::string & url,
const std::string & path, const std::string & path,
const std::string & bearer_token) { const std::string & bearer_token,
const common_header_list & custom_headers) {
static const int max_attempts = 3; static const int max_attempts = 3;
static const int retry_delay_seconds = 2; static const int retry_delay_seconds = 2;
for (int i = 0; i < max_attempts; ++i) { for (int i = 0; i < max_attempts; ++i) {
std::string etag; std::string etag;
@ -330,6 +347,11 @@ static bool common_download_file_single_online(const std::string & url,
common_load_model_from_url_headers headers; common_load_model_from_url_headers headers;
curl_easy_setopt(curl.get(), CURLOPT_HEADERDATA, &headers); curl_easy_setopt(curl.get(), CURLOPT_HEADERDATA, &headers);
curl_slist_ptr http_headers; curl_slist_ptr http_headers;
for (const auto & h : custom_headers) {
std::string s = h.first + ": " + h.second;
http_headers.ptr = curl_slist_append(http_headers.ptr, s.c_str());
}
const bool was_perform_successful = common_download_head(curl.get(), http_headers, url, bearer_token); const bool was_perform_successful = common_download_head(curl.get(), http_headers, url, bearer_token);
if (!was_perform_successful) { if (!was_perform_successful) {
head_request_ok = false; head_request_ok = false;
@ -365,7 +387,7 @@ static bool common_download_file_single_online(const std::string & url,
LOG_WRN("%s: deleting previous downloaded file: %s\n", __func__, path.c_str()); LOG_WRN("%s: deleting previous downloaded file: %s\n", __func__, path.c_str());
if (remove(path.c_str()) != 0) { if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str()); LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
return false; return -1;
} }
} }
@ -374,14 +396,14 @@ static bool common_download_file_single_online(const std::string & url,
if (std::filesystem::exists(path_temporary)) { if (std::filesystem::exists(path_temporary)) {
if (remove(path_temporary.c_str()) != 0) { if (remove(path_temporary.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path_temporary.c_str()); LOG_ERR("%s: unable to delete file: %s\n", __func__, path_temporary.c_str());
return false; return -1;
} }
} }
if (std::filesystem::exists(path)) { if (std::filesystem::exists(path)) {
if (remove(path.c_str()) != 0) { if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str()); LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
return false; return -1;
} }
} }
} }
@ -408,23 +430,27 @@ static bool common_download_file_single_online(const std::string & url,
long http_code = 0; long http_code = 0;
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &http_code); curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &http_code);
if (http_code < 200 || http_code >= 400) {
int status = static_cast<int>(http_code);
if (!is_http_status_ok(http_code)) {
LOG_ERR("%s: invalid http status code received: %ld\n", __func__, http_code); LOG_ERR("%s: invalid http status code received: %ld\n", __func__, http_code);
return false; return status; // TODO: maybe only return on certain codes
} }
if (rename(path_temporary.c_str(), path.c_str()) != 0) { if (rename(path_temporary.c_str(), path.c_str()) != 0) {
LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str()); LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str());
return false; return -1;
} }
return static_cast<int>(http_code);
} else { } else {
LOG_INF("%s: using cached file: %s\n", __func__, path.c_str()); LOG_INF("%s: using cached file: %s\n", __func__, path.c_str());
return 304; // Not Modified - fake cached response
}
} }
break; return -1; // max attempts reached
}
return true;
} }
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params) { std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params) {
@ -454,8 +480,10 @@ std::pair<long, std::vector<char>> common_remote_get_content(const std::string &
curl_easy_setopt(curl.get(), CURLOPT_MAXFILESIZE, params.max_size); curl_easy_setopt(curl.get(), CURLOPT_MAXFILESIZE, params.max_size);
} }
http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp"); http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp");
for (const auto & header : params.headers) { for (const auto & header : params.headers) {
http_headers.ptr = curl_slist_append(http_headers.ptr, header.c_str()); std::string header_ = header.first + ": " + header.second;
http_headers.ptr = curl_slist_append(http_headers.ptr, header_.c_str());
} }
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr); curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
@ -617,9 +645,11 @@ static bool common_pull_file(httplib::Client & cli,
} }
// download one single file from remote URL to local path // download one single file from remote URL to local path
static bool common_download_file_single_online(const std::string & url, // returns status code or -1 on error
static int common_download_file_single_online(const std::string & url,
const std::string & path, const std::string & path,
const std::string & bearer_token) { const std::string & bearer_token,
const common_header_list & custom_headers) {
static const int max_attempts = 3; static const int max_attempts = 3;
static const int retry_delay_seconds = 2; static const int retry_delay_seconds = 2;
@ -629,6 +659,9 @@ static bool common_download_file_single_online(const std::string & url,
if (!bearer_token.empty()) { if (!bearer_token.empty()) {
default_headers.insert({"Authorization", "Bearer " + bearer_token}); default_headers.insert({"Authorization", "Bearer " + bearer_token});
} }
for (const auto & h : custom_headers) {
default_headers.emplace(h.first, h.second);
}
cli.set_default_headers(default_headers); cli.set_default_headers(default_headers);
const bool file_exists = std::filesystem::exists(path); const bool file_exists = std::filesystem::exists(path);
@ -647,8 +680,10 @@ static bool common_download_file_single_online(const std::string & url,
LOG_WRN("%s: HEAD invalid http status code received: %d\n", __func__, head ? head->status : -1); LOG_WRN("%s: HEAD invalid http status code received: %d\n", __func__, head ? head->status : -1);
if (file_exists) { if (file_exists) {
LOG_INF("%s: Using cached file (HEAD failed): %s\n", __func__, path.c_str()); LOG_INF("%s: Using cached file (HEAD failed): %s\n", __func__, path.c_str());
return true; return 304; // 304 Not Modified - fake cached response
} }
return head->status; // cannot use cached file, return raw status code
// TODO: maybe retry only on certain codes
} }
std::string etag; std::string etag;
@ -680,12 +715,12 @@ static bool common_download_file_single_online(const std::string & url,
if (file_exists) { if (file_exists) {
if (!should_download_from_scratch) { if (!should_download_from_scratch) {
LOG_INF("%s: using cached file: %s\n", __func__, path.c_str()); LOG_INF("%s: using cached file: %s\n", __func__, path.c_str());
return true; return 304; // 304 Not Modified - fake cached response
} }
LOG_WRN("%s: deleting previous downloaded file: %s\n", __func__, path.c_str()); LOG_WRN("%s: deleting previous downloaded file: %s\n", __func__, path.c_str());
if (remove(path.c_str()) != 0) { if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str()); LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
return false; return -1;
} }
} }
@ -697,7 +732,7 @@ static bool common_download_file_single_online(const std::string & url,
existing_size = std::filesystem::file_size(path_temporary); existing_size = std::filesystem::file_size(path_temporary);
} else if (remove(path_temporary.c_str()) != 0) { } else if (remove(path_temporary.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path_temporary.c_str()); LOG_ERR("%s: unable to delete file: %s\n", __func__, path_temporary.c_str());
return false; return -1;
} }
} }
@ -718,15 +753,16 @@ static bool common_download_file_single_online(const std::string & url,
if (std::rename(path_temporary.c_str(), path.c_str()) != 0) { if (std::rename(path_temporary.c_str(), path.c_str()) != 0) {
LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str()); LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str());
return false; return -1;
} }
if (!etag.empty()) { if (!etag.empty()) {
write_etag(path, etag); write_etag(path, etag);
} }
break;
return head->status; // TODO: use actual GET status?
} }
return true; return -1; // max attempts reached
} }
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url,
@ -734,13 +770,9 @@ std::pair<long, std::vector<char>> common_remote_get_content(const std::string
auto [cli, parts] = common_http_client(url); auto [cli, parts] = common_http_client(url);
httplib::Headers headers = {{"User-Agent", "llama-cpp"}}; httplib::Headers headers = {{"User-Agent", "llama-cpp"}};
for (const auto & header : params.headers) { for (const auto & header : params.headers) {
size_t pos = header.find(':'); headers.emplace(header.first, header.second);
if (pos != std::string::npos) {
headers.emplace(header.substr(0, pos), header.substr(pos + 1));
} else {
headers.emplace(header, "");
}
} }
if (params.timeout > 0) { if (params.timeout > 0) {
@ -769,32 +801,45 @@ std::pair<long, std::vector<char>> common_remote_get_content(const std::string
#if defined(LLAMA_USE_CURL) || defined(LLAMA_USE_HTTPLIB) #if defined(LLAMA_USE_CURL) || defined(LLAMA_USE_HTTPLIB)
static bool common_download_file_single(const std::string & url, int common_download_file_single(const std::string & url,
const std::string & path, const std::string & path,
const std::string & bearer_token, const std::string & bearer_token,
bool offline) { bool offline,
const common_header_list & headers) {
if (!offline) { if (!offline) {
return common_download_file_single_online(url, path, bearer_token); return common_download_file_single_online(url, path, bearer_token, headers);
} }
if (!std::filesystem::exists(path)) { if (!std::filesystem::exists(path)) {
LOG_ERR("%s: required file is not available in cache (offline mode): %s\n", __func__, path.c_str()); LOG_ERR("%s: required file is not available in cache (offline mode): %s\n", __func__, path.c_str());
return false; return -1;
} }
LOG_INF("%s: using cached file (offline mode): %s\n", __func__, path.c_str()); LOG_INF("%s: using cached file (offline mode): %s\n", __func__, path.c_str());
return true; return 304; // Not Modified - fake cached response
} }
// download multiple files from remote URLs to local paths // download multiple files from remote URLs to local paths
// the input is a vector of pairs <url, path> // the input is a vector of pairs <url, path>
static bool common_download_file_multiple(const std::vector<std::pair<std::string, std::string>> & urls, const std::string & bearer_token, bool offline) { static bool common_download_file_multiple(const std::vector<std::pair<std::string, std::string>> & urls,
const std::string & bearer_token,
bool offline,
const common_header_list & headers) {
// Prepare download in parallel // Prepare download in parallel
std::vector<std::future<bool>> futures_download; std::vector<std::future<bool>> futures_download;
futures_download.reserve(urls.size());
for (auto const & item : urls) { for (auto const & item : urls) {
futures_download.push_back(std::async(std::launch::async, [bearer_token, offline](const std::pair<std::string, std::string> & it) -> bool { futures_download.push_back(
return common_download_file_single(it.first, it.second, bearer_token, offline); std::async(
}, item)); std::launch::async,
[&bearer_token, offline, &headers](const std::pair<std::string, std::string> & it) -> bool {
const int http_status = common_download_file_single(it.first, it.second, bearer_token, offline, headers);
return is_http_status_ok(http_status);
},
item
)
);
} }
// Wait for all downloads to complete // Wait for all downloads to complete
@ -807,17 +852,18 @@ static bool common_download_file_multiple(const std::vector<std::pair<std::strin
return true; return true;
} }
bool common_download_model( bool common_download_model(const common_params_model & model,
const common_params_model & model,
const std::string & bearer_token, const std::string & bearer_token,
bool offline) { bool offline,
const common_header_list & headers) {
// Basic validation of the model.url // Basic validation of the model.url
if (model.url.empty()) { if (model.url.empty()) {
LOG_ERR("%s: invalid model url\n", __func__); LOG_ERR("%s: invalid model url\n", __func__);
return false; return false;
} }
if (!common_download_file_single(model.url, model.path, bearer_token, offline)) { const int http_status = common_download_file_single(model.url, model.path, bearer_token, offline, headers);
if (!is_http_status_ok(http_status)) {
return false; return false;
} }
@ -876,27 +922,26 @@ bool common_download_model(
} }
// Download in parallel // Download in parallel
common_download_file_multiple(urls, bearer_token, offline); common_download_file_multiple(urls, bearer_token, offline, headers);
} }
return true; return true;
} }
common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag, const std::string & bearer_token, bool offline) { common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag,
auto parts = string_split<std::string>(hf_repo_with_tag, ':'); const std::string & bearer_token,
std::string tag = parts.size() > 1 ? parts.back() : "latest"; bool offline,
std::string hf_repo = parts[0]; const common_header_list & custom_headers) {
if (string_split<std::string>(hf_repo, '/').size() != 2) { // the returned hf_repo is without tag
throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n"); auto [hf_repo, tag] = common_download_split_repo_tag(hf_repo_with_tag);
}
std::string url = get_model_endpoint() + "v2/" + hf_repo + "/manifests/" + tag; std::string url = get_model_endpoint() + "v2/" + hf_repo + "/manifests/" + tag;
// headers // headers
std::vector<std::string> headers; common_header_list headers = custom_headers;
headers.push_back("Accept: application/json"); headers.push_back({"Accept", "application/json"});
if (!bearer_token.empty()) { if (!bearer_token.empty()) {
headers.push_back("Authorization: Bearer " + bearer_token); headers.push_back({"Authorization", "Bearer " + bearer_token});
} }
// Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response // Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
// User-Agent header is already set in common_remote_get_content, no need to set it here // User-Agent header is already set in common_remote_get_content, no need to set it here
@ -952,7 +997,7 @@ common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag, cons
} else if (res_code == 401) { } else if (res_code == 401) {
throw std::runtime_error("error: model is private or does not exist; if you are accessing a gated model, please provide a valid HF token"); throw std::runtime_error("error: model is private or does not exist; if you are accessing a gated model, please provide a valid HF token");
} else { } else {
throw std::runtime_error(string_format("error from HF API, response code: %ld, data: %s", res_code, res_str.c_str())); throw std::runtime_error(string_format("error from HF API (%s), response code: %ld, data: %s", url.c_str(), res_code, res_str.c_str()));
} }
// check response // check response
@ -1031,9 +1076,10 @@ std::string common_docker_resolve_model(const std::string & docker) {
const std::string url_prefix = "https://registry-1.docker.io/v2/" + repo; const std::string url_prefix = "https://registry-1.docker.io/v2/" + repo;
std::string manifest_url = url_prefix + "/manifests/" + tag; std::string manifest_url = url_prefix + "/manifests/" + tag;
common_remote_params manifest_params; common_remote_params manifest_params;
manifest_params.headers.push_back("Authorization: Bearer " + token); manifest_params.headers.push_back({"Authorization", "Bearer " + token});
manifest_params.headers.push_back( manifest_params.headers.push_back({"Accept",
"Accept: application/vnd.docker.distribution.manifest.v2+json,application/vnd.oci.image.manifest.v1+json"); "application/vnd.docker.distribution.manifest.v2+json,application/vnd.oci.image.manifest.v1+json"
});
auto manifest_res = common_remote_get_content(manifest_url, manifest_params); auto manifest_res = common_remote_get_content(manifest_url, manifest_params);
if (manifest_res.first != 200) { if (manifest_res.first != 200) {
throw std::runtime_error("Failed to get Docker manifest, HTTP code: " + std::to_string(manifest_res.first)); throw std::runtime_error("Failed to get Docker manifest, HTTP code: " + std::to_string(manifest_res.first));
@ -1070,7 +1116,8 @@ std::string common_docker_resolve_model(const std::string & docker) {
std::string local_path = fs_get_cache_file(model_filename); std::string local_path = fs_get_cache_file(model_filename);
const std::string blob_url = url_prefix + "/blobs/" + gguf_digest; const std::string blob_url = url_prefix + "/blobs/" + gguf_digest;
if (!common_download_file_single(blob_url, local_path, token, false)) { const int http_status = common_download_file_single(blob_url, local_path, token, false, {});
if (!is_http_status_ok(http_status)) {
throw std::runtime_error("Failed to download Docker Model"); throw std::runtime_error("Failed to download Docker Model");
} }
@ -1084,11 +1131,11 @@ std::string common_docker_resolve_model(const std::string & docker) {
#else #else
common_hf_file_res common_get_hf_file(const std::string &, const std::string &, bool) { common_hf_file_res common_get_hf_file(const std::string &, const std::string &, bool, const common_header_list &) {
throw std::runtime_error("download functionality is not enabled in this build"); throw std::runtime_error("download functionality is not enabled in this build");
} }
bool common_download_model(const common_params_model &, const std::string &, bool) { bool common_download_model(const common_params_model &, const std::string &, bool, const common_header_list &) {
throw std::runtime_error("download functionality is not enabled in this build"); throw std::runtime_error("download functionality is not enabled in this build");
} }
@ -1096,6 +1143,14 @@ std::string common_docker_resolve_model(const std::string &) {
throw std::runtime_error("download functionality is not enabled in this build"); throw std::runtime_error("download functionality is not enabled in this build");
} }
int common_download_file_single(const std::string &,
const std::string &,
const std::string &,
bool,
const common_header_list &) {
throw std::runtime_error("download functionality is not enabled in this build");
}
#endif // LLAMA_USE_CURL || LLAMA_USE_HTTPLIB #endif // LLAMA_USE_CURL || LLAMA_USE_HTTPLIB
std::vector<common_cached_model_info> common_list_cached_models() { std::vector<common_cached_model_info> common_list_cached_models() {

View File

@ -1,12 +1,27 @@
#pragma once #pragma once
#include <string> #include <string>
#include <vector>
struct common_params_model; struct common_params_model;
// using common_header = std::pair<std::string, std::string>;
// download functionalities using common_header_list = std::vector<common_header>;
//
struct common_remote_params {
common_header_list headers;
long timeout = 0; // in seconds, 0 means no timeout
long max_size = 0; // unlimited if 0
};
// get remote file content, returns <http_code, raw_response_body>
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);
// split HF repo with tag into <repo, tag>
// for example: "user/model:tag" -> <"user/model", "tag">
// if tag is not present, default to "latest"
// example: "user/model" -> <"user/model", "latest">
std::pair<std::string, std::string> common_download_split_repo_tag(const std::string & hf_repo_with_tag);
struct common_cached_model_info { struct common_cached_model_info {
std::string manifest_path; std::string manifest_path;
@ -41,17 +56,29 @@ struct common_hf_file_res {
common_hf_file_res common_get_hf_file( common_hf_file_res common_get_hf_file(
const std::string & hf_repo_with_tag, const std::string & hf_repo_with_tag,
const std::string & bearer_token, const std::string & bearer_token,
bool offline); bool offline,
const common_header_list & headers = {}
);
// returns true if download succeeded // returns true if download succeeded
bool common_download_model( bool common_download_model(
const common_params_model & model, const common_params_model & model,
const std::string & bearer_token, const std::string & bearer_token,
bool offline); bool offline,
const common_header_list & headers = {}
);
// returns list of cached models // returns list of cached models
std::vector<common_cached_model_info> common_list_cached_models(); std::vector<common_cached_model_info> common_list_cached_models();
// download single file from url to local path
// returns status code or -1 on error
int common_download_file_single(const std::string & url,
const std::string & path,
const std::string & bearer_token,
bool offline,
const common_header_list & headers = {});
// resolve and download model from Docker registry // resolve and download model from Docker registry
// return local path to downloaded model file // return local path to downloaded model file
std::string common_docker_resolve_model(const std::string & docker); std::string common_docker_resolve_model(const std::string & docker);

View File

@ -16,6 +16,48 @@ static std::string rm_leading_dashes(const std::string & str) {
return str.substr(pos); return str.substr(pos);
} }
// only allow a subset of args for remote presets for security reasons
// do not add more args unless absolutely necessary
// args that output to files are strictly prohibited
static std::set<std::string> get_remote_preset_whitelist(const std::map<std::string, common_arg> & key_to_opt) {
static const std::set<std::string> allowed_options = {
"model-url",
"hf-repo",
"hf-repo-draft",
"hf-repo-v", // vocoder
"hf-file-v", // vocoder
"mmproj-url",
"pooling",
"jinja",
"batch-size",
"ubatch-size",
"cache-reuse",
"chat-template-kwargs",
"mmap",
// note: sampling params are automatically allowed by default
// negated args will be added automatically if the positive arg is specified above
};
std::set<std::string> allowed_keys;
for (const auto & it : key_to_opt) {
const std::string & key = it.first;
const common_arg & opt = it.second;
if (allowed_options.find(key) != allowed_options.end() || opt.is_sparam) {
allowed_keys.insert(key);
// also add variant keys (args without leading dashes and env vars)
for (const auto & arg : opt.get_args()) {
allowed_keys.insert(rm_leading_dashes(arg));
}
for (const auto & env : opt.get_env()) {
allowed_keys.insert(env);
}
}
}
return allowed_keys;
}
std::vector<std::string> common_preset::to_args(const std::string & bin_path) const { std::vector<std::string> common_preset::to_args(const std::string & bin_path) const {
std::vector<std::string> args; std::vector<std::string> args;
@ -121,6 +163,29 @@ void common_preset::merge(const common_preset & other) {
} }
} }
void common_preset::apply_to_params(common_params & params) const {
for (const auto & [opt, val] : options) {
// apply each option to params
if (opt.handler_string) {
opt.handler_string(params, val);
} else if (opt.handler_int) {
opt.handler_int(params, std::stoi(val));
} else if (opt.handler_bool) {
opt.handler_bool(params, common_arg_utils::is_truthy(val));
} else if (opt.handler_str_str) {
// not supported yet
throw std::runtime_error(string_format(
"%s: option with two values is not supported yet",
__func__
));
} else if (opt.handler_void) {
opt.handler_void(params);
} else {
GGML_ABORT("unknown handler type");
}
}
}
static std::map<std::string, std::map<std::string, std::string>> parse_ini_from_file(const std::string & path) { static std::map<std::string, std::map<std::string, std::string>> parse_ini_from_file(const std::string & path) {
std::map<std::string, std::map<std::string, std::string>> parsed; std::map<std::string, std::map<std::string, std::string>> parsed;
@ -230,10 +295,16 @@ static std::string parse_bool_arg(const common_arg & arg, const std::string & ke
return value; return value;
} }
common_preset_context::common_preset_context(llama_example ex) common_preset_context::common_preset_context(llama_example ex, bool only_remote_allowed)
: ctx_params(common_params_parser_init(default_params, ex)) { : ctx_params(common_params_parser_init(default_params, ex)) {
common_params_add_preset_options(ctx_params.options); common_params_add_preset_options(ctx_params.options);
key_to_opt = get_map_key_opt(ctx_params); key_to_opt = get_map_key_opt(ctx_params);
// setup allowed keys if only_remote_allowed is true
if (only_remote_allowed) {
filter_allowed_keys = true;
allowed_keys = get_remote_preset_whitelist(key_to_opt);
}
} }
common_presets common_preset_context::load_from_ini(const std::string & path, common_preset & global) const { common_presets common_preset_context::load_from_ini(const std::string & path, common_preset & global) const {
@ -249,7 +320,18 @@ common_presets common_preset_context::load_from_ini(const std::string & path, co
} }
LOG_DBG("loading preset: %s\n", preset.name.c_str()); LOG_DBG("loading preset: %s\n", preset.name.c_str());
for (const auto & [key, value] : section.second) { for (const auto & [key, value] : section.second) {
if (key == "version") {
// skip version key (reserved for future use)
continue;
}
LOG_DBG("option: %s = %s\n", key.c_str(), value.c_str()); LOG_DBG("option: %s = %s\n", key.c_str(), value.c_str());
if (filter_allowed_keys && allowed_keys.find(key) == allowed_keys.end()) {
throw std::runtime_error(string_format(
"option '%s' is not allowed in remote presets",
key.c_str()
));
}
if (key_to_opt.find(key) != key_to_opt.end()) { if (key_to_opt.find(key) != key_to_opt.end()) {
const auto & opt = key_to_opt.at(key); const auto & opt = key_to_opt.at(key);
if (is_bool_arg(opt)) { if (is_bool_arg(opt)) {
@ -259,7 +341,10 @@ common_presets common_preset_context::load_from_ini(const std::string & path, co
} }
LOG_DBG("accepted option: %s = %s\n", key.c_str(), preset.options[opt].c_str()); LOG_DBG("accepted option: %s = %s\n", key.c_str(), preset.options[opt].c_str());
} else { } else {
// TODO: maybe warn about unknown key? throw std::runtime_error(string_format(
"option '%s' not recognized in preset '%s'",
key.c_str(), preset.name.c_str()
));
} }
} }

View File

@ -6,6 +6,7 @@
#include <string> #include <string>
#include <vector> #include <vector>
#include <map> #include <map>
#include <set>
// //
// INI preset parser and writer // INI preset parser and writer
@ -40,6 +41,9 @@ struct common_preset {
// merge another preset into this one, overwriting existing options // merge another preset into this one, overwriting existing options
void merge(const common_preset & other); void merge(const common_preset & other);
// apply preset options to common_params
void apply_to_params(common_params & params) const;
}; };
// interface for multiple presets in one file // interface for multiple presets in one file
@ -50,7 +54,12 @@ struct common_preset_context {
common_params default_params; // unused for now common_params default_params; // unused for now
common_params_context ctx_params; common_params_context ctx_params;
std::map<std::string, common_arg> key_to_opt; std::map<std::string, common_arg> key_to_opt;
common_preset_context(llama_example ex);
bool filter_allowed_keys = false;
std::set<std::string> allowed_keys;
// if only_remote_allowed is true, only accept whitelisted keys
common_preset_context(llama_example ex, bool only_remote_allowed = false);
// load presets from INI file // load presets from INI file
common_presets load_from_ini(const std::string & path, common_preset & global) const; common_presets load_from_ini(const std::string & path, common_preset & global) const;

View File

@ -528,7 +528,11 @@ class ModelBase:
return () return ()
def prepare_tensors(self): def prepare_tensors(self):
# Handle empty tensor_map for models with block_count=0 (like MobileNetV5)
if self.tensor_map.mapping:
max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,") max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,")
else:
max_name_len = len("vision_encoder.weight,") # Default reasonable length
for name, data_torch in chain(self.generate_extra_tensors(), self.get_tensors()): for name, data_torch in chain(self.generate_extra_tensors(), self.get_tensors()):
# we don't need these # we don't need these
@ -771,8 +775,8 @@ class TextModel(ModelBase):
self.rope_parameters = self.hparams.get("rope_parameters", self.hparams.get("rope_scaling")) or {} self.rope_parameters = self.hparams.get("rope_parameters", self.hparams.get("rope_scaling")) or {}
rope_theta = self.find_hparam(["rope_theta", "global_rope_theta", "rotary_emb_base"], optional=True) rope_theta = self.find_hparam(["global_rope_theta", "rope_global_theta", "rope_theta_global", "rope_theta", "rotary_emb_base"], optional=True)
local_rope_theta = self.find_hparam(["local_rope_theta", "rope_local_theta", "swa_rope_theta", "rope_local_base_freq"], optional=True) local_rope_theta = self.find_hparam(["local_rope_theta", "rope_local_theta", "rope_theta_local", "swa_rope_theta", "rope_local_base_freq"], optional=True)
# Ensure "rope_theta" and "rope_type" is mirrored in rope_parameters # Ensure "rope_theta" and "rope_type" is mirrored in rope_parameters
if "full_attention" not in self.rope_parameters and "sliding_attention" not in self.rope_parameters: if "full_attention" not in self.rope_parameters and "sliding_attention" not in self.rope_parameters:
@ -4363,6 +4367,36 @@ class Qwen3NextModel(Qwen2MoeModel):
elif name.endswith("norm.weight") and not name.endswith("linear_attn.norm.weight"): elif name.endswith("norm.weight") and not name.endswith("linear_attn.norm.weight"):
data_torch = data_torch + 1 data_torch = data_torch + 1
if "in_proj_qkvz.weight" in name:
# original order: [q, k, v, z] * head_count
# corrected order: [q * head_count, k * head_count, v * head_count, z * head_count]
head_k_dim = self.hparams["linear_key_head_dim"]
head_v_dim = self.hparams["linear_value_head_dim"]
num_v_heads = self.hparams["linear_num_value_heads"]
num_k_heads = self.hparams["linear_num_key_heads"]
hidden_size = self.hparams["hidden_size"]
split_arg_list_qkvz = [
head_k_dim, # q partition
head_k_dim, # k partition
(num_v_heads // num_k_heads * head_v_dim), # v partition
(num_v_heads // num_k_heads * head_v_dim), # z partition
]
# view as (n_embd, head_count, [q+k+v+z])
data_torch = data_torch.permute(1, 0).contiguous()
data_torch = data_torch.view(-1, num_k_heads, sum(split_arg_list_qkvz))
# split into q, k, v, z
q, k, v, z = torch.split(data_torch, split_arg_list_qkvz, dim=-1)
# flatten dim + head_count
q = q.contiguous().view(hidden_size, -1)
k = k.contiguous().view(hidden_size, -1)
v = v.contiguous().view(hidden_size, -1)
z = z.contiguous().view(hidden_size, -1)
# stack back
qkv = torch.cat([q, k, v], dim=-1).permute(1, 0).contiguous()
z = z.permute(1, 0).contiguous()
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_QKV, bid, ".weight"), qkv)
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_GATE, bid, ".weight"), z)
else:
yield from super().modify_tensors(data_torch, name, bid) yield from super().modify_tensors(data_torch, name, bid)
@ -6038,7 +6072,175 @@ class Gemma3VisionModel(MmprojModel):
return [] # skip other tensors return [] # skip other tensors
class ConformerAudioModel(MmprojModel):
_batch_norm_tensors: list[dict[str, Tensor]] | None = None
@staticmethod
def is_audio_tensor(name: str):
return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"])
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ConformerAudioModel.is_audio_tensor(name):
if ".conv" in name or "_conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# fold running_mean, running_var and eps into weight and bias for batch_norm
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch
if len(self._batch_norm_tensors[bid]) < 5:
return []
weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value
a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]
# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])
return [(self.map_tensor_name(name), data_torch)]
@ModelBase.register("Gemma3nForConditionalGeneration") @ModelBase.register("Gemma3nForConditionalGeneration")
class Gemma3nVisionAudioModel(ConformerAudioModel):
has_audio_encoder = True
has_vision_encoder = True
# Double indexed mapping for MobileNetV5 blocks (not supported by tensor_mapping.py)
# This is the only known model having this, so we prefer implementing it outside of tensor_mapping.py
block_tensor_mapping = {
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight",
}
def __init__(self, *args, **kwargs):
# Parent init will call find_hparam which now returns 0 for empty keys
super().__init__(*args, **kwargs)
assert self.hparams_vision is not None
self.hparams_vision["n_layers"] = 128 # fake value for audio encoder, vision encoder doesn't use it
self.hparams_vision["intermediate_size"] = self.hparams_vision.get("intermediate_size", 2048) * 4
self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_attention_heads", 8)
# MobileNetV5 does not use image_mean/std
self.preprocessor_config["image_mean"] = [0.0 ,0.0 , 0.0]
self.preprocessor_config["image_std"] = [1.0 ,1.0 ,1.0]
self.hparams_vision["image_size"] = self.preprocessor_config.get(
"size", {"height": 768, "width": 768}
)["height"]
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
self.hparams_vision["patch_size"] = image_size // image_seq_length
# remap audio hparams
assert self.hparams_audio is not None
self.hparams_audio["n_layers"] = self.hparams_audio["conf_num_hidden_layers"]
self.hparams_audio["num_attention_heads"] = self.hparams_audio["conf_num_attention_heads"]
self.hparams_audio["feat_in"] = self.hparams_audio["input_feat_size"]
self.hparams_audio["intermediate_size"] = self.hparams_audio.get("intermediate_size", 6144)
def set_gguf_parameters(self):
super().set_gguf_parameters()
# vision params
self.gguf_writer.add_clip_vision_projector_type(gguf.VisionProjectorType.GEMMA3NV)
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6))
# audio params
assert self.hparams_audio is not None
self.gguf_writer.add_clip_audio_projector_type(gguf.VisionProjectorType.GEMMA3NA)
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(1e-5)
def tensor_force_quant(self, name, new_name, bid, n_dims):
# Force quantization settings for specific tensor types
if "input_projection" in name or "input_proj" in name:
return gguf.GGMLQuantizationType.F16
if ".embeddings." in name or "stem" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def custom_map(self, name: str) -> str:
"""Parses names like model.vision_tower.timm_model.blocks.1.2.suffix and applies template mapping."""
parts = name.split(".")
# MobileNet blocks have at least 7 parts: model, vision_tower, timm_model, blocks, bid, sid, and suffix
if len(parts) >= 7:
bid, sid = parts[4], parts[5]
suffix = ".".join(parts[6:])
template = f"model.vision_tower.timm_model.blocks.{{bid}}.{{sid}}.{suffix}"
if template in self.block_tensor_mapping:
return self.block_tensor_mapping[template].format(bid=bid, sid=sid)
raise ValueError(f"Unknown name: {name}")
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if (ConformerAudioModel.is_audio_tensor(name)):
name = name.replace("model.audio_tower.conformer.", "conformer.layers.")
return super().modify_tensors(data_torch, name, bid)
# Gemma3n uses
# - model.embed_vision.* for projection layers
# - model.vision_tower.* for vision encoder
# Skip non-vision tensors
if not (name.startswith("model.embed_vision.") or name.startswith("model.vision_tower.")):
return []
if name.startswith("model.vision_tower.timm_model.blocks."):
# Double-indexed block tensors through custom logic
new_name = self.custom_map(name)
else:
# Route non-repeating (conv_stem, msfa, embedding, etc.) and un-catched through tensor_mapping.py
new_name = self.map_tensor_name(name)
if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"):
data_torch = data_torch.unsqueeze(0).unsqueeze(-1).unsqueeze(-1) # [1, C, 1, 1]
return [(new_name, data_torch)]
@ModelBase.register("Gemma3nForCausalLM", "Gemma3nForConditionalGeneration")
class Gemma3NModel(Gemma3Model): class Gemma3NModel(Gemma3Model):
model_arch = gguf.MODEL_ARCH.GEMMA3N model_arch = gguf.MODEL_ARCH.GEMMA3N
norm_shift = 0.0 # same value with Gemma3p5RMSNorm scale_shift on python code norm_shift = 0.0 # same value with Gemma3p5RMSNorm scale_shift on python code
@ -6061,8 +6263,25 @@ class Gemma3NModel(Gemma3Model):
] ]
def set_vocab(self): def set_vocab(self):
# For Gemma3n multimodal models, we need the FULL vocab_size (262400)
# which includes special tokens from 262144-262399 for vision/audio.
# The vocab_size_per_layer_input (262144) is only the embedding size per layer.
# Temporarily override the hparams lookup order to prioritize vocab_size.
# Store original vocab_size_per_layer_input if it exists
vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input")
# Temporarily remove vocab_size_per_layer_input to force using vocab_size
if vocab_size_per_layer_input is not None:
del self.hparams["vocab_size_per_layer_input"]
# Call parent set_vocab which will now use vocab_size (262400)
super().set_vocab() super().set_vocab()
# Restore vocab_size_per_layer_input for later use
if vocab_size_per_layer_input is not None:
self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input
def set_gguf_parameters(self): def set_gguf_parameters(self):
super().set_gguf_parameters() super().set_gguf_parameters()
self.gguf_writer.add_altup_active_idx(self.hparams["altup_active_idx"]) self.gguf_writer.add_altup_active_idx(self.hparams["altup_active_idx"])
@ -6098,8 +6317,32 @@ class Gemma3NModel(Gemma3Model):
if "language_model." not in name: if "language_model." not in name:
return [] # skip non-language model tensors return [] # skip non-language model tensors
# Pad token embeddings for vision/audio special tokens (262144-262399)
if "embed_tokens.weight" in name or "embed_tokens_per_layer" in name:
# Move to CPU to avoid meta device issues during padding
data_torch = data_torch.to(device="cpu")
vocab_size = self.hparams.get("vocab_size", 262400)
current_size = data_torch.shape[0] # First dimension is vocab_size
if current_size < vocab_size:
# Pad with zeros for vision/audio tokens (they get embeddings from vision tower)
padding_size = vocab_size - current_size
tensor_type = "per-layer embeddings" if "per_layer" in name else "token embeddings"
logger.info(f"Padding {tensor_type} shape {list(data_torch.shape)} from {current_size} to {vocab_size} (adding {padding_size} vision/audio token slots)")
# Create padding with zeros (vision tokens won't use these embeddings)
padding = torch.zeros((padding_size, data_torch.shape[1]), dtype=data_torch.dtype, device=data_torch.device)
data_torch = torch.cat([data_torch, padding], dim=0)
# Continue with normal processing
name = name.replace("language_model.", "")
return [(self.map_tensor_name(name), data_torch)]
if "altup_unembed_projections" in name: if "altup_unembed_projections" in name:
data_torch = data_torch.to(device="cpu") data_torch = data_torch.to(device="cpu")
# altup_unembed matrices are [hidden_size, hidden_size], NOT vocab-based
# They should NOT be padded
if ".0." in name: if ".0." in name:
self._altup_unembd[0] = data_torch self._altup_unembd[0] = data_torch
elif ".1." in name: elif ".1." in name:
@ -9936,7 +10179,7 @@ class LFM2Model(TextModel):
self._add_feed_forward_length() self._add_feed_forward_length()
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if self._is_vision_tensor(name) or self._is_audio_tensor(name): if self._is_vision_tensor(name) or ConformerAudioModel.is_audio_tensor(name):
# skip multimodal tensors # skip multimodal tensors
return [] return []
@ -9952,9 +10195,6 @@ class LFM2Model(TextModel):
def _is_vision_tensor(self, name: str) -> bool: def _is_vision_tensor(self, name: str) -> bool:
return "vision_tower" in name or "multi_modal_projector" in name return "vision_tower" in name or "multi_modal_projector" in name
def _is_audio_tensor(self, name: str):
return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"])
@ModelBase.register("Lfm2Model") @ModelBase.register("Lfm2Model")
class LFM2ColBertModel(LFM2Model): class LFM2ColBertModel(LFM2Model):
@ -10082,13 +10322,11 @@ class LFM2VLModel(MmprojModel):
@ModelBase.register("Lfm2AudioForConditionalGeneration") @ModelBase.register("Lfm2AudioForConditionalGeneration")
class LFM2AudioModel(MmprojModel): class LFM2AudioModel(ConformerAudioModel):
has_vision_encoder = False has_vision_encoder = False
has_audio_encoder = True has_audio_encoder = True
model_name = "Lfm2AudioEncoder" model_name = "Lfm2AudioEncoder"
_batch_norm_tensors: list[dict[str, Tensor]] | None = None
def get_audio_config(self) -> dict[str, Any] | None: def get_audio_config(self) -> dict[str, Any] | None:
return self.global_config.get("encoder") return self.global_config.get("encoder")
@ -10102,12 +10340,7 @@ class LFM2AudioModel(MmprojModel):
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"]) self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(1e-5) self.gguf_writer.add_audio_attention_layernorm_eps(1e-5)
def tensor_force_quant(self, name, new_name, bid, n_dims): def modify_tensors(self, data_torch, name, bid):
if ".conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip language model tensors # skip language model tensors
if name.startswith("lfm."): if name.startswith("lfm."):
return [] return []
@ -10120,40 +10353,7 @@ class LFM2AudioModel(MmprojModel):
if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]): if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]):
return [] return []
# fold running_mean, running_var and eps into weight and bias for batch_norm return super().modify_tensors(data_torch, name, bid)
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch
if len(self._batch_norm_tensors[bid]) < 5:
return []
weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value
a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]
# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])
return [(self.map_tensor_name(name), data_torch)]
@ModelBase.register("SmallThinkerForCausalLM") @ModelBase.register("SmallThinkerForCausalLM")
@ -10975,7 +11175,7 @@ def parse_args() -> argparse.Namespace:
parser.add_argument( parser.add_argument(
"--sentence-transformers-dense-modules", action="store_true", "--sentence-transformers-dense-modules", action="store_true",
help=("Whether to include sentence-transformers dense modules. " help=("Whether to include sentence-transformers dense modules. "
"It can be used for sentence-transformers models, like google/embeddinggemma-300m" "It can be used for sentence-transformers models, like google/embeddinggemma-300m. "
"Default these modules are not included.") "Default these modules are not included.")
) )

View File

@ -57,7 +57,6 @@ Legend:
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ | | GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | | GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| GROUP_NORM_MUL_ADD | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ | | HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ | | HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | | IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
@ -71,10 +70,9 @@ Legend:
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ | | MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ |
| NEG | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ | | NEG | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ❌ | ❌ | ❌ | | NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ❌ | ❌ | ❌ |
| NORM_MUL_ADD | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | | OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| OPT_STEP_SGD | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | | OPT_STEP_SGD | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| OUT_PROD | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | | | OUT_PROD | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | 🟡 |
| PAD | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ | | PAD | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ |
| PAD_REFLECT_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | | PAD_REFLECT_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | | POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
@ -99,7 +97,6 @@ Legend:
| SILU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ | | SILU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SILU_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | | SILU_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| SIN | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ | ❌ | | SIN | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ | ❌ |
| SOFTCAP | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | | SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
| SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | | SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ | | SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |

View File

@ -965,6 +965,7 @@
"BLAS","IM2COL","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1","support","0","no","BLAS" "BLAS","IM2COL","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1","support","0","no","BLAS"
"BLAS","IM2COL","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1","support","0","no","BLAS" "BLAS","IM2COL","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1","support","0","no","BLAS"
"BLAS","IM2COL","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[5,5,1,32],ne_kernel=[3,4,1,32],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1","support","0","no","BLAS" "BLAS","IM2COL","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[5,5,1,32],ne_kernel=[3,4,1,32],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1","support","0","no","BLAS"
"BLAS","IM2COL","type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[2,2,1536,729],ne_kernel=[2,2,1536,4096],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1","support","0","no","BLAS"
"BLAS","IM2COL_3D","type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[10,10,10,9],ne_kernel=[3,3,3,1],IC=3,s0=1,s1=1,s2=1,p0=1,p1=1,p2=1,d0=1,d1=1,d2=1,v=0","support","0","no","BLAS" "BLAS","IM2COL_3D","type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[10,10,10,9],ne_kernel=[3,3,3,1],IC=3,s0=1,s1=1,s2=1,p0=1,p1=1,p2=1,d0=1,d1=1,d2=1,v=0","support","0","no","BLAS"
"BLAS","IM2COL_3D","type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[10,10,10,9],ne_kernel=[3,3,3,1],IC=3,s0=1,s1=1,s2=1,p0=1,p1=1,p2=1,d0=1,d1=1,d2=1,v=0","support","0","no","BLAS" "BLAS","IM2COL_3D","type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[10,10,10,9],ne_kernel=[3,3,3,1],IC=3,s0=1,s1=1,s2=1,p0=1,p1=1,p2=1,d0=1,d1=1,d2=1,v=0","support","0","no","BLAS"
"BLAS","IM2COL_3D","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[10,10,10,9],ne_kernel=[3,3,3,1],IC=3,s0=1,s1=1,s2=1,p0=1,p1=1,p2=1,d0=1,d1=1,d2=1,v=0","support","0","no","BLAS" "BLAS","IM2COL_3D","type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[10,10,10,9],ne_kernel=[3,3,3,1],IC=3,s0=1,s1=1,s2=1,p0=1,p1=1,p2=1,d0=1,d1=1,d2=1,v=0","support","0","no","BLAS"
@ -4964,6 +4965,7 @@
"BLAS","CONV_TRANSPOSE_1D","ne_input=[2,1,1,1],ne_kernel=[3,1,1,1],s0=1,p0=0,d0=1","support","0","no","BLAS" "BLAS","CONV_TRANSPOSE_1D","ne_input=[2,1,1,1],ne_kernel=[3,1,1,1],s0=1,p0=0,d0=1","support","0","no","BLAS"
"BLAS","CONV_TRANSPOSE_2D","ne_input=[3,2,3,1],ne_kernel=[2,2,1,3],stride=1","support","0","no","BLAS" "BLAS","CONV_TRANSPOSE_2D","ne_input=[3,2,3,1],ne_kernel=[2,2,1,3],stride=1","support","0","no","BLAS"
"BLAS","CONV_TRANSPOSE_2D","ne_input=[10,10,9,1],ne_kernel=[3,3,1,9],stride=2","support","0","no","BLAS" "BLAS","CONV_TRANSPOSE_2D","ne_input=[10,10,9,1],ne_kernel=[3,3,1,9],stride=2","support","0","no","BLAS"
"BLAS","CONV_TRANSPOSE_2D","ne_input=[129,63,35,1],ne_kernel=[3,3,48,35],stride=1","support","0","no","BLAS"
"BLAS","COUNT_EQUAL","type=f32,ne=[4,500,1,1]","support","0","no","BLAS" "BLAS","COUNT_EQUAL","type=f32,ne=[4,500,1,1]","support","0","no","BLAS"
"BLAS","COUNT_EQUAL","type=f32,ne=[4,5000,1,1]","support","0","no","BLAS" "BLAS","COUNT_EQUAL","type=f32,ne=[4,5000,1,1]","support","0","no","BLAS"
"BLAS","ARGMAX","type=f32,ne=[32,1,1,1]","support","0","no","BLAS" "BLAS","ARGMAX","type=f32,ne=[32,1,1,1]","support","0","no","BLAS"
@ -5715,15 +5717,15 @@
"BLAS","L2_NORM","type=f32,ne=[64,5,4,3]","support","0","no","BLAS" "BLAS","L2_NORM","type=f32,ne=[64,5,4,3]","support","0","no","BLAS"
"BLAS","RMS_NORM","type=f32,ne=[64,5,4,3],v=0,eps=0.000001,inplace=1","support","0","no","BLAS" "BLAS","RMS_NORM","type=f32,ne=[64,5,4,3],v=0,eps=0.000001,inplace=1","support","0","no","BLAS"
"BLAS","L2_NORM","type=f32,ne=[64,5,4,3]","support","0","no","BLAS" "BLAS","L2_NORM","type=f32,ne=[64,5,4,3]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,1024,1,1],ne_b=[3,1024,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[3,1024,1,1],ne_b=[3,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[8,1024,1,1],ne_b=[3,1024,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[6,1024,1,1],ne_b=[3,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,1024,4,1],ne_b=[3,1024,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[3,1024,4,1],ne_b=[3,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,1536,1,1],ne_b=[3,1536,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[3,1536,1,1],ne_b=[3,1536,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[8,1536,1,1],ne_b=[3,1536,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[6,1536,1,1],ne_b=[3,1536,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,1536,4,1],ne_b=[3,1536,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[3,1536,4,1],ne_b=[3,1536,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,2048,1,1],ne_b=[3,2048,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[3,2048,1,1],ne_b=[3,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[8,2048,1,1],ne_b=[3,2048,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[6,2048,1,1],ne_b=[3,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,2048,4,1],ne_b=[3,2048,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[3,2048,4,1],ne_b=[3,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,1024,1,1],ne_b=[4,1024,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[4,1024,1,1],ne_b=[4,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[8,1024,1,1],ne_b=[4,1024,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[8,1024,1,1],ne_b=[4,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,1024,4,1],ne_b=[4,1024,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[4,1024,4,1],ne_b=[4,1024,1,1]","support","0","no","BLAS"
@ -5733,6 +5735,15 @@
"BLAS","SSM_CONV","type=f32,ne_a=[4,2048,1,1],ne_b=[4,2048,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[4,2048,1,1],ne_b=[4,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[8,2048,1,1],ne_b=[4,2048,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[8,2048,1,1],ne_b=[4,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[4,2048,4,1],ne_b=[4,2048,1,1]","support","0","no","BLAS" "BLAS","SSM_CONV","type=f32,ne_a=[4,2048,4,1],ne_b=[4,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[9,1024,1,1],ne_b=[9,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[18,1024,1,1],ne_b=[9,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[9,1024,4,1],ne_b=[9,1024,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[9,1536,1,1],ne_b=[9,1536,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[18,1536,1,1],ne_b=[9,1536,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[9,1536,4,1],ne_b=[9,1536,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[9,2048,1,1],ne_b=[9,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[18,2048,1,1],ne_b=[9,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_CONV","type=f32,ne_a=[9,2048,4,1],ne_b=[9,2048,1,1]","support","0","no","BLAS"
"BLAS","SSM_SCAN","type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4","support","0","no","BLAS" "BLAS","SSM_SCAN","type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4","support","0","no","BLAS"
"BLAS","SSM_SCAN","type=f32,d_state=128,head_dim=64,n_head=16,n_group=2,n_seq_tokens=32,n_seqs=4","support","0","no","BLAS" "BLAS","SSM_SCAN","type=f32,d_state=128,head_dim=64,n_head=16,n_group=2,n_seq_tokens=32,n_seqs=4","support","0","no","BLAS"
"BLAS","SSM_SCAN","type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4","support","0","no","BLAS" "BLAS","SSM_SCAN","type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4","support","0","no","BLAS"
@ -6592,6 +6603,30 @@
"BLAS","MUL_MAT","type_a=f16,type_b=f32,m=1056,n=1,k=67,bs=[1,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1","support","0","no","BLAS" "BLAS","MUL_MAT","type_a=f16,type_b=f32,m=1056,n=1,k=67,bs=[1,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=f32,type_b=f32,m=64,n=77,k=77,bs=[12,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","1","yes","BLAS" "BLAS","MUL_MAT","type_a=f32,type_b=f32,m=64,n=77,k=77,bs=[12,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","1","yes","BLAS"
"BLAS","MUL_MAT","type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","1","yes","BLAS" "BLAS","MUL_MAT","type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","1","yes","BLAS"
"BLAS","MUL_MAT","type_a=q4_0,type_b=f32,m=1,n=2048,k=8192,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=f32,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=f16,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=bf16,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q4_0,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q4_1,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q5_0,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q5_1,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q8_0,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=mxfp4,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q2_K,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q3_K,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q4_K,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q5_K,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=q6_K,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq2_xxs,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq2_xs,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq2_s,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq3_xxs,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq1_s,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq1_m,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq4_nl,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq3_s,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=iq4_xs,type_b=f32,m=1,n=64,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=f16,type_b=f32,m=1056,n=1,k=128,bs=[1,1],nr=[1,1],per=[0,2,1,3],k_v=0,o=1","support","0","no","BLAS" "BLAS","MUL_MAT","type_a=f16,type_b=f32,m=1056,n=1,k=128,bs=[1,1],nr=[1,1],per=[0,2,1,3],k_v=0,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=f16,type_b=f32,m=128,n=1,k=1056,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=2112,o=1","support","0","no","BLAS" "BLAS","MUL_MAT","type_a=f16,type_b=f32,m=128,n=1,k=1056,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=2112,o=1","support","0","no","BLAS"
"BLAS","MUL_MAT","type_a=bf16,type_b=f32,m=1056,n=1,k=128,bs=[1,1],nr=[1,1],per=[0,2,1,3],k_v=0,o=1","support","0","no","BLAS" "BLAS","MUL_MAT","type_a=bf16,type_b=f32,m=1056,n=1,k=128,bs=[1,1],nr=[1,1],per=[0,2,1,3],k_v=0,o=1","support","0","no","BLAS"
@ -8916,6 +8951,11 @@
"BLAS","SOFT_MAX","type=f32,ne=[32,2,32,1],mask=1,sinks=0,m_prec=f16,nr23=[1,1],scale=0.100000,max_bias=0.000000,inplace=0","support","0","no","BLAS" "BLAS","SOFT_MAX","type=f32,ne=[32,2,32,1],mask=1,sinks=0,m_prec=f16,nr23=[1,1],scale=0.100000,max_bias=0.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[32,2,32,1],mask=1,sinks=1,m_prec=f32,nr23=[1,1],scale=0.100000,max_bias=8.000000,inplace=0","support","0","no","BLAS" "BLAS","SOFT_MAX","type=f32,ne=[32,2,32,1],mask=1,sinks=1,m_prec=f32,nr23=[1,1],scale=0.100000,max_bias=8.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[32,2,32,1],mask=1,sinks=1,m_prec=f16,nr23=[1,1],scale=0.100000,max_bias=8.000000,inplace=0","support","0","no","BLAS" "BLAS","SOFT_MAX","type=f32,ne=[32,2,32,1],mask=1,sinks=1,m_prec=f16,nr23=[1,1],scale=0.100000,max_bias=8.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[200001,2,3,1],mask=1,sinks=1,m_prec=f32,nr23=[1,1],scale=0.100000,max_bias=8.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[200001,2,3,1],mask=1,sinks=1,m_prec=f16,nr23=[1,1],scale=0.100000,max_bias=8.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[200000,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[200000,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX","type=f32,ne=[643251,3,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0","support","0","no","BLAS"
"BLAS","SOFT_MAX_BACK","type=f32,ne=[16,16,1,1],scale=1.000000,max_bias=0.000000","support","0","no","BLAS" "BLAS","SOFT_MAX_BACK","type=f32,ne=[16,16,1,1],scale=1.000000,max_bias=0.000000","support","0","no","BLAS"
"BLAS","SOFT_MAX_BACK","type=f32,ne=[15,15,1,1],scale=1.000000,max_bias=0.000000","support","0","no","BLAS" "BLAS","SOFT_MAX_BACK","type=f32,ne=[15,15,1,1],scale=1.000000,max_bias=0.000000","support","0","no","BLAS"
"BLAS","SOFT_MAX_BACK","type=f32,ne=[16,16,2,3],scale=1.000000,max_bias=0.000000","support","0","no","BLAS" "BLAS","SOFT_MAX_BACK","type=f32,ne=[16,16,2,3],scale=1.000000,max_bias=0.000000","support","0","no","BLAS"
@ -8968,6 +9008,7 @@
"BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
@ -8977,6 +9018,7 @@
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
@ -8987,11 +9029,13 @@
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
@ -9001,6 +9045,7 @@
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
@ -9011,11 +9056,13 @@
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
@ -9025,6 +9072,7 @@
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
@ -9035,11 +9083,13 @@
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
@ -9049,6 +9099,7 @@
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
@ -9059,6 +9110,7 @@
"BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f16,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f16,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE","type=f16,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE","type=f16,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
@ -9184,6 +9236,7 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
@ -9193,6 +9246,7 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
@ -9203,11 +9257,13 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
@ -9217,6 +9273,7 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
@ -9227,11 +9284,13 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
@ -9241,6 +9300,7 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
@ -9251,11 +9311,13 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,40,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,52,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,64,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,1,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,71,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,8,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
@ -9265,6 +9327,7 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=20,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,2,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,32,4,1],n_dims=32,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=128,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,12,2,1],n_dims=20,mode=8,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
@ -9275,6 +9338,7 @@
"BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,28,2,1],n_dims=32,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[80,16,2,1],n_dims=80,mode=24,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[128,16,2,1],n_dims=128,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[16,16,8192,1],n_dims=16,mode=40,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f32,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=1,v=1,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f16,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f16,ne_a=[128,32,2,1],n_dims=128,mode=0,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
"BLAS","ROPE_BACK","type=f16,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS" "BLAS","ROPE_BACK","type=f16,ne_a=[64,128,2,1],n_dims=64,mode=2,n_ctx=512,fs=1.000000,ef=0.000000,af=1.000000,ff=0,v=0,inplace=0","support","0","no","BLAS"
@ -9542,333 +9606,333 @@
"BLAS","ARGSORT","type=f32,ne=[2048,2,1,3],order=1","support","0","no","BLAS" "BLAS","ARGSORT","type=f32,ne=[2048,2,1,3],order=1","support","0","no","BLAS"
"BLAS","ARGSORT","type=f32,ne=[2049,2,1,3],order=1","support","0","no","BLAS" "BLAS","ARGSORT","type=f32,ne=[2049,2,1,3],order=1","support","0","no","BLAS"
"BLAS","ARGSORT","type=f32,ne=[2,8,8192,1],order=1","support","0","no","BLAS" "BLAS","ARGSORT","type=f32,ne=[2,8,8192,1],order=1","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[12,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[12,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[13,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[13,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[13,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[13,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[15,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[15,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[15,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[15,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[15,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[15,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[19,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[27,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[43,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[64,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[75,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[128,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[139,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[256,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[267,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[512,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[523,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1035,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2059,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4096,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[4107,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8192,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[8203,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16395,1,2,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32768,1,1,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[32779,1,2,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65536,1,1,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[65547,1,2,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131072,1,1,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[131083,1,2,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262144,1,1,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[262155,1,2,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=100","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=100,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=500","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=500,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=1023","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=1023,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524288,1,1,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=9999","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[524299,1,2,1],k=9999,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=1","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=1,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=2","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=2,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=3","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=3,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=7","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=7,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16,10,10,10],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[60,10,10,10],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1023,2,1,3],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1024,2,1,3],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[1025,2,1,3],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[16384,1,1,1],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2047,2,1,3],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2048,2,1,3],k=15,ties=0","support","0","no","BLAS"
"BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=15","support","0","no","BLAS" "BLAS","TOP_K","type=f32,ne=[2049,2,1,3],k=15,ties=0","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=0","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=0","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=1","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=1","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=nearest,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=nearest","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=nearest,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=nearest","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bilinear,transpose=0","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bilinear,transpose=0","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bilinear,transpose=1","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bilinear,transpose=1","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bilinear,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bilinear","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bicubic,transpose=0","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bicubic,transpose=0","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bicubic,transpose=1","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bicubic,transpose=1","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bicubic,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bicubic","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bicubic,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bicubic","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=513,transpose=0","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bilinear|antialias,transpose=0","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=513,transpose=1","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bilinear|antialias,transpose=1","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear|antialias","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bilinear,flags=none","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bilinear|antialias","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=align_corners","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear|align_corners","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[1,4,3,2],ne_tgt=[2,8,3,2],mode=bilinear,flags=align_corners","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[1,4,3,2],ne_tgt=[2,8,3,2],mode=bilinear|align_corners","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[4,1,3,2],ne_tgt=[1,1,3,2],mode=bilinear,flags=align_corners","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[4,1,3,2],ne_tgt=[1,1,3,2],mode=bilinear|align_corners","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bicubic,flags=align_corners","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bicubic|align_corners","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[1,4,3,2],ne_tgt=[2,8,3,2],mode=bicubic,flags=align_corners","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[1,4,3,2],ne_tgt=[2,8,3,2],mode=bicubic|align_corners","support","0","no","BLAS"
"BLAS","UPSCALE","type=f32,ne=[4,1,3,2],ne_tgt=[1,1,3,2],mode=bicubic,flags=align_corners","support","0","no","BLAS" "BLAS","UPSCALE","type=f32,ne=[4,1,3,2],ne_tgt=[1,1,3,2],mode=bicubic|align_corners","support","0","no","BLAS"
"BLAS","SUM","type=f32,ne=[10,5,4,3]","support","0","no","BLAS" "BLAS","SUM","type=f32,ne=[10,5,4,3]","support","0","no","BLAS"
"BLAS","SUM_ROWS","type=f32,ne=[10,5,4,3],permute=0,slice=0","support","0","no","BLAS" "BLAS","SUM_ROWS","type=f32,ne=[10,5,4,3],permute=0,slice=0","support","0","no","BLAS"
"BLAS","SUM","type=f32,ne=[11,5,6,3],permute=[0,2,1,3]","support","0","no","BLAS" "BLAS","SUM","type=f32,ne=[11,5,6,3],permute=[0,2,1,3]","support","0","no","BLAS"
@ -9891,8 +9955,9 @@
"BLAS","GROUP_NORM","type=f32,ne=[64,64,320,1],num_groups=32,eps=0.000001","support","0","no","BLAS" "BLAS","GROUP_NORM","type=f32,ne=[64,64,320,1],num_groups=32,eps=0.000001","support","0","no","BLAS"
"BLAS","GROUP_NORM","type=f32,ne=[9,9,1280,1],num_groups=32,eps=0.000001","support","0","no","BLAS" "BLAS","GROUP_NORM","type=f32,ne=[9,9,1280,1],num_groups=32,eps=0.000001","support","0","no","BLAS"
"BLAS","ACC","type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1]","support","0","no","BLAS" "BLAS","ACC","type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1]","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],pad_0=1,pad_1=1","support","0","no","BLAS" "BLAS","PAD","type=f32,ne_a=[512,512,1,1],pad_0=1,pad_1=1,circular=0","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,3,1],lp0=1,rp0=1,lp1=1,rp1=1,lp2=1,rp2=1,lp3=1,rp3=1,v=0","support","0","no","BLAS" "BLAS","PAD","type=f32,ne_a=[33,17,2,1],pad_0=4,pad_1=3,circular=1","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,3,1],lp0=1,rp0=1,lp1=1,rp1=1,lp2=1,rp2=1,lp3=1,rp3=1,v=0,circular=0","support","0","no","BLAS"
"BLAS","PAD_REFLECT_1D","type=f32,ne_a=[512,34,2,1],pad_0=10,pad_1=9","support","0","no","BLAS" "BLAS","PAD_REFLECT_1D","type=f32,ne_a=[512,34,2,1],pad_0=10,pad_1=9","support","0","no","BLAS"
"BLAS","PAD_REFLECT_1D","type=f32,ne_a=[3000,384,4,1],pad_0=10,pad_1=9","support","0","no","BLAS" "BLAS","PAD_REFLECT_1D","type=f32,ne_a=[3000,384,4,1],pad_0=10,pad_1=9","support","0","no","BLAS"
"BLAS","ROLL","shift0=3,shift1=-2,shift3=1,shift4=-1","support","0","no","BLAS" "BLAS","ROLL","shift0=3,shift1=-2,shift3=1,shift4=-1","support","0","no","BLAS"
@ -9914,6 +9979,7 @@
"BLAS","CUMSUM","type=f32,ne=[2048,5,4,3]","support","0","no","BLAS" "BLAS","CUMSUM","type=f32,ne=[2048,5,4,3]","support","0","no","BLAS"
"BLAS","CUMSUM","type=f32,ne=[242004,1,1,1]","support","0","no","BLAS" "BLAS","CUMSUM","type=f32,ne=[242004,1,1,1]","support","0","no","BLAS"
"BLAS","CUMSUM","type=f32,ne=[375960,1,1,1]","support","0","no","BLAS" "BLAS","CUMSUM","type=f32,ne=[375960,1,1,1]","support","0","no","BLAS"
"BLAS","CUMSUM","type=f32,ne=[20481,4,1,1]","support","0","no","BLAS"
"BLAS","XIELU","type=f32,ne=[10,5,4,3]","support","0","no","BLAS" "BLAS","XIELU","type=f32,ne=[10,5,4,3]","support","0","no","BLAS"
"BLAS","TRI","type=f32,ne=[10,10,4,3],tri_type=3","support","0","no","BLAS" "BLAS","TRI","type=f32,ne=[10,10,4,3],tri_type=3","support","0","no","BLAS"
"BLAS","TRI","type=f32,ne=[10,10,4,3],tri_type=2","support","0","no","BLAS" "BLAS","TRI","type=f32,ne=[10,10,4,3],tri_type=2","support","0","no","BLAS"
@ -9923,17 +9989,41 @@
"BLAS","FILL","type=f32,ne=[303,207,11,3],c=2.000000","support","0","no","BLAS" "BLAS","FILL","type=f32,ne=[303,207,11,3],c=2.000000","support","0","no","BLAS"
"BLAS","FILL","type=f32,ne=[800,600,4,4],c=-152.000000","support","0","no","BLAS" "BLAS","FILL","type=f32,ne=[800,600,4,4],c=-152.000000","support","0","no","BLAS"
"BLAS","FILL","type=f32,ne=[2048,512,2,2],c=3.500000","support","0","no","BLAS" "BLAS","FILL","type=f32,ne=[2048,512,2,2],c=3.500000","support","0","no","BLAS"
"BLAS","DIAG","type=f32,ne=[10,1,4,3]","support","0","no","BLAS"
"BLAS","DIAG","type=f32,ne=[79,1,19,13]","support","0","no","BLAS"
"BLAS","DIAG","type=f32,ne=[256,1,8,16]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[10,10,4,3],ne_rhs=[3,10,4,3]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[10,10,4,3],ne_rhs=[3,10,4,3]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[11,11,1,1],ne_rhs=[5,11,1,1]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[11,11,1,1],ne_rhs=[5,11,1,1]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[17,17,2,4],ne_rhs=[9,17,2,4]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[17,17,2,4],ne_rhs=[9,17,2,4]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[30,30,7,1],ne_rhs=[8,30,7,1]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[30,30,7,1],ne_rhs=[8,30,7,1]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[42,42,5,2],ne_rhs=[10,42,5,2]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[42,42,5,2],ne_rhs=[10,42,5,2]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[64,64,2,2],ne_rhs=[64,64,2,2]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[79,79,5,3],ne_rhs=[417,79,5,3]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[128,128,4,2],ne_rhs=[32,128,4,2]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[80,80,2,8],ne_rhs=[80,80,2,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[80,80,2,8],ne_rhs=[79,80,2,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[80,80,2,8],ne_rhs=[81,80,2,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[80,80,8,8],ne_rhs=[80,80,8,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[80,80,8,8],ne_rhs=[79,80,8,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[80,80,8,8],ne_rhs=[81,80,8,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[84,84,4,4],ne_rhs=[32,84,4,4]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[95,95,8,8],ne_rhs=[40,95,8,8]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[100,100,4,4],ne_rhs=[41,100,4,4]","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[100,100,4,4],ne_rhs=[41,100,4,4]","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[128,128,4,4],ne_rhs=[31,128,4,4]","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[128,128,4,4],ne_rhs=[32,128,4,4]","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[128,128,3,4],ne_rhs=[32,128,3,4]","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1","support","0","no","BLAS" "BLAS","SOLVE_TRI","type=f32,ne_lhs=[128,128,4,1],ne_rhs=[32,128,4,1]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[64,64,4,4],ne_rhs=[200,64,4,4]","support","0","no","BLAS"
"BLAS","SOLVE_TRI","type=f32,ne_lhs=[64,64,4,4],ne_rhs=[384,64,4,4]","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0,circular=0","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0,circular=0","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0,circular=1","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0,circular=1","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1,circular=0","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1,circular=0","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1,circular=1","support","0","no","BLAS"
"BLAS","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1,circular=1","support","0","no","BLAS"
"BLAS","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f32,permute=[0,1,2,3]","support","0","no","BLAS" "BLAS","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f32,permute=[0,1,2,3]","support","0","no","BLAS"
"BLAS","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]","support","0","no","BLAS" "BLAS","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]","support","0","no","BLAS"
"BLAS","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=bf16,permute=[0,1,2,3]","support","0","no","BLAS" "BLAS","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=bf16,permute=[0,1,2,3]","support","0","no","BLAS"

Can't render this file because it is too large.

File diff suppressed because it is too large Load Diff

97
docs/preset.md Normal file
View File

@ -0,0 +1,97 @@
# llama.cpp INI Presets
## Introduction
The INI preset feature, introduced in [PR#17859](https://github.com/ggml-org/llama.cpp/pull/17859), allows users to create reusable and shareable parameter configurations for llama.cpp.
### Using Presets with the Server
When running multiple models on the server (router mode), INI preset files can be used to configure model-specific parameters. Please refer to the [server documentation](../tools/server/README.md) for more details.
### Using a Remote Preset
> [!NOTE]
>
> This feature is currently only supported via the `-hf` option.
For GGUF models hosted on Hugging Face, you can include a `preset.ini` file in the root directory of the repository to define specific configurations for that model.
Example:
```ini
hf-repo-draft = username/my-draft-model-GGUF
temp = 0.5
top-k = 20
top-p = 0.95
```
For security reasons, only certain options are allowed. Please refer to [preset.cpp](../common/preset.cpp) for the complete list of permitted options.
Example usage:
Assuming your repository `username/my-model-with-preset` contains a `preset.ini` with the configuration above:
```sh
llama-cli -hf username/my-model-with-preset
# This is equivalent to:
llama-cli -hf username/my-model-with-preset \
--hf-repo-draft username/my-draft-model-GGUF \
--temp 0.5 \
--top-k 20 \
--top-p 0.95
```
You can also override preset arguments by specifying them on the command line:
```sh
# Force temp = 0.1, overriding the preset value
llama-cli -hf username/my-model-with-preset --temp 0.1
```
If you want to define multiple preset configurations for one or more GGUF models, you can create a blank HF repo for each preset. Each HF repo should contain a `preset.ini` file that references the actual model(s):
```ini
hf-repo = user/my-model-main
hf-repo-draft = user/my-model-draft
temp = 0.8
ctx-size = 1024
; (and other configurations)
```
### Named presets
If you want to define multiple preset configurations for one or more GGUF models, you can create a blank HF repo containing a single `preset.ini` file that references the actual model(s):
```ini
[*]
mmap = 1
[gpt-oss-20b-hf]
hf = ggml-org/gpt-oss-20b-GGUF
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}
[gpt-oss-120b-hf]
hf = ggml-org/gpt-oss-120b-GGUF
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}
```
You can then use it via `llama-cli` or `llama-server`, example:
```sh
llama-server -hf user/repo:gpt-oss-120b-hf
```
Please make sure to provide the correct `hf-repo` for each child preset. Otherwise, you may get error: `The specified tag is not a valid quantization scheme.`

View File

@ -21,7 +21,7 @@ int main(int argc, char ** argv) {
params.prompt = "Hello my name is"; params.prompt = "Hello my name is";
params.n_predict = 32; params.n_predict = 32;
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_COMMON, print_usage)) { if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_BATCHED, print_usage)) {
return 1; return 1;
} }

View File

@ -57,11 +57,21 @@ struct callback_data {
} }
}; };
static bool has_pooling(llama_context * ctx) {
switch (llama_pooling_type(ctx)) {
case LLAMA_POOLING_TYPE_NONE:
case LLAMA_POOLING_TYPE_UNSPECIFIED:
return false;
default:
return true;
}
}
struct output_data { struct output_data {
float * data_ptr = nullptr; float * data_ptr = nullptr;
int data_size = 0; int data_size = 0;
std::string type_suffix; std::string type_suffix;
std::vector<float> storage; std::vector<float> embd_norm;
std::string prompt; std::string prompt;
std::vector<llama_token> tokens; std::vector<llama_token> tokens;
@ -74,23 +84,31 @@ struct output_data {
if (params.embedding) { if (params.embedding) {
const int n_embd = llama_model_n_embd_out(model); const int n_embd = llama_model_n_embd_out(model);
const bool pooling_enabled = llama_pooling_type(ctx) != LLAMA_POOLING_TYPE_NONE; const bool pooling = has_pooling(ctx);
const int n_embd_count = pooling_enabled ? 1 : tokens.size(); const int n_embd_count = pooling ? 1 : tokens.size();
const int n_embeddings = n_embd * n_embd_count; const int n_floats = n_embd * n_embd_count;
float * embeddings; float * embd_raw = pooling ? llama_get_embeddings_seq(ctx, 0) : llama_get_embeddings(ctx);
if (pooling_enabled) { if (embd_raw == nullptr) {
embeddings = llama_get_embeddings_seq(ctx, 0); throw std::runtime_error("failed to get embeddings from the model");
storage.resize(n_embeddings);
common_embd_normalize(embeddings, storage.data(), n_embeddings, params.embd_normalize);
embeddings = storage.data();
} else {
embeddings = llama_get_embeddings(ctx);
} }
data_ptr = embeddings; LOG_DBG("pooling_enabled: %s\n", pooling ? "true" : "false");
data_size = n_embeddings; LOG_DBG("n_embd: %d\n", n_embd);
LOG_DBG("n_floats: %d\n", n_floats);
LOG_DBG("n_embd_count: %d\n", n_embd_count);
data_ptr = embd_raw;
data_size = n_floats;
type_suffix = "-embeddings"; type_suffix = "-embeddings";
if (params.embd_normalize >= 0) {
embd_norm.resize(n_floats);
for (int i = 0; i < n_embd_count; i++) {
common_embd_normalize(embd_raw+i*n_embd, embd_norm.data()+i*n_embd, n_embd, params.embd_normalize);
}
data_ptr = embd_norm.data();
}
} else { } else {
const float * logits = llama_get_logits_ith(ctx, tokens.size() - 1); const float * logits = llama_get_logits_ith(ctx, tokens.size() - 1);
const int n_logits = llama_vocab_n_tokens(vocab); const int n_logits = llama_vocab_n_tokens(vocab);

View File

@ -553,6 +553,7 @@ int main(int argc, char ** argv) {
model_params.n_gpu_layers = params.n_gpu_layers; model_params.n_gpu_layers = params.n_gpu_layers;
model_params.devices = params.devices.data(); model_params.devices = params.devices.data();
model_params.use_mmap = params.use_mmap; model_params.use_mmap = params.use_mmap;
model_params.use_direct_io = params.use_direct_io;
model_params.use_mlock = params.use_mlock; model_params.use_mlock = params.use_mlock;
model_params.check_tensors = params.check_tensors; model_params.check_tensors = params.check_tensors;

View File

@ -61,7 +61,7 @@ causal-run-converted-model:
@CONVERTED_MODEL="$(CONVERTED_MODEL)" ./scripts/causal/run-converted-model.sh @CONVERTED_MODEL="$(CONVERTED_MODEL)" ./scripts/causal/run-converted-model.sh
causal-verify-logits: causal-run-original-model causal-run-converted-model causal-verify-logits: causal-run-original-model causal-run-converted-model
@./scripts/causal/compare-logits.py @MODEL_PATH="$(MODEL_PATH)" ./scripts/causal/compare-logits.py
@MODEL_PATH="$(MODEL_PATH)" ./scripts/utils/check-nmse.py -m ${MODEL_PATH} @MODEL_PATH="$(MODEL_PATH)" ./scripts/utils/check-nmse.py -m ${MODEL_PATH}
causal-run-original-embeddings: causal-run-original-embeddings:
@ -138,16 +138,13 @@ embedding-run-original-model-st: embedding-run-original-model
embedding-run-converted-model: embedding-run-converted-model:
@./scripts/embedding/run-converted-model.sh $(CONVERTED_EMBEDDING_MODEL) \ @./scripts/embedding/run-converted-model.sh $(CONVERTED_EMBEDDING_MODEL) \
$(if $(PROMPTS_FILE),--prompts-file "$(PROMPTS_FILE)") \ $(if $(PROMPTS_FILE),--prompts-file "$(PROMPTS_FILE)") \
$(if $(USE_POOLING),--pooling) $(if $(EMBD_NORMALIZE),--embd-normalize "$(EMBD_NORMALIZE)")
embedding-run-converted-model-st: USE_POOLING=1
embedding-run-converted-model-st: embedding-run-converted-model
embedding-verify-logits: embedding-run-original-model embedding-run-converted-model embedding-verify-logits: embedding-run-original-model embedding-run-converted-model
@./scripts/embedding/compare-embeddings-logits.sh \ @./scripts/embedding/compare-embeddings-logits.sh \
$(if $(PROMPTS_FILE),--prompts-file "$(PROMPTS_FILE)") $(if $(PROMPTS_FILE),--prompts-file "$(PROMPTS_FILE)")
embedding-verify-logits-st: embedding-run-original-model-st embedding-run-converted-model-st embedding-verify-logits-st: embedding-run-original-model-st embedding-run-converted-model
@./scripts/embedding/compare-embeddings-logits.sh \ @./scripts/embedding/compare-embeddings-logits.sh \
$(if $(PROMPTS_FILE),--prompts-file "$(PROMPTS_FILE)") $(if $(PROMPTS_FILE),--prompts-file "$(PROMPTS_FILE)")

View File

@ -198,14 +198,13 @@ model, and the other is a text file which allows for manual visual inspection.
#### Using SentenceTransformer with numbered layers #### Using SentenceTransformer with numbered layers
For models that have numbered SentenceTransformer layers (01_Pooling, 02_Dense, For models that have numbered SentenceTransformer layers (01_Pooling, 02_Dense,
03_Dense, 04_Normalize), use the `-st` targets to apply all these layers: 03_Dense, 04_Normalize), these will be applied automatically when running the
converted model but currently there is a separate target to run the original
version:
```console ```console
# Run original model with SentenceTransformer (applies all numbered layers) # Run original model with SentenceTransformer (applies all numbered layers)
(venv) $ make embedding-run-original-model-st (venv) $ make embedding-run-original-model-st
# Run converted model with pooling enabled
(venv) $ make embedding-run-converted-model-st
``` ```
This will use the SentenceTransformer library to load and run the model, which This will use the SentenceTransformer library to load and run the model, which
@ -213,6 +212,17 @@ automatically applies all the numbered layers in the correct order. This is
particularly useful when comparing with models that should include these particularly useful when comparing with models that should include these
additional transformation layers beyond just the base model output. additional transformation layers beyond just the base model output.
The type of normalization can be specified for the converted model but is not
strictly necessary as the verification uses cosine similarity and the magnitude
of the output vectors does not affect this. But the normalization type can be
specified as an argument to the target which might be useful for manual
inspection:
```console
(venv) $ make embedding-verify-logits-st EMBD_NORMALIZE=1
```
The original model will apply the normalization according to the normalization
layer specified in the modules.json configuration file.
### Model conversion ### Model conversion
After updates have been made to [gguf-py](../../gguf-py) to add support for the After updates have been made to [gguf-py](../../gguf-py) to add support for the
new model the model can be converted to GGUF format using the following command: new model the model can be converted to GGUF format using the following command:

View File

@ -3,10 +3,11 @@
import sys import sys
import numpy as np import numpy as np
from pathlib import Path from pathlib import Path
import os
# Add utils directory to path for direct script execution # Add utils directory to path for direct script execution
sys.path.insert(0, str(Path(__file__).parent.parent / "utils")) sys.path.insert(0, str(Path(__file__).parent.parent / "utils"))
from common import get_model_name_from_env_path, compare_tokens # type: ignore[import-not-found] from common import get_model_name_from_env_path, compare_tokens, exit_with_warning # type: ignore[import-not-found]
def quick_logits_check(pytorch_file, llamacpp_file): def quick_logits_check(pytorch_file, llamacpp_file):
"""Lightweight sanity check before NMSE""" """Lightweight sanity check before NMSE"""
@ -38,6 +39,7 @@ def quick_logits_check(pytorch_file, llamacpp_file):
return True return True
def main(): def main():
model_path = os.environ.get('MODEL_PATH')
model_name = get_model_name_from_env_path('MODEL_PATH') model_name = get_model_name_from_env_path('MODEL_PATH')
data_dir = Path("data") data_dir = Path("data")
pytorch_file = data_dir / f"pytorch-{model_name}.bin" pytorch_file = data_dir / f"pytorch-{model_name}.bin"
@ -62,8 +64,7 @@ def main():
print("🔍 Token Comparison Check") print("🔍 Token Comparison Check")
print("=" * 40) print("=" * 40)
if not compare_tokens(f"pytorch-{model_name}", f"llamacpp-{llamacpp_model_name}"): if not compare_tokens(f"pytorch-{model_name}", f"llamacpp-{llamacpp_model_name}"):
print("\n❌ Token mismatch detected") exit_with_warning("\n❌ Token mismatch detected", model_path)
sys.exit(1)
print() print()
print("🔍 GGML Model Validation for model ", model_name) print("🔍 GGML Model Validation for model ", model_name)
@ -80,8 +81,7 @@ def main():
print(" Ok to proceed with NMSE check...") print(" Ok to proceed with NMSE check...")
sys.exit(0) sys.exit(0)
else: else:
print(f"❌ NOK: Top 10 predictions don't match - generation will differ") exit_with_warning(f"❌ NOK: Top 10 predictions don't match - generation will differ", model_path)
sys.exit(1)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@ -7,7 +7,7 @@ base_model:
Recommended way to run this model: Recommended way to run this model:
```sh ```sh
llama-server -hf {namespace}/{model_name}-GGUF -c 0 llama-server -hf {namespace}/{model_name}-GGUF
``` ```
Then, access http://localhost:8080 Then, access http://localhost:8080

View File

@ -5,7 +5,7 @@ set -e
# Parse command line arguments # Parse command line arguments
CONVERTED_MODEL="" CONVERTED_MODEL=""
PROMPTS_FILE="" PROMPTS_FILE=""
USE_POOLING="" EMBD_NORMALIZE="2"
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case $1 in case $1 in
@ -13,9 +13,9 @@ while [[ $# -gt 0 ]]; do
PROMPTS_FILE="$2" PROMPTS_FILE="$2"
shift 2 shift 2
;; ;;
--pooling) --embd-normalize)
USE_POOLING="1" EMBD_NORMALIZE="$2"
shift shift 2
;; ;;
*) *)
if [ -z "$CONVERTED_MODEL" ]; then if [ -z "$CONVERTED_MODEL" ]; then
@ -51,8 +51,4 @@ fi
echo $CONVERTED_MODEL echo $CONVERTED_MODEL
cmake --build ../../build --target llama-debug -j8 cmake --build ../../build --target llama-debug -j8
if [ -n "$USE_POOLING" ]; then ../../build/bin/llama-debug -m "$CONVERTED_MODEL" --embedding -p "$PROMPT" --save-logits --embd-normalize $EMBD_NORMALIZE
../../build/bin/llama-debug -m "$CONVERTED_MODEL" --embedding --pooling mean -p "$PROMPT" --save-logits
else
../../build/bin/llama-debug -m "$CONVERTED_MODEL" --embedding --pooling none -p "$PROMPT" --save-logits
fi

View File

@ -3,6 +3,9 @@
import os import os
import sys import sys
import torch import torch
import transformers
import json
import textwrap
import numpy as np import numpy as np
from pathlib import Path from pathlib import Path
@ -243,3 +246,54 @@ def compare_tokens(original, converted, type_suffix="", output_dir="data"):
print(f" ... and {len(mismatches) - num_to_show} more mismatches") print(f" ... and {len(mismatches) - num_to_show} more mismatches")
return False return False
def show_version_warning(current_version, model_version):
if not model_version:
return False
try:
from packaging.version import parse, InvalidVersion
try:
return parse(current_version) < parse(model_version)
except InvalidVersion:
return current_version != model_version
except ImportError:
return current_version != model_version
def get_model_transformers_version(model_path):
if not model_path:
return None
config_path = Path(model_path) / "config.json"
if not config_path.is_file():
return None
try:
with open(config_path, "r", encoding="utf-8") as f:
config = json.load(f)
return config.get("transformers_version")
except (IOError, json.JSONDecodeError) as e:
print(f"Warning: Could not read or parse {config_path}: {e}", file=sys.stderr)
return None
def exit_with_warning(message, model_path):
print(message)
if model_path and transformers is not None:
model_transformers_version = get_model_transformers_version(model_path)
transformers_version = transformers.__version__
if show_version_warning(transformers_version, model_transformers_version):
warning_message = f"""
=====================================================================
Verification failure might be due to a transformers version mismatch:
Current transformers version: {transformers_version}
Model's required version : {model_transformers_version}
Consider installing the version specified by the model's config:
pip install transformers=={model_transformers_version}
=====================================================================
"""
print(textwrap.dedent(warning_message))
sys.exit(1)

View File

@ -7,7 +7,7 @@ import importlib
from pathlib import Path from pathlib import Path
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModel from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModel
from common import compare_tokens # type: ignore[import-not-found] from common import compare_tokens, exit_with_warning # type: ignore[import-not-found]
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME') unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
@ -174,8 +174,7 @@ def main():
print("=" * 70) print("=" * 70)
data_dir = python_emb_path.parent data_dir = python_emb_path.parent
if not compare_tokens(python_model_name, cpp_model_name, type_suffix="-embeddings", output_dir=str(data_dir)): if not compare_tokens(python_model_name, cpp_model_name, type_suffix="-embeddings", output_dir=str(data_dir)):
print("\n❌ Token mismatch detected") exit_with_warning("\n❌ Token mismatch detected", args.model_path)
exit(1)
print() print()
# Single prompt detailed comparison # Single prompt detailed comparison
@ -237,7 +236,7 @@ def main():
elif avg_cross_sim > 0.70: elif avg_cross_sim > 0.70:
print("⚠️ FAIR: Models have some differences") print("⚠️ FAIR: Models have some differences")
else: else:
print("❌ POOR: Models are significantly different") exit_with_warning("❌ POOR: Models are significantly different", args.model_path)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@ -234,6 +234,11 @@
#if UINTPTR_MAX == 0xFFFFFFFF #if UINTPTR_MAX == 0xFFFFFFFF
#define GGML_MEM_ALIGN 4 #define GGML_MEM_ALIGN 4
#elif defined(__EMSCRIPTEN__)
// emscripten uses max_align_t == 8, so we need GGML_MEM_ALIGN == 8 for 64-bit wasm.
// (for 32-bit wasm, the first conditional is true and GGML_MEM_ALIGN stays 4.)
// ref: https://github.com/ggml-org/llama.cpp/pull/18628
#define GGML_MEM_ALIGN 8
#else #else
#define GGML_MEM_ALIGN 16 #define GGML_MEM_ALIGN 16
#endif #endif

View File

@ -144,7 +144,7 @@ extern "C" {
// device description: short informative description of the device, could be the model name // device description: short informative description of the device, could be the model name
const char * (*get_description)(ggml_backend_dev_t dev); const char * (*get_description)(ggml_backend_dev_t dev);
// device memory in bytes // device memory in bytes: 0 bytes to indicate no memory to report
void (*get_memory)(ggml_backend_dev_t dev, size_t * free, size_t * total); void (*get_memory)(ggml_backend_dev_t dev, size_t * free, size_t * total);
// device type // device type

View File

@ -32,14 +32,12 @@ if (BLAS_FOUND)
pkg_check_modules(DepBLAS openblas) pkg_check_modules(DepBLAS openblas)
endif() endif()
elseif (${GGML_BLAS_VENDOR} MATCHES "FLAME") elseif (${GGML_BLAS_VENDOR} MATCHES "FLAME")
add_compile_definitions(GGML_BLAS_USE_BLIS)
pkg_check_modules(DepBLAS blis) pkg_check_modules(DepBLAS blis)
elseif (${GGML_BLAS_VENDOR} MATCHES "ATLAS") elseif (${GGML_BLAS_VENDOR} MATCHES "ATLAS")
pkg_check_modules(DepBLAS blas-atlas) pkg_check_modules(DepBLAS blas-atlas)
elseif (${GGML_BLAS_VENDOR} MATCHES "FlexiBLAS") elseif (${GGML_BLAS_VENDOR} MATCHES "FlexiBLAS")
pkg_check_modules(DepBLAS flexiblas_api) pkg_check_modules(DepBLAS flexiblas_api)
elseif (${GGML_BLAS_VENDOR} MATCHES "Intel") elseif (${GGML_BLAS_VENDOR} MATCHES "Intel")
add_compile_definitions(GGML_BLAS_USE_MKL)
# all Intel* libraries share the same include path # all Intel* libraries share the same include path
pkg_check_modules(DepBLAS mkl-sdl) pkg_check_modules(DepBLAS mkl-sdl)
elseif (${GGML_BLAS_VENDOR} MATCHES "NVHPC") elseif (${GGML_BLAS_VENDOR} MATCHES "NVHPC")
@ -74,10 +72,26 @@ if (BLAS_FOUND)
target_compile_options(ggml-blas PRIVATE ${BLAS_LINKER_FLAGS}) target_compile_options(ggml-blas PRIVATE ${BLAS_LINKER_FLAGS})
if ("${BLAS_INCLUDE_DIRS}" MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel")) if ("${GGML_BLAS_VENDOR}" STREQUAL "")
message(WARNING "GGML_BLAS_VENDOR is not set; some methods may not link properly.")
endif()
if ("${GGML_BLAS_VENDOR}" MATCHES "Intel" OR ("${BLAS_INCLUDE_DIRS}" MATCHES "mkl" AND "${GGML_BLAS_VENDOR}" MATCHES "Generic"))
add_compile_definitions(GGML_BLAS_USE_MKL) add_compile_definitions(GGML_BLAS_USE_MKL)
endif() endif()
if ("${GGML_BLAS_VENDOR}" MATCHES "OpenBLAS")
add_compile_definitions(GGML_BLAS_USE_OPENBLAS)
endif()
if ("${GGML_BLAS_VENDOR}" MATCHES "FLAME" OR "${GGML_BLAS_VENDOR}" MATCHES "AOCL" OR "${GGML_BLAS_VENDOR}" MATCHES "AOCL_mt")
add_compile_definitions(GGML_BLAS_USE_BLIS)
endif()
if ("${GGML_BLAS_VENDOR}" MATCHES "NVPL")
add_compile_definitions(GGML_BLAS_USE_NVPL)
endif()
target_link_libraries (ggml-blas PRIVATE ${BLAS_LIBRARIES}) target_link_libraries (ggml-blas PRIVATE ${BLAS_LIBRARIES})
target_include_directories(ggml-blas PRIVATE ${BLAS_INCLUDE_DIRS}) target_include_directories(ggml-blas PRIVATE ${BLAS_INCLUDE_DIRS})
else() else()

View File

@ -115,15 +115,11 @@ static void ggml_backend_blas_mul_mat(ggml_backend_blas_context * ctx, struct gg
#endif #endif
} }
#if defined(OPENBLAS_VERSION) #if defined(GGML_BLAS_USE_OPENBLAS)
openblas_set_num_threads(ctx->n_threads); openblas_set_num_threads(ctx->n_threads);
#endif #elif defined(GGML_BLAS_USE_BLIS)
#if defined(GGML_BLAS_USE_BLIS)
bli_thread_set_num_threads(ctx->n_threads); bli_thread_set_num_threads(ctx->n_threads);
#endif #elif defined(GGML_BLAS_USE_NVPL)
#if defined(GGML_BLAS_USE_NVPL)
nvpl_blas_set_num_threads(ctx->n_threads); nvpl_blas_set_num_threads(ctx->n_threads);
#endif #endif
@ -288,7 +284,7 @@ ggml_backend_t ggml_backend_blas_init(void) {
/* .context = */ ctx, /* .context = */ ctx,
}; };
#if defined(OPENBLAS_VERSION) && defined(GGML_USE_OPENMP) #if defined(GGML_BLAS_USE_OPENBLAS) && defined(GGML_USE_OPENMP)
if (openblas_get_parallel() != OPENBLAS_OPENMP) { if (openblas_get_parallel() != OPENBLAS_OPENMP) {
GGML_LOG_DEBUG("%s: warning: ggml is using OpenMP, but OpenBLAS was compiled without OpenMP support\n", __func__); GGML_LOG_DEBUG("%s: warning: ggml is using OpenMP, but OpenBLAS was compiled without OpenMP support\n", __func__);
} }
@ -329,7 +325,7 @@ static const char * ggml_backend_blas_device_get_description(ggml_backend_dev_t
return "BLIS"; return "BLIS";
#elif defined(GGML_BLAS_USE_NVPL) #elif defined(GGML_BLAS_USE_NVPL)
return "NVPL"; return "NVPL";
#elif defined(OPENBLAS_VERSION) #elif defined(GGML_BLAS_USE_OPENBLAS)
return "OpenBLAS"; return "OpenBLAS";
#else #else
return "BLAS"; return "BLAS";

View File

@ -2541,27 +2541,6 @@ static bool ggml_backend_buft_is_cann(ggml_backend_buffer_type_t buft) {
return buft->iface.get_name == ggml_backend_cann_buffer_type_name; return buft->iface.get_name == ggml_backend_cann_buffer_type_name;
} }
/**
* @brief Determines if a tensor operation should be offloaded to the CANN
* backend.
*
* This function checks if a given tensor operation should be offloaded to the
* CANN backend based on the operation type and the size of the tensor. It
* returns true if the second dimension (ne[1]) of the tensor is greater than or
* equal to the minimum batch size and the operation is not GGML_OP_GET_ROWS.
*
* @param backend Pointer to the CANN backend.
* @param op Pointer to the tensor operation to check.
* @return bool Returns true if the operation should be offloaded, otherwise
* false.
*/
static bool ggml_backend_cann_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
const int min_batch_size = 32;
GGML_UNUSED(dev);
return op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS;
}
/** /**
* @brief Records an event on the CANN backend stream. * @brief Records an event on the CANN backend stream.
* *
@ -2637,6 +2616,7 @@ struct ggml_backend_cann_device_context {
int device; int device;
std::string name; std::string name;
std::string description; std::string description;
int op_offload_min_batch_size;
}; };
static const char * ggml_backend_cann_device_get_name(ggml_backend_dev_t dev) { static const char * ggml_backend_cann_device_get_name(ggml_backend_dev_t dev) {
@ -2713,6 +2693,26 @@ static ggml_backend_buffer_type_t ggml_backend_cann_device_get_host_buffer_type(
return ggml_backend_cann_host_buffer_type(); return ggml_backend_cann_host_buffer_type();
} }
/**
* @brief Determines if a tensor operation should be offloaded to the CANN
* backend.
*
* This function checks if a given tensor operation should be offloaded to the
* CANN backend based on the operation type and the size of the tensor. It
* returns true if the second dimension (ne[1]) of the tensor is greater than or
* equal to the minimum batch size and the operation is not GGML_OP_GET_ROWS.
*
* @param backend Pointer to the CANN backend.
* @param op Pointer to the tensor operation to check.
* @return bool Returns true if the operation should be offloaded, otherwise
* false.
*/
static bool ggml_backend_cann_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
ggml_backend_cann_device_context * dev_ctx = (ggml_backend_cann_device_context *)dev->context;
return op->ne[1] >= dev_ctx->op_offload_min_batch_size && op->op != GGML_OP_GET_ROWS;
}
/** /**
* @brief Creates a new event for the CANN backend device. * @brief Creates a new event for the CANN backend device.
* *
@ -2829,12 +2829,14 @@ ggml_backend_reg_t ggml_backend_cann_reg() {
if (!initialized) { if (!initialized) {
aclInit(nullptr); aclInit(nullptr);
ggml_backend_cann_reg_context * ctx = new ggml_backend_cann_reg_context; ggml_backend_cann_reg_context * ctx = new ggml_backend_cann_reg_context;
const int min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
for (int i = 0; i < ggml_cann_info().device_count; i++) { for (int i = 0; i < ggml_cann_info().device_count; i++) {
ggml_backend_cann_device_context * dev_ctx = new ggml_backend_cann_device_context(); ggml_backend_cann_device_context * dev_ctx = new ggml_backend_cann_device_context();
dev_ctx->description = aclrtGetSocName(); dev_ctx->description = aclrtGetSocName();
dev_ctx->device = i; dev_ctx->device = i;
dev_ctx->name = GGML_CANN_NAME + std::to_string(i); dev_ctx->name = GGML_CANN_NAME + std::to_string(i);
dev_ctx->op_offload_min_batch_size = min_batch_size;
ggml_cann_set_device(i); ggml_cann_set_device(i);
ggml_backend_dev_t dev = new ggml_backend_device{ /* .iface = */ ggml_backend_cann_device_interface, ggml_backend_dev_t dev = new ggml_backend_device{ /* .iface = */ ggml_backend_cann_device_interface,
/* .reg = */ &reg, /* .reg = */ &reg,

View File

@ -47,7 +47,10 @@ if (CUDAToolkit_FOUND)
# check Modules/Internal/CMakeCUDAArchitecturesValidate.cmake in the CMake git repository instead. # check Modules/Internal/CMakeCUDAArchitecturesValidate.cmake in the CMake git repository instead.
# However, the architectures 120a-real and 121a-real should work with basically any CMake version and # However, the architectures 120a-real and 121a-real should work with basically any CMake version and
# until the release of e.g. Rubin there is no benefit to shipping virtual architectures for Blackwell. # until the release of e.g. Rubin there is no benefit to shipping virtual architectures for Blackwell.
list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real 121a-real) list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real)
endif()
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "12.9")
list(APPEND CMAKE_CUDA_ARCHITECTURES 121a-real)
endif() endif()
endif() endif()
endif() endif()

View File

@ -262,6 +262,10 @@ static const char * cu_get_error_str(CUresult err) {
#define FLASH_ATTN_AVAILABLE #define FLASH_ATTN_AVAILABLE
#endif // !defined(GGML_CUDA_NO_FA) && !(defined(GGML_USE_MUSA) && __MUSA_ARCH__ < 220) #endif // !defined(GGML_CUDA_NO_FA) && !(defined(GGML_USE_MUSA) && __MUSA_ARCH__ < 220)
#if defined(TURING_MMA_AVAILABLE)
#define LDMATRIX_TRANS_AVAILABLE
#endif // defined(TURING_MMA_AVAILABLE)
static bool fp16_available(const int cc) { static bool fp16_available(const int cc) {
return ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_PASCAL || return ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_PASCAL ||
(GGML_CUDA_CC_IS_MTHREADS(cc) && cc >= GGML_CUDA_CC_PH1); (GGML_CUDA_CC_IS_MTHREADS(cc) && cc >= GGML_CUDA_CC_PH1);

View File

@ -914,7 +914,7 @@ void launch_fattn(
const int nblocks_stream_k = max_blocks; const int nblocks_stream_k = max_blocks;
const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || tiles_efficiency_percent < 75; const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || amd_wmma_available(cc) || tiles_efficiency_percent < 75;
blocks_num.x = use_stream_k ? nblocks_stream_k : ntiles_total; blocks_num.x = use_stream_k ? nblocks_stream_k : ntiles_total;
blocks_num.y = 1; blocks_num.y = 1;

View File

@ -98,6 +98,19 @@ static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_co
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols); return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);
} }
static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_config_rdna(const int DKQ, const int DV, const int ncols) {
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 16, 128, 2, 64, 128, 128, 128, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 32, 128, 2, 64, 128, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 64, 128, 2, 64, 128, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(576, 512, 16, 64, 4, 32, 96, 64, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(576, 512, 32, 128, 2, 32, 160, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(576, 512, 64, 256, 1, 32, 160, 128, 128, 1, false);
// TODO tune specifically for RDNA
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);
}
static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(const int DKQ, const int DV, const int ncols, const int cc) { static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(const int DKQ, const int DV, const int ncols, const int cc) {
if (ampere_mma_available(cc)) { if (ampere_mma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols); return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);
@ -105,6 +118,9 @@ static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(const int DKQ, c
if (turing_mma_available(cc)) { if (turing_mma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols); return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
} }
if (amd_wmma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_rdna(DKQ, DV, ncols);
}
GGML_ASSERT(volta_mma_available(cc)); GGML_ASSERT(volta_mma_available(cc));
return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols); return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols);
} }
@ -116,6 +132,8 @@ static constexpr __device__ fattn_mma_config ggml_cuda_fattn_mma_get_config(cons
return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols); return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
#elif defined(VOLTA_MMA_AVAILABLE) #elif defined(VOLTA_MMA_AVAILABLE)
return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols); return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols);
#elif defined(AMD_WMMA_AVAILABLE)
return ggml_cuda_fattn_mma_get_config_rdna(DKQ, DV, ncols);
#else #else
GGML_UNUSED_VARS(DKQ, DV, ncols); GGML_UNUSED_VARS(DKQ, DV, ncols);
return fattn_mma_config(32, 1, 0, 0, 0, 0, 0, false); return fattn_mma_config(32, 1, 0, 0, 0, 0, 0, false);
@ -186,6 +204,23 @@ static constexpr __device__ bool ggml_cuda_fattn_mma_get_Q_in_reg(const int DKQ,
return ggml_cuda_fattn_mma_get_config(DKQ, DV, ncols).Q_in_reg; return ggml_cuda_fattn_mma_get_config(DKQ, DV, ncols).Q_in_reg;
} }
static constexpr __device__ int get_cols_per_thread() {
#if defined(AMD_WMMA_AVAILABLE)
return 1; // RDNA has a single column.
#else
return 2; // This is specifically KQ columns, Volta only has a single VKQ column.
#endif // defined(AMD_WMMA_AVAILABLE)
}
static __host__ int get_cols_per_warp(const int cc) {
if (turing_mma_available(cc) || amd_wmma_available(cc)) {
return 16;
} else {
// Volta
return 32;
}
}
// ------------------------------------------------------------------------------------------------------------------ // ------------------------------------------------------------------------------------------------------------------
static __host__ int ggml_cuda_fattn_mma_get_nstages(const int DKQ, const int DV, const int ncols1, const int ncols2, const int cc) { static __host__ int ggml_cuda_fattn_mma_get_nstages(const int DKQ, const int DV, const int ncols1, const int ncols2, const int cc) {
@ -393,10 +428,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
const int jt, const int jt,
const int kb0, const int kb0,
const int k_VKQ_sup) { const int k_VKQ_sup) {
#if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) #if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
constexpr int ncols = ncols1 * ncols2; constexpr int ncols = ncols1 * ncols2;
constexpr int cols_per_warp = T_B_KQ::I; constexpr int cols_per_warp = T_B_KQ::I;
constexpr int cols_per_thread = 2; // This is specifically KQ columns, Volta only has a single VKQ column. constexpr int cols_per_thread = get_cols_per_thread();
constexpr int np = nwarps * (cols_per_warp/ncols2) / ncols1; // Number of parallel CUDA warps per Q column. constexpr int np = nwarps * (cols_per_warp/ncols2) / ncols1; // Number of parallel CUDA warps per Q column.
constexpr int nbatch_fa = ggml_cuda_fattn_mma_get_nbatch_fa(DKQ, DV, ncols); constexpr int nbatch_fa = ggml_cuda_fattn_mma_get_nbatch_fa(DKQ, DV, ncols);
constexpr int nbatch_K2 = ggml_cuda_fattn_mma_get_nbatch_K2(DKQ, DV, ncols); constexpr int nbatch_K2 = ggml_cuda_fattn_mma_get_nbatch_K2(DKQ, DV, ncols);
@ -413,6 +448,8 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
const int k_VKQ_0 = kb0 * nbatch_fa; const int k_VKQ_0 = kb0 * nbatch_fa;
#if defined(TURING_MMA_AVAILABLE) #if defined(TURING_MMA_AVAILABLE)
T_C_KQ KQ_C[nbatch_fa/(np*(cols_per_warp == 8 ? T_C_KQ::I : T_C_KQ::J))]; T_C_KQ KQ_C[nbatch_fa/(np*(cols_per_warp == 8 ? T_C_KQ::I : T_C_KQ::J))];
#elif defined(AMD_WMMA_AVAILABLE)
T_C_KQ KQ_C[nbatch_fa/(np*T_C_KQ::J)];
#else // Volta #else // Volta
T_C_KQ KQ_C[nbatch_fa/(np*T_C_KQ::J)]; T_C_KQ KQ_C[nbatch_fa/(np*T_C_KQ::J)];
#endif // defined(TURING_MMA_AVAILABLE) #endif // defined(TURING_MMA_AVAILABLE)
@ -461,8 +498,14 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
if constexpr (cols_per_warp == 8) { if constexpr (cols_per_warp == 8) {
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[k_KQ_0/T_A_KQ::J]); mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[k_KQ_0/T_A_KQ::J]);
} else { } else {
// Wide version of KQ_C is column-major => swap A and B. // Wide version of KQ_C is column-major
#if defined(AMD_WMMA_AVAILABLE)
// RDNA matrix C is column-major.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[k_KQ_0/T_A_KQ::J]);
#else
// swap A and B for CUDA.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[k_KQ_0/T_A_KQ::J], K_A); mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[k_KQ_0/T_A_KQ::J], K_A);
#endif // defined(AMD_WMMA_AVAILABLE)
} }
} }
} }
@ -479,8 +522,14 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
T_A_KQ K_A; T_A_KQ K_A;
load_ldmatrix(K_A, tile_K + i_KQ_0*stride_tile_K + (k_KQ_0 - k0_start), stride_tile_K); load_ldmatrix(K_A, tile_K + i_KQ_0*stride_tile_K + (k_KQ_0 - k0_start), stride_tile_K);
// Wide version of KQ_C is column-major => swap A and B. // Wide version of KQ_C is column-major
#if defined(AMD_WMMA_AVAILABLE)
// RDNA matrix C is column-major.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
#else
// swap A and B for CUDA.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A); mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A);
#endif // defined(AMD_WMMA_AVAILABLE)
} }
} }
} }
@ -532,7 +581,13 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll #pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) { for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::I + T_C_KQ::get_i(l) < k_VKQ_sup) { if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::I + T_C_KQ::get_i(l) < k_VKQ_sup) {
KQ_max_new[l % 2] = fmaxf(KQ_max_new[l % 2], KQ_C[k0/(np*T_C_KQ::I)].x[l] + FATTN_KQ_MAX_OFFSET); #if defined(AMD_WMMA_AVAILABLE)
constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = l % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
KQ_max_new[KQ_idx] = fmaxf(KQ_max_new[KQ_idx], KQ_C[k0/(np*T_C_KQ::I)].x[l] + FATTN_KQ_MAX_OFFSET);
} }
} }
} }
@ -552,8 +607,14 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll #pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) { for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::I + T_C_KQ::get_i(l) < k_VKQ_sup) { if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::I + T_C_KQ::get_i(l) < k_VKQ_sup) {
KQ_C[k0/(np*T_C_KQ::I)].x[l] = expf(KQ_C[k0/(np*T_C_KQ::I)].x[l] - KQ_max_new[l % 2]); #if defined(AMD_WMMA_AVAILABLE)
KQ_rowsum_add[l % 2] += KQ_C[k0/(np*T_C_KQ::I)].x[l]; constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = l % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
KQ_C[k0/(np*T_C_KQ::I)].x[l] = expf(KQ_C[k0/(np*T_C_KQ::I)].x[l] - KQ_max_new[KQ_idx]);
KQ_rowsum_add[KQ_idx] += KQ_C[k0/(np*T_C_KQ::I)].x[l];
} else { } else {
KQ_C[k0/(np*T_C_KQ::I)].x[l] = 0.0f; KQ_C[k0/(np*T_C_KQ::I)].x[l] = 0.0f;
} }
@ -584,8 +645,13 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll #pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) { for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::J + T_C_KQ::get_j(l) < k_VKQ_sup) { if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::J + T_C_KQ::get_j(l) < k_VKQ_sup) {
#if defined(AMD_WMMA_AVAILABLE)
constexpr int KQ_idx = 0;
#else
// Turing + Volta: // Turing + Volta:
KQ_max_new[(l/2) % 2] = fmaxf(KQ_max_new[(l/2) % 2], KQ_C[(k0/(np*T_C_KQ::J))].x[l] + FATTN_KQ_MAX_OFFSET); const int KQ_idx = (l/2) % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
KQ_max_new[KQ_idx] = fmaxf(KQ_max_new[KQ_idx], KQ_C[(k0/(np*T_C_KQ::J))].x[l] + FATTN_KQ_MAX_OFFSET);
} }
} }
} }
@ -596,7 +662,11 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
// Values per KQ column are spread across 4 threads: // Values per KQ column are spread across 4 threads:
constexpr int offset_first = 2; constexpr int offset_first = 2;
constexpr int offset_last = 1; constexpr int offset_last = 1;
#else #elif defined(AMD_WMMA_AVAILABLE)
// Values per KQ column are spread across 2 threads:
constexpr int offset_first = 16;
constexpr int offset_last = 16;
#else // Volta
// Values per KQ column are spread across 2 threads: // Values per KQ column are spread across 2 threads:
constexpr int offset_first = 2; constexpr int offset_first = 2;
constexpr int offset_last = 2; constexpr int offset_last = 2;
@ -612,10 +682,15 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
for (int k0 = 0; k0 < nbatch_fa; k0 += np*T_C_KQ::J) { for (int k0 = 0; k0 < nbatch_fa; k0 += np*T_C_KQ::J) {
#pragma unroll #pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) { for (int l = 0; l < T_C_KQ::ne; ++l) {
// Turing + Volta:
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::J + T_C_KQ::get_j(l) < k_VKQ_sup) { if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::J + T_C_KQ::get_j(l) < k_VKQ_sup) {
KQ_C[(k0/(np*T_C_KQ::J))].x[l] = expf(KQ_C[(k0/(np*T_C_KQ::J))].x[l] - KQ_max_new[(l/2) % 2]); #if defined(AMD_WMMA_AVAILABLE)
KQ_rowsum_add[(l/2) % 2] += KQ_C[(k0/(np*T_C_KQ::J))].x[l]; constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = (l/2) % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
KQ_C[(k0/(np*T_C_KQ::J))].x[l] = expf(KQ_C[(k0/(np*T_C_KQ::J))].x[l] - KQ_max_new[KQ_idx]);
KQ_rowsum_add[KQ_idx] += KQ_C[(k0/(np*T_C_KQ::J))].x[l];
} else { } else {
KQ_C[(k0/(np*T_C_KQ::J))].x[l] = 0.0f; KQ_C[(k0/(np*T_C_KQ::J))].x[l] = 0.0f;
} }
@ -639,7 +714,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#if defined(TURING_MMA_AVAILABLE) #if defined(TURING_MMA_AVAILABLE)
if constexpr (cols_per_warp == 8) { if constexpr (cols_per_warp == 8) {
const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[0], KQ_max_scale[1]); const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[0], KQ_max_scale[cols_per_thread - 1]);
#pragma unroll #pragma unroll
for (int i = 0; i < DV/T_C_VKQ::I; ++i) { for (int i = 0; i < DV/T_C_VKQ::I; ++i) {
#pragma unroll #pragma unroll
@ -660,6 +735,16 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
} }
} }
} }
#elif defined(AMD_WMMA_AVAILABLE)
const half2 KQ_max_scale_h2 = make_half2(
KQ_max_scale[0], KQ_max_scale[0]);
#pragma unroll
for (int i = 0; i < (DV/2)/T_C_VKQ::J; ++i) {
#pragma unroll
for (int l = 0; l < T_C_VKQ::ne; ++l) {
VKQ_C[i].x[l] *= KQ_max_scale_h2;
}
}
#else // Volta #else // Volta
const half2 KQ_max_scale_h2 = make_half2( const half2 KQ_max_scale_h2 = make_half2(
KQ_max_scale[(threadIdx.x / 2) % 2], KQ_max_scale[(threadIdx.x / 2) % 2]); KQ_max_scale[(threadIdx.x / 2) % 2], KQ_max_scale[(threadIdx.x / 2) % 2]);
@ -707,6 +792,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
// Therefore, iterate over V in reverse and re-use the data if possible. // Therefore, iterate over V in reverse and re-use the data if possible.
static_assert(!mla || nstages <= 1, "combination of MLA and multi-stage loading not implemented"); static_assert(!mla || nstages <= 1, "combination of MLA and multi-stage loading not implemented");
constexpr int reusable_cutoff = mla ? (DKQ - 1) - (DKQ - 1) % (2*nbatch_K2) - (DKQ - DV) : DV; constexpr int reusable_cutoff = mla ? (DKQ - 1) - (DKQ - 1) % (2*nbatch_K2) - (DKQ - DV) : DV;
#if defined(AMD_WMMA_AVAILABLE) && !defined(LDMATRIX_TRANS_AVAILABLE)
T_A_VKQ A_identity;
make_identity_mat(A_identity);
#endif // defined(AMD_WMMA_AVAILABLE) && !defined(LDMATRIX_TRANS_AVAILABLE)
// Calculate VKQ tile, need to use logical rather than physical elements for i0 due to transposition of V: // Calculate VKQ tile, need to use logical rather than physical elements for i0 due to transposition of V:
#pragma unroll #pragma unroll
@ -727,7 +816,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
} }
const half2 * tile_V_i = i0_start < reusable_cutoff ? tile_V : tile_V + (i0_start - reusable_cutoff)/2; const half2 * tile_V_i = i0_start < reusable_cutoff ? tile_V : tile_V + (i0_start - reusable_cutoff)/2;
#if defined(TURING_MMA_AVAILABLE) #if defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
constexpr int i0_stride = cols_per_warp == 8 ? T_C_VKQ::I : 2*T_C_VKQ::J; constexpr int i0_stride = cols_per_warp == 8 ? T_C_VKQ::I : 2*T_C_VKQ::J;
#pragma unroll #pragma unroll
for (int i_VKQ_0 = i0_start; i_VKQ_0 < i0_stop; i_VKQ_0 += i0_stride) { for (int i_VKQ_0 = i0_start; i_VKQ_0 < i0_stop; i_VKQ_0 += i0_stride) {
@ -737,12 +826,26 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
const int k0 = k00 + (threadIdx.y % np)*T_A_VKQ::J; const int k0 = k00 + (threadIdx.y % np)*T_A_VKQ::J;
T_A_VKQ A; // Transposed in SRAM but not in registers, gets transposed on load. T_A_VKQ A; // Transposed in SRAM but not in registers, gets transposed on load.
#if defined(LDMATRIX_TRANS_AVAILABLE)
load_ldmatrix_trans(A, tile_V_i + 2*k0*stride_tile_V + (i_VKQ_0 - i0_start)/2, stride_tile_V); load_ldmatrix_trans(A, tile_V_i + 2*k0*stride_tile_V + (i_VKQ_0 - i0_start)/2, stride_tile_V);
#else
// TODO: Try to transpose tile_V when loading gmem to smem.
// Use mma to transpose T_A_VKQ for RDNA.
T_A_VKQ A_trans;
load_ldmatrix(A_trans, tile_V_i + 2*k0*stride_tile_V + (i_VKQ_0 - i0_start)/2, stride_tile_V);
mma(A, A_trans, A_identity);
#endif // defined(TURING_MMA_AVAILABLE)
if constexpr (T_B_KQ::I == 8) { if constexpr (T_B_KQ::I == 8) {
mma(VKQ_C[i_VKQ_0/i0_stride], A, B[k00/(np*T_A_VKQ::J)]); mma(VKQ_C[i_VKQ_0/i0_stride], A, B[k00/(np*T_A_VKQ::J)]);
} else { } else {
// Wide version of VKQ_C is column-major => swap A and B. // Wide version of VKQ_C is column-major.
#if defined(AMD_WMMA_AVAILABLE)
// RDNA matrix C is column-major.
mma(VKQ_C[i_VKQ_0/i0_stride], A, B[k00/(np*T_A_VKQ::J)]);
#else
// swap A and B for CUDA.
mma(VKQ_C[i_VKQ_0/i0_stride], B[k00/(np*T_A_VKQ::J)], A); mma(VKQ_C[i_VKQ_0/i0_stride], B[k00/(np*T_A_VKQ::J)], A);
#endif // defined(AMD_WMMA_AVAILABLE)
} }
} }
} }
@ -761,7 +864,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
mma(VKQ_C[i_VKQ_0/i0_stride], B[k00/(np*T_A_VKQ::I)], A); mma(VKQ_C[i_VKQ_0/i0_stride], B[k00/(np*T_A_VKQ::I)], A);
} }
} }
#endif // defined(TURING_MMA_AVAILABLE) #endif // defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
if constexpr (nstages <= 1) { if constexpr (nstages <= 1) {
__syncthreads(); // Only needed if tile_K == tile_V. __syncthreads(); // Only needed if tile_K == tile_V.
@ -774,7 +877,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
tile_Q, tile_K, tile_V, tile_mask, tile_Q, tile_K, tile_V, tile_mask,
Q_B, VKQ_C, KQ_max, KQ_rowsum, kb0); Q_B, VKQ_C, KQ_max, KQ_rowsum, kb0);
NO_DEVICE_CODE; NO_DEVICE_CODE;
#endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) #endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
} }
#if defined(TURING_MMA_AVAILABLE) #if defined(TURING_MMA_AVAILABLE)
@ -794,6 +897,15 @@ template<> struct mma_tile_sizes<8> {
using T_B_VKQ = tile< 8, 8, half2>; // column-major using T_B_VKQ = tile< 8, 8, half2>; // column-major
using T_C_VKQ = tile<16, 4, half2>; // row-major using T_C_VKQ = tile<16, 4, half2>; // row-major
}; };
#elif defined(AMD_WMMA_AVAILABLE)
template<int ncols> struct mma_tile_sizes {
using T_A_KQ = tile<16, 8, half2>; // row-major
using T_B_KQ = tile<16, 8, half2>; // column-major
using T_C_KQ = tile<16, 16, float>; // column-major
using T_A_VKQ = tile<16, 8, half2>; // row-major
using T_B_VKQ = tile<16, 8, half2>; // column-major
using T_C_VKQ = tile<16, 8, half2>; // column-major
};
#else // Volta #else // Volta
template<int ncols> struct mma_tile_sizes { template<int ncols> struct mma_tile_sizes {
using T_A_KQ = tile< 8, 4, half2, DATA_LAYOUT_I_MAJOR_MIRRORED>; // row-major using T_A_KQ = tile< 8, 4, half2, DATA_LAYOUT_I_MAJOR_MIRRORED>; // row-major
@ -828,7 +940,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
const int jt, const int jt,
const int kb0_start, const int kb0_start,
const int kb0_stop) { const int kb0_stop) {
#if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) #if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
//In this kernel Q, K, V are matrices while i, j, k are matrix indices. //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
constexpr int ncols = ncols1 * ncols2; constexpr int ncols = ncols1 * ncols2;
@ -840,7 +952,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
using T_C_VKQ = typename mma_tile_sizes<ncols>::T_C_VKQ; using T_C_VKQ = typename mma_tile_sizes<ncols>::T_C_VKQ;
constexpr int cols_per_warp = T_B_KQ::I; constexpr int cols_per_warp = T_B_KQ::I;
constexpr int cols_per_thread = 2; // This is specifically KQ columns, Volta only has a single VKQ column. constexpr int cols_per_thread = get_cols_per_thread();
constexpr int np = nwarps * (cols_per_warp/ncols2) / ncols1; // Number of parallel CUDA warps per Q column. constexpr int np = nwarps * (cols_per_warp/ncols2) / ncols1; // Number of parallel CUDA warps per Q column.
constexpr int nbatch_fa = ggml_cuda_fattn_mma_get_nbatch_fa (DKQ, DV, ncols); constexpr int nbatch_fa = ggml_cuda_fattn_mma_get_nbatch_fa (DKQ, DV, ncols);
constexpr int nbatch_K2 = ggml_cuda_fattn_mma_get_nbatch_K2 (DKQ, DV, ncols); constexpr int nbatch_K2 = ggml_cuda_fattn_mma_get_nbatch_K2 (DKQ, DV, ncols);
@ -871,6 +983,8 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
T_B_KQ Q_B[(Q_in_reg ? DKQ/(2*T_B_KQ::J) : 1)]; T_B_KQ Q_B[(Q_in_reg ? DKQ/(2*T_B_KQ::J) : 1)];
#if defined(TURING_MMA_AVAILABLE) #if defined(TURING_MMA_AVAILABLE)
T_C_VKQ VKQ_C[cols_per_warp == 8 ? DV/T_C_VKQ::I : DV/(2*T_C_VKQ::J)]; T_C_VKQ VKQ_C[cols_per_warp == 8 ? DV/T_C_VKQ::I : DV/(2*T_C_VKQ::J)];
#elif defined(AMD_WMMA_AVAILABLE)
T_C_VKQ VKQ_C[ DV/(2*T_C_VKQ::J)];
#else // Volta #else // Volta
T_C_VKQ VKQ_C[ DV/(2*T_C_VKQ::J)]; T_C_VKQ VKQ_C[ DV/(2*T_C_VKQ::J)];
#endif // defined(TURING_MMA_AVAILABLE) #endif // defined(TURING_MMA_AVAILABLE)
@ -1010,6 +1124,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
// The partial sums are spread across 8/4 threads. // The partial sums are spread across 8/4 threads.
constexpr int offset_first = cols_per_warp == 8 ? 16 : 2; constexpr int offset_first = cols_per_warp == 8 ? 16 : 2;
constexpr int offset_last = cols_per_warp == 8 ? 4 : 1; constexpr int offset_last = cols_per_warp == 8 ? 4 : 1;
#elif defined(AMD_WMMA_AVAILABLE)
// The partial sums are spread across 2 threads.
constexpr int offset_first = 16;
constexpr int offset_last = 16;
#else // Volta #else // Volta
// The partial sums are spread across 2 threads. // The partial sums are spread across 2 threads.
constexpr int offset_first = 2; constexpr int offset_first = 2;
@ -1047,7 +1165,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
#if defined(TURING_MMA_AVAILABLE) #if defined(TURING_MMA_AVAILABLE)
if constexpr (cols_per_warp == 8) { if constexpr (cols_per_warp == 8) {
const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[0], KQ_max_scale[1]); const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[0], KQ_max_scale[cols_per_thread - 1]);
#pragma unroll #pragma unroll
for (int i = 0; i < DV/T_C_VKQ::I; ++i) { for (int i = 0; i < DV/T_C_VKQ::I; ++i) {
#pragma unroll #pragma unroll
@ -1068,6 +1186,15 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
} }
} }
} }
#elif defined(AMD_WMMA_AVAILABLE)
const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[0], KQ_max_scale[0]);
#pragma unroll
for (int i = 0; i < (DV/2)/T_C_VKQ::J; ++i) {
#pragma unroll
for (int l = 0; l < T_C_VKQ::ne; ++l) {
VKQ_C[i].x[l] *= KQ_max_scale_h2;
}
}
#else // Volta #else // Volta
const int col = (threadIdx.x / 2) % 2; const int col = (threadIdx.x / 2) % 2;
const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[col], KQ_max_scale[col]); const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[col], KQ_max_scale[col]);
@ -1119,6 +1246,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
const int jc_cwm = threadIdx.y*cols_per_warp + T_C_VKQ::get_i(threadIdx.x % 4); const int jc_cwm = threadIdx.y*cols_per_warp + T_C_VKQ::get_i(threadIdx.x % 4);
const float2 KQ_cmr = make_float2(KQ_max[threadIdx.x % cols_per_thread], KQ_rowsum[threadIdx.x % cols_per_thread]); const float2 KQ_cmr = make_float2(KQ_max[threadIdx.x % cols_per_thread], KQ_rowsum[threadIdx.x % cols_per_thread]);
const bool thread_should_write = threadIdx.x % 4 < cols_per_thread; const bool thread_should_write = threadIdx.x % 4 < cols_per_thread;
#elif defined(AMD_WMMA_AVAILABLE)
const int jc_cwm = threadIdx.y*cols_per_warp + T_C_VKQ::get_i(0);
const float2 KQ_cmr = make_float2(KQ_max[0], KQ_rowsum[0]);
const bool thread_should_write = threadIdx.x / 16 < cols_per_thread;
#else // Volta #else // Volta
const int jc_cwm = threadIdx.y*cols_per_warp + T_C_KQ::get_i(threadIdx.x & 2); const int jc_cwm = threadIdx.y*cols_per_warp + T_C_KQ::get_i(threadIdx.x & 2);
const float2 KQ_cmr = make_float2(KQ_max[(threadIdx.x & 2) / 2], KQ_rowsum[(threadIdx.x & 2) / 2]); const float2 KQ_cmr = make_float2(KQ_max[(threadIdx.x & 2) / 2], KQ_rowsum[(threadIdx.x & 2) / 2]);
@ -1319,7 +1450,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
stride_Q1, stride_Q2, stride_K, stride_V, stride_mask, stride_Q1, stride_Q2, stride_K, stride_V, stride_mask,
jt, kb0_start, kb0_stop); jt, kb0_start, kb0_stop);
NO_DEVICE_CODE; NO_DEVICE_CODE;
#endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) #endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
} }
template<int DKQ, int DV, int ncols1, int ncols2, bool use_logit_softcap, bool mla> template<int DKQ, int DV, int ncols1, int ncols2, bool use_logit_softcap, bool mla>
@ -1346,7 +1477,7 @@ static __global__ void flash_attn_ext_f16(
const int32_t nb21, const int32_t nb22, const int64_t nb23, const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33, const int32_t ne31, const int32_t ne32, const int32_t ne33,
const int32_t nb31, const int32_t nb32, const int64_t nb33) { const int32_t nb31, const int32_t nb32, const int64_t nb33) {
#if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE)) #if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)))
// Skip unused kernel variants for faster compilation: // Skip unused kernel variants for faster compilation:
if (use_logit_softcap && !(DKQ == 128 || DKQ == 256)) { if (use_logit_softcap && !(DKQ == 128 || DKQ == 256)) {
@ -1360,6 +1491,13 @@ static __global__ void flash_attn_ext_f16(
} }
#endif // __CUDA_ARCH__ == GGML_CUDA_CC_TURING #endif // __CUDA_ARCH__ == GGML_CUDA_CC_TURING
#if defined(AMD_WMMA_AVAILABLE)
if (ncols1*ncols2 > 32 || ncols1*ncols2 < 16 || DKQ > 128 || ncols2 == 1) {
NO_DEVICE_CODE;
return;
}
#endif // defined(AMD_WMMA_AVAILABLE)
static_assert(!mla || DKQ >= DV, "MLA needs DKQ >= DV"); static_assert(!mla || DKQ >= DV, "MLA needs DKQ >= DV");
constexpr int ncols = ncols1 * ncols2; constexpr int ncols = ncols1 * ncols2;
@ -1473,7 +1611,7 @@ static __global__ void flash_attn_ext_f16(
ne31, ne32, ne33, ne31, ne32, ne33,
nb31, nb32, nb33); nb31, nb32, nb33);
NO_DEVICE_CODE; NO_DEVICE_CODE;
#endif // defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE)) #endif // defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)))
} }
template <int DKQ, int DV, int ncols1, int ncols2> template <int DKQ, int DV, int ncols1, int ncols2>
@ -1492,7 +1630,7 @@ void ggml_cuda_flash_attn_ext_mma_f16_case(ggml_backend_cuda_context & ctx, ggml
const bool Q_in_reg = ggml_cuda_fattn_mma_get_Q_in_reg (DKQ, DV, ncols, cc); const bool Q_in_reg = ggml_cuda_fattn_mma_get_Q_in_reg (DKQ, DV, ncols, cc);
const int nstages = ggml_cuda_fattn_mma_get_nstages (DKQ, DV, ncols1, ncols2, cc); const int nstages = ggml_cuda_fattn_mma_get_nstages (DKQ, DV, ncols1, ncols2, cc);
const int cols_per_warp = std::min(ncols, turing_mma_available(cc) ? 16 : 32); const int cols_per_warp = std::min(ncols, get_cols_per_warp(cc));
const int nwarps = nthreads / WARP_SIZE; const int nwarps = nthreads / WARP_SIZE;
constexpr bool mla = DKQ == 576; constexpr bool mla = DKQ == 576;
@ -1512,29 +1650,34 @@ void ggml_cuda_flash_attn_ext_mma_f16_case(ggml_backend_cuda_context & ctx, ggml
float logit_softcap; float logit_softcap;
memcpy(&logit_softcap, (const float *) KQV->op_params + 2, sizeof(float)); memcpy(&logit_softcap, (const float *) KQV->op_params + 2, sizeof(float));
#if defined(GGML_USE_HIP)
using fattn_kernel_ptr_t = const void*;
#else
using fattn_kernel_ptr_t = fattn_kernel_t;
#endif // defined(GGML_USE_HIP)
fattn_kernel_t fattn_kernel; fattn_kernel_t fattn_kernel;
if (logit_softcap == 0.0f) { if (logit_softcap == 0.0f) {
constexpr bool use_logit_softcap = false; constexpr bool use_logit_softcap = false;
fattn_kernel = flash_attn_ext_f16<DKQ, DV, ncols1, ncols2, use_logit_softcap, mla>; fattn_kernel = flash_attn_ext_f16<DKQ, DV, ncols1, ncols2, use_logit_softcap, mla>;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) #if !defined(GGML_USE_MUSA)
static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false}; static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
if (!shared_memory_limit_raised[id]) { if (!shared_memory_limit_raised[id]) {
CUDA_CHECK(cudaFuncSetAttribute(fattn_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared_total)); CUDA_CHECK(cudaFuncSetAttribute(reinterpret_cast<fattn_kernel_ptr_t>(fattn_kernel), cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared_total));
shared_memory_limit_raised[id] = true; shared_memory_limit_raised[id] = true;
} }
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) #endif // !defined(GGML_USE_MUSA)
} else { } else {
constexpr bool use_logit_softcap = true; constexpr bool use_logit_softcap = true;
fattn_kernel = flash_attn_ext_f16<DKQ, DV, ncols1, ncols2, use_logit_softcap, mla>; fattn_kernel = flash_attn_ext_f16<DKQ, DV, ncols1, ncols2, use_logit_softcap, mla>;
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) #if !defined(GGML_USE_MUSA)
static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false}; static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
if (!shared_memory_limit_raised[id]) { if (!shared_memory_limit_raised[id]) {
CUDA_CHECK(cudaFuncSetAttribute(fattn_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared_total)); CUDA_CHECK(cudaFuncSetAttribute(reinterpret_cast<fattn_kernel_ptr_t>(fattn_kernel), cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared_total));
shared_memory_limit_raised[id] = true; shared_memory_limit_raised[id] = true;
} }
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) #endif // !defined(GGML_USE_MUSA)
} }
launch_fattn<DV, ncols1, ncols2> launch_fattn<DV, ncols1, ncols2>

View File

@ -18,12 +18,12 @@ static void ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1(ggml_backend_cuda_con
} }
} }
if (turing_mma_available(cc) && Q->ne[1] <= 16/ncols2) { if ((turing_mma_available(cc) || amd_wmma_available(cc)) && Q->ne[1] <= 16/ncols2) {
ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 16/ncols2, ncols2>(ctx, dst); ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 16/ncols2, ncols2>(ctx, dst);
return; return;
} }
if (ggml_cuda_highest_compiled_arch(cc) == GGML_CUDA_CC_TURING || Q->ne[1] <= 32/ncols2) { if (ggml_cuda_highest_compiled_arch(cc) == GGML_CUDA_CC_TURING || amd_wmma_available(cc) || Q->ne[1] <= 32/ncols2) {
ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 32/ncols2, ncols2>(ctx, dst); ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 32/ncols2, ncols2>(ctx, dst);
return; return;
} }
@ -230,7 +230,18 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
// The effective batch size for the kernel can be increased by gqa_ratio. // The effective batch size for the kernel can be increased by gqa_ratio.
// The kernel versions without this optimization are also used for ALiBi, if there is no mask, or if the KV cache is not padded, // The kernel versions without this optimization are also used for ALiBi, if there is no mask, or if the KV cache is not padded,
const bool gqa_opt_applies = gqa_ratio % 2 == 0 && mask && max_bias == 0.0f && K->ne[1] % FATTN_KQ_STRIDE == 0; bool gqa_opt_applies = gqa_ratio % 2 == 0 && mask && max_bias == 0.0f && K->ne[1] % FATTN_KQ_STRIDE == 0;
for (const ggml_tensor * t : {Q, K, V, mask}) {
if (t == nullptr) {
continue;
}
for (size_t i = 1; i < GGML_MAX_DIMS; ++i) {
if (t->nb[i] % 16 != 0) {
gqa_opt_applies = false;
break;
}
}
}
const int cc = ggml_cuda_info().devices[device].cc; const int cc = ggml_cuda_info().devices[device].cc;
@ -337,6 +348,31 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
return BEST_FATTN_KERNEL_WMMA_F16; return BEST_FATTN_KERNEL_WMMA_F16;
} }
if (amd_wmma_available(cc) && GGML_CUDA_CC_IS_RDNA4(cc) && gqa_opt_applies && Q->ne[0] <= 128 && Q->ne[0] != 40 && Q->ne[0] != 72) {
if (can_use_vector_kernel) {
if (!ggml_is_quantized(K->type) && !ggml_is_quantized(V->type)) {
if (Q->ne[1] == 1) {
if (!gqa_opt_applies) {
return BEST_FATTN_KERNEL_VEC;
}
}
} else {
if (Q->ne[1] <= 2) {
return BEST_FATTN_KERNEL_VEC;
}
}
}
int gqa_ratio_eff = 1;
const int ncols2_max = Q->ne[0] == 576 ? 16 : 8;
while (gqa_ratio % (2*gqa_ratio_eff) == 0 && gqa_ratio_eff < ncols2_max) {
gqa_ratio_eff *= 2;
}
if (Q->ne[1] * gqa_ratio_eff <= 8) {
return BEST_FATTN_KERNEL_TILE; // AMD WMMA is only faster if the full tile width of 16 can be utilized.
}
return BEST_FATTN_KERNEL_MMA_F16;
}
// If there are no tensor cores available, use the generic tile kernel: // If there are no tensor cores available, use the generic tile kernel:
if (can_use_vector_kernel) { if (can_use_vector_kernel) {
if (!ggml_is_quantized(K->type) && !ggml_is_quantized(V->type)) { if (!ggml_is_quantized(K->type) && !ggml_is_quantized(V->type)) {

View File

@ -3737,6 +3737,7 @@ static bool ggml_cuda_graph_set_enabled(ggml_backend_cuda_context * cuda_ctx) {
return cuda_ctx->cuda_graph->is_enabled(); return cuda_ctx->cuda_graph->is_enabled();
#else #else
GGML_UNUSED(cuda_ctx);
return false; return false;
#endif // USE_CUDA_GRAPH #endif // USE_CUDA_GRAPH
} }
@ -4122,6 +4123,7 @@ struct ggml_backend_cuda_device_context {
std::string name; std::string name;
std::string description; std::string description;
std::string pci_bus_id; std::string pci_bus_id;
int op_offload_min_batch_size;
}; };
static const char * ggml_backend_cuda_device_get_name(ggml_backend_dev_t dev) { static const char * ggml_backend_cuda_device_get_name(ggml_backend_dev_t dev) {
@ -4676,11 +4678,9 @@ static int64_t get_op_batch_size(const ggml_tensor * op) {
} }
static bool ggml_backend_cuda_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { static bool ggml_backend_cuda_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
const int min_batch_size = 32; ggml_backend_cuda_device_context * dev_ctx = (ggml_backend_cuda_device_context *) dev->context;
return get_op_batch_size(op) >= min_batch_size; return get_op_batch_size(op) >= dev_ctx->op_offload_min_batch_size;
GGML_UNUSED(dev);
} }
static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_t dev) { static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_t dev) {
@ -4848,6 +4848,7 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
std::lock_guard<std::mutex> lock(mutex); std::lock_guard<std::mutex> lock(mutex);
if (!initialized) { if (!initialized) {
ggml_backend_cuda_reg_context * ctx = new ggml_backend_cuda_reg_context; ggml_backend_cuda_reg_context * ctx = new ggml_backend_cuda_reg_context;
const int min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
for (int i = 0; i < ggml_cuda_info().device_count; i++) { for (int i = 0; i < ggml_cuda_info().device_count; i++) {
ggml_backend_cuda_device_context * dev_ctx = new ggml_backend_cuda_device_context; ggml_backend_cuda_device_context * dev_ctx = new ggml_backend_cuda_device_context;
@ -4861,6 +4862,7 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
char pci_bus_id[16] = {}; char pci_bus_id[16] = {};
snprintf(pci_bus_id, sizeof(pci_bus_id), "%04x:%02x:%02x.0", prop.pciDomainID, prop.pciBusID, prop.pciDeviceID); snprintf(pci_bus_id, sizeof(pci_bus_id), "%04x:%02x:%02x.0", prop.pciDomainID, prop.pciBusID, prop.pciDeviceID);
dev_ctx->pci_bus_id = pci_bus_id; dev_ctx->pci_bus_id = pci_bus_id;
dev_ctx->op_offload_min_batch_size = min_batch_size;
ggml_backend_dev_t dev = new ggml_backend_device { ggml_backend_dev_t dev = new ggml_backend_device {
/* .iface = */ ggml_backend_cuda_device_interface, /* .iface = */ ggml_backend_cuda_device_interface,

View File

@ -206,10 +206,16 @@ namespace ggml_cuda_mma {
static __device__ __forceinline__ int get_j(const int l) { static __device__ __forceinline__ int get_j(const int l) {
if constexpr (I == 16 && J == 16) { if constexpr (I == 16 && J == 16) {
// matrix C
#if defined(RDNA3) #if defined(RDNA3)
if constexpr (std::is_same_v<T, float> || std::is_same_v<T, int>) {
// matrix C
return 2 * l + (threadIdx.x / 16); return 2 * l + (threadIdx.x / 16);
} else {
// matrix A&B
return l;
}
#else #else
// matrix C is the transposed matrix A&B on RDNA4
return ne * (threadIdx.x / 16) + l; return ne * (threadIdx.x / 16) + l;
#endif // defined(RDNA3) #endif // defined(RDNA3)
} else if constexpr (I == 16 && J == 8) { } else if constexpr (I == 16 && J == 8) {
@ -621,6 +627,21 @@ namespace ggml_cuda_mma {
return ret; return ret;
} }
#elif defined(AMD_WMMA_AVAILABLE)
template <int I, int J>
static __device__ __forceinline__ tile<I, J/2, half2> get_half2(const tile<I, J, float> & tile_float) {
tile<I, J/2, half2> ret;
#pragma unroll
for (int l0 = 0; l0 < tile_float.ne; l0 += 2) {
ret.x[l0/2] = make_half2(tile_float.x[l0 + 0], tile_float.x[l0 + 1]);
}
return ret;
}
static __device__ __forceinline__ tile<8, 8, half2> get_transposed(const tile<16, 4, half2> & t) {
NO_DEVICE_CODE;
return tile<8, 8, half2>{};
}
#else // Volta #else // Volta
template <int I, int J> template <int I, int J>
static __device__ __forceinline__ tile<I, J/2, half2> get_half2(const tile<I, J, float> & tile_float) { static __device__ __forceinline__ tile<I, J/2, half2> get_half2(const tile<I, J, float> & tile_float) {
@ -639,6 +660,19 @@ namespace ggml_cuda_mma {
} }
#endif // defined(TURING_MMA_AVAILABLE) #endif // defined(TURING_MMA_AVAILABLE)
static __device__ __forceinline__ void make_identity_mat(tile<16, 8, half2> & t) {
#if defined(RDNA4)
const int row = t.get_i(0);
const int left_right = t.get_j(0) / 4;
const int up_down = row / 8;
const int idx = row % 8;
reinterpret_cast<half*>(t.x)[idx] = left_right == up_down ? 1.0f : 0.0f;
#else
GGML_UNUSED_VARS(t);
NO_DEVICE_CODE;
#endif // defined(RDNA4)
}
template <int I, int J, typename T, data_layout dl> template <int I, int J, typename T, data_layout dl>
static __device__ __forceinline__ void load_generic(tile<I, J, T, dl> & t, const T * __restrict__ xs0, const int stride) { static __device__ __forceinline__ void load_generic(tile<I, J, T, dl> & t, const T * __restrict__ xs0, const int stride) {
#if defined(AMD_MFMA_AVAILABLE) #if defined(AMD_MFMA_AVAILABLE)
@ -878,6 +912,17 @@ namespace ggml_cuda_mma {
: "+r"(Dxi[2]), "+r"(Dxi[3]) : "+r"(Dxi[2]), "+r"(Dxi[3])
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3])); : "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#elif defined(AMD_WMMA_AVAILABLE)
#if defined(RDNA4)
using halfx8_t = __attribute__((ext_vector_type(8))) _Float16;
halfx8_t& acc_frag = reinterpret_cast<halfx8_t&>(D.x[0]);
const halfx8_t& a_frag = reinterpret_cast<const halfx8_t&>(A.x[0]);
const halfx8_t& b_frag = reinterpret_cast<const halfx8_t&>(B.x[0]);
acc_frag = __builtin_amdgcn_wmma_f16_16x16x16_f16_w32_gfx12(a_frag, b_frag, acc_frag);
#else
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // defined(RDNA4)
#else #else
GGML_UNUSED_VARS(D, A, B); GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE; NO_DEVICE_CODE;

View File

@ -190,7 +190,7 @@ void ggml_cuda_mul_mat_q(
{ {
const int64_t s11 = src1->nb[1] / ts_src1; const int64_t s11 = src1->nb[1] / ts_src1;
const int64_t s12 = src1->nb[2] / ts_src1; const int64_t s12 = src1->nb[2] / ts_src1;
const int64_t s13 = src1->nb[2] / ts_src1; const int64_t s13 = src1->nb[3] / ts_src1;
if (use_native_mxfp4) { if (use_native_mxfp4) {
quantize_mmq_mxfp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13, quantize_mmq_mxfp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
@ -333,28 +333,31 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
} }
if (amd_wmma_available(cc)) { if (amd_wmma_available(cc)) {
// RDNA 4 is consistently worse on rocblas
// https://github.com/ggml-org/llama.cpp/pull/18537#issuecomment-3706422301
if (GGML_CUDA_CC_IS_RDNA3(cc)) { if (GGML_CUDA_CC_IS_RDNA3(cc)) {
// High expert counts almost always better on MMQ // High expert counts are almost always better on MMQ due to
// due to a large amount of graph splits // the synchronization overhead in the cuBLAS/hipBLAS path:
// https://github.com/ggml-org/llama.cpp/pull/18202 // https://github.com/ggml-org/llama.cpp/pull/18202
if (n_experts >= 64) { if (n_experts >= 64) {
return true; return true;
} }
// For some quantization types MMQ can have lower peak TOPS than hipBLAS
// so it's only faster for sufficiently small batch sizes:
switch (type) { switch (type) {
// These quants are really bad on MMQ
case GGML_TYPE_Q2_K: case GGML_TYPE_Q2_K:
return ne11 <= 128;
case GGML_TYPE_Q6_K: case GGML_TYPE_Q6_K:
// These quants are usually worse but not always return ne11 <= (GGML_CUDA_CC_IS_RDNA3_0(cc) ? 128 : 256);
case GGML_TYPE_IQ2_XS: case GGML_TYPE_IQ2_XS:
case GGML_TYPE_IQ2_S: case GGML_TYPE_IQ2_S:
return ne11 <= 128; return GGML_CUDA_CC_IS_RDNA3_5(cc) || ne11 <= 128;
default: default:
return true; return true;
} }
} }
// For RDNA4 MMQ is consistently faster than dequantization + hipBLAS:
// https://github.com/ggml-org/llama.cpp/pull/18537#issuecomment-3706422301
return true; return true;
} }

View File

@ -138,6 +138,8 @@
#define cudaStream_t hipStream_t #define cudaStream_t hipStream_t
#define cudaSuccess hipSuccess #define cudaSuccess hipSuccess
#define cudaOccupancyMaxActiveBlocksPerMultiprocessor hipOccupancyMaxActiveBlocksPerMultiprocessor #define cudaOccupancyMaxActiveBlocksPerMultiprocessor hipOccupancyMaxActiveBlocksPerMultiprocessor
#define cudaFuncSetAttribute hipFuncSetAttribute
#define cudaFuncAttributeMaxDynamicSharedMemorySize hipFuncAttributeMaxDynamicSharedMemorySize
#define __trap() do { abort(); __builtin_unreachable(); } while(0) #define __trap() do { abort(); __builtin_unreachable(); } while(0)
#define CUBLAS_STATUS_SUCCESS HIPBLAS_STATUS_SUCCESS #define CUBLAS_STATUS_SUCCESS HIPBLAS_STATUS_SUCCESS
#define CUBLAS_STATUS_NOT_INITIALIZED HIPBLAS_STATUS_NOT_INITIALIZED #define CUBLAS_STATUS_NOT_INITIALIZED HIPBLAS_STATUS_NOT_INITIALIZED

View File

@ -219,6 +219,8 @@ struct ggml_metal_device_props {
bool use_shared_buffers; bool use_shared_buffers;
bool supports_gpu_family_apple7; bool supports_gpu_family_apple7;
int op_offload_min_batch_size;
}; };
ggml_metal_device_t ggml_metal_device_init(void); ggml_metal_device_t ggml_metal_device_init(void);

View File

@ -782,6 +782,8 @@ ggml_metal_device_t ggml_metal_device_init(void) {
dev->props.supports_gpu_family_apple7 = [dev->mtl_device supportsFamily:MTLGPUFamilyApple7]; dev->props.supports_gpu_family_apple7 = [dev->mtl_device supportsFamily:MTLGPUFamilyApple7];
dev->props.op_offload_min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
dev->props.max_buffer_size = dev->mtl_device.maxBufferLength; dev->props.max_buffer_size = dev->mtl_device.maxBufferLength;
dev->props.max_working_set_size = dev->mtl_device.recommendedMaxWorkingSetSize; dev->props.max_working_set_size = dev->mtl_device.recommendedMaxWorkingSetSize;
dev->props.max_theadgroup_memory_size = dev->mtl_device.maxThreadgroupMemoryLength; dev->props.max_theadgroup_memory_size = dev->mtl_device.maxThreadgroupMemoryLength;

View File

@ -625,14 +625,11 @@ static int64_t get_op_batch_size(const ggml_tensor * op) {
} }
static bool ggml_backend_metal_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { static bool ggml_backend_metal_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
const int min_batch_size = 32; ggml_metal_device_t ctx_dev = (ggml_metal_device_t)dev->context;
return (op->op == GGML_OP_MUL_MAT || return (op->op == GGML_OP_MUL_MAT ||
op->op == GGML_OP_MUL_MAT_ID) && op->op == GGML_OP_MUL_MAT_ID) &&
get_op_batch_size(op) >= min_batch_size; get_op_batch_size(op) >= ggml_metal_device_get_props(ctx_dev)->op_offload_min_batch_size;
GGML_UNUSED(dev);
GGML_UNUSED(op);
} }
static ggml_backend_device_i ggml_backend_metal_device_i = { static ggml_backend_device_i ggml_backend_metal_device_i = {

View File

@ -9148,6 +9148,7 @@ typedef decltype(kernel_mul_mm_id_map0<1>) kernel_mul_mm_id_map0_t;
template [[host_name("kernel_mul_mm_id_map0_ne20_1" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<1>; template [[host_name("kernel_mul_mm_id_map0_ne20_1" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<1>;
template [[host_name("kernel_mul_mm_id_map0_ne20_2" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<2>; template [[host_name("kernel_mul_mm_id_map0_ne20_2" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<2>;
template [[host_name("kernel_mul_mm_id_map0_ne20_4" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<4>; template [[host_name("kernel_mul_mm_id_map0_ne20_4" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<4>;
template [[host_name("kernel_mul_mm_id_map0_ne20_5" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<5>;
template [[host_name("kernel_mul_mm_id_map0_ne20_6" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<6>; template [[host_name("kernel_mul_mm_id_map0_ne20_6" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<6>;
template [[host_name("kernel_mul_mm_id_map0_ne20_8" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<8>; template [[host_name("kernel_mul_mm_id_map0_ne20_8" )]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<8>;
template [[host_name("kernel_mul_mm_id_map0_ne20_10")]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<10>; template [[host_name("kernel_mul_mm_id_map0_ne20_10")]] kernel kernel_mul_mm_id_map0_t kernel_mul_mm_id_map0<10>;

View File

@ -57,6 +57,7 @@ set(GGML_OPENCL_KERNELS
add add
add_id add_id
argsort argsort
fill
clamp clamp
cpy cpy
cvt cvt
@ -120,6 +121,8 @@ set(GGML_OPENCL_KERNELS
tsembd tsembd
upscale upscale
tanh tanh
expm1
softplus
pad pad
repeat repeat
mul_mat_f16_f32 mul_mat_f16_f32

View File

@ -489,6 +489,7 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_gelu_quick, kernel_gelu_quick_4; cl_kernel kernel_gelu_quick, kernel_gelu_quick_4;
cl_kernel kernel_relu; cl_kernel kernel_relu;
cl_kernel kernel_sigmoid_f32, kernel_sigmoid_f16; cl_kernel kernel_sigmoid_f32, kernel_sigmoid_f16;
cl_kernel kernel_fill;
cl_kernel kernel_clamp; cl_kernel kernel_clamp;
cl_kernel kernel_geglu, kernel_reglu, kernel_swiglu, kernel_swiglu_oai, kernel_geglu_erf, kernel_geglu_quick, cl_kernel kernel_geglu, kernel_reglu, kernel_swiglu, kernel_swiglu_oai, kernel_geglu_erf, kernel_geglu_quick,
kernel_geglu_f16, kernel_reglu_f16, kernel_swiglu_f16, kernel_geglu_erf_f16, kernel_geglu_quick_f16; kernel_geglu_f16, kernel_reglu_f16, kernel_swiglu_f16, kernel_geglu_erf_f16, kernel_geglu_quick_f16;
@ -537,6 +538,10 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_pad; cl_kernel kernel_pad;
cl_kernel kernel_tanh_f32_nd; cl_kernel kernel_tanh_f32_nd;
cl_kernel kernel_tanh_f16_nd; cl_kernel kernel_tanh_f16_nd;
cl_kernel kernel_expm1_f32_nd;
cl_kernel kernel_expm1_f16_nd;
cl_kernel kernel_softplus_f32_nd;
cl_kernel kernel_softplus_f16_nd;
cl_kernel kernel_upscale; cl_kernel kernel_upscale;
cl_kernel kernel_upscale_bilinear; cl_kernel kernel_upscale_bilinear;
cl_kernel kernel_concat_f32_contiguous; cl_kernel kernel_concat_f32_contiguous;
@ -787,6 +792,24 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx, ggml_cl_ve
GGML_LOG_CONT("."); GGML_LOG_CONT(".");
} }
// fill
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "fill.cl.h"
};
#else
const std::string kernel_src = read_file("fill.cl");
#endif
cl_program prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_fill = clCreateKernel(prog, "kernel_fill_f32", &err), err));
GGML_LOG_CONT(".");
CL_CHECK(clReleaseProgram(prog));
}
// clamp // clamp
{ {
#ifdef GGML_OPENCL_EMBED_KERNELS #ifdef GGML_OPENCL_EMBED_KERNELS
@ -1780,6 +1803,56 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx, ggml_cl_ve
} }
} }
// expm1
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "expm1.cl.h"
};
#else
const std::string kernel_src = read_file("expm1.cl");
#endif
cl_program prog;
if (!kernel_src.empty()) {
prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_expm1_f32_nd = clCreateKernel(prog, "kernel_expm1_f32_nd", &err), err));
CL_CHECK((backend_ctx->kernel_expm1_f16_nd = clCreateKernel(prog, "kernel_expm1_f16_nd", &err), err));
GGML_LOG_CONT(".");
} else {
GGML_LOG_WARN("ggml_opencl: expm1 kernel source not found or empty. Expm1 operation will not be available.\n");
prog = nullptr;
backend_ctx->kernel_expm1_f32_nd = nullptr;
backend_ctx->kernel_expm1_f16_nd = nullptr;
}
CL_CHECK(clReleaseProgram(prog));
}
// softplus
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "softplus.cl.h"
};
#else
const std::string kernel_src = read_file("softplus.cl");
#endif
cl_program prog;
if (!kernel_src.empty()) {
prog =
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_softplus_f32_nd = clCreateKernel(prog, "kernel_softplus_f32_nd", &err), err));
CL_CHECK((backend_ctx->kernel_softplus_f16_nd = clCreateKernel(prog, "kernel_softplus_f16_nd", &err), err));
GGML_LOG_CONT(".");
} else {
GGML_LOG_WARN("ggml_opencl: softplus kernel source not found or empty. Softplus operation will not be available.\n");
prog = nullptr;
backend_ctx->kernel_softplus_f32_nd = nullptr;
backend_ctx->kernel_softplus_f16_nd = nullptr;
}
CL_CHECK(clReleaseProgram(prog));
}
// upscale // upscale
{ {
#ifdef GGML_OPENCL_EMBED_KERNELS #ifdef GGML_OPENCL_EMBED_KERNELS
@ -3089,6 +3162,12 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
case GGML_UNARY_OP_TANH: case GGML_UNARY_OP_TANH:
return (op->src[0]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32) || return (op->src[0]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32) ||
(op->src[0]->type == GGML_TYPE_F16 && op->type == GGML_TYPE_F16); (op->src[0]->type == GGML_TYPE_F16 && op->type == GGML_TYPE_F16);
case GGML_UNARY_OP_EXPM1:
return (op->src[0]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32) ||
(op->src[0]->type == GGML_TYPE_F16 && op->type == GGML_TYPE_F16);
case GGML_UNARY_OP_SOFTPLUS:
return (op->src[0]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32) ||
(op->src[0]->type == GGML_TYPE_F16 && op->type == GGML_TYPE_F16);
default: default:
return false; return false;
} }
@ -3104,6 +3183,8 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
default: default:
return false; return false;
} }
case GGML_OP_FILL:
return op->type == GGML_TYPE_F32 && ggml_is_contiguous(op);
case GGML_OP_CLAMP: case GGML_OP_CLAMP:
return op->src[0]->type == GGML_TYPE_F32; return op->src[0]->type == GGML_TYPE_F32;
case GGML_OP_SOFT_MAX: case GGML_OP_SOFT_MAX:
@ -4266,8 +4347,8 @@ static const char * ggml_backend_opencl_device_get_description(ggml_backend_dev_
} }
static void ggml_backend_opencl_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) { static void ggml_backend_opencl_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
*free = 1; *free = 0;
*total = 1; *total = 0;
GGML_UNUSED(dev); GGML_UNUSED(dev);
} }
@ -5860,6 +5941,36 @@ static void ggml_cl_sigmoid(ggml_backend_t backend, const ggml_tensor * src0, co
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size_ptr, dst); backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size_ptr, dst);
} }
static void ggml_cl_fill(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
UNUSED(src0);
UNUSED(src1);
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
ggml_tensor_extra_cl * extrad = (ggml_tensor_extra_cl *)dst->extra;
cl_ulong offsetd = extrad->offset + dst->view_offs;
float v = 0.0f;
memcpy(&v, ((int32_t *) dst->op_params), sizeof(float));
const int64_t n = ggml_nelements(dst);
cl_kernel kernel = backend_ctx->kernel_fill;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(float), &v));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(float), &n));
size_t local_work_size[1] = { 256 };
size_t global_work_size[1] = { ((size_t)n + local_work_size[0] - 1) / local_work_size[0] * local_work_size[0] };
backend_ctx->enqueue_ndrange_kernel(kernel, 1, global_work_size, local_work_size, dst);
}
static void ggml_cl_clamp(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { static void ggml_cl_clamp(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(src0); GGML_ASSERT(src0);
GGML_ASSERT(src0->extra); GGML_ASSERT(src0->extra);
@ -6413,6 +6524,210 @@ static void ggml_cl_tanh(ggml_backend_t backend, const ggml_tensor * src0, const
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size_ptr, dst); backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size_ptr, dst);
} }
static void ggml_cl_expm1(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(src0);
GGML_ASSERT(src0->extra);
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
UNUSED(src1);
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
ggml_tensor_extra_cl * extra0 = (ggml_tensor_extra_cl *)src0->extra;
ggml_tensor_extra_cl * extrad = (ggml_tensor_extra_cl *)dst->extra;
cl_ulong offset0_abs = extra0->offset + src0->view_offs;
cl_ulong offsetd_abs = extrad->offset + dst->view_offs;
cl_kernel kernel;
if (dst->type == GGML_TYPE_F32) {
kernel = backend_ctx->kernel_expm1_f32_nd;
} else if (dst->type == GGML_TYPE_F16) {
kernel = backend_ctx->kernel_expm1_f16_nd;
} else {
GGML_ASSERT(false && "Unsupported type for ggml_cl_expm1");
}
GGML_ASSERT(kernel != nullptr);
const int ne00 = src0->ne[0];
const int ne01 = src0->ne[1];
const int ne02 = src0->ne[2];
const int ne03 = src0->ne[3];
const cl_ulong nb00 = src0->nb[0];
const cl_ulong nb01 = src0->nb[1];
const cl_ulong nb02 = src0->nb[2];
const cl_ulong nb03 = src0->nb[3];
const int ne10 = dst->ne[0];
const int ne11 = dst->ne[1];
const int ne12 = dst->ne[2];
const int ne13 = dst->ne[3];
const cl_ulong nb10 = dst->nb[0];
const cl_ulong nb11 = dst->nb[1];
const cl_ulong nb12 = dst->nb[2];
const cl_ulong nb13 = dst->nb[3];
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0->data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &offset0_abs));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &offsetd_abs));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne03));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_ulong), &nb00));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(cl_ulong), &nb01));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(cl_ulong),&nb02));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(cl_ulong),&nb03));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &ne11));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 15, sizeof(int), &ne13));
CL_CHECK(clSetKernelArg(kernel, 16, sizeof(cl_ulong),&nb10));
CL_CHECK(clSetKernelArg(kernel, 17, sizeof(cl_ulong),&nb11));
CL_CHECK(clSetKernelArg(kernel, 18, sizeof(cl_ulong),&nb12));
CL_CHECK(clSetKernelArg(kernel, 19, sizeof(cl_ulong),&nb13));
size_t global_work_size[3];
if (ne10 == 0 || ne11 == 0 || ne12 == 0 || ne13 == 0) { // Handle case of 0 elements
return;
}
global_work_size[0] = (size_t)ne10;
global_work_size[1] = (size_t)ne11;
global_work_size[2] = (size_t)ne12;
size_t lws0 = 16, lws1 = 4, lws2 = 1;
if (ne10 < 16) lws0 = ne10;
if (ne11 < 4) lws1 = ne11;
if (ne12 < 1) lws2 = ne12 > 0 ? ne12 : 1;
while (lws0 * lws1 * lws2 > 256 && lws0 > 1) lws0 /= 2;
while (lws0 * lws1 * lws2 > 256 && lws1 > 1) lws1 /= 2;
while (lws0 * lws1 * lws2 > 256 && lws2 > 1) lws2 /= 2;
size_t local_work_size[] = {lws0, lws1, lws2};
size_t* local_work_size_ptr = local_work_size;
if (!backend_ctx->non_uniform_workgroups) {
if (global_work_size[0] % local_work_size[0] != 0 ||
global_work_size[1] % local_work_size[1] != 0 ||
global_work_size[2] % local_work_size[2] != 0) {
local_work_size_ptr = NULL;
}
}
if (global_work_size[0] == 0 || global_work_size[1] == 0 || global_work_size[2] == 0) return;
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size_ptr, dst);
}
static void ggml_cl_softplus(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(src0);
GGML_ASSERT(src0->extra);
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
UNUSED(src1);
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
ggml_tensor_extra_cl * extra0 = (ggml_tensor_extra_cl *)src0->extra;
ggml_tensor_extra_cl * extrad = (ggml_tensor_extra_cl *)dst->extra;
cl_ulong offset0_abs = extra0->offset + src0->view_offs;
cl_ulong offsetd_abs = extrad->offset + dst->view_offs;
cl_kernel kernel;
if (dst->type == GGML_TYPE_F32) {
kernel = backend_ctx->kernel_softplus_f32_nd;
} else if (dst->type == GGML_TYPE_F16) {
kernel = backend_ctx->kernel_softplus_f16_nd;
} else {
GGML_ASSERT(false && "Unsupported type for ggml_cl_softplus");
}
GGML_ASSERT(kernel != nullptr);
const int ne00 = src0->ne[0];
const int ne01 = src0->ne[1];
const int ne02 = src0->ne[2];
const int ne03 = src0->ne[3];
const cl_ulong nb00 = src0->nb[0];
const cl_ulong nb01 = src0->nb[1];
const cl_ulong nb02 = src0->nb[2];
const cl_ulong nb03 = src0->nb[3];
const int ne10 = dst->ne[0];
const int ne11 = dst->ne[1];
const int ne12 = dst->ne[2];
const int ne13 = dst->ne[3];
const cl_ulong nb10 = dst->nb[0];
const cl_ulong nb11 = dst->nb[1];
const cl_ulong nb12 = dst->nb[2];
const cl_ulong nb13 = dst->nb[3];
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0->data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_ulong), &offset0_abs));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &offsetd_abs));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne03));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_ulong), &nb00));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(cl_ulong), &nb01));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(cl_ulong),&nb02));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(cl_ulong),&nb03));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &ne11));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 15, sizeof(int), &ne13));
CL_CHECK(clSetKernelArg(kernel, 16, sizeof(cl_ulong),&nb10));
CL_CHECK(clSetKernelArg(kernel, 17, sizeof(cl_ulong),&nb11));
CL_CHECK(clSetKernelArg(kernel, 18, sizeof(cl_ulong),&nb12));
CL_CHECK(clSetKernelArg(kernel, 19, sizeof(cl_ulong),&nb13));
size_t global_work_size[3];
if (ne10 == 0 || ne11 == 0 || ne12 == 0 || ne13 == 0) { // Handle case of 0 elements
return;
}
global_work_size[0] = (size_t)ne10;
global_work_size[1] = (size_t)ne11;
global_work_size[2] = (size_t)ne12;
size_t lws0 = 16, lws1 = 4, lws2 = 1;
if (ne10 < 16) lws0 = ne10;
if (ne11 < 4) lws1 = ne11;
if (ne12 < 1) lws2 = ne12 > 0 ? ne12 : 1;
while (lws0 * lws1 * lws2 > 256 && lws0 > 1) lws0 /= 2;
while (lws0 * lws1 * lws2 > 256 && lws1 > 1) lws1 /= 2;
while (lws0 * lws1 * lws2 > 256 && lws2 > 1) lws2 /= 2;
size_t local_work_size[] = {lws0, lws1, lws2};
size_t* local_work_size_ptr = local_work_size;
if (!backend_ctx->non_uniform_workgroups) {
if (global_work_size[0] % local_work_size[0] != 0 ||
global_work_size[1] % local_work_size[1] != 0 ||
global_work_size[2] % local_work_size[2] != 0) {
local_work_size_ptr = NULL;
}
}
if (global_work_size[0] == 0 || global_work_size[1] == 0 || global_work_size[2] == 0) return;
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size_ptr, dst);
}
static void ggml_cl_repeat(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1_shape_def, ggml_tensor * dst) { static void ggml_cl_repeat(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1_shape_def, ggml_tensor * dst) {
GGML_ASSERT(src0); GGML_ASSERT(src0);
GGML_ASSERT(src0->extra); GGML_ASSERT(src0->extra);
@ -9586,6 +9901,18 @@ bool ggml_cl_compute_forward(ggml_backend_t backend, struct ggml_tensor * tensor
} }
func = ggml_cl_tanh; func = ggml_cl_tanh;
break; break;
case GGML_UNARY_OP_EXPM1:
if (!any_on_device) {
return false;
}
func = ggml_cl_expm1;
break;
case GGML_UNARY_OP_SOFTPLUS:
if (!any_on_device) {
return false;
}
func = ggml_cl_softplus;
break;
default: default:
return false; return false;
} break; } break;
@ -9595,6 +9922,12 @@ bool ggml_cl_compute_forward(ggml_backend_t backend, struct ggml_tensor * tensor
} }
func = ggml_cl_glu; func = ggml_cl_glu;
break; break;
case GGML_OP_FILL:
if (!any_on_device) {
return false;
}
func = ggml_cl_fill;
break;
case GGML_OP_CLAMP: case GGML_OP_CLAMP:
if (!any_on_device) { if (!any_on_device) {
return false; return false;

View File

@ -0,0 +1,82 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
//------------------------------------------------------------------------------
// expm1
//------------------------------------------------------------------------------
kernel void kernel_expm1_f32_nd(
global void * p_src0_base,
ulong off_src0_abs,
global void * p_dst_base,
ulong off_dst_abs,
int ne00,
int ne01,
int ne02,
int ne03,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
int ne13,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13
) {
int i0 = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
if (i0 < ne10 && i1 < ne11 && i2 < ne12) {
for (int i3 = 0; i3 < ne13; ++i3) {
ulong src_offset_in_tensor = (ulong)i0*nb00 + (ulong)i1*nb01 + (ulong)i2*nb02 + (ulong)i3*nb03;
global const float *src_val_ptr = (global const float *)((global char *)p_src0_base + off_src0_abs + src_offset_in_tensor);
ulong dst_offset_in_tensor = (ulong)i0*nb10 + (ulong)i1*nb11 + (ulong)i2*nb12 + (ulong)i3*nb13;
global float *dst_val_ptr = (global float *)((global char *)p_dst_base + off_dst_abs + dst_offset_in_tensor);
*dst_val_ptr = exp(*src_val_ptr) - 1;
}
}
}
kernel void kernel_expm1_f16_nd(
global void * p_src0_base,
ulong off_src0_abs,
global void * p_dst_base,
ulong off_dst_abs,
int ne00,
int ne01,
int ne02,
int ne03,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
int ne13,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13
) {
int i0 = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
if (i0 < ne10 && i1 < ne11 && i2 < ne12) {
for (int i3 = 0; i3 < ne13; ++i3) {
ulong src_offset_in_tensor = (ulong)i0*nb00 + (ulong)i1*nb01 + (ulong)i2*nb02 + (ulong)i3*nb03;
global const half *src_val_ptr = (global const half *)((global char *)p_src0_base + off_src0_abs + src_offset_in_tensor);
ulong dst_offset_in_tensor = (ulong)i0*nb10 + (ulong)i1*nb11 + (ulong)i2*nb12 + (ulong)i3*nb13;
global half *dst_val_ptr = (global half *)((global char *)p_dst_base + off_dst_abs + dst_offset_in_tensor);
*dst_val_ptr = exp(*src_val_ptr) - 1;
}
}
}

View File

@ -0,0 +1,17 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
//------------------------------------------------------------------------------
// fill
//------------------------------------------------------------------------------
__kernel void kernel_fill_f32(
__global float *dst,
ulong offsetd,
float v,
int n
) {
dst = (global float*)((global char*)dst + offsetd);
if(get_global_id(0) < n){
dst[get_global_id(0)] = v;
}
}

View File

@ -0,0 +1,88 @@
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
//------------------------------------------------------------------------------
// softplus
//------------------------------------------------------------------------------
inline float softplus_f32(float x){
float ax = fabs(x);
float m = fmax(x, 0.0f);
return log1p(exp(-ax)) + m;
}
kernel void kernel_softplus_f32_nd(
global void * p_src0_base,
ulong off_src0_abs,
global void * p_dst_base,
ulong off_dst_abs,
int ne00,
int ne01,
int ne02,
int ne03,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
int ne13,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13
) {
int i0 = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
if (i0 < ne10 && i1 < ne11 && i2 < ne12) {
for (int i3 = 0; i3 < ne13; ++i3) {
ulong src_offset_in_tensor = (ulong)i0*nb00 + (ulong)i1*nb01 + (ulong)i2*nb02 + (ulong)i3*nb03;
global const float *src_val_ptr = (global const float *)((global char *)p_src0_base + off_src0_abs + src_offset_in_tensor);
ulong dst_offset_in_tensor = (ulong)i0*nb10 + (ulong)i1*nb11 + (ulong)i2*nb12 + (ulong)i3*nb13;
global float *dst_val_ptr = (global float *)((global char *)p_dst_base + off_dst_abs + dst_offset_in_tensor);
*dst_val_ptr = softplus_f32(*src_val_ptr);
}
}
}
kernel void kernel_softplus_f16_nd(
global void * p_src0_base,
ulong off_src0_abs,
global void * p_dst_base,
ulong off_dst_abs,
int ne00,
int ne01,
int ne02,
int ne03,
ulong nb00,
ulong nb01,
ulong nb02,
ulong nb03,
int ne10,
int ne11,
int ne12,
int ne13,
ulong nb10,
ulong nb11,
ulong nb12,
ulong nb13
) {
int i0 = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
if (i0 < ne10 && i1 < ne11 && i2 < ne12) {
for (int i3 = 0; i3 < ne13; ++i3) {
ulong src_offset_in_tensor = (ulong)i0*nb00 + (ulong)i1*nb01 + (ulong)i2*nb02 + (ulong)i3*nb03;
global const half *src_val_ptr = (global const half *)((global char *)p_src0_base + off_src0_abs + src_offset_in_tensor);
ulong dst_offset_in_tensor = (ulong)i0*nb10 + (ulong)i1*nb11 + (ulong)i2*nb12 + (ulong)i3*nb13;
global half *dst_val_ptr = (global half *)((global char *)p_dst_base + off_dst_abs + dst_offset_in_tensor);
*dst_val_ptr = (half)(softplus_f32((float)(*src_val_ptr)));
}
}
}

View File

@ -4286,6 +4286,7 @@ struct ggml_backend_sycl_device_context {
int device; int device;
std::string name; std::string name;
std::string description; std::string description;
int op_offload_min_batch_size;
}; };
static const char * ggml_backend_sycl_device_get_name(ggml_backend_dev_t dev) { static const char * ggml_backend_sycl_device_get_name(ggml_backend_dev_t dev) {
@ -4674,9 +4675,8 @@ static int64_t get_op_batch_size(const ggml_tensor * op) {
} }
static bool ggml_backend_sycl_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { static bool ggml_backend_sycl_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
const int min_batch_size = 32; ggml_backend_sycl_device_context * sycl_ctx = (ggml_backend_sycl_device_context *)dev->context;
return get_op_batch_size(op) >= min_batch_size; return get_op_batch_size(op) >= sycl_ctx->op_offload_min_batch_size;
GGML_UNUSED(dev);
} }
static ggml_backend_event_t static ggml_backend_event_t
@ -4799,6 +4799,7 @@ ggml_backend_reg_t ggml_backend_sycl_reg() {
std::lock_guard<std::mutex> lock(mutex); std::lock_guard<std::mutex> lock(mutex);
if (!initialized) { if (!initialized) {
ggml_backend_sycl_reg_context * ctx = new ggml_backend_sycl_reg_context; ggml_backend_sycl_reg_context * ctx = new ggml_backend_sycl_reg_context;
const int min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
for (int i = 0; i < ggml_sycl_info().device_count; i++) { for (int i = 0; i < ggml_sycl_info().device_count; i++) {
ggml_backend_sycl_device_context * dev_ctx = new ggml_backend_sycl_device_context; ggml_backend_sycl_device_context * dev_ctx = new ggml_backend_sycl_device_context;
@ -4812,6 +4813,7 @@ ggml_backend_reg_t ggml_backend_sycl_reg() {
prop, dpct::dev_mgr::instance().get_device(i)))); prop, dpct::dev_mgr::instance().get_device(i))));
dev_ctx->description = prop.get_name(); dev_ctx->description = prop.get_name();
dev_ctx->op_offload_min_batch_size = min_batch_size;
ggml_backend_dev_t dev = new ggml_backend_device { ggml_backend_dev_t dev = new ggml_backend_device {
/* .iface = */ ggml_backend_sycl_device_interface, /* .iface = */ ggml_backend_sycl_device_interface,

View File

@ -119,6 +119,8 @@ struct ggml_backend_vk_context;
// Max number of adds that can be fused without exceeding MAX_PARAMETER_COUNT. // Max number of adds that can be fused without exceeding MAX_PARAMETER_COUNT.
#define MAX_FUSED_ADDS (MAX_PARAMETER_COUNT - 3) #define MAX_FUSED_ADDS (MAX_PARAMETER_COUNT - 3)
typedef std::shared_ptr<struct vk_pipeline_struct> vk_pipeline;
struct vk_pipeline_struct { struct vk_pipeline_struct {
std::string name; std::string name;
vk::ShaderModule shader_module; vk::ShaderModule shader_module;
@ -136,9 +138,15 @@ struct vk_pipeline_struct {
std::atomic<bool> compiled {}; std::atomic<bool> compiled {};
// number of registers used, extracted from pipeline executable properties // number of registers used, extracted from pipeline executable properties
uint32_t register_count {}; uint32_t register_count {};
#if defined(VK_EXT_shader_64bit_indexing)
bool is_64b_indexing {};
#endif
// linked list of pipelines for multiple compilation variants.
// currently only used to compile a 64-bit indexing variant.
vk_pipeline next;
}; };
typedef std::shared_ptr<vk_pipeline_struct> vk_pipeline;
typedef std::weak_ptr<vk_pipeline_struct> vk_pipeline_ref; typedef std::weak_ptr<vk_pipeline_struct> vk_pipeline_ref;
static void ggml_vk_destroy_pipeline(vk::Device& device, vk_pipeline& pipeline); static void ggml_vk_destroy_pipeline(vk::Device& device, vk_pipeline& pipeline);
@ -230,9 +238,7 @@ static ggml_backend_buffer_type_i ggml_backend_vk_buffer_type_interface = {
/* .is_host = */ NULL, /* .is_host = */ NULL,
}; };
#ifdef GGML_VULKAN_MEMORY_DEBUG
class vk_memory_logger; class vk_memory_logger;
#endif
class vk_perf_logger; class vk_perf_logger;
static void ggml_vk_destroy_buffer(vk_buffer& buf); static void ggml_vk_destroy_buffer(vk_buffer& buf);
static void ggml_vk_synchronize(ggml_backend_vk_context * ctx); static void ggml_vk_synchronize(ggml_backend_vk_context * ctx);
@ -570,6 +576,7 @@ struct vk_device_struct {
bool uma; bool uma;
bool prefer_host_memory; bool prefer_host_memory;
bool float_controls_rte_fp16; bool float_controls_rte_fp16;
bool subgroup_basic;
bool subgroup_arithmetic; bool subgroup_arithmetic;
bool subgroup_shuffle; bool subgroup_shuffle;
bool subgroup_ballot; bool subgroup_ballot;
@ -583,6 +590,8 @@ struct vk_device_struct {
bool add_rms_fusion; bool add_rms_fusion;
uint32_t partials_binding_alignment; uint32_t partials_binding_alignment;
bool shader_64b_indexing;
bool integer_dot_product; bool integer_dot_product;
// 0: default, 1: force mmvq, -1: disable mmvq // 0: default, 1: force mmvq, -1: disable mmvq
int32_t mmvq_mode; int32_t mmvq_mode;
@ -814,9 +823,7 @@ struct vk_device_struct {
bool allow_sysmem_fallback; bool allow_sysmem_fallback;
bool disable_graph_optimize; bool disable_graph_optimize;
#ifdef GGML_VULKAN_MEMORY_DEBUG
std::unique_ptr<vk_memory_logger> memory_logger; std::unique_ptr<vk_memory_logger> memory_logger;
#endif
~vk_device_struct() { ~vk_device_struct() {
VK_LOG_DEBUG("destroy device " << name); VK_LOG_DEBUG("destroy device " << name);
@ -1504,6 +1511,11 @@ template <> void init_pushconst_fastdiv(vk_op_sum_rows_push_constants &p) {
init_fastdiv_values(p.ne01, p.ne0_1mp, p.ne0_1L); init_fastdiv_values(p.ne01, p.ne0_1mp, p.ne0_1L);
} }
struct vk_quantize_q8_1_push_constants {
uint32_t ne;
uint32_t num_blocks;
};
// Allow pre-recording command buffers // Allow pre-recording command buffers
struct vk_staging_memcpy { struct vk_staging_memcpy {
vk_staging_memcpy(void * _dst, const void * _src, size_t _n) : dst(_dst), src(_src), n(_n) {} vk_staging_memcpy(void * _dst, const void * _src, size_t _n) : dst(_dst), src(_src), n(_n) {}
@ -1547,8 +1559,9 @@ static void ggml_vk_preallocate_buffers(ggml_backend_vk_context * ctx, vk_contex
static void ggml_vk_load_shaders(vk_device& device); static void ggml_vk_load_shaders(vk_device& device);
static void ggml_pipeline_allocate_descriptor_sets(ggml_backend_vk_context * ctx); static void ggml_pipeline_allocate_descriptor_sets(ggml_backend_vk_context * ctx);
#if defined(GGML_VULKAN_MEMORY_DEBUG) || defined(GGML_VULKAN_DEBUG) static bool vk_memory_logger_enabled = false;
#define VK_LOG_MEMORY(msg) std::cerr << "ggml_vulkan memory: " << msg << std::endl
#define VK_LOG_MEMORY(msg) if (vk_memory_logger_enabled) { std::cerr << "ggml_vulkan memory: " << msg << std::endl; }
static std::string format_size(size_t size) { static std::string format_size(size_t size) {
const size_t kib = 1024; const size_t kib = 1024;
@ -1581,10 +1594,10 @@ private:
std::map<vk::Buffer, size_t> allocations; // Track allocations std::map<vk::Buffer, size_t> allocations; // Track allocations
size_t total_device; size_t total_device;
size_t total_host; size_t total_host;
static std::mutex log_mutex;
}; };
#else
#define VK_LOG_MEMORY(msg) ((void) 0) std::mutex vk_memory_logger::log_mutex;
#endif // GGML_VULKAN_MEMORY_DEBUG
static bool vk_perf_logger_enabled = false; static bool vk_perf_logger_enabled = false;
static bool vk_perf_logger_concurrent = false; static bool vk_perf_logger_concurrent = false;
@ -1891,10 +1904,10 @@ struct ggml_backend_vk_buffer_context {
} }
}; };
#ifdef GGML_VULKAN_MEMORY_DEBUG
static std::mutex log_mutex;
void vk_memory_logger::log_allocation(vk_buffer_ref buf_ref, size_t size) { void vk_memory_logger::log_allocation(vk_buffer_ref buf_ref, size_t size) {
if (!vk_memory_logger_enabled) {
return;
}
std::lock_guard<std::mutex> guard(log_mutex); std::lock_guard<std::mutex> guard(log_mutex);
vk_buffer buf = buf_ref.lock(); vk_buffer buf = buf_ref.lock();
const bool device = bool(buf->memory_property_flags & vk::MemoryPropertyFlagBits::eDeviceLocal); const bool device = bool(buf->memory_property_flags & vk::MemoryPropertyFlagBits::eDeviceLocal);
@ -1906,7 +1919,7 @@ void vk_memory_logger::log_allocation(vk_buffer_ref buf_ref, size_t size) {
} }
void vk_memory_logger::log_deallocation(vk_buffer_ref buf_ref) { void vk_memory_logger::log_deallocation(vk_buffer_ref buf_ref) {
if (buf_ref.expired() || buf_ref.lock()->size == 0) { if (buf_ref.expired() || buf_ref.lock()->size == 0 || !vk_memory_logger_enabled) {
return; return;
} }
@ -1924,7 +1937,6 @@ void vk_memory_logger::log_deallocation(vk_buffer_ref buf_ref) {
VK_LOG_MEMORY("ERROR " << buf->device->name << ": Attempted to deallocate unknown " << type << " memory at " << buf->buffer); VK_LOG_MEMORY("ERROR " << buf->device->name << ": Attempted to deallocate unknown " << type << " memory at " << buf->buffer);
} }
} }
#endif // GGML_VULKAN_MEMORY_DEBUG
struct vk_instance_t { struct vk_instance_t {
vk::Instance instance; vk::Instance instance;
@ -2074,6 +2086,19 @@ static void ggml_vk_create_pipeline_func(vk_device& device, vk_pipeline& pipelin
compute_pipeline_create_info.setPNext(&rci); compute_pipeline_create_info.setPNext(&rci);
} }
#if defined(VK_EXT_shader_64bit_indexing)
vk::PipelineCreateFlags2CreateInfo pipelineFlags2CreateInfo;
if (pipeline->is_64b_indexing)
{
pipelineFlags2CreateInfo.flags = vk::PipelineCreateFlagBits2::e64BitIndexingEXT;
if (device->pipeline_executable_properties_support) {
pipelineFlags2CreateInfo.flags |= vk::PipelineCreateFlagBits2::eCaptureStatisticsKHR;
}
pipelineFlags2CreateInfo.setPNext(compute_pipeline_create_info.pNext);
compute_pipeline_create_info.setPNext(&pipelineFlags2CreateInfo);
}
#endif
try { try {
pipeline->pipeline = device->device.createComputePipeline(VK_NULL_HANDLE, compute_pipeline_create_info).value; pipeline->pipeline = device->device.createComputePipeline(VK_NULL_HANDLE, compute_pipeline_create_info).value;
} catch (const vk::SystemError& e) { } catch (const vk::SystemError& e) {
@ -2564,9 +2589,7 @@ static vk_buffer ggml_vk_create_buffer(vk_device& device, size_t size, const std
buf->bda_addr = device->device.getBufferAddress(addressInfo); buf->bda_addr = device->device.getBufferAddress(addressInfo);
} }
#ifdef GGML_VULKAN_MEMORY_DEBUG
device->memory_logger->log_allocation(buf, size); device->memory_logger->log_allocation(buf, size);
#endif
return buf; return buf;
} }
@ -2623,11 +2646,9 @@ static void ggml_vk_destroy_buffer(vk_buffer& buf) {
return; return;
} }
#ifdef GGML_VULKAN_MEMORY_DEBUG
if (buf->device != nullptr) { if (buf->device != nullptr) {
buf->device->memory_logger->log_deallocation(buf); buf->device->memory_logger->log_deallocation(buf);
} }
#endif
buf.reset(); buf.reset();
} }
@ -2996,6 +3017,15 @@ static void ggml_vk_load_shaders(vk_device& device) {
if ((device->architecture == AMD_GCN) && (device->driver_id != vk::DriverId::eAmdProprietary)) { if ((device->architecture == AMD_GCN) && (device->driver_id != vk::DriverId::eAmdProprietary)) {
m_warptile_mmq = m_warptile_mmq_int = { 256, 64, 64, 32, 16, 16, 2, 2, 2, 1, 16 }; m_warptile_mmq = m_warptile_mmq_int = { 256, 64, 64, 32, 16, 16, 2, 2, 2, 1, 16 };
m_warptile_mmqid = m_warptile_mmqid_int = { 256, 64, 64, 32, 16, 16, 2, 2, 2, 1, 16 }; m_warptile_mmqid = m_warptile_mmqid_int = { 256, 64, 64, 32, 16, 16, 2, 2, 2, 1, 16 };
} else if (device->vendor_id == VK_VENDOR_ID_AMD && device->coopmat_support && device->driver_id != vk::DriverId::eAmdProprietary) {
// This is intentionally using tx_m values, slight performance increase
l_warptile = { 256, 128, 128, 16, subgroup_size_8, 64, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
l_warptile_mmq = l_warptile_mmq_int = { 256, 128, 128, 32, subgroup_size_8, 64, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
l_warptile_mmq_int_k = { 256, 128, 128, 32, subgroup_size_16, 64, 1, 4, 2, 1, subgroup_size_16 };
} else if (device->vendor_id == VK_VENDOR_ID_INTEL && device->coopmat_support && device->architecture == INTEL_XE2) {
// Xe2/Xe3 with coopmat enabled - warptile performance tuning
l_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
l_warptile_mmq = { 512, 128, 128, 32, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
} }
l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 }; l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 };
@ -3051,7 +3081,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
} }
std::vector<std::future<void>> compiles; std::vector<std::future<void>> compiles;
auto const &ggml_vk_create_pipeline = [&](vk_device& device, vk_pipeline& pipeline, const char *name, size_t spv_size, const void* spv_data, const char *entrypoint, auto const &ggml_vk_create_pipeline = [&](vk_device& device, vk_pipeline& base_pipeline, const char *name, size_t spv_size, const void* spv_data, const char *entrypoint,
uint32_t parameter_count, uint32_t push_constant_size, std::array<uint32_t, 3> wg_denoms, const std::vector<uint32_t>& specialization_constants, uint32_t parameter_count, uint32_t push_constant_size, std::array<uint32_t, 3> wg_denoms, const std::vector<uint32_t>& specialization_constants,
uint32_t align, bool disable_robustness = false, bool require_full_subgroups = false, uint32_t required_subgroup_size = 0) { uint32_t align, bool disable_robustness = false, bool require_full_subgroups = false, uint32_t required_subgroup_size = 0) {
@ -3059,6 +3089,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
required_subgroup_size = get_subgroup_size(name, device->architecture); required_subgroup_size = get_subgroup_size(name, device->architecture);
} }
vk_pipeline *ptr = &base_pipeline;
int num_pipelines = 1;
#if defined(VK_EXT_shader_64bit_indexing)
if (device->shader_64b_indexing) {
num_pipelines = 2;
}
#endif
for (int i = 0; i < num_pipelines; ++i, ptr = &(*ptr)->next) {
vk_pipeline &pipeline = *ptr;
if (!pipeline) { if (!pipeline) {
pipeline = std::make_shared<vk_pipeline_struct>(); pipeline = std::make_shared<vk_pipeline_struct>();
} }
@ -3069,10 +3109,13 @@ static void ggml_vk_load_shaders(vk_device& device) {
pipeline->wg_denoms = wg_denoms; pipeline->wg_denoms = wg_denoms;
pipeline->align = align; pipeline->align = align;
pipeline->initialized = true; pipeline->initialized = true;
#if defined(VK_EXT_shader_64bit_indexing)
pipeline->is_64b_indexing = (i == 1);
#endif
} }
if (!pipeline->needed || pipeline->compiled) { if (!pipeline->needed || pipeline->compiled) {
return; continue;
} }
// TODO: We're no longer benefitting from the async compiles (shaders are // TODO: We're no longer benefitting from the async compiles (shaders are
// compiled individually, as needed) and this complexity can be removed. // compiled individually, as needed) and this complexity can be removed.
@ -3088,6 +3131,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
compiles.push_back(std::async(ggml_vk_create_pipeline_func, std::ref(device), std::ref(pipeline), spv_size, spv_data, entrypoint, compiles.push_back(std::async(ggml_vk_create_pipeline_func, std::ref(device), std::ref(pipeline), spv_size, spv_data, entrypoint,
parameter_count, wg_denoms, specialization_constants, disable_robustness, require_full_subgroups, required_subgroup_size)); parameter_count, wg_denoms, specialization_constants, disable_robustness, require_full_subgroups, required_subgroup_size));
}
}; };
auto const &ggml_vk_create_pipeline2 = [&](vk_device& device, vk_pipeline& pipeline, const std::string &name, size_t spv_size, const void* spv_data, const char *entrypoint, auto const &ggml_vk_create_pipeline2 = [&](vk_device& device, vk_pipeline& pipeline, const std::string &name, size_t spv_size, const void* spv_data, const char *entrypoint,
@ -3336,12 +3380,12 @@ static void ggml_vk_load_shaders(vk_device& device) {
GGML_ASSERT(device->subgroup_ballot); GGML_ASSERT(device->subgroup_ballot);
CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_subgroup_f32_f32, , wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id); CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_subgroup_f32_f32, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id);
CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16, matmul_id_subgroup_f16, wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id); CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16, matmul_id_subgroup_f16, wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id);
CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16_f32, matmul_id_subgroup_f16_f32, wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id); CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16_f32, matmul_id_subgroup_f16_f32, wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id);
#if defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT) #if defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT)
if (device->coopmat_bf16_support) { if (device->coopmat_bf16_support) {
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_subgroup_bf16, , wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_subgroup_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id);
} }
#endif #endif
@ -3449,9 +3493,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
#endif #endif
if (device->subgroup_ballot && device->subgroup_require_full_support && subgroup_min_size_16) { if (device->subgroup_ballot && device->subgroup_require_full_support && subgroup_min_size_16) {
CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_subgroup_f32_f32, , wg_denoms, warptile_id, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_subgroup_f32_f32, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16, matmul_id_subgroup_f16, wg_denoms, warptile_id, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16, matmul_id_subgroup_f16, wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16_f32, matmul_id_subgroup_f16_f32, wg_denoms, warptile_id, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16_f32, matmul_id_subgroup_f16_f32, wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_subgroup_bf16, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_subgroup_bf16, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM2(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0], matmul_id_subgroup_q4_0_f32, mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size); CREATE_MM2(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0], matmul_id_subgroup_q4_0_f32, mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size);
@ -3493,9 +3537,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
} }
#endif #endif
} else { } else {
CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_f32_f32, , wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_f32_f32, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16, matmul_id_f16, wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16, matmul_id_f16, wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16_f32, matmul_id_f16_f32, wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM2(GGML_TYPE_F16, pipeline_matmul_id_f16_f32, matmul_id_f16_f32, wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM2(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0], matmul_id_q4_0_f32, mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM2(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0], matmul_id_q4_0_f32, mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
@ -3610,9 +3654,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
#endif #endif
if (device->subgroup_ballot && device->subgroup_require_full_support && subgroup_min_size_16) { if (device->subgroup_ballot && device->subgroup_require_full_support && subgroup_min_size_16) {
CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_subgroup_f32_f32, , wg_denoms, warptile_id, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_subgroup_f32_f32, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16.f32acc, matmul_id_subgroup_f16, , wg_denoms, warptile_id, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16.f32acc, matmul_id_subgroup_f16, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16_f32.f32acc, matmul_id_subgroup_f16_f32, , wg_denoms, warptile_id, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16_f32.f32acc, matmul_id_subgroup_f16_f32, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_subgroup_bf16, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_subgroup_bf16, , wg_denoms, warptile_id, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size_16);
CREATE_MM(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0].f32acc, matmul_id_subgroup_q4_0_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size); CREATE_MM(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0].f32acc, matmul_id_subgroup_q4_0_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size);
@ -3636,9 +3680,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
CREATE_MM(GGML_TYPE_IQ4_NL, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_NL].f32acc, matmul_id_subgroup_iq4_nl_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size); CREATE_MM(GGML_TYPE_IQ4_NL, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_NL].f32acc, matmul_id_subgroup_iq4_nl_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size);
CREATE_MM(GGML_TYPE_MXFP4, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4].f32acc, matmul_id_subgroup_mxfp4_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size); CREATE_MM(GGML_TYPE_MXFP4, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4].f32acc, matmul_id_subgroup_mxfp4_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, mul_mat_subgroup_size);
} else { } else {
CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_f32_f32, , wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_F32, pipeline_matmul_id_f32, matmul_id_f32_f32, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16.f32acc, matmul_id_f16, , wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16.f32acc, matmul_id_f16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16_f32.f32acc, matmul_id_f16_f32, , wg_denoms, warptile, vk_mat_mat_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_F16, pipeline_matmul_id_f16_f32.f32acc, matmul_id_f16_f32, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
CREATE_MM(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0].f32acc, matmul_id_q4_0_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_Q4_0, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_0].f32acc, matmul_id_q4_0_f32, , mmq_wg_denoms, warptile_mmqid, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
@ -3678,6 +3722,11 @@ static void ggml_vk_load_shaders(vk_device& device) {
m_wg_denoms = { 64, 64, 1 }; m_wg_denoms = { 64, 64, 1 };
s_wg_denoms = { 32, 32, 1 }; s_wg_denoms = { 32, 32, 1 };
if (device->vendor_id == VK_VENDOR_ID_INTEL && device->architecture == INTEL_XE2) {
// Xe2/Xe3 - bf16 warptile performance tuning
l_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, 4, 4, 1, subgroup_size_8 };
}
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_bf16, matmul_bf16, , wg_denoms, warptile, vk_mat_mat_push_constants, 3, , 0); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_bf16, matmul_bf16, , wg_denoms, warptile, vk_mat_mat_push_constants, 3, , 0);
CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0); CREATE_MM(GGML_TYPE_BF16, pipeline_matmul_id_bf16, matmul_id_bf16, , wg_denoms, warptile, vk_mat_mat_id_push_constants, mul_mat_id_param_count, _id, 0);
} }
@ -3831,22 +3880,22 @@ static void ggml_vk_load_shaders(vk_device& device) {
const uint32_t subgroup_size_int = (device->vendor_id == VK_VENDOR_ID_INTEL && device->subgroup_size_control) ? device->subgroup_min_size : device->subgroup_size; const uint32_t subgroup_size_int = (device->vendor_id == VK_VENDOR_ID_INTEL && device->subgroup_size_control) ? device->subgroup_min_size : device->subgroup_size;
const uint32_t wg_size_subgroup_int = (w == DMMV_WG_SIZE_SUBGROUP) ? subgroup_size_int : (subgroup_size_int * 4); const uint32_t wg_size_subgroup_int = (w == DMMV_WG_SIZE_SUBGROUP) ? subgroup_size_int : (subgroup_size_int * 4);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q4_0], "mul_mat_vec_id_q4_0_q8_1_f32", arr_dmmv_id_q4_0_q8_1_f32_len[reduc], arr_dmmv_id_q4_0_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q4_0], "mul_mat_vec_id_q4_0_q8_1_f32", arr_dmmv_id_q4_0_q8_1_f32_len[reduc], arr_dmmv_id_q4_0_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q4_1], "mul_mat_vec_id_q4_1_q8_1_f32", arr_dmmv_id_q4_1_q8_1_f32_len[reduc], arr_dmmv_id_q4_1_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q4_1], "mul_mat_vec_id_q4_1_q8_1_f32", arr_dmmv_id_q4_1_q8_1_f32_len[reduc], arr_dmmv_id_q4_1_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q5_0], "mul_mat_vec_id_q5_0_q8_1_f32", arr_dmmv_id_q5_0_q8_1_f32_len[reduc], arr_dmmv_id_q5_0_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q5_0], "mul_mat_vec_id_q5_0_q8_1_f32", arr_dmmv_id_q5_0_q8_1_f32_len[reduc], arr_dmmv_id_q5_0_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q5_1], "mul_mat_vec_id_q5_1_q8_1_f32", arr_dmmv_id_q5_1_q8_1_f32_len[reduc], arr_dmmv_id_q5_1_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q5_1], "mul_mat_vec_id_q5_1_q8_1_f32", arr_dmmv_id_q5_1_q8_1_f32_len[reduc], arr_dmmv_id_q5_1_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q8_0], "mul_mat_vec_id_q8_0_q8_1_f32", arr_dmmv_id_q8_0_q8_1_f32_len[reduc], arr_dmmv_id_q8_0_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q8_0], "mul_mat_vec_id_q8_0_q8_1_f32", arr_dmmv_id_q8_0_q8_1_f32_len[reduc], arr_dmmv_id_q8_0_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_MXFP4], "mul_mat_vec_id_mxfp4_q8_1_f32", arr_dmmv_id_mxfp4_q8_1_f32_len[reduc], arr_dmmv_id_mxfp4_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {2*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 2*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_MXFP4], "mul_mat_vec_id_mxfp4_q8_1_f32", arr_dmmv_id_mxfp4_q8_1_f32_len[reduc], arr_dmmv_id_mxfp4_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {2*rm_stdq_int, 1, 1}, {wg_size_subgroup_int, 2*rm_stdq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q2_K], "mul_mat_vec_id_q2_k_q8_1_f32", arr_dmmv_id_q2_k_q8_1_f32_len[reduc], arr_dmmv_id_q2_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {2*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 2*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q2_K], "mul_mat_vec_id_q2_k_q8_1_f32", arr_dmmv_id_q2_k_q8_1_f32_len[reduc], arr_dmmv_id_q2_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {2*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 2*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q3_K], "mul_mat_vec_id_q3_k_q8_1_f32", arr_dmmv_id_q3_k_q8_1_f32_len[reduc], arr_dmmv_id_q3_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q3_K], "mul_mat_vec_id_q3_k_q8_1_f32", arr_dmmv_id_q3_k_q8_1_f32_len[reduc], arr_dmmv_id_q3_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q4_K], "mul_mat_vec_id_q4_k_q8_1_f32", arr_dmmv_id_q4_k_q8_1_f32_len[reduc], arr_dmmv_id_q4_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q4_K], "mul_mat_vec_id_q4_k_q8_1_f32", arr_dmmv_id_q4_k_q8_1_f32_len[reduc], arr_dmmv_id_q4_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q5_K], "mul_mat_vec_id_q5_k_q8_1_f32", arr_dmmv_id_q5_k_q8_1_f32_len[reduc], arr_dmmv_id_q5_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q5_K], "mul_mat_vec_id_q5_k_q8_1_f32", arr_dmmv_id_q5_k_q8_1_f32_len[reduc], arr_dmmv_id_q5_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q6_K], "mul_mat_vec_id_q6_k_q8_1_f32", arr_dmmv_id_q6_k_q8_1_f32_len[reduc], arr_dmmv_id_q6_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_Q6_K], "mul_mat_vec_id_q6_k_q8_1_f32", arr_dmmv_id_q6_k_q8_1_f32_len[reduc], arr_dmmv_id_q6_k_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_kq_int, 1, 1}, {wg_size_subgroup_int, 1*rm_kq_int}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_IQ1_S], "mul_mat_vec_id_iq1_s_q8_1_f32", arr_dmmv_id_iq1_s_q8_1_f32_len[reduc], arr_dmmv_id_iq1_s_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_iq_int(0), 1, 1}, {wg_size_subgroup_int, 1*rm_iq_int(0)}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_IQ1_S], "mul_mat_vec_id_iq1_s_q8_1_f32", arr_dmmv_id_iq1_s_q8_1_f32_len[reduc], arr_dmmv_id_iq1_s_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_iq_int(0), 1, 1}, {wg_size_subgroup_int, 1*rm_iq_int(0)}, 1, true, use_subgroups, subgroup_size_int);
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_IQ1_M], "mul_mat_vec_id_iq1_m_q8_1_f32", arr_dmmv_id_iq1_m_q8_1_f32_len[reduc], arr_dmmv_id_iq1_m_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_push_constants), {1*rm_iq_int(0), 1, 1}, {wg_size_subgroup_int, 1*rm_iq_int(0)}, 1, true, use_subgroups, subgroup_size_int); ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_id_q8_1_f32[w][GGML_TYPE_IQ1_M], "mul_mat_vec_id_iq1_m_q8_1_f32", arr_dmmv_id_iq1_m_q8_1_f32_len[reduc], arr_dmmv_id_iq1_m_q8_1_f32_data[reduc], "main", mul_mat_vec_id_num_bindings, sizeof(vk_mat_vec_id_push_constants), {1*rm_iq_int(0), 1, 1}, {wg_size_subgroup_int, 1*rm_iq_int(0)}, 1, true, use_subgroups, subgroup_size_int);
} }
#endif // GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT #endif // GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT
} }
@ -3934,9 +3983,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_flash_attn_split_k_reduce, "fa_split_k_reduce", fa_split_k_reduce_len, fa_split_k_reduce_data, "main", 3, 5 * sizeof(uint32_t), {1, device->subgroup_size, 1}, {device->subgroup_size}, 1, true); ggml_vk_create_pipeline(device, device->pipeline_flash_attn_split_k_reduce, "fa_split_k_reduce", fa_split_k_reduce_len, fa_split_k_reduce_data, "main", 3, 5 * sizeof(uint32_t), {1, device->subgroup_size, 1}, {device->subgroup_size}, 1, true);
if (device->subgroup_clustered && device->subgroup_require_full_support) { if (device->subgroup_clustered && device->subgroup_require_full_support) {
ggml_vk_create_pipeline(device, device->pipeline_quantize_q8_1_x4, "quantize_q8_1_x4", quantize_q8_1_x4_subgroup_len, quantize_q8_1_x4_subgroup_data, "main", 2, 1 * sizeof(uint32_t), {32 * device->subgroup_size / 8, 1, 1}, { device->subgroup_size }, 1, true, true); ggml_vk_create_pipeline(device, device->pipeline_quantize_q8_1_x4, "quantize_q8_1_x4", quantize_q8_1_x4_subgroup_len, quantize_q8_1_x4_subgroup_data, "main", 2, sizeof(vk_quantize_q8_1_push_constants), {32 * device->subgroup_size / 8, 1, 1}, { device->subgroup_size }, 1, true, true);
} else { } else {
ggml_vk_create_pipeline(device, device->pipeline_quantize_q8_1_x4, "quantize_q8_1_x4", quantize_q8_1_x4_len, quantize_q8_1_x4_data, "main", 2, 1 * sizeof(uint32_t), {32 * device->subgroup_size / 8, 1, 1}, { device->subgroup_size }, 1); ggml_vk_create_pipeline(device, device->pipeline_quantize_q8_1_x4, "quantize_q8_1_x4", quantize_q8_1_x4_len, quantize_q8_1_x4_data, "main", 2, sizeof(vk_quantize_q8_1_push_constants), {32 * device->subgroup_size / 8, 1, 1}, { device->subgroup_size }, 1);
} }
for (uint32_t i = 0; i < p021_max_gqa_ratio; ++i) { for (uint32_t i = 0; i < p021_max_gqa_ratio; ++i) {
@ -4144,9 +4193,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_add1_f16_f32, "add1_f16_f32", add1_f16_f32_len, add1_f16_f32_data, "main", 3, sizeof(vk_op_binary_push_constants), {512, 1, 1}, {}, 1); ggml_vk_create_pipeline(device, device->pipeline_add1_f16_f32, "add1_f16_f32", add1_f16_f32_len, add1_f16_f32_data, "main", 3, sizeof(vk_op_binary_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_add1_f32_f32, "add1_f32_f32", add1_f32_f32_len, add1_f32_f32_data, "main", 3, sizeof(vk_op_binary_push_constants), {512, 1, 1}, {}, 1); ggml_vk_create_pipeline(device, device->pipeline_add1_f32_f32, "add1_f32_f32", add1_f32_f32_len, add1_f32_f32_data, "main", 3, sizeof(vk_op_binary_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_arange_f32, "arange_f32", arange_f32_len, arange_f32_data, "main", 1, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1); ggml_vk_create_pipeline(device, device->pipeline_arange_f32, "arange_f32", arange_f32_len, arange_f32_data, "main", 1, sizeof(vk_op_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_fill_f32, "fill_f32", fill_f32_len, fill_f32_data, "main", 1, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1); ggml_vk_create_pipeline(device, device->pipeline_fill_f32, "fill_f32", fill_f32_len, fill_f32_data, "main", 1, sizeof(vk_op_push_constants), {512, 1, 1}, {}, 1);
#define CREATE_GLU(name) \ #define CREATE_GLU(name) \
if (device->float_controls_rte_fp16) { \ if (device->float_controls_rte_fp16) { \
@ -4292,8 +4341,8 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_rwkv_wkv7_f32, "rwkv_wkv7_f32", rwkv_wkv7_f32_len, rwkv_wkv7_f32_data, "main", 8, sizeof(vk_op_rwkv_wkv7_push_constants), {1, 1, 1}, {device->subgroup_size}, 1); ggml_vk_create_pipeline(device, device->pipeline_rwkv_wkv7_f32, "rwkv_wkv7_f32", rwkv_wkv7_f32_len, rwkv_wkv7_f32_data, "main", 8, sizeof(vk_op_rwkv_wkv7_push_constants), {1, 1, 1}, {device->subgroup_size}, 1);
if (device->subgroup_arithmetic && device->subgroup_require_full_support) { if (device->subgroup_arithmetic && device->subgroup_require_full_support) {
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d128, "ssm_scan_128_f32", ssm_scan_subgroup_f32_len, ssm_scan_subgroup_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {128, device->subgroup_size, 16}, 1, true, true); ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d128, "ssm_scan_128_f32", ssm_scan_subgroup_f32_len, ssm_scan_subgroup_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {128, device->subgroup_size}, 1, true, true);
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d256, "ssm_scan_256_f32", ssm_scan_subgroup_f32_len, ssm_scan_subgroup_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {256, device->subgroup_size, 16}, 1, true, true); ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d256, "ssm_scan_256_f32", ssm_scan_subgroup_f32_len, ssm_scan_subgroup_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {256, device->subgroup_size}, 1, true, true);
} else { } else {
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d128, "ssm_scan_128_f32", ssm_scan_f32_len, ssm_scan_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {128, device->subgroup_size, 16}, 1, true, true); ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d128, "ssm_scan_128_f32", ssm_scan_f32_len, ssm_scan_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {128, device->subgroup_size, 16}, 1, true, true);
ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d256, "ssm_scan_256_f32", ssm_scan_f32_len, ssm_scan_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {256, device->subgroup_size, 16}, 1, true, true); ggml_vk_create_pipeline(device, device->pipeline_ssm_scan_f32_d256, "ssm_scan_256_f32", ssm_scan_f32_len, ssm_scan_f32_data, "main", 8, sizeof(vk_op_ssm_scan_push_constants), {1, 1, 1}, {256, device->subgroup_size, 16}, 1, true, true);
@ -4420,9 +4469,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
vk_device device = std::make_shared<vk_device_struct>(); vk_device device = std::make_shared<vk_device_struct>();
vk_instance.devices[idx] = device; vk_instance.devices[idx] = device;
#ifdef GGML_VULKAN_MEMORY_DEBUG
device->memory_logger = std::unique_ptr<vk_memory_logger>(new vk_memory_logger()); device->memory_logger = std::unique_ptr<vk_memory_logger>(new vk_memory_logger());
#endif
size_t dev_num = vk_instance.device_indices[idx]; size_t dev_num = vk_instance.device_indices[idx];
@ -4460,6 +4507,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
bool pipeline_executable_properties_support = false; bool pipeline_executable_properties_support = false;
device->coopmat_support = false; device->coopmat_support = false;
device->integer_dot_product = false; device->integer_dot_product = false;
device->shader_64b_indexing = false;
bool bfloat16_support = false; bool bfloat16_support = false;
for (const auto& properties : ext_props) { for (const auto& properties : ext_props) {
@ -4507,6 +4555,10 @@ static vk_device ggml_vk_get_device(size_t idx) {
device->memory_priority = true; device->memory_priority = true;
} else if (strcmp("VK_EXT_external_memory_host", properties.extensionName) == 0) { } else if (strcmp("VK_EXT_external_memory_host", properties.extensionName) == 0) {
device->external_memory_host = true; device->external_memory_host = true;
#if defined(VK_EXT_shader_64bit_indexing)
} else if (strcmp("VK_EXT_shader_64bit_indexing", properties.extensionName) == 0) {
device->shader_64b_indexing = true;
#endif
} }
} }
@ -4629,6 +4681,8 @@ static vk_device ggml_vk_get_device(size_t idx) {
} }
device->float_controls_rte_fp16 = vk12_props.shaderRoundingModeRTEFloat16; device->float_controls_rte_fp16 = vk12_props.shaderRoundingModeRTEFloat16;
device->subgroup_basic = (vk11_props.subgroupSupportedStages & vk::ShaderStageFlagBits::eCompute) &&
(vk11_props.subgroupSupportedOperations & vk::SubgroupFeatureFlagBits::eBasic);
device->subgroup_arithmetic = (vk11_props.subgroupSupportedStages & vk::ShaderStageFlagBits::eCompute) && device->subgroup_arithmetic = (vk11_props.subgroupSupportedStages & vk::ShaderStageFlagBits::eCompute) &&
(vk11_props.subgroupSupportedOperations & vk::SubgroupFeatureFlagBits::eArithmetic); (vk11_props.subgroupSupportedOperations & vk::SubgroupFeatureFlagBits::eArithmetic);
#ifdef __APPLE__ #ifdef __APPLE__
@ -4795,6 +4849,16 @@ static vk_device ggml_vk_get_device(size_t idx) {
device_extensions.push_back("VK_EXT_external_memory_host"); device_extensions.push_back("VK_EXT_external_memory_host");
} }
#if defined(VK_EXT_shader_64bit_indexing)
VkPhysicalDeviceShader64BitIndexingFeaturesEXT shader_64bit_indexing_features {};
shader_64bit_indexing_features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_64_BIT_INDEXING_FEATURES_EXT;
if (device->shader_64b_indexing) {
last_struct->pNext = (VkBaseOutStructure *)&shader_64bit_indexing_features;
last_struct = (VkBaseOutStructure *)&shader_64bit_indexing_features;
device_extensions.push_back("VK_EXT_shader_64bit_indexing");
}
#endif
vkGetPhysicalDeviceFeatures2(device->physical_device, &device_features2); vkGetPhysicalDeviceFeatures2(device->physical_device, &device_features2);
device->pipeline_executable_properties_support = pipeline_executable_properties_support; device->pipeline_executable_properties_support = pipeline_executable_properties_support;
@ -5061,14 +5125,26 @@ static vk_device ggml_vk_get_device(size_t idx) {
switch (device->vendor_id) { switch (device->vendor_id) {
#ifndef GGML_VULKAN_RUN_TESTS #ifndef GGML_VULKAN_RUN_TESTS
case VK_VENDOR_ID_AMD: case VK_VENDOR_ID_AMD:
case VK_VENDOR_ID_INTEL: device->mul_mat_l[i] = device->coopmat_support && device->driver_id != vk::DriverId::eAmdProprietary;
device->mul_mat_l[i] = false;
device->mul_mat_m[i] = true; device->mul_mat_m[i] = true;
device->mul_mat_s[i] = true; device->mul_mat_s[i] = true;
device->mul_mat_id_l[i] = false; device->mul_mat_id_l[i] = false;
device->mul_mat_id_m[i] = true; device->mul_mat_id_m[i] = true;
device->mul_mat_id_s[i] = true; device->mul_mat_id_s[i] = true;
break; break;
case VK_VENDOR_ID_INTEL:
if (!device->coopmat_support || device->architecture != INTEL_XE2) {
device->mul_mat_l[i] = false;
device->mul_mat_id_l[i] = false;
} else {
device->mul_mat_l[i] = true; // if coopmat & XE2+, allow large matmul warptile config for Intel
device->mul_mat_id_l[i] = true;
}
device->mul_mat_m[i] = true;
device->mul_mat_s[i] = true;
device->mul_mat_id_m[i] = true;
device->mul_mat_id_s[i] = true;
break;
case VK_VENDOR_ID_APPLE: case VK_VENDOR_ID_APPLE:
device->mul_mat_l[i] = false; device->mul_mat_l[i] = false;
device->mul_mat_m[i] = true; device->mul_mat_m[i] = true;
@ -5390,6 +5466,7 @@ static void ggml_vk_instance_init() {
vk_perf_logger_enabled = getenv("GGML_VK_PERF_LOGGER") != nullptr; vk_perf_logger_enabled = getenv("GGML_VK_PERF_LOGGER") != nullptr;
vk_perf_logger_concurrent = getenv("GGML_VK_PERF_LOGGER_CONCURRENT") != nullptr; vk_perf_logger_concurrent = getenv("GGML_VK_PERF_LOGGER_CONCURRENT") != nullptr;
vk_enable_sync_logger = getenv("GGML_VK_SYNC_LOGGER") != nullptr; vk_enable_sync_logger = getenv("GGML_VK_SYNC_LOGGER") != nullptr;
vk_memory_logger_enabled = getenv("GGML_VK_MEMORY_LOGGER") != nullptr;
const char* GGML_VK_PERF_LOGGER_FREQUENCY = getenv("GGML_VK_PERF_LOGGER_FREQUENCY"); const char* GGML_VK_PERF_LOGGER_FREQUENCY = getenv("GGML_VK_PERF_LOGGER_FREQUENCY");
if (GGML_VK_PERF_LOGGER_FREQUENCY != nullptr) { if (GGML_VK_PERF_LOGGER_FREQUENCY != nullptr) {
@ -6076,6 +6153,7 @@ static void ggml_vk_dispatch_pipeline(ggml_backend_vk_context* ctx, vk_context&
GGML_ASSERT(ctx->descriptor_set_idx < ctx->descriptor_sets.size()); GGML_ASSERT(ctx->descriptor_set_idx < ctx->descriptor_sets.size());
GGML_ASSERT(descriptor_buffer_infos.size() <= MAX_PARAMETER_COUNT); GGML_ASSERT(descriptor_buffer_infos.size() <= MAX_PARAMETER_COUNT);
GGML_ASSERT(pipeline->parameter_count == descriptor_buffer_infos.size()); GGML_ASSERT(pipeline->parameter_count == descriptor_buffer_infos.size());
GGML_ASSERT(pipeline->push_constant_size == push_constant_size(push_constants));
vk::DescriptorSet& descriptor_set = ctx->descriptor_sets[ctx->descriptor_set_idx++]; vk::DescriptorSet& descriptor_set = ctx->descriptor_sets[ctx->descriptor_set_idx++];
vk::WriteDescriptorSet write_descriptor_set{ descriptor_set, 0, 0, pipeline->parameter_count, vk::DescriptorType::eStorageBuffer, nullptr, descriptor_buffer_infos.begin() }; vk::WriteDescriptorSet write_descriptor_set{ descriptor_set, 0, 0, pipeline->parameter_count, vk::DescriptorType::eStorageBuffer, nullptr, descriptor_buffer_infos.begin() };
@ -6858,10 +6936,29 @@ static void ggml_vk_quantize_q8_1(ggml_backend_vk_context * ctx, vk_context& sub
const uint64_t max_elements = std::min<uint64_t>(uint64_t{ctx->device->properties.limits.maxComputeWorkGroupCount[0]} * pipeline->wg_denoms[0], std::numeric_limits<uint32_t>::max()); const uint64_t max_elements = std::min<uint64_t>(uint64_t{ctx->device->properties.limits.maxComputeWorkGroupCount[0]} * pipeline->wg_denoms[0], std::numeric_limits<uint32_t>::max());
const uint32_t elements = std::min(ne, static_cast<uint32_t>(max_elements)); const uint32_t elements = std::min(ne, static_cast<uint32_t>(max_elements));
ggml_vk_dispatch_pipeline(ctx, subctx, pipeline, { in, out }, std::array<uint32_t, 2>{ ne, num_blocks }, { elements, 1, 1 }); const vk_quantize_q8_1_push_constants pc = {
ne,
num_blocks,
};
ggml_vk_dispatch_pipeline(ctx, subctx, pipeline, { in, out }, pc, { elements, 1, 1 });
ggml_vk_sync_buffers(ctx, subctx); ggml_vk_sync_buffers(ctx, subctx);
} }
static vk_pipeline ggml_vk_get_64b_indexing_pipeline(ggml_backend_vk_context * ctx, vk_pipeline &pipeline) {
GGML_UNUSED(ctx);
#if defined(VK_EXT_shader_64bit_indexing)
vk_pipeline *ptr = &pipeline;
while (*ptr) {
if ((*ptr)->is_64b_indexing) {
return *ptr;
}
ptr = &(*ptr)->next;
}
#endif
return pipeline;
}
static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, bool disable_split_k) { static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, bool disable_split_k) {
VK_LOG_DEBUG("ggml_vk_mul_mat_q_f16((" << src0 << ", name=" << src0->name << ", type=" << ggml_type_name(src0->type) << ", ne0=" << src0->ne[0] << ", ne1=" << src0->ne[1] << ", ne2=" << src0->ne[2] << ", ne3=" << src0->ne[3] << ", nb0=" << src0->nb[0] << ", nb1=" << src0->nb[1] << ", nb2=" << src0->nb[2] << ", nb3=" << src0->nb[3]; VK_LOG_DEBUG("ggml_vk_mul_mat_q_f16((" << src0 << ", name=" << src0->name << ", type=" << ggml_type_name(src0->type) << ", ne0=" << src0->ne[0] << ", ne1=" << src0->ne[1] << ", ne2=" << src0->ne[2] << ", ne3=" << src0->ne[3] << ", nb0=" << src0->nb[0] << ", nb1=" << src0->nb[1] << ", nb2=" << src0->nb[2] << ", nb3=" << src0->nb[3];
std::cerr << "), (" << src1 << ", name=" << src1->name << ", type=" << ggml_type_name(src1->type) << ", ne0=" << src1->ne[0] << ", ne1=" << src1->ne[1] << ", ne2=" << src1->ne[2] << ", ne3=" << src1->ne[3] << ", nb0=" << src1->nb[0] << ", nb1=" << src1->nb[1] << ", nb2=" << src1->nb[2] << ", nb3=" << src1->nb[3]; std::cerr << "), (" << src1 << ", name=" << src1->name << ", type=" << ggml_type_name(src1->type) << ", ne0=" << src1->ne[0] << ", ne1=" << src1->ne[1] << ", ne2=" << src1->ne[2] << ", ne3=" << src1->ne[3] << ", nb0=" << src1->nb[0] << ", nb1=" << src1->nb[1] << ", nb2=" << src1->nb[2] << ", nb3=" << src1->nb[3];
@ -6945,6 +7042,10 @@ static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& sub
vk_pipeline pipeline = ggml_vk_guess_matmul_pipeline(ctx, mmp, ne01, ne11, aligned, qx_needs_dequant ? f16_type : src0->type, quantize_y ? GGML_TYPE_Q8_1 : (y_f32_kernel ? GGML_TYPE_F32 : src1->type)); vk_pipeline pipeline = ggml_vk_guess_matmul_pipeline(ctx, mmp, ne01, ne11, aligned, qx_needs_dequant ? f16_type : src0->type, quantize_y ? GGML_TYPE_Q8_1 : (y_f32_kernel ? GGML_TYPE_F32 : src1->type));
if (ggml_nbytes(src0) > ctx->device->properties.limits.maxStorageBufferRange) {
pipeline = ggml_vk_get_64b_indexing_pipeline(ctx, pipeline);
}
// Reserve extra storage in the N dimension for the Y matrix, so we can avoid bounds-checking // Reserve extra storage in the N dimension for the Y matrix, so we can avoid bounds-checking
uint32_t padded_n = qy_needs_dequant ? ROUNDUP_POW2(ne11, pipeline->wg_denoms[1]) : ne11; uint32_t padded_n = qy_needs_dequant ? ROUNDUP_POW2(ne11, pipeline->wg_denoms[1]) : ne11;
const uint64_t x_ne = ggml_nelements(src0); const uint64_t x_ne = ggml_nelements(src0);
@ -7254,6 +7355,10 @@ static void ggml_vk_mul_mat_vec_q_f16(ggml_backend_vk_context * ctx, vk_context&
to_q8_1 = ggml_vk_get_quantize_pipeline(ctx, GGML_TYPE_Q8_1); to_q8_1 = ggml_vk_get_quantize_pipeline(ctx, GGML_TYPE_Q8_1);
} }
if (ggml_nbytes(src0) > ctx->device->properties.limits.maxStorageBufferRange) {
dmmv = ggml_vk_get_64b_indexing_pipeline(ctx, dmmv);
}
const bool qx_needs_dequant = x_non_contig; const bool qx_needs_dequant = x_non_contig;
const bool qy_needs_dequant = !quantize_y && ((src1->type != GGML_TYPE_F16 && !f16_f32_kernel) || y_non_contig); const bool qy_needs_dequant = !quantize_y && ((src1->type != GGML_TYPE_F16 && !f16_f32_kernel) || y_non_contig);
@ -7449,9 +7554,15 @@ static void ggml_vk_mul_mat_vec_p021_f16_f32(ggml_backend_vk_context * ctx, vk_c
gqa_ratio = 1; gqa_ratio = 1;
} }
vk_pipeline pipeline = ctx->device->pipeline_mul_mat_vec_p021_f16_f32[gqa_ratio - 1];
if (ggml_nbytes(src0) > ctx->device->properties.limits.maxStorageBufferRange) {
pipeline = ggml_vk_get_64b_indexing_pipeline(ctx, pipeline);
}
{ {
// Request descriptor sets // Request descriptor sets
ggml_pipeline_request_descriptor_sets(ctx, ctx->device->pipeline_mul_mat_vec_p021_f16_f32[gqa_ratio - 1], 1); ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
} }
vk_subbuffer d_D = ggml_vk_tensor_subbuffer(ctx, cgraph->nodes[node_idx + ctx->num_additional_fused_ops], true); vk_subbuffer d_D = ggml_vk_tensor_subbuffer(ctx, cgraph->nodes[node_idx + ctx->num_additional_fused_ops], true);
@ -7493,7 +7604,7 @@ static void ggml_vk_mul_mat_vec_p021_f16_f32(ggml_backend_vk_context * ctx, vk_c
workgroups_z /= gqa_ratio; workgroups_z /= gqa_ratio;
} }
ggml_vk_dispatch_pipeline(ctx, subctx, ctx->device->pipeline_mul_mat_vec_p021_f16_f32[gqa_ratio - 1], ggml_vk_dispatch_pipeline(ctx, subctx, pipeline,
{ {
d_Qx, d_Qx,
d_Qy, d_Qy,
@ -7543,9 +7654,14 @@ static void ggml_vk_mul_mat_vec_nc_f16_f32(ggml_backend_vk_context * ctx, vk_con
const uint32_t channel_stride_x = nb02 / sizeof(ggml_fp16_t); const uint32_t channel_stride_x = nb02 / sizeof(ggml_fp16_t);
const uint32_t channel_stride_y = nb12 / sizeof(float); const uint32_t channel_stride_y = nb12 / sizeof(float);
vk_pipeline pipeline = ctx->device->pipeline_mul_mat_vec_nc_f16_f32;
if (ggml_nbytes(src0) > ctx->device->properties.limits.maxStorageBufferRange) {
pipeline = ggml_vk_get_64b_indexing_pipeline(ctx, pipeline);
}
{ {
// Request descriptor sets // Request descriptor sets
ggml_pipeline_request_descriptor_sets(ctx, ctx->device->pipeline_mul_mat_vec_nc_f16_f32, 1); ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
} }
vk_subbuffer d_D = ggml_vk_tensor_subbuffer(ctx, cgraph->nodes[node_idx + ctx->num_additional_fused_ops], true); vk_subbuffer d_D = ggml_vk_tensor_subbuffer(ctx, cgraph->nodes[node_idx + ctx->num_additional_fused_ops], true);
@ -7582,7 +7698,7 @@ static void ggml_vk_mul_mat_vec_nc_f16_f32(ggml_backend_vk_context * ctx, vk_con
init_pushconst_tensor_offsets(ctx, pc, src0, src1, nullptr, nullptr, cgraph->nodes[node_idx + ctx->num_additional_fused_ops]); init_pushconst_tensor_offsets(ctx, pc, src0, src1, nullptr, nullptr, cgraph->nodes[node_idx + ctx->num_additional_fused_ops]);
ggml_vk_dispatch_pipeline(ctx, subctx, ctx->device->pipeline_mul_mat_vec_nc_f16_f32, ggml_vk_dispatch_pipeline(ctx, subctx, pipeline,
{ {
d_Qx, d_Qx,
d_Qy, d_Qy,
@ -7601,8 +7717,9 @@ static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, c
// Handle huge A matrix by splitting the M dimensions. This works well for convolution use cases // Handle huge A matrix by splitting the M dimensions. This works well for convolution use cases
// where the M dimension is very large. // where the M dimension is very large.
// Split_k doesn't work with M splitting. // Split_k doesn't work with M splitting.
// This only supports batchsize == 1.
const size_t nbytes = ggml_nbytes(src0); const size_t nbytes = ggml_nbytes(src0);
const bool needs_split = nbytes > ctx->device->properties.limits.maxStorageBufferRange; const bool needs_split = dst->ne[2] == 1 && dst->ne[3] == 1 && nbytes > ctx->device->properties.limits.maxStorageBufferRange;
if (needs_split) { if (needs_split) {
// Choose the number of rows that can fit (and divide by two, to allow for any additional offsets) // Choose the number of rows that can fit (and divide by two, to allow for any additional offsets)
const uint32_t M_split = ctx->device->properties.limits.maxStorageBufferRange / (2 * src0->nb[1]); const uint32_t M_split = ctx->device->properties.limits.maxStorageBufferRange / (2 * src0->nb[1]);
@ -7744,6 +7861,9 @@ static void ggml_vk_mul_mat_id_q_f16(ggml_backend_vk_context * ctx, vk_context&
vk_pipeline pipeline = ggml_vk_guess_matmul_id_pipeline(ctx, mmp, ne01, nei1, aligned, qx_needs_dequant ? f16_type : src0->type); vk_pipeline pipeline = ggml_vk_guess_matmul_id_pipeline(ctx, mmp, ne01, nei1, aligned, qx_needs_dequant ? f16_type : src0->type);
if (ggml_nbytes(src0) > ctx->device->properties.limits.maxStorageBufferRange) {
pipeline = ggml_vk_get_64b_indexing_pipeline(ctx, pipeline);
}
// Reserve extra storage in the N dimension for the Y matrix, so we can avoid bounds-checking // Reserve extra storage in the N dimension for the Y matrix, so we can avoid bounds-checking
uint32_t padded_n = qy_needs_dequant ? ROUNDUP_POW2(ne11, pipeline->wg_denoms[1]) :ne11; uint32_t padded_n = qy_needs_dequant ? ROUNDUP_POW2(ne11, pipeline->wg_denoms[1]) :ne11;
const uint64_t x_ne = ggml_nelements(src0); const uint64_t x_ne = ggml_nelements(src0);
@ -8005,6 +8125,10 @@ static void ggml_vk_mul_mat_vec_id_q_f16(ggml_backend_vk_context * ctx, vk_conte
const bool qx_needs_dequant = x_non_contig; const bool qx_needs_dequant = x_non_contig;
const bool qy_needs_dequant = !quantize_y && ((src1->type != GGML_TYPE_F16 && !f16_f32_kernel) || y_non_contig); const bool qy_needs_dequant = !quantize_y && ((src1->type != GGML_TYPE_F16 && !f16_f32_kernel) || y_non_contig);
if (ggml_nbytes(src0) > ctx->device->properties.limits.maxStorageBufferRange) {
dmmv = ggml_vk_get_64b_indexing_pipeline(ctx, dmmv);
}
// Not implemented // Not implemented
GGML_ASSERT(y_non_contig || !qy_needs_dequant); // NOLINT GGML_ASSERT(y_non_contig || !qy_needs_dequant); // NOLINT
GGML_ASSERT(!qx_needs_dequant || to_fp16_vk_0 != nullptr); // NOLINT GGML_ASSERT(!qx_needs_dequant || to_fp16_vk_0 != nullptr); // NOLINT
@ -9849,8 +9973,9 @@ static void ggml_vk_ssm_scan(ggml_backend_vk_context * ctx, vk_context& subctx,
std::array<uint32_t, 3> elements; std::array<uint32_t, 3> elements;
const int splitH = 16; const uint32_t d_state = src0->ne[0];
const uint32_t num_workgroups_x = CEIL_DIV(n_head * head_dim, splitH); uint32_t num_subgroups = d_state / ctx->device->subgroup_size;
const uint32_t num_workgroups_x = CEIL_DIV(n_head * head_dim, num_subgroups);
const uint32_t num_workgroups_y = n_seq; const uint32_t num_workgroups_y = n_seq;
elements = { num_workgroups_x, num_workgroups_y, 1 }; elements = { num_workgroups_x, num_workgroups_y, 1 };
@ -14228,6 +14353,7 @@ struct ggml_backend_vk_device_context {
std::string description; std::string description;
bool is_integrated_gpu; bool is_integrated_gpu;
std::string pci_bus_id; std::string pci_bus_id;
int op_offload_min_batch_size;
}; };
static const char * ggml_backend_vk_device_get_name(ggml_backend_dev_t dev) { static const char * ggml_backend_vk_device_get_name(ggml_backend_dev_t dev) {
@ -14284,6 +14410,19 @@ static ggml_backend_t ggml_backend_vk_device_init(ggml_backend_dev_t dev, const
} }
static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) { static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
const vk_device& device = ggml_vk_get_device(ctx->device);
// reject any tensors larger than the max buffer size
for (int i = 0; i < GGML_MAX_SRC; i++) {
if (op->src[i] && ggml_nbytes(op->src[i]) > device->max_buffer_size) {
return false;
}
}
if (ggml_nbytes(op) > device->max_buffer_size) {
return false;
}
switch (op->op) { switch (op->op) {
case GGML_OP_UNARY: case GGML_OP_UNARY:
switch (ggml_get_unary_op(op)) { switch (ggml_get_unary_op(op)) {
@ -14332,8 +14471,6 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
case GGML_OP_MUL_MAT_ID: case GGML_OP_MUL_MAT_ID:
{ {
ggml_type src0_type = op->src[0]->type; ggml_type src0_type = op->src[0]->type;
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
const vk_device& device = ggml_vk_get_device(ctx->device);
if (op->op == GGML_OP_MUL_MAT_ID) { if (op->op == GGML_OP_MUL_MAT_ID) {
if (!device->mul_mat_id_s[src0_type] && !device->mul_mat_id_m[src0_type] && !device->mul_mat_id_l[src0_type]) { if (!device->mul_mat_id_s[src0_type] && !device->mul_mat_id_m[src0_type] && !device->mul_mat_id_l[src0_type]) {
// If there's not enough shared memory for row_ids and the result tile, fallback to CPU // If there's not enough shared memory for row_ids and the result tile, fallback to CPU
@ -14394,8 +14531,6 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
} }
case GGML_OP_FLASH_ATTN_EXT: case GGML_OP_FLASH_ATTN_EXT:
{ {
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
auto device = ggml_vk_get_device(ctx->device);
bool coopmat2 = device->coopmat2; bool coopmat2 = device->coopmat2;
uint32_t HSK = op->src[1]->ne[0]; uint32_t HSK = op->src[1]->ne[0];
uint32_t HSV = op->src[2]->ne[0]; uint32_t HSV = op->src[2]->ne[0];
@ -14617,8 +14752,6 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
if (!ggml_is_contiguous(op) || !ggml_is_contiguous(op->src[0])) { if (!ggml_is_contiguous(op) || !ggml_is_contiguous(op->src[0])) {
return false; return false;
} }
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
auto device = ggml_vk_get_device(ctx->device);
// pipeline_argsort_large_f32 requires vulkan memory model. // pipeline_argsort_large_f32 requires vulkan memory model.
if (device->vulkan_memory_model) { if (device->vulkan_memory_model) {
return true; return true;
@ -14631,8 +14764,6 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
if (!ggml_is_contiguous(op) || !ggml_is_contiguous(op->src[0])) { if (!ggml_is_contiguous(op) || !ggml_is_contiguous(op->src[0])) {
return false; return false;
} }
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
auto device = ggml_vk_get_device(ctx->device);
// We could potentially support larger, using argsort to sort the // We could potentially support larger, using argsort to sort the
// whole thing. Not clear if this is needed. // whole thing. Not clear if this is needed.
uint32_t min_pipeline = (uint32_t)log2f(float(op->ne[0])) + 1; uint32_t min_pipeline = (uint32_t)log2f(float(op->ne[0])) + 1;
@ -14679,8 +14810,6 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
return op->src[0]->type == GGML_TYPE_F32 && ggml_is_contiguous_rows(op->src[0]); return op->src[0]->type == GGML_TYPE_F32 && ggml_is_contiguous_rows(op->src[0]);
case GGML_OP_CUMSUM: case GGML_OP_CUMSUM:
{ {
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
auto device = ggml_vk_get_device(ctx->device);
if (device->subgroup_arithmetic && device->subgroup_require_full_support) { if (device->subgroup_arithmetic && device->subgroup_require_full_support) {
return op->src[0]->type == GGML_TYPE_F32 && ggml_is_contiguous_rows(op->src[0]); return op->src[0]->type == GGML_TYPE_F32 && ggml_is_contiguous_rows(op->src[0]);
} }
@ -14688,9 +14817,6 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
} }
case GGML_OP_SOLVE_TRI: case GGML_OP_SOLVE_TRI:
{ {
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
const vk_device& device = ggml_vk_get_device(ctx->device);
if (op->type != GGML_TYPE_F32 || op->src[0]->type != GGML_TYPE_F32) { if (op->type != GGML_TYPE_F32 || op->src[0]->type != GGML_TYPE_F32) {
return false; return false;
} }
@ -14755,14 +14881,13 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
return false; return false;
} }
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context; size_t shmem_size = d_state * sizeof(float);
const vk_device& device = ggml_vk_get_device(ctx->device);
const uint32_t SPLIT_H = 16; if (shmem_size > device->properties.limits.maxComputeSharedMemorySize) {
return false;
}
size_t stateC_size = SPLIT_H * d_state * sizeof(float); if (!device->subgroup_basic) {
if (stateC_size > device->properties.limits.maxComputeSharedMemorySize) {
return false; return false;
} }
@ -14802,12 +14927,10 @@ static bool ggml_backend_vk_device_supports_buft(ggml_backend_dev_t dev, ggml_ba
} }
static bool ggml_backend_vk_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { static bool ggml_backend_vk_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
const int min_batch_size = 32; ggml_backend_vk_device_context * dev_ctx = (ggml_backend_vk_device_context *)dev->context;
return (op->ne[1] >= min_batch_size && op->op != GGML_OP_GET_ROWS) || return (op->ne[1] >= dev_ctx->op_offload_min_batch_size && op->op != GGML_OP_GET_ROWS) ||
(op->ne[2] >= min_batch_size && op->op == GGML_OP_MUL_MAT_ID); (op->ne[2] >= dev_ctx->op_offload_min_batch_size && op->op == GGML_OP_MUL_MAT_ID);
UNUSED(dev);
} }
static ggml_backend_event_t ggml_backend_vk_device_event_new(ggml_backend_dev_t dev) { static ggml_backend_event_t ggml_backend_vk_device_event_new(ggml_backend_dev_t dev) {
@ -14933,6 +15056,7 @@ static ggml_backend_dev_t ggml_backend_vk_reg_get_device(ggml_backend_reg_t reg,
static std::mutex mutex; static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex); std::lock_guard<std::mutex> lock(mutex);
if (!initialized) { if (!initialized) {
const int min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
for (int i = 0; i < ggml_backend_vk_get_device_count(); i++) { for (int i = 0; i < ggml_backend_vk_get_device_count(); i++) {
ggml_backend_vk_device_context * ctx = new ggml_backend_vk_device_context; ggml_backend_vk_device_context * ctx = new ggml_backend_vk_device_context;
char desc[256]; char desc[256];
@ -14942,6 +15066,7 @@ static ggml_backend_dev_t ggml_backend_vk_reg_get_device(ggml_backend_reg_t reg,
ctx->description = desc; ctx->description = desc;
ctx->is_integrated_gpu = ggml_backend_vk_get_device_type(i) == vk::PhysicalDeviceType::eIntegratedGpu; ctx->is_integrated_gpu = ggml_backend_vk_get_device_type(i) == vk::PhysicalDeviceType::eIntegratedGpu;
ctx->pci_bus_id = ggml_backend_vk_get_device_pci_id(i); ctx->pci_bus_id = ggml_backend_vk_get_device_pci_id(i);
ctx->op_offload_min_batch_size = min_batch_size;
devices.push_back(new ggml_backend_device { devices.push_back(new ggml_backend_device {
/* .iface = */ ggml_backend_vk_device_i, /* .iface = */ ggml_backend_vk_device_i,
/* .reg = */ reg, /* .reg = */ reg,

View File

@ -87,7 +87,6 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
const uint tid = gl_LocalInvocationID.x; const uint tid = gl_LocalInvocationID.x;
get_offsets(a_offset, b_offset, d_offset); get_offsets(a_offset, b_offset, d_offset);
a_offset /= QUANT_K;
y_offset = QUANT_R == 1 ? 1 : QUANT_K/2; y_offset = QUANT_R == 1 ? 1 : QUANT_K/2;

View File

@ -65,9 +65,9 @@ void get_offsets(out uint a_offset, out uint b_offset, out uint d_offset) {
a_offset = a_offset =
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID
expert_id * p.batch_stride_a; expert_id * (p.batch_stride_a / QUANT_K);
#else #else
batch_idx_a * p.batch_stride_a; batch_idx_a * (p.batch_stride_a / QUANT_K);
#endif #endif
b_offset = b_offset =
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID

View File

@ -11,7 +11,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32,
const uint num_blocks_per_row, const uint first_row, const uint num_rows) { const uint num_blocks_per_row, const uint first_row, const uint num_rows) {
// Compute starting index in matrix B for this superblock // Compute starting index in matrix B for this superblock
const uint y_idx = i * QUANT_K + 32 * ib32; const uint y_idx = i * QUANT_K + 32 * ib32;
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
// Precompute indices for quantization lookup tables // Precompute indices for quantization lookup tables
const uint qh_base = 2 * ib32; const uint qh_base = 2 * ib32;

View File

@ -17,7 +17,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32,
const vec4 b_val_1 = vec4(data_b_v4[base_b_idx + 2 * l + 1]); const vec4 b_val_1 = vec4(data_b_v4[base_b_idx + 2 * l + 1]);
// index for data_a // index for data_a
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const float d = float(data_a[ibi].d); const float d = float(data_a[ibi].d);

View File

@ -12,7 +12,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
const uint nibble_shift = 4 * (itid & 1); const uint nibble_shift = 4 * (itid & 1);
const uint ib32 = itid / 2; // 0..7 const uint ib32 = itid / 2; // 0..7
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const float d = float(data_a[ibi].d); const float d = float(data_a[ibi].d);
const uint scale = (data_a[ibi].scales[ib32] >> nibble_shift) & 0xF; const uint scale = (data_a[ibi].scales[ib32] >> nibble_shift) & 0xF;

View File

@ -11,7 +11,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
const uint y_idx = i * QUANT_K + 16 * itid; const uint y_idx = i * QUANT_K + 16 * itid;
const uint nibble_shift = 4 * (itid & 1); const uint nibble_shift = 4 * (itid & 1);
const uint ib32 = itid / 2; // 0..7 const uint ib32 = itid / 2; // 0..7
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
// Precompute db multiplication factors // Precompute db multiplication factors
float db_vals[NUM_ROWS]; float db_vals[NUM_ROWS];
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
@ -22,7 +22,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
db_vals[n] = d * (0.125f + float(scale) * 0.25f); db_vals[n] = d * (0.125f + float(scale) * 0.25f);
ibi += num_blocks_per_row; ibi += num_blocks_per_row;
} }
ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; ibi = a_offset + first_row * num_blocks_per_row + i;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
// Preload grid and sign data for all l values // Preload grid and sign data for all l values
vec4 grid0_vals[2], grid1_vals[2]; vec4 grid0_vals[2], grid1_vals[2];

View File

@ -11,7 +11,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
const uint y_idx = i * QUANT_K + 16 * itid; const uint y_idx = i * QUANT_K + 16 * itid;
const uint ib32 = itid / 2; // 0..7 const uint ib32 = itid / 2; // 0..7
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const float d = float(data_a[ibi].d); const float d = float(data_a[ibi].d);
const uint signscale = pack32(u16vec2( const uint signscale = pack32(u16vec2(

View File

@ -10,7 +10,7 @@ FLOAT_TYPE temp[NUM_COLS][NUM_ROWS];
void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32, const uint i, const uint num_blocks_per_row, const uint first_row, const uint num_rows) { void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32, const uint i, const uint num_blocks_per_row, const uint first_row, const uint num_rows) {
const uint y_idx = i * QUANT_K + 32 * ib32; const uint y_idx = i * QUANT_K + 32 * ib32;
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const float d = float(data_a[ibi].d); const float d = float(data_a[ibi].d);
const uint scale = (data_a[ibi].scales[ib32/2] >> (4 * (ib32 & 1))) & 0xF; const uint scale = (data_a[ibi].scales[ib32/2] >> (4 * (ib32 & 1))) & 0xF;

View File

@ -11,7 +11,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
const uint y_idx = i * QUANT_K + 16 * itid; const uint y_idx = i * QUANT_K + 16 * itid;
const uint ib32 = itid / 2; // 0..7 const uint ib32 = itid / 2; // 0..7
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i; uint ibi = a_offset + first_row * num_blocks_per_row + i;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const float d = float(data_a[ibi].d); const float d = float(data_a[ibi].d);
const uint signscale = pack32(u16vec2( const uint signscale = pack32(u16vec2(

View File

@ -15,7 +15,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
const uint y_idx = i * QUANT_K + y_offset; const uint y_idx = i * QUANT_K + y_offset;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; const uint ib0 = a_offset + (first_row+n)*num_blocks_per_row;
csel ^= 1; csel ^= 1;
if (!all_threads) { // when we don't have enough blocks to use all threads if (!all_threads) { // when we don't have enough blocks to use all threads

View File

@ -14,7 +14,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint ix, co
const uint y_idx = i * QUANT_K + y_offset; const uint y_idx = i * QUANT_K + y_offset;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; const uint ib0 = a_offset + (first_row+n)*num_blocks_per_row;
csel ^= 1; csel ^= 1;
if (!all_threads) { // when we don't have enough blocks to use all threads if (!all_threads) { // when we don't have enough blocks to use all threads

View File

@ -13,7 +13,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint v_im,
const uint y2_idx = y1_idx + 128; const uint y2_idx = y1_idx + 128;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; const uint ib0 = a_offset + (first_row+n)*num_blocks_per_row;
const FLOAT_TYPE_VEC2 dm = FLOAT_TYPE_VEC2(data_a[ib0 + i].dm); const FLOAT_TYPE_VEC2 dm = FLOAT_TYPE_VEC2(data_a[ib0 + i].dm);
const uint32_t scale0_u32 = data_a_packed16[ib0 + i].scales[v_im ]; const uint32_t scale0_u32 = data_a_packed16[ib0 + i].scales[v_im ];

View File

@ -13,7 +13,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint v_im,
const uint y2_idx = y1_idx + 128; const uint y2_idx = y1_idx + 128;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; const uint ib0 = a_offset + (first_row+n)*num_blocks_per_row;
const FLOAT_TYPE_VEC2 dm = FLOAT_TYPE_VEC2(data_a[ib0 + i].dm); const FLOAT_TYPE_VEC2 dm = FLOAT_TYPE_VEC2(data_a[ib0 + i].dm);
const uint32_t scale0_u32 = data_a_packed16[ib0 + i].scales[v_im ]; const uint32_t scale0_u32 = data_a_packed16[ib0 + i].scales[v_im ];

View File

@ -15,7 +15,7 @@ void calc_superblock(const uint a_offset, const uint b_offset, const uint itid,
const uint y_idx = i * QUANT_K + y_offset; const uint y_idx = i * QUANT_K + y_offset;
[[unroll]] for (uint n = 0; n < num_rows; ++n) { [[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; const uint ib0 = a_offset + (first_row+n)*num_blocks_per_row;
csel ^= 1; csel ^= 1;
if (!all_threads) { // when we don't have enough blocks to use all threads if (!all_threads) { // when we don't have enough blocks to use all threads

View File

@ -79,7 +79,7 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
const uint tid = gl_LocalInvocationID.x; const uint tid = gl_LocalInvocationID.x;
get_offsets(a_offset, b_offset, d_offset); get_offsets(a_offset, b_offset, d_offset);
a_offset /= QUANT_K_Q8_1; a_offset *= QUANT_K / QUANT_K_Q8_1;
b_offset /= QUANT_K_Q8_1; b_offset /= QUANT_K_Q8_1;
FLOAT_TYPE temp[NUM_COLS][NUM_ROWS]; FLOAT_TYPE temp[NUM_COLS][NUM_ROWS];

View File

@ -234,13 +234,13 @@ void main() {
const uint end_k = min(p.K, (ik + 1) * p.k_split); const uint end_k = min(p.K, (ik + 1) * p.k_split);
#endif #endif
uint pos_a = ( uint pos_a =
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID
expert_idx * p.batch_stride_a + expert_idx * (p.batch_stride_a / LOAD_VEC_A) +
#else #else
batch_idx_a * p.batch_stride_a + batch_idx_a * (p.batch_stride_a / LOAD_VEC_A) +
#endif #endif
ir * BM * p.stride_a + start_k) / LOAD_VEC_A; (ir * BM * p.stride_a + start_k) / LOAD_VEC_A;
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID
uint pos_b = 0; uint pos_b = 0;
#else #else

View File

@ -250,10 +250,10 @@ void main() {
#endif #endif
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID
uint pos_a = (expert_idx * p.batch_stride_a) / QUANT_K; uint pos_a = expert_idx * (p.batch_stride_a / QUANT_K);
uint pos_b = 0; uint pos_b = 0;
#else #else
uint pos_a = (batch_idx_a * p.batch_stride_a) / QUANT_K; uint pos_a = batch_idx_a * (p.batch_stride_a / QUANT_K);
uint pos_b = batch_idx * p.batch_stride_b; uint pos_b = batch_idx * p.batch_stride_b;
uint pos_d = batch_idx * p.batch_stride_d + ik * p.batch_stride_d * gl_NumWorkGroups.z; uint pos_d = batch_idx * p.batch_stride_d + ik * p.batch_stride_d * gl_NumWorkGroups.z;
#endif #endif

View File

@ -189,13 +189,13 @@ void main() {
const uint end_k = min(p.K, (ik + 1) * p.k_split); const uint end_k = min(p.K, (ik + 1) * p.k_split);
#endif #endif
uint pos_a_ib = ( uint pos_a_ib =
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID
expert_idx * p.batch_stride_a + expert_idx * (p.batch_stride_a / BK) +
#else #else
batch_idx_a * p.batch_stride_a + batch_idx_a * (p.batch_stride_a / BK) +
#endif #endif
ir * BM * p.stride_a + start_k) / BK; (ir * BM * p.stride_a + start_k) / BK;
#ifdef MUL_MAT_ID #ifdef MUL_MAT_ID
uint pos_b_ib = 0; uint pos_b_ib = 0;
#else #else

View File

@ -1,6 +1,7 @@
#version 450 #version 450
#extension GL_EXT_control_flow_attributes : require #extension GL_EXT_control_flow_attributes : require
#extension GL_KHR_shader_subgroup_basic : enable
#if USE_SUBGROUP_ADD #if USE_SUBGROUP_ADD
#extension GL_KHR_shader_subgroup_arithmetic : enable #extension GL_KHR_shader_subgroup_arithmetic : enable
#endif #endif
@ -9,7 +10,8 @@
layout(constant_id = 0) const uint D_STATE = 128; layout(constant_id = 0) const uint D_STATE = 128;
layout(constant_id = 1) const uint SUBGROUP_SIZE = 32; layout(constant_id = 1) const uint SUBGROUP_SIZE = 32;
layout(constant_id = 2) const uint SPLIT_H = 16;
const uint32_t c_factor = D_STATE / SUBGROUP_SIZE;
layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in; layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
@ -41,22 +43,28 @@ float softplus(float x) {
} }
} }
shared float stateC[SPLIT_H * D_STATE]; #if !USE_SUBGROUP_ADD
shared float temp[D_STATE];
#endif
void main() { void main() {
const uint tid = gl_LocalInvocationID.x; const uint subgroup = gl_SubgroupID;
const uint head_idx = (gl_WorkGroupID.x * SPLIT_H) / d_head; const uint lane = gl_SubgroupInvocationID;
const uint head_off = ((gl_WorkGroupID.x * SPLIT_H) % d_head) * 4; const uint tid = gl_SubgroupID * SUBGROUP_SIZE + lane;
const uint subgroup_idx = gl_WorkGroupID.x * c_factor + subgroup;
const uint head_idx = subgroup_idx / d_head;
const uint head_off = (subgroup_idx % d_head) * 4;
const uint seq_idx = gl_WorkGroupID.y; const uint seq_idx = gl_WorkGroupID.y;
const uint group_off = (head_idx / (n_head / n_group)) * D_STATE * 4; const uint group_off = (head_idx / (n_head / n_group)) * D_STATE * 4;
const uint s0_base_idx = (uint(ids[seq_idx]) * nb03 + head_idx * nb02 + head_off * D_STATE) / 4; const uint s0_base_idx = (uint(ids[seq_idx]) * nb03 + head_idx * nb02 + head_off * D_STATE) / 4;
const uint x_base_idx = (seq_idx * nb13 + gl_WorkGroupID.x * SPLIT_H * 4) / 4; const uint x_base_idx = (seq_idx * nb13 + subgroup_idx * 4) / 4;
const uint dt_base_idx = (seq_idx * nb22 + head_idx * 4) / 4; const uint dt_base_idx = (seq_idx * nb22 + head_idx * 4) / 4;
const uint A_base_idx = (head_idx * nb31) / 4; const uint A_base_idx = (head_idx * nb31) / 4;
const uint B_base_idx = (seq_idx * nb43 + group_off) / 4; const uint B_base_idx = (seq_idx * nb43 + group_off) / 4;
const uint C_base_idx = (seq_idx * nb53 + group_off) / 4; const uint C_base_idx = (seq_idx * nb53 + group_off) / 4;
const uint y_base_idx = seq_idx * n_tok * n_head * d_head + gl_WorkGroupID.x * SPLIT_H; const uint y_base_idx = seq_idx * n_tok * n_head * d_head + subgroup_idx;
const uint s_base_idx = (s_off + seq_idx * nb03 + head_idx * nb02 + head_off * D_STATE) / 4; const uint s_base_idx = (s_off + seq_idx * nb03 + head_idx * nb02 + head_off * D_STATE) / 4;
const uint stride_x = nb12 / 4; const uint stride_x = nb12 / 4;
@ -65,76 +73,52 @@ void main() {
const uint stride_C = nb52 / 4; const uint stride_C = nb52 / 4;
const uint stride_y = n_head * d_head; const uint stride_y = n_head * d_head;
float state[SPLIT_H]; float state[c_factor];
[[unroll]] for (uint j = 0; j < SPLIT_H; j++) {
state[j] = s0[s0_base_idx + j * D_STATE + tid]; [[unroll]] for (uint j = 0; j < c_factor; j++) {
state[j] = s0[s0_base_idx + SUBGROUP_SIZE * j + lane];
} }
float a = A[A_base_idx];
for (uint i = 0; i < n_tok; i++) { for (uint i = 0; i < n_tok; i++) {
const float dt_soft_plus = softplus(dt[dt_base_idx + i * stride_dt]); float dt_soft_plus = softplus(dt[dt_base_idx + i * stride_dt]);
const float dA = exp(dt_soft_plus * A[A_base_idx]); float state_sum = 0.0f;
const float B_val = B[B_base_idx + i * stride_B + tid];
const float C_val = C[C_base_idx + i * stride_C + tid];
[[unroll]] for (uint j = 0; j < SPLIT_H; j++) {
const float x_dt = x[x_base_idx + i * stride_x + j] * dt_soft_plus;
const float dA = exp(dt_soft_plus * a);
const float x_dt = x[x_base_idx + i * stride_x] * dt_soft_plus;
[[unroll]] for (uint j = 0; j < c_factor; j++) {
float B_val = B[B_base_idx + i * stride_B + SUBGROUP_SIZE * j + lane];
float C_val = C[C_base_idx + i * stride_C + SUBGROUP_SIZE * j + lane];
state[j] = (state[j] * dA) + (B_val * x_dt); state[j] = (state[j] * dA) + (B_val * x_dt);
state_sum += state[j] * C_val;
stateC[j * D_STATE + tid] = state[j] * C_val;
} }
barrier();
[[unroll]]
for (uint w = D_STATE / 2; w >= SUBGROUP_SIZE; w >>= 1) {
[[unroll]] for (uint j = 0; j < (w * SPLIT_H + D_STATE - 1) / D_STATE; j++) {
const uint k = (tid % w) + (D_STATE * (tid / w)) + j * D_STATE * (D_STATE / w);
if (k < SPLIT_H * D_STATE && (k + w) < SPLIT_H * D_STATE) {
stateC[k] += stateC[k + w];
}
}
barrier();
}
[[unroll]] for (uint j = 0; j < max(1, SPLIT_H / (D_STATE / SUBGROUP_SIZE)); j++) {
const uint idx = (tid % SUBGROUP_SIZE) +
D_STATE * (tid / SUBGROUP_SIZE) +
j * D_STATE * (D_STATE / SUBGROUP_SIZE);
const uint max_idx = SUBGROUP_SIZE - 1 +
D_STATE * ((D_STATE - 1) / SUBGROUP_SIZE) +
j * D_STATE * (D_STATE / SUBGROUP_SIZE);
if (idx < SPLIT_H * D_STATE ||
max_idx < SPLIT_H * D_STATE) {
float sc;
#if USE_SUBGROUP_ADD #if USE_SUBGROUP_ADD
sc = stateC[idx]; state_sum = subgroupAdd(state_sum);
sc = subgroupAdd(sc);
#else #else
[[unroll]] for (uint offset = SUBGROUP_SIZE / 2; offset > 0; offset >>= 1) { temp[tid] = state_sum;
if (idx + offset < SPLIT_H * D_STATE) { barrier();
stateC[idx] += stateC[idx + offset]; [[unroll]] for (uint s = SUBGROUP_SIZE / 2; s > 0; s >>= 1) {
if (lane < s) {
temp[tid] += temp[tid + s];
} }
barrier(); barrier();
} }
if (tid % SUBGROUP_SIZE == 0) { // get the value from lane 0
sc = stateC[idx]; state_sum = temp[subgroup * SUBGROUP_SIZE];
} barrier();
#endif #endif
if (tid % SUBGROUP_SIZE == 0) { if (lane == 0) {
const uint k = tid / SUBGROUP_SIZE + j * (D_STATE / SUBGROUP_SIZE); d[y_base_idx + i * stride_y] = state_sum;
d[y_base_idx + i * stride_y + k] = sc;
}
} }
} }
barrier(); // write back the state
} [[unroll]]
for (int j = 0; j < c_factor; j++) {
[[unroll]] for (uint j = 0; j < SPLIT_H; j++) { d[s_base_idx + SUBGROUP_SIZE * j + lane] = state[j];
d[s_base_idx + j * D_STATE + tid] = state[j];
} }
} }

View File

@ -0,0 +1,169 @@
#ifndef GGML_WEBGPU_SHADER_LIB_HPP
#define GGML_WEBGPU_SHADER_LIB_HPP
#include "ggml.h"
#include "pre_wgsl.hpp"
#include <string>
#include <vector>
#define GGML_WEBGPU_F16_SIZE_BYTES 2
#define GGML_WEBGPU_F32_SIZE_BYTES 4
#define GGML_WEBGPU_FLASH_ATTN_PREFERRED_KV_SG_TILES 8u
#define GGML_WEBGPU_FLASH_ATTN_PREFERRED_WG_SIZE 128u
// Matches GGML_PAD(..., 256) in src/llama-context.cpp for KV cache sizing.
#define GGML_WEBGPU_KV_SEQ_PAD 256u
struct ggml_webgpu_flash_attn_shader_lib_context {
ggml_type kv_type;
uint32_t head_dim_qk;
uint32_t head_dim_v;
bool kv_direct;
bool has_mask;
bool has_sinks;
bool uses_logit_softcap;
uint32_t sg_mat_m;
uint32_t sg_mat_n;
uint32_t sg_mat_k;
size_t wg_mem_limit_bytes;
uint32_t max_subgroup_size;
};
struct ggml_webgpu_flash_attn_shader_decisions {
uint32_t q_tile = 0;
uint32_t kv_tile = 0;
uint32_t wg_size = 0;
};
struct ggml_webgpu_processed_shader {
std::string wgsl;
std::string variant;
ggml_webgpu_flash_attn_shader_decisions decisions;
};
// This is exposed because it's necessary in supports_op
inline size_t ggml_webgpu_flash_attn_wg_mem_bytes(uint32_t q_tile,
uint32_t kv_tile,
uint32_t head_dim_qk,
uint32_t head_dim_v,
bool has_mask,
bool kv_direct) {
const uint32_t max_head_dim = std::max(head_dim_qk, head_dim_v);
size_t f16_elems = 0;
size_t f32_elems = 0;
f16_elems += q_tile * head_dim_qk; // q_shmem
if (!kv_direct) {
f16_elems += kv_tile * max_head_dim; // kv_shmem
}
f16_elems += q_tile * head_dim_v; // o_shmem
if (has_mask) {
f16_elems += q_tile * kv_tile; // mask_shmem
}
f16_elems += q_tile * kv_tile; // inter_shmem
f32_elems += q_tile; // row_max_shmem
f32_elems += q_tile; // exp_sum_shmem
return f16_elems * GGML_WEBGPU_F16_SIZE_BYTES + f32_elems * GGML_WEBGPU_F32_SIZE_BYTES;
}
static uint32_t ggml_webgpu_flash_attn_max_kv_tile(const ggml_webgpu_flash_attn_shader_lib_context & context) {
const size_t limit_bytes = context.wg_mem_limit_bytes;
const size_t q_tile = context.sg_mat_m;
const size_t base_q_bytes = (context.head_dim_qk + context.head_dim_v) * q_tile * GGML_WEBGPU_F16_SIZE_BYTES +
2 * q_tile * GGML_WEBGPU_F32_SIZE_BYTES;
size_t bytes_per_kv = 0;
if (!context.kv_direct) {
bytes_per_kv += std::max(context.head_dim_qk, context.head_dim_v);
}
if (context.has_mask) {
bytes_per_kv += q_tile;
}
bytes_per_kv += q_tile;
bytes_per_kv *= GGML_WEBGPU_F16_SIZE_BYTES;
const uint32_t max_kv_tile = (limit_bytes - base_q_bytes) / bytes_per_kv;
return (max_kv_tile / context.sg_mat_n) * context.sg_mat_n;
}
inline ggml_webgpu_processed_shader ggml_webgpu_preprocess_flash_attn_shader(
pre_wgsl::Preprocessor & preprocessor,
const char * shader_src,
const ggml_webgpu_flash_attn_shader_lib_context & context) {
std::vector<std::string> defines;
std::string variant = "flash_attn";
switch (context.kv_type) {
case GGML_TYPE_F32:
defines.push_back("KV_F32");
break;
case GGML_TYPE_F16:
defines.push_back("KV_F16");
break;
case GGML_TYPE_Q4_0:
defines.push_back("KV_Q4_0");
break;
case GGML_TYPE_Q8_0:
defines.push_back("KV_Q8_0");
break;
default:
GGML_ABORT("Unsupported KV type for flash attention shader");
}
variant += std::string("_") + ggml_type_name(context.kv_type);
if (context.has_mask) {
defines.push_back("MASK");
variant += "_mask";
}
if (context.has_sinks) {
defines.push_back("SINKS");
variant += "_sinks";
}
if (context.uses_logit_softcap) {
defines.push_back("LOGIT_SOFTCAP");
variant += "_lgsc";
}
if (context.kv_direct) {
defines.push_back("KV_DIRECT");
variant += "_kvdirect";
}
defines.push_back(std::string("HEAD_DIM_QK=") + std::to_string(context.head_dim_qk));
variant += std::string("_hsqk") + std::to_string(context.head_dim_qk);
defines.push_back(std::string("HEAD_DIM_V=") + std::to_string(context.head_dim_v));
variant += std::string("_hsv") + std::to_string(context.head_dim_v);
// For now these are not part of the variant name
defines.push_back(std::string("SG_MAT_M=") + std::to_string(context.sg_mat_m));
defines.push_back(std::string("SG_MAT_N=") + std::to_string(context.sg_mat_n));
defines.push_back(std::string("SG_MAT_K=") + std::to_string(context.sg_mat_k));
// Add chosen Q/KV tile sizes
uint32_t q_tile = context.sg_mat_m;
uint32_t kv_tile = std::min(ggml_webgpu_flash_attn_max_kv_tile(context),
context.sg_mat_n * GGML_WEBGPU_FLASH_ATTN_PREFERRED_KV_SG_TILES);
if (context.kv_direct) {
GGML_ASSERT(kv_tile <= GGML_WEBGPU_KV_SEQ_PAD);
// Avoids having to use bounds-checks and decreasing performance for direct KV loads
while (GGML_WEBGPU_KV_SEQ_PAD % kv_tile != 0) {
kv_tile -= context.sg_mat_n;
}
}
defines.push_back(std::string("Q_TILE=") + std::to_string(q_tile));
defines.push_back(std::string("KV_TILE=") + std::to_string(kv_tile));
// workgroup size
uint32_t wg_size = std::max(context.max_subgroup_size, GGML_WEBGPU_FLASH_ATTN_PREFERRED_WG_SIZE);
defines.push_back(std::string("WG_SIZE=") + std::to_string(wg_size));
ggml_webgpu_processed_shader result;
result.wgsl = preprocessor.preprocess(shader_src, defines);
result.variant = variant;
result.decisions.q_tile = q_tile;
result.decisions.kv_tile = kv_tile;
result.decisions.wg_size = wg_size;
return result;
}
#endif // GGML_WEBGPU_SHADER_LIB_HPP

View File

@ -7,7 +7,9 @@
#include "ggml-backend-impl.h" #include "ggml-backend-impl.h"
#include "ggml-impl.h" #include "ggml-impl.h"
#include "ggml-webgpu-shader-lib.hpp"
#include "ggml-wgsl-shaders.hpp" #include "ggml-wgsl-shaders.hpp"
#include "pre_wgsl.hpp"
#ifdef __EMSCRIPTEN__ #ifdef __EMSCRIPTEN__
# include <emscripten/emscripten.h> # include <emscripten/emscripten.h>
@ -17,6 +19,7 @@
#include <atomic> #include <atomic>
#include <condition_variable> #include <condition_variable>
#include <cstdint>
#include <cstring> #include <cstring>
#include <iostream> #include <iostream>
#include <map> #include <map>
@ -30,7 +33,7 @@
#ifdef GGML_WEBGPU_DEBUG #ifdef GGML_WEBGPU_DEBUG
# define WEBGPU_LOG_DEBUG(msg) std::cout << msg << std::endl # define WEBGPU_LOG_DEBUG(msg) std::cout << msg << std::endl
# define WEBGPU_DEBUG_BUF_ELEMS 32 # define WEBGPU_DEBUG_BUF_ELEMS 512
#else #else
# define WEBGPU_LOG_DEBUG(msg) ((void) 0) # define WEBGPU_LOG_DEBUG(msg) ((void) 0)
#endif // GGML_WEBGPU_DEBUG #endif // GGML_WEBGPU_DEBUG
@ -251,6 +254,7 @@ struct webgpu_gpu_profile_buf_pool {
struct webgpu_pipeline { struct webgpu_pipeline {
wgpu::ComputePipeline pipeline; wgpu::ComputePipeline pipeline;
std::string name; std::string name;
void * context = nullptr;
}; };
struct webgpu_command { struct webgpu_command {
@ -263,6 +267,46 @@ struct webgpu_command {
#endif #endif
}; };
struct flash_attn_pipeline_key {
int q_type;
int kv_type;
int dst_type;
uint32_t head_dim_qk;
uint32_t head_dim_v;
bool kv_direct;
bool has_mask;
bool has_sinks;
bool uses_logit_softcap;
bool operator==(const flash_attn_pipeline_key & other) const {
return q_type == other.q_type && kv_type == other.kv_type && dst_type == other.dst_type &&
head_dim_qk == other.head_dim_qk && head_dim_v == other.head_dim_v && kv_direct == other.kv_direct &&
has_mask == other.has_mask && has_sinks == other.has_sinks &&
uses_logit_softcap == other.uses_logit_softcap;
}
};
// Same hash combine function as in boost
template <typename T> inline void ggml_webgpu_hash_combine(size_t & seed, const T & value) {
seed ^= std::hash<T>{}(value) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
struct flash_attn_pipeline_key_hash {
size_t operator()(const flash_attn_pipeline_key & key) const {
size_t seed = 0;
ggml_webgpu_hash_combine(seed, key.q_type);
ggml_webgpu_hash_combine(seed, key.kv_type);
ggml_webgpu_hash_combine(seed, key.dst_type);
ggml_webgpu_hash_combine(seed, key.head_dim_qk);
ggml_webgpu_hash_combine(seed, key.head_dim_v);
ggml_webgpu_hash_combine(seed, key.kv_direct);
ggml_webgpu_hash_combine(seed, key.has_mask);
ggml_webgpu_hash_combine(seed, key.has_sinks);
ggml_webgpu_hash_combine(seed, key.uses_logit_softcap);
return seed;
}
};
// All the base objects needed to run operations on a WebGPU device // All the base objects needed to run operations on a WebGPU device
struct webgpu_context_struct { struct webgpu_context_struct {
wgpu::Instance instance; wgpu::Instance instance;
@ -271,12 +315,12 @@ struct webgpu_context_struct {
wgpu::Queue queue; wgpu::Queue queue;
wgpu::Limits limits; wgpu::Limits limits;
uint32_t subgroup_size; uint32_t max_subgroup_size;
#ifndef __EMSCRIPTEN__
bool supports_subgroup_matrix = false; bool supports_subgroup_matrix = false;
wgpu::SubgroupMatrixConfig subgroup_matrix_config; uint32_t sg_mat_m;
#endif uint32_t sg_mat_n;
uint32_t sg_mat_k;
std::recursive_mutex mutex; std::recursive_mutex mutex;
std::atomic_uint inflight_threads = 0; std::atomic_uint inflight_threads = 0;
@ -284,12 +328,16 @@ struct webgpu_context_struct {
webgpu_buf_pool param_buf_pool; webgpu_buf_pool param_buf_pool;
webgpu_buf_pool set_rows_error_buf_pool; webgpu_buf_pool set_rows_error_buf_pool;
pre_wgsl::Preprocessor p;
std::map<int, webgpu_pipeline> memset_pipelines; // variant or type index std::map<int, webgpu_pipeline> memset_pipelines; // variant or type index
std::map<int, std::map<int, std::map<int, webgpu_pipeline>>> mul_mat_pipelines; // src0_type, src1_type, vectorized std::map<int, std::map<int, std::map<int, webgpu_pipeline>>> mul_mat_pipelines; // src0_type, src1_type, vectorized
std::map<int, std::map<int, std::map<int, webgpu_pipeline>>> std::map<int, std::map<int, std::map<int, webgpu_pipeline>>>
mul_mat_vec_pipelines; // src0_type, src1_type, vectorized mul_mat_vec_pipelines; // src0_type, src1_type, vectorized
std::unordered_map<flash_attn_pipeline_key, webgpu_pipeline, flash_attn_pipeline_key_hash> flash_attn_pipelines;
std::map<int, std::map<int, webgpu_pipeline>> set_rows_pipelines; // dst_type, vectorized std::map<int, std::map<int, webgpu_pipeline>> set_rows_pipelines; // dst_type, vectorized
std::map<int, std::map<int, webgpu_pipeline>> get_rows_pipelines; // src_type, vectorized std::map<int, std::map<int, webgpu_pipeline>> get_rows_pipelines; // src_type, vectorized
@ -361,8 +409,6 @@ struct ggml_backend_webgpu_buffer_context {
label(std::move(lbl)) {} label(std::move(lbl)) {}
}; };
/* End struct definitions */
/* WebGPU object initializations */ /* WebGPU object initializations */
// Process a WGSL shader string, replacing tokens of the form {{KEY}} with // Process a WGSL shader string, replacing tokens of the form {{KEY}} with
@ -484,14 +530,9 @@ static void ggml_backend_webgpu_debug(webgpu_context & ctx) {
encoder.CopyBufferToBuffer(ctx->debug_dev_buf, 0, ctx->debug_host_buf, 0, ctx->debug_host_buf.GetSize()); encoder.CopyBufferToBuffer(ctx->debug_dev_buf, 0, ctx->debug_host_buf, 0, ctx->debug_host_buf.GetSize());
wgpu::CommandBuffer commands = encoder.Finish(); wgpu::CommandBuffer commands = encoder.Finish();
ctx->queue.Submit(1, &commands); ctx->queue.Submit(1, &commands);
ggml_backend_webgpu_map_buffer(ctx, ctx->debug_host_buf, wgpu::MapMode::Read, 0, ctx->debug_host_buf.GetSize()); ggml_backend_webgpu_map_buffer(ctx, ctx->debug_host_buf, wgpu::MapMode::Read, 0, ctx->debug_host_buf.GetSize());
const uint32_t * debug_data = (const uint32_t *) ctx->debug_host_buf.GetConstMappedRange(); const float * debug_data = (const float *) ctx->debug_host_buf.GetConstMappedRange();
std::cout << "debug data:"; std::cout << "debug[0]: " << debug_data[0] << "\n";
for (size_t i = 0; i < WEBGPU_DEBUG_BUF_ELEMS; i++) {
std::cout << " " << i << ": " << debug_data[i];
}
std::cout << "\n";
ctx->debug_host_buf.Unmap(); ctx->debug_host_buf.Unmap();
} }
#endif #endif
@ -673,6 +714,7 @@ static const char * ggml_backend_webgpu_name(ggml_backend_t backend) {
return ctx->name.c_str(); return ctx->name.c_str();
} }
// TODO: implement proper cleanup
static void ggml_backend_webgpu_free(ggml_backend_t backend) { static void ggml_backend_webgpu_free(ggml_backend_t backend) {
ggml_backend_webgpu_context * ctx = (ggml_backend_webgpu_context *) backend->context; ggml_backend_webgpu_context * ctx = (ggml_backend_webgpu_context *) backend->context;
WEBGPU_LOG_DEBUG("ggml_backend_webgpu_free(" << ctx->name << ")"); WEBGPU_LOG_DEBUG("ggml_backend_webgpu_free(" << ctx->name << ")");
@ -730,12 +772,12 @@ static wgpu::Buffer ggml_webgpu_tensor_buf(const ggml_tensor * tensor) {
return ctx->buffer; return ctx->buffer;
} }
static size_t ggml_webgpu_tensor_misalignment(webgpu_context & ctx, ggml_tensor * t) { static size_t ggml_webgpu_tensor_misalignment(webgpu_context & ctx, const ggml_tensor * t) {
size_t offset = ggml_webgpu_tensor_offset(t); size_t offset = ggml_webgpu_tensor_offset(t);
return offset & (ctx->limits.minStorageBufferOffsetAlignment - 1); return offset & (ctx->limits.minStorageBufferOffsetAlignment - 1);
} }
static size_t ggml_webgpu_tensor_align_offset(webgpu_context & ctx, ggml_tensor * t) { static size_t ggml_webgpu_tensor_align_offset(webgpu_context & ctx, const ggml_tensor * t) {
size_t offset = ggml_webgpu_tensor_offset(t); size_t offset = ggml_webgpu_tensor_offset(t);
return offset & ~(ctx->limits.minStorageBufferOffsetAlignment - 1); return offset & ~(ctx->limits.minStorageBufferOffsetAlignment - 1);
} }
@ -964,11 +1006,9 @@ static webgpu_command ggml_webgpu_mul_mat(webgpu_context & ctx,
#ifndef __EMSCRIPTEN__ #ifndef __EMSCRIPTEN__
if (ctx->supports_subgroup_matrix) { if (ctx->supports_subgroup_matrix) {
// The total number of subgroups/workgroups needed per matrix. // The total number of subgroups/workgroups needed per matrix.
uint32_t wg_m_sg_tile = uint32_t wg_m_sg_tile = WEBGPU_MUL_MAT_SUBGROUP_M * WEBGPU_MUL_MAT_SUBGROUP_MATRIX_M * ctx->sg_mat_m;
WEBGPU_MUL_MAT_SUBGROUP_M * WEBGPU_MUL_MAT_SUBGROUP_MATRIX_M * ctx->subgroup_matrix_config.M;
wg_m = CEIL_DIV(dst->ne[0], wg_m_sg_tile); wg_m = CEIL_DIV(dst->ne[0], wg_m_sg_tile);
uint32_t wg_n_sg_tile = uint32_t wg_n_sg_tile = WEBGPU_MUL_MAT_SUBGROUP_N * WEBGPU_MUL_MAT_SUBGROUP_MATRIX_N * ctx->sg_mat_n;
WEBGPU_MUL_MAT_SUBGROUP_N * WEBGPU_MUL_MAT_SUBGROUP_MATRIX_N * ctx->subgroup_matrix_config.N;
wg_n = CEIL_DIV(dst->ne[1], wg_n_sg_tile); wg_n = CEIL_DIV(dst->ne[1], wg_n_sg_tile);
} else { } else {
#endif #endif
@ -986,6 +1026,146 @@ static webgpu_command ggml_webgpu_mul_mat(webgpu_context & ctx,
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y); return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
} }
static webgpu_command ggml_webgpu_flash_attn(webgpu_context & ctx,
ggml_tensor * Q,
ggml_tensor * K,
ggml_tensor * V,
ggml_tensor * mask,
ggml_tensor * sinks,
ggml_tensor * dst) {
float scale = *(float *) dst->op_params;
float max_bias;
memcpy(&max_bias, (float *) dst->op_params + 1, sizeof(float));
float logit_softcap;
memcpy(&logit_softcap, (float *) dst->op_params + 2, sizeof(float));
if (logit_softcap != 0.0f) {
scale /= logit_softcap;
}
float n_head_log2 = float(1u << (uint32_t) floor(log2(Q->ne[2])));
float m0 = powf(2.0f, -(max_bias) / n_head_log2);
float m1 = powf(2.0f, -(max_bias / 2.0f) / n_head_log2);
const int has_mask = (mask != nullptr);
const int has_sinks = (sinks != nullptr);
std::vector<uint32_t> params = {
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, Q) / ggml_type_size(Q->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, K) / ggml_type_size(K->type)),
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, V) / ggml_type_size(V->type)),
has_mask ? (uint32_t) (ggml_webgpu_tensor_misalignment(ctx, mask) / ggml_type_size(mask->type)) : 0,
has_sinks ? (uint32_t) (ggml_webgpu_tensor_misalignment(ctx, sinks) / ggml_type_size(sinks->type)) : 0,
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, dst) / ggml_type_size(dst->type)),
(uint32_t) Q->ne[2], // number of heads
(uint32_t) Q->ne[1], // sequence length (Q)
(uint32_t) K->ne[1], // sequence length (K/V)
(uint32_t) (Q->nb[1] / ggml_type_size(Q->type)), // stride (elements/blocks) of Q in dimension 1
(uint32_t) (Q->nb[2] / ggml_type_size(Q->type)), // stride (elements/blocks) of Q in dimension 2
(uint32_t) (Q->nb[3] / ggml_type_size(Q->type)), // stride (elements/blocks) of Q in dimension 3
(uint32_t) (K->nb[1] / ggml_type_size(K->type)), // stride (elements/blocks) of K in dimension 1
(uint32_t) (K->nb[2] / ggml_type_size(K->type)), // stride (elements/blocks) of K in dimension 2
(uint32_t) (K->nb[3] / ggml_type_size(K->type)), // stride (elements/blocks) of K in dimension 3
(uint32_t) (V->nb[1] / ggml_type_size(V->type)), // stride (elements/blocks) of V in dimension 1
(uint32_t) (V->nb[2] / ggml_type_size(V->type)), // stride (elements/blocks) of V in dimension 2
(uint32_t) (V->nb[3] / ggml_type_size(V->type)), // stride (elements/blocks) of V in dimension 3
has_mask ? (uint32_t) (mask->nb[3] / ggml_type_size(mask->type)) : 0, // stride of mask dim 3
(uint32_t) (Q->ne[2] / K->ne[2]), // repeat factor for K/V in dim 2 (MHA/MQA/GQA)
*(uint32_t *) &scale, // scale (possibly adjusted for logit softcap)
*(uint32_t *) &max_bias,
*(uint32_t *) &logit_softcap,
*(uint32_t *) &n_head_log2,
*(uint32_t *) &m0,
*(uint32_t *) &m1
};
std::vector<wgpu::BindGroupEntry> entries = {
{ .binding = 0,
.buffer = ggml_webgpu_tensor_buf(Q),
.offset = ggml_webgpu_tensor_align_offset(ctx, Q),
.size = ggml_webgpu_tensor_binding_size(ctx, Q) },
{ .binding = 1,
.buffer = ggml_webgpu_tensor_buf(K),
.offset = ggml_webgpu_tensor_align_offset(ctx, K),
.size = ggml_webgpu_tensor_binding_size(ctx, K) },
{ .binding = 2,
.buffer = ggml_webgpu_tensor_buf(V),
.offset = ggml_webgpu_tensor_align_offset(ctx, V),
.size = ggml_webgpu_tensor_binding_size(ctx, V) }
};
uint32_t binding_index = 3;
if (has_mask) {
entries.push_back({ .binding = binding_index++,
.buffer = ggml_webgpu_tensor_buf(mask),
.offset = ggml_webgpu_tensor_align_offset(ctx, mask),
.size = ggml_webgpu_tensor_binding_size(ctx, mask) });
}
if (has_sinks) {
entries.push_back({ .binding = binding_index++,
.buffer = ggml_webgpu_tensor_buf(sinks),
.offset = ggml_webgpu_tensor_align_offset(ctx, sinks),
.size = ggml_webgpu_tensor_binding_size(ctx, sinks) });
}
entries.push_back({ .binding = binding_index++,
.buffer = ggml_webgpu_tensor_buf(dst),
.offset = ggml_webgpu_tensor_align_offset(ctx, dst),
.size = ggml_webgpu_tensor_binding_size(ctx, dst) });
bool kv_direct =
(K->type == GGML_TYPE_F16) && (Q->ne[0] % ctx->sg_mat_k == 0) && (K->ne[1] % GGML_WEBGPU_KV_SEQ_PAD == 0);
flash_attn_pipeline_key key = {
.q_type = Q->type,
.kv_type = K->type,
.dst_type = dst->type,
.head_dim_qk = (uint32_t) Q->ne[0],
.head_dim_v = (uint32_t) V->ne[0],
.kv_direct = kv_direct,
.has_mask = static_cast<bool>(has_mask),
.has_sinks = static_cast<bool>(has_sinks),
.uses_logit_softcap = logit_softcap != 0.0f,
};
webgpu_pipeline pipeline;
ggml_webgpu_flash_attn_shader_decisions decisions = {};
auto it = ctx->flash_attn_pipelines.find(key);
if (it != ctx->flash_attn_pipelines.end()) {
pipeline = it->second;
decisions = *static_cast<ggml_webgpu_flash_attn_shader_decisions *>(pipeline.context);
} else {
std::lock_guard<std::recursive_mutex> lock(ctx->mutex);
it = ctx->flash_attn_pipelines.find(key);
if (it != ctx->flash_attn_pipelines.end()) {
pipeline = it->second;
decisions = *static_cast<ggml_webgpu_flash_attn_shader_decisions *>(pipeline.context);
} else {
ggml_webgpu_flash_attn_shader_lib_context shader_lib_ctx = { .kv_type = K->type,
.head_dim_qk = (uint32_t) Q->ne[0],
.head_dim_v = (uint32_t) V->ne[0],
.kv_direct = kv_direct,
.has_mask = static_cast<bool>(has_mask),
.has_sinks = static_cast<bool>(has_sinks),
.uses_logit_softcap = logit_softcap != 0.0f,
.sg_mat_m = ctx->sg_mat_m,
.sg_mat_n = ctx->sg_mat_n,
.sg_mat_k = ctx->sg_mat_k,
.wg_mem_limit_bytes =
ctx->limits.maxComputeWorkgroupStorageSize,
.max_subgroup_size = ctx->max_subgroup_size };
ggml_webgpu_processed_shader processed =
ggml_webgpu_preprocess_flash_attn_shader(ctx->p, wgsl_flash_attn, shader_lib_ctx);
pipeline = ggml_webgpu_create_pipeline(ctx->device, processed.wgsl.c_str(), processed.variant.c_str());
pipeline.context = new ggml_webgpu_flash_attn_shader_decisions(processed.decisions);
ctx->flash_attn_pipelines.emplace(key, pipeline);
decisions = processed.decisions;
}
}
uint32_t wg_per_head = CEIL_DIV(Q->ne[1], decisions.q_tile);
uint32_t wg_x = wg_per_head * Q->ne[2] * Q->ne[3]; // wg per head * number of heads * number of batches
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x);
}
static webgpu_command ggml_webgpu_unary_op(webgpu_context & ctx, ggml_tensor * src, ggml_tensor * dst) { static webgpu_command ggml_webgpu_unary_op(webgpu_context & ctx, ggml_tensor * src, ggml_tensor * dst) {
uint32_t ne = (uint32_t) ggml_nelements(dst); uint32_t ne = (uint32_t) ggml_nelements(dst);
ggml_unary_op unary_op = ggml_get_unary_op(dst); ggml_unary_op unary_op = ggml_get_unary_op(dst);
@ -1397,6 +1577,8 @@ static std::optional<webgpu_command> ggml_webgpu_encode_node(webgpu_context ctx,
return ggml_webgpu_get_rows(ctx, src0, src1, node); return ggml_webgpu_get_rows(ctx, src0, src1, node);
case GGML_OP_MUL_MAT: case GGML_OP_MUL_MAT:
return ggml_webgpu_mul_mat(ctx, src0, src1, node); return ggml_webgpu_mul_mat(ctx, src0, src1, node);
case GGML_OP_FLASH_ATTN_EXT:
return ggml_webgpu_flash_attn(ctx, src0, src1, src2, node->src[3], node->src[4], node);
case GGML_OP_ADD: case GGML_OP_ADD:
{ {
int inplace = ggml_webgpu_tensor_equal(src0, node); int inplace = ggml_webgpu_tensor_equal(src0, node);
@ -1466,6 +1648,7 @@ static ggml_status ggml_backend_webgpu_graph_compute(ggml_backend_t backend, str
webgpu_submission_futures new_futures = ggml_backend_webgpu_submit(ctx, commands); webgpu_submission_futures new_futures = ggml_backend_webgpu_submit(ctx, commands);
futures.push_back(new_futures); futures.push_back(new_futures);
} }
ggml_backend_webgpu_wait(ctx, futures); ggml_backend_webgpu_wait(ctx, futures);
ctx->inflight_threads--; ctx->inflight_threads--;
WEBGPU_CPU_PROFILE_TOTAL_END(graph_compute, ctx); WEBGPU_CPU_PROFILE_TOTAL_END(graph_compute, ctx);
@ -1698,9 +1881,18 @@ static const char * ggml_backend_webgpu_device_get_description(ggml_backend_dev_
static void ggml_backend_webgpu_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) { static void ggml_backend_webgpu_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
ggml_backend_webgpu_device_context * ctx = static_cast<ggml_backend_webgpu_device_context *>(dev->context); ggml_backend_webgpu_device_context * ctx = static_cast<ggml_backend_webgpu_device_context *>(dev->context);
// TODO: what do we actually want to return here? maxBufferSize might not be the full available memory. // TODO: for now, return maxBufferSize as both free and total memory
*free = ctx->webgpu_ctx->limits.maxBufferSize; // Track https://github.com/gpuweb/gpuweb/issues/5505 for updates.
*total = ctx->webgpu_ctx->limits.maxBufferSize; uint64_t max_buffer_size = ctx->webgpu_ctx->limits.maxBufferSize;
// If we're on a 32-bit system, clamp to UINTPTR_MAX
#if UINTPTR_MAX < UINT64_MAX
uint64_t max_ptr_size = static_cast<uint64_t>(UINTPTR_MAX);
if (max_buffer_size > max_ptr_size) {
max_buffer_size = max_ptr_size;
}
#endif
*free = static_cast<size_t>(max_buffer_size);
*total = static_cast<size_t>(max_buffer_size);
} }
static enum ggml_backend_dev_type ggml_backend_webgpu_device_get_type(ggml_backend_dev_t dev) { static enum ggml_backend_dev_type ggml_backend_webgpu_device_get_type(ggml_backend_dev_t dev) {
@ -1808,15 +2000,15 @@ static void ggml_webgpu_init_mul_mat_pipeline(webgpu_context & webgpu_ctx) {
#ifndef __EMSCRIPTEN__ #ifndef __EMSCRIPTEN__
if (webgpu_ctx->supports_subgroup_matrix) { if (webgpu_ctx->supports_subgroup_matrix) {
std::map<std::string, std::string> sg_matrix_repls; std::map<std::string, std::string> sg_matrix_repls;
sg_matrix_repls["WEBGPU_MAX_SUBGROUP_SIZE"] = std::to_string(webgpu_ctx->subgroup_size); sg_matrix_repls["WEBGPU_MAX_SUBGROUP_SIZE"] = std::to_string(webgpu_ctx->max_subgroup_size);
sg_matrix_repls["WEBGPU_TILE_K"] = std::to_string(WEBGPU_MUL_MAT_TILE_K); sg_matrix_repls["WEBGPU_TILE_K"] = std::to_string(WEBGPU_MUL_MAT_TILE_K);
sg_matrix_repls["WEBGPU_SUBGROUP_M"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_M); sg_matrix_repls["WEBGPU_SUBGROUP_M"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_M);
sg_matrix_repls["WEBGPU_SUBGROUP_N"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_N); sg_matrix_repls["WEBGPU_SUBGROUP_N"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_N);
sg_matrix_repls["WEBGPU_SUBGROUP_MATRIX_M"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_MATRIX_M); sg_matrix_repls["WEBGPU_SUBGROUP_MATRIX_M"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_MATRIX_M);
sg_matrix_repls["WEBGPU_SUBGROUP_MATRIX_N"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_MATRIX_N); sg_matrix_repls["WEBGPU_SUBGROUP_MATRIX_N"] = std::to_string(WEBGPU_MUL_MAT_SUBGROUP_MATRIX_N);
sg_matrix_repls["WEBGPU_SG_MAT_M_SIZE"] = std::to_string(webgpu_ctx->subgroup_matrix_config.M); sg_matrix_repls["WEBGPU_SG_MAT_M_SIZE"] = std::to_string(webgpu_ctx->sg_mat_m);
sg_matrix_repls["WEBGPU_SG_MAT_N_SIZE"] = std::to_string(webgpu_ctx->subgroup_matrix_config.N); sg_matrix_repls["WEBGPU_SG_MAT_N_SIZE"] = std::to_string(webgpu_ctx->sg_mat_n);
sg_matrix_repls["WEBGPU_SG_MAT_K_SIZE"] = std::to_string(webgpu_ctx->subgroup_matrix_config.K); sg_matrix_repls["WEBGPU_SG_MAT_K_SIZE"] = std::to_string(webgpu_ctx->sg_mat_k);
proc_mul_mat_f32_f32 = ggml_webgpu_process_shader_repls(wgsl_mul_mat_subgroup_matrix_f32_f32, sg_matrix_repls); proc_mul_mat_f32_f32 = ggml_webgpu_process_shader_repls(wgsl_mul_mat_subgroup_matrix_f32_f32, sg_matrix_repls);
proc_mul_mat_f32_f32_vec = proc_mul_mat_f32_f32_vec =
@ -2328,6 +2520,7 @@ static void ggml_webgpu_init_soft_max_pipeline(webgpu_context & webgpu_ctx) {
webgpu_ctx->device, wgsl_soft_max_f32_mask_f16_sink_inplace, "soft_max_f32_mask_f16_sink_inplace", constants); webgpu_ctx->device, wgsl_soft_max_f32_mask_f16_sink_inplace, "soft_max_f32_mask_f16_sink_inplace", constants);
} }
// TODO: move most initialization logic here
static ggml_backend_t ggml_backend_webgpu_device_init(ggml_backend_dev_t dev, const char * params) { static ggml_backend_t ggml_backend_webgpu_device_init(ggml_backend_dev_t dev, const char * params) {
GGML_UNUSED(params); GGML_UNUSED(params);
@ -2489,6 +2682,29 @@ static bool ggml_backend_webgpu_device_supports_op(ggml_backend_dev_t dev, const
} }
break; break;
} }
case GGML_OP_FLASH_ATTN_EXT:
{
if (!webgpu_ctx->supports_subgroup_matrix) {
break;
}
// Head dimensions must fit in workgroup memory with minimum tile sizes
size_t limit_bytes = webgpu_ctx->limits.maxComputeWorkgroupStorageSize;
const bool has_mask = op->src[3] != nullptr;
const bool kv_direct = src1->type == GGML_TYPE_F16 && (src0->ne[0] % webgpu_ctx->sg_mat_k) == 0 &&
(src1->ne[1] % GGML_WEBGPU_KV_SEQ_PAD) == 0;
const size_t min_bytes = ggml_webgpu_flash_attn_wg_mem_bytes(
webgpu_ctx->sg_mat_m, webgpu_ctx->sg_mat_n, (uint32_t) src0->ne[0], (uint32_t) src2->ne[0],
has_mask, kv_direct);
if (min_bytes > limit_bytes) {
break;
}
supports_op = src0->type == GGML_TYPE_F32 &&
(src1->type == GGML_TYPE_F32 || src1->type == GGML_TYPE_F16 ||
src1->type == GGML_TYPE_Q4_0 || src1->type == GGML_TYPE_Q8_0) &&
src2->type == src1->type && op->type == GGML_TYPE_F32;
break;
}
case GGML_OP_RMS_NORM: case GGML_OP_RMS_NORM:
supports_op = op->type == GGML_TYPE_F32 && src0->type == GGML_TYPE_F32; supports_op = op->type == GGML_TYPE_F32 && src0->type == GGML_TYPE_F32;
break; break;
@ -2606,6 +2822,7 @@ static size_t ggml_backend_webgpu_reg_get_device_count(ggml_backend_reg_t reg) {
} }
// TODO: Does this need to be thread safe? Is it only called once? // TODO: Does this need to be thread safe? Is it only called once?
// TODO: move most logic to device_init function so backend can be freed/initialized properly
// Only one device is supported for now // Only one device is supported for now
static ggml_backend_dev_t ggml_backend_webgpu_reg_get_device(ggml_backend_reg_t reg, size_t index) { static ggml_backend_dev_t ggml_backend_webgpu_reg_get_device(ggml_backend_reg_t reg, size_t index) {
GGML_ASSERT(index == 0); GGML_ASSERT(index == 0);
@ -2665,7 +2882,9 @@ static ggml_backend_dev_t ggml_backend_webgpu_reg_get_device(ggml_backend_reg_t
if (config.M == config.N && config.N == config.K && (config.K == 8 || config.K == 16) && if (config.M == config.N && config.N == config.K && (config.K == 8 || config.K == 16) &&
config.componentType == wgpu::SubgroupMatrixComponentType::F16 && config.componentType == wgpu::SubgroupMatrixComponentType::F16 &&
config.resultComponentType == wgpu::SubgroupMatrixComponentType::F16) { config.resultComponentType == wgpu::SubgroupMatrixComponentType::F16) {
ctx->subgroup_matrix_config = config; ctx->sg_mat_m = config.M;
ctx->sg_mat_n = config.N;
ctx->sg_mat_k = config.K;
valid_subgroup_matrix_config = true; valid_subgroup_matrix_config = true;
break; break;
} }
@ -2676,7 +2895,7 @@ static ggml_backend_dev_t ggml_backend_webgpu_reg_get_device(ggml_backend_reg_t
#endif #endif
// For subgroup matrix code to be the most efficient, we would like the subgroup size to be consistent and accurate. // For subgroup matrix code to be the most efficient, we would like the subgroup size to be consistent and accurate.
// Unfortunately, that is not possible, so we use the maximum subgroup size reported by the adapter. // Unfortunately, that is not possible, so we use the maximum subgroup size reported by the adapter.
ctx->subgroup_size = info.subgroupMaxSize; ctx->max_subgroup_size = info.subgroupMaxSize;
// Initialize device // Initialize device
std::vector<wgpu::FeatureName> required_features = { wgpu::FeatureName::ShaderF16 }; std::vector<wgpu::FeatureName> required_features = { wgpu::FeatureName::ShaderF16 };
@ -2701,8 +2920,11 @@ static ggml_backend_dev_t ggml_backend_webgpu_reg_get_device(ggml_backend_reg_t
wgpu::CallbackMode::AllowSpontaneous, wgpu::CallbackMode::AllowSpontaneous,
[](const wgpu::Device & device, wgpu::DeviceLostReason reason, wgpu::StringView message) { [](const wgpu::Device & device, wgpu::DeviceLostReason reason, wgpu::StringView message) {
GGML_UNUSED(device); GGML_UNUSED(device);
GGML_LOG_ERROR("ggml_webgpu: Device lost! Reason: %d, Message: %s\n", static_cast<int>(reason), GGML_UNUSED(reason);
std::string(message).c_str()); GGML_UNUSED(message);
//TODO: uncomment once proper free logic is in place
//GGML_LOG_ERROR("ggml_webgpu: Device lost! Reason: %d, Message: %s\n", static_cast<int>(reason),
//std::string(message).c_str());
}); });
dev_desc.SetUncapturedErrorCallback( dev_desc.SetUncapturedErrorCallback(
[](const wgpu::Device & device, wgpu::ErrorType reason, wgpu::StringView message) { [](const wgpu::Device & device, wgpu::ErrorType reason, wgpu::StringView message) {

View File

@ -0,0 +1,778 @@
#ifndef PRE_WGSL_HPP
#define PRE_WGSL_HPP
#include <cctype>
#include <fstream>
#include <sstream>
#include <stdexcept>
#include <string>
#include <string_view>
#include <unordered_map>
#include <unordered_set>
#include <vector>
namespace pre_wgsl {
//==============================================================
// Options
//==============================================================
struct Options {
std::string include_path = ".";
std::vector<std::string> macros;
};
//==============================================================
// Utility: trim
//==============================================================
static std::string trim(const std::string & s) {
size_t a = 0;
while (a < s.size() && std::isspace((unsigned char) s[a])) {
a++;
}
size_t b = s.size();
while (b > a && std::isspace((unsigned char) s[b - 1])) {
b--;
}
return s.substr(a, b - a);
}
static std::string trim_value(std::istream & is) {
std::string str;
std::getline(is, str);
return trim(str);
}
static bool isIdentChar(char c) {
return std::isalnum(static_cast<unsigned char>(c)) || c == '_';
}
static std::string expandMacrosRecursiveInternal(const std::string & line,
const std::unordered_map<std::string, std::string> & macros,
std::unordered_set<std::string> & visiting);
static std::string expandMacroValue(const std::string & name,
const std::unordered_map<std::string, std::string> & macros,
std::unordered_set<std::string> & visiting) {
if (visiting.count(name)) {
throw std::runtime_error("Recursive macro: " + name);
}
visiting.insert(name);
auto it = macros.find(name);
if (it == macros.end()) {
visiting.erase(name);
return name;
}
const std::string & value = it->second;
if (value.empty()) {
visiting.erase(name);
return "";
}
std::string expanded = expandMacrosRecursiveInternal(value, macros, visiting);
visiting.erase(name);
return expanded;
}
static std::string expandMacrosRecursiveInternal(const std::string & line,
const std::unordered_map<std::string, std::string> & macros,
std::unordered_set<std::string> & visiting) {
std::string result;
result.reserve(line.size());
size_t i = 0;
while (i < line.size()) {
if (isIdentChar(line[i])) {
size_t start = i;
while (i < line.size() && isIdentChar(line[i])) {
i++;
}
std::string token = line.substr(start, i - start);
auto it = macros.find(token);
if (it != macros.end()) {
result += expandMacroValue(token, macros, visiting);
} else {
result += token;
}
} else {
result += line[i];
i++;
}
}
return result;
}
static std::string expandMacrosRecursive(const std::string & line,
const std::unordered_map<std::string, std::string> & macros) {
std::unordered_set<std::string> visiting;
return expandMacrosRecursiveInternal(line, macros, visiting);
}
//==============================================================
// Tokenizer for expressions in #if/#elif
//==============================================================
class ExprLexer {
public:
enum Kind { END, IDENT, NUMBER, OP, LPAREN, RPAREN };
struct Tok {
Kind kind;
std::string text;
};
explicit ExprLexer(std::string_view sv) : src(sv), pos(0) {}
Tok next() {
skipWS();
if (pos >= src.size()) {
return { END, "" };
}
char c = src[pos];
// number
if (std::isdigit((unsigned char) c)) {
size_t start = pos;
while (pos < src.size() && std::isdigit((unsigned char) src[pos])) {
pos++;
}
return { NUMBER, std::string(src.substr(start, pos - start)) };
}
// identifier
if (std::isalpha((unsigned char) c) || c == '_') {
size_t start = pos;
while (pos < src.size() && (std::isalnum((unsigned char) src[pos]) || src[pos] == '_')) {
pos++;
}
return { IDENT, std::string(src.substr(start, pos - start)) };
}
if (c == '(') {
pos++;
return { LPAREN, "(" };
}
if (c == ')') {
pos++;
return { RPAREN, ")" };
}
// multi-char operators
static const char * two_ops[] = { "==", "!=", "<=", ">=", "&&", "||", "<<", ">>" };
for (auto op : two_ops) {
if (src.substr(pos, 2) == op) {
pos += 2;
return { OP, std::string(op) };
}
}
// single-char operators
if (std::string("+-*/%<>!").find(c) != std::string::npos) {
pos++;
return { OP, std::string(1, c) };
}
// unexpected
pos++;
return { END, "" };
}
private:
std::string_view src;
size_t pos;
void skipWS() {
while (pos < src.size() && std::isspace((unsigned char) src[pos])) {
pos++;
}
}
};
//==============================================================
// Expression Parser (recursive descent)
//==============================================================
class ExprParser {
public:
ExprParser(std::string_view expr,
const std::unordered_map<std::string, std::string> & macros,
std::unordered_set<std::string> & visiting) :
lex(expr),
macros(macros),
visiting(visiting) {
advance();
}
int parse() { return parseLogicalOr(); }
private:
ExprLexer lex;
ExprLexer::Tok tok;
const std::unordered_map<std::string, std::string> & macros;
std::unordered_set<std::string> & visiting;
void advance() { tok = lex.next(); }
bool acceptOp(const std::string & s) {
if (tok.kind == ExprLexer::OP && tok.text == s) {
advance();
return true;
}
return false;
}
bool acceptKind(ExprLexer::Kind k) {
if (tok.kind == k) {
advance();
return true;
}
return false;
}
int parseLogicalOr() {
int v = parseLogicalAnd();
while (acceptOp("||")) {
int rhs = parseLogicalAnd();
v = (v || rhs);
}
return v;
}
int parseLogicalAnd() {
int v = parseEquality();
while (acceptOp("&&")) {
int rhs = parseEquality();
v = (v && rhs);
}
return v;
}
int parseEquality() {
int v = parseRelational();
for (;;) {
if (acceptOp("==")) {
int rhs = parseRelational();
v = (v == rhs);
} else if (acceptOp("!=")) {
int rhs = parseRelational();
v = (v != rhs);
} else {
break;
}
}
return v;
}
int parseRelational() {
int v = parseShift();
for (;;) {
if (acceptOp("<")) {
int rhs = parseShift();
v = (v < rhs);
} else if (acceptOp(">")) {
int rhs = parseShift();
v = (v > rhs);
} else if (acceptOp("<=")) {
int rhs = parseShift();
v = (v <= rhs);
} else if (acceptOp(">=")) {
int rhs = parseShift();
v = (v >= rhs);
} else {
break;
}
}
return v;
}
int parseShift() {
int v = parseAdd();
for (;;) {
if (acceptOp("<<")) {
int rhs = parseAdd();
v = (v << rhs);
} else if (acceptOp(">>")) {
int rhs = parseAdd();
v = (v >> rhs);
} else {
break;
}
}
return v;
}
int parseAdd() {
int v = parseMult();
for (;;) {
if (acceptOp("+")) {
int rhs = parseMult();
v = (v + rhs);
} else if (acceptOp("-")) {
int rhs = parseMult();
v = (v - rhs);
} else {
break;
}
}
return v;
}
int parseMult() {
int v = parseUnary();
for (;;) {
if (acceptOp("*")) {
int rhs = parseUnary();
v = (v * rhs);
} else if (acceptOp("/")) {
int rhs = parseUnary();
v = (rhs == 0 ? 0 : v / rhs);
} else if (acceptOp("%")) {
int rhs = parseUnary();
v = (rhs == 0 ? 0 : v % rhs);
} else {
break;
}
}
return v;
}
int parseUnary() {
if (acceptOp("!")) {
return !parseUnary();
}
if (acceptOp("-")) {
return -parseUnary();
}
if (acceptOp("+")) {
return +parseUnary();
}
return parsePrimary();
}
int parsePrimary() {
// '(' expr ')'
if (acceptKind(ExprLexer::LPAREN)) {
int v = parse();
if (!acceptKind(ExprLexer::RPAREN)) {
throw std::runtime_error("missing ')'");
}
return v;
}
// number
if (tok.kind == ExprLexer::NUMBER) {
int v = std::stoi(tok.text);
advance();
return v;
}
// defined(identifier)
if (tok.kind == ExprLexer::IDENT && tok.text == "defined") {
advance();
if (acceptKind(ExprLexer::LPAREN)) {
if (tok.kind != ExprLexer::IDENT) {
throw std::runtime_error("expected identifier in defined()");
}
std::string name = tok.text;
advance();
if (!acceptKind(ExprLexer::RPAREN)) {
throw std::runtime_error("missing ) in defined()");
}
return macros.count(name) ? 1 : 0;
} else {
// defined NAME
if (tok.kind != ExprLexer::IDENT) {
throw std::runtime_error("expected identifier in defined NAME");
}
std::string name = tok.text;
advance();
return macros.count(name) ? 1 : 0;
}
}
// identifier -> treat as integer, if defined use its value else 0
if (tok.kind == ExprLexer::IDENT) {
std::string name = tok.text;
advance();
auto it = macros.find(name);
if (it == macros.end()) {
return 0;
}
if (it->second.empty()) {
return 1;
}
return evalMacroExpression(name, it->second);
}
// unexpected
return 0;
}
int evalMacroExpression(const std::string & name, const std::string & value) {
if (visiting.count(name)) {
throw std::runtime_error("Recursive macro: " + name);
}
visiting.insert(name);
ExprParser ep(value, macros, visiting);
int v = ep.parse();
visiting.erase(name);
return v;
}
};
//==============================================================
// Preprocessor
//==============================================================
class Preprocessor {
public:
explicit Preprocessor(Options opts = {}) : opts_(std::move(opts)) {
// Treat empty include path as current directory
if (opts_.include_path.empty()) {
opts_.include_path = ".";
}
parseMacroDefinitions(opts_.macros);
}
std::string preprocess_file(const std::string & filename, const std::vector<std::string> & additional_macros = {}) {
std::unordered_map<std::string, std::string> macros;
std::unordered_set<std::string> predefined;
std::unordered_set<std::string> include_stack;
buildMacros(additional_macros, macros, predefined);
std::string result = processFile(filename, macros, predefined, include_stack, DirectiveMode::All);
return result;
}
std::string preprocess(const std::string & contents, const std::vector<std::string> & additional_macros = {}) {
std::unordered_map<std::string, std::string> macros;
std::unordered_set<std::string> predefined;
std::unordered_set<std::string> include_stack;
buildMacros(additional_macros, macros, predefined);
std::string result = processString(contents, macros, predefined, include_stack, DirectiveMode::All);
return result;
}
std::string preprocess_includes_file(const std::string & filename) {
std::unordered_map<std::string, std::string> macros;
std::unordered_set<std::string> predefined;
std::unordered_set<std::string> include_stack;
std::string result = processFile(filename, macros, predefined, include_stack, DirectiveMode::IncludesOnly);
return result;
}
std::string preprocess_includes(const std::string & contents) {
std::unordered_map<std::string, std::string> macros;
std::unordered_set<std::string> predefined;
std::unordered_set<std::string> include_stack;
std::string result = processString(contents, macros, predefined, include_stack, DirectiveMode::IncludesOnly);
return result;
}
private:
Options opts_;
std::unordered_map<std::string, std::string> global_macros;
enum class DirectiveMode { All, IncludesOnly };
struct Cond {
bool parent_active;
bool active;
bool taken;
};
//----------------------------------------------------------
// Parse macro definitions into global_macros
//----------------------------------------------------------
void parseMacroDefinitions(const std::vector<std::string> & macro_defs) {
for (const auto & def : macro_defs) {
size_t eq_pos = def.find('=');
if (eq_pos != std::string::npos) {
// Format: NAME=VALUE
std::string name = trim(def.substr(0, eq_pos));
std::string value = trim(def.substr(eq_pos + 1));
global_macros[name] = value;
} else {
// Format: NAME
std::string name = trim(def);
global_macros[name] = "";
}
}
}
//----------------------------------------------------------
// Build combined macro map and predefined set for a preprocessing operation
//----------------------------------------------------------
void buildMacros(const std::vector<std::string> & additional_macros,
std::unordered_map<std::string, std::string> & macros,
std::unordered_set<std::string> & predefined) {
macros = global_macros;
predefined.clear();
for (const auto & [name, value] : global_macros) {
predefined.insert(name);
}
for (const auto & def : additional_macros) {
size_t eq_pos = def.find('=');
std::string name, value;
if (eq_pos != std::string::npos) {
name = trim(def.substr(0, eq_pos));
value = trim(def.substr(eq_pos + 1));
} else {
name = trim(def);
value = "";
}
// Add to macros map (will override global if same name)
macros[name] = value;
predefined.insert(name);
}
}
//----------------------------------------------------------
// Helpers
//----------------------------------------------------------
std::string loadFile(const std::string & fname) {
std::ifstream f(fname);
if (!f.is_open()) {
throw std::runtime_error("Could not open file: " + fname);
}
std::stringstream ss;
ss << f.rdbuf();
return ss.str();
}
bool condActive(const std::vector<Cond> & cond) const {
if (cond.empty()) {
return true;
}
return cond.back().active;
}
//----------------------------------------------------------
// Process a file
//----------------------------------------------------------
std::string processFile(const std::string & name,
std::unordered_map<std::string, std::string> & macros,
const std::unordered_set<std::string> & predefined_macros,
std::unordered_set<std::string> & include_stack,
DirectiveMode mode) {
if (include_stack.count(name)) {
throw std::runtime_error("Recursive include: " + name);
}
include_stack.insert(name);
std::string shader_code = loadFile(name);
std::string out = processString(shader_code, macros, predefined_macros, include_stack, mode);
include_stack.erase(name);
return out;
}
std::string processIncludeFile(const std::string & fname,
std::unordered_map<std::string, std::string> & macros,
const std::unordered_set<std::string> & predefined_macros,
std::unordered_set<std::string> & include_stack,
DirectiveMode mode) {
std::string full_path = opts_.include_path + "/" + fname;
return processFile(full_path, macros, predefined_macros, include_stack, mode);
}
//----------------------------------------------------------
// Process text
//----------------------------------------------------------
std::string processString(const std::string & shader_code,
std::unordered_map<std::string, std::string> & macros,
const std::unordered_set<std::string> & predefined_macros,
std::unordered_set<std::string> & include_stack,
DirectiveMode mode) {
std::vector<Cond> cond; // Conditional stack for this shader
std::stringstream out;
std::istringstream in(shader_code);
std::string line;
while (std::getline(in, line)) {
std::string t = trim(line);
if (!t.empty() && t[0] == '#') {
bool handled = handleDirective(t, out, macros, predefined_macros, cond, include_stack, mode);
if (mode == DirectiveMode::IncludesOnly && !handled) {
out << line << "\n";
}
} else {
if (mode == DirectiveMode::IncludesOnly) {
out << line << "\n";
} else if (condActive(cond)) {
// Expand macros in the line before outputting
std::string expanded = expandMacrosRecursive(line, macros);
out << expanded << "\n";
}
}
}
if (mode == DirectiveMode::All && !cond.empty()) {
throw std::runtime_error("Unclosed #if directive");
}
return out.str();
}
//----------------------------------------------------------
// Directive handler
//----------------------------------------------------------
bool handleDirective(const std::string & t,
std::stringstream & out,
std::unordered_map<std::string, std::string> & macros,
const std::unordered_set<std::string> & predefined_macros,
std::vector<Cond> & cond,
std::unordered_set<std::string> & include_stack,
DirectiveMode mode) {
// split into tokens
std::string body = t.substr(1);
std::istringstream iss(body);
std::string cmd;
iss >> cmd;
if (cmd == "include") {
if (mode == DirectiveMode::All && !condActive(cond)) {
return true;
}
std::string file;
iss >> file;
if (file.size() >= 2 && file.front() == '"' && file.back() == '"') {
file = file.substr(1, file.size() - 2);
}
out << processIncludeFile(file, macros, predefined_macros, include_stack, mode);
return true;
}
if (mode == DirectiveMode::IncludesOnly) {
return false;
}
if (cmd == "define") {
if (!condActive(cond)) {
return true;
}
std::string name;
iss >> name;
// Don't override predefined macros from options
if (predefined_macros.count(name)) {
return true;
}
std::string value = trim_value(iss);
macros[name] = value;
return true;
}
if (cmd == "undef") {
if (!condActive(cond)) {
return true;
}
std::string name;
iss >> name;
// Don't undef predefined macros from options
if (predefined_macros.count(name)) {
return true;
}
macros.erase(name);
return true;
}
if (cmd == "ifdef") {
std::string name;
iss >> name;
bool p = condActive(cond);
bool v = macros.count(name);
cond.push_back({ p, p && v, p && v });
return true;
}
if (cmd == "ifndef") {
std::string name;
iss >> name;
bool p = condActive(cond);
bool v = !macros.count(name);
cond.push_back({ p, p && v, p && v });
return true;
}
if (cmd == "if") {
std::string expr = trim_value(iss);
bool p = condActive(cond);
bool v = false;
if (p) {
std::unordered_set<std::string> visiting;
ExprParser ep(expr, macros, visiting);
v = ep.parse() != 0;
}
cond.push_back({ p, p && v, p && v });
return true;
}
if (cmd == "elif") {
std::string expr = trim_value(iss);
if (cond.empty()) {
throw std::runtime_error("#elif without #if");
}
Cond & c = cond.back();
if (!c.parent_active) {
c.active = false;
return true;
}
if (c.taken) {
c.active = false;
return true;
}
std::unordered_set<std::string> visiting;
ExprParser ep(expr, macros, visiting);
bool v = ep.parse() != 0;
c.active = v;
if (v) {
c.taken = true;
}
return true;
}
if (cmd == "else") {
if (cond.empty()) {
throw std::runtime_error("#else without #if");
}
Cond & c = cond.back();
if (!c.parent_active) {
c.active = false;
return true;
}
if (c.taken) {
c.active = false;
} else {
c.active = true;
c.taken = true;
}
return true;
}
if (cmd == "endif") {
if (cond.empty()) {
throw std::runtime_error("#endif without #if");
}
cond.pop_back();
return true;
}
// Unknown directive
throw std::runtime_error("Unknown directive: #" + cmd);
}
};
} // namespace pre_wgsl
#endif // PRE_WGSL_HPP

View File

@ -0,0 +1,591 @@
diagnostic(off, chromium.subgroup_matrix_uniformity);
diagnostic(off, subgroup_uniformity);
enable f16;
enable subgroups;
enable chromium_experimental_subgroup_matrix;
#ifdef KV_F32
#define KV_TYPE f32
#else
#define KV_TYPE f16
#endif
// Default values
#define HEAD_DIM_QK 64
#define HEAD_DIM_V 64
// The number of rows/columns/k in a subgroup matrix. MxK * KxN = MxN
// Note that the "K" here does not correspond to the K in attention's Q/K/V, it's just the common dimension.
#define SG_MAT_M 8
#define SG_MAT_N 8
#define SG_MAT_K 8
// Each workgroup processes one subgroup matrix of Q rows
#define Q_TILE SG_MAT_M
#define KV_TILE 16
#define WG_SIZE 64
// Number of subgroup-matrix-width blocks that span the KV tile. SG_MAT_N must divide KV_TILE.
#define KV_BLOCKS (KV_TILE / SG_MAT_N)
// Quantization constants/helpers
#define BLOCK_SIZE 32
#define BLOCKS_K ((HEAD_DIM_QK + BLOCK_SIZE - 1) / BLOCK_SIZE)
#define BLOCKS_V ((HEAD_DIM_V + BLOCK_SIZE - 1) / BLOCK_SIZE)
// number of quantized elements processed per thread
#if defined(KV_Q4_0)
#define NQ 16
// Q4_0 has 32 elements, 1 f16 for scale, 8 f16 for 4-bit weights
#define F16_PER_BLOCK 9
#define WEIGHTS_PER_F16 4
#elif defined(KV_Q8_0)
#define NQ 8
// Q8_0 has 32 elements, 1 f16 for scale, 16 f16 for 8-bit weights
#define F16_PER_BLOCK 17
#define WEIGHTS_PER_F16 2
#endif
#define F16_PER_THREAD (NQ / WEIGHTS_PER_F16)
// Ok not to put these in a define block, compiler will remove if unused
fn get_byte(value: u32, index: u32) -> u32 {
return (value >> (index * 8)) & 0xFF;
}
fn get_byte_i32(value: u32, index: u32) -> i32 {
return bitcast<i32>(((value >> (index * 8)) & 0xFF) << 24) >> 24;
}
struct Params {
offset_q: u32,
offset_k: u32,
offset_v: u32,
offset_mask: u32,
offset_sinks: u32,
offset_dst: u32,
// shapes of Q/K/V
n_heads: u32,
seq_len_q: u32,
seq_len_kv: u32,
// strides (in elements)
stride_q1: u32,
stride_q2: u32,
stride_q3: u32,
stride_k1: u32,
stride_k2: u32,
stride_k3: u32,
stride_v1: u32,
stride_v2: u32,
stride_v3: u32,
stride_mask3: u32,
// repeat factors for K/V, e.g., MHA vs. MQA vs. GQA
q_per_kv: u32,
// softmax params
scale: f32,
max_bias: f32,
logit_softcap: f32,
n_head_log2: f32,
m0: f32,
m1: f32,
};
@group(0) @binding(0) var<storage, read_write> Q: array<f32>;
@group(0) @binding(1) var<storage, read_write> K: array<KV_TYPE>;
@group(0) @binding(2) var<storage, read_write> V: array<KV_TYPE>;
#if defined(MASK) && defined(SINKS)
@group(0) @binding(3) var<storage, read_write> mask: array<f16>;
@group(0) @binding(4) var<storage, read_write> sinks: array<f32>;
#define DST_BINDING 5
#define PARAMS_BINDING 6
#elif defined(MASK)
@group(0) @binding(3) var<storage, read_write> mask: array<f16>;
#define DST_BINDING 4
#define PARAMS_BINDING 5
#elif defined(SINKS)
@group(0) @binding(3) var<storage, read_write> sinks: array<f32>;
#define DST_BINDING 4
#define PARAMS_BINDING 5
#else
#define DST_BINDING 3
#define PARAMS_BINDING 4
#endif
@group(0) @binding(DST_BINDING) var<storage, read_write> dst: array<f32>;
@group(0) @binding(PARAMS_BINDING) var<uniform> params: Params;
// Just a very small float value.
const FLOAT_MIN: f32 = -1.0e9;
// The number of Q rows processed per workgroup
var<workgroup> q_shmem: array<f16, Q_TILE * HEAD_DIM_QK>;
#ifndef KV_DIRECT
const kv_shmem_size = KV_TILE * max(HEAD_DIM_QK, HEAD_DIM_V);
// we can reuse the same shmem for K and V since we only need one at a time
var<workgroup> kv_shmem: array<f16, kv_shmem_size>;
#endif
var<workgroup> o_shmem: array<f16, Q_TILE * HEAD_DIM_V>; // output shmem
#ifdef MASK
// storage for mask values
var<workgroup> mask_shmem: array<f16, Q_TILE * KV_TILE>;
#endif
// storage for output of Q*K^T scores for online softmax (S matrix from paper)
// also storage for diagonal matrix during online softmax (P matrix from paper)
// note that we reuse the same storage for both since we only need one at a time
var<workgroup> inter_shmem: array<f16, Q_TILE * KV_TILE>;
// Storage for row max and exp sum during online softmax
var<workgroup> row_max_shmem: array<f32, Q_TILE>;
var<workgroup> exp_sum_shmem: array<f32, Q_TILE>;
fn calc_softmax_term(kv_idx: u32, q_tile_row: u32, slope: f32) -> f32 {
var v = select(FLOAT_MIN,
f32(inter_shmem[kv_idx + q_tile_row * KV_TILE]) * params.scale,
kv_idx < KV_TILE);
#ifdef LOGIT_SOFTCAP
v = params.logit_softcap * tanh(v);
#endif
#ifdef MASK
let mask_val = select(0.0, f32(mask_shmem[q_tile_row * KV_TILE + kv_idx]), kv_idx < KV_TILE);
let mask_term = slope * mask_val;
v += mask_term;
#endif
return v;
}
@compute @workgroup_size(WG_SIZE)
fn main(@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(subgroup_id) subgroup_id: u32,
@builtin(subgroup_size) subgroup_size: u32,
@builtin(num_subgroups) num_subgroups: u32,
@builtin(subgroup_invocation_id) sg_inv_id: u32) {
// initialize row max for online softmax
for (var i = local_id.x; i < Q_TILE; i += WG_SIZE) {
row_max_shmem[i] = FLOAT_MIN;
exp_sum_shmem[i] = 0.0;
}
for (var i = local_id.x; i < Q_TILE * HEAD_DIM_V; i += WG_SIZE) {
o_shmem[i] = 0.0;
}
// workgroups per head/batch
let wg_per_head = (params.seq_len_q + Q_TILE - 1u) / Q_TILE;
let wg_per_batch = wg_per_head * params.n_heads;
let dst2_stride = HEAD_DIM_V * params.n_heads;
let dst3_stride = dst2_stride * params.seq_len_q;
// batch index
let batch_idx = wg_id.x / wg_per_batch;
let q_batch_offset = params.offset_q + batch_idx * params.stride_q3;
let k_batch_offset = params.offset_k + batch_idx * params.stride_k3;
let v_batch_offset = params.offset_v + batch_idx * params.stride_v3;
let dst_batch_offset = params.offset_dst + batch_idx * dst3_stride;
let wg_in_batch = wg_id.x % wg_per_batch;
// head index
let head_idx = wg_in_batch / wg_per_head;
let q_head_offset = q_batch_offset + head_idx * params.stride_q2;
let k_head_idx = head_idx / params.q_per_kv;
let v_head_idx = k_head_idx;
let k_head_offset = k_batch_offset + k_head_idx * params.stride_k2;
let v_head_offset = v_batch_offset + v_head_idx * params.stride_v2;
// starting Q row for this workgroup
let wg_in_head = wg_in_batch % wg_per_head;
let q_row_start = wg_in_head * Q_TILE;
#ifdef MASK
// mask offset
let mask_global_offset = params.offset_mask + batch_idx * params.stride_mask3 + q_row_start * params.seq_len_kv;
#endif
// note that the output is permuted, the layout is [head_dim_v, n_heads, seq_len_q, batch_size]
let dst_global_offset = dst_batch_offset + q_row_start * dst2_stride + head_idx * HEAD_DIM_V;
let head = f32(head_idx);
let slope = select(1.0, select(pow(params.m1, 2.0 * (head - params.n_head_log2) + 1.0), pow(params.m0, head + 1.0), head < params.n_head_log2), params.max_bias > 0);
// load q tile into shared memory
for (var elem_idx = local_id.x; elem_idx < Q_TILE * HEAD_DIM_QK; elem_idx += WG_SIZE) {
let q_row = elem_idx / HEAD_DIM_QK;
let q_col = elem_idx % HEAD_DIM_QK;
let head_q_row = q_row_start + q_row;
let global_q_row_offset = q_head_offset + head_q_row * params.stride_q1;
q_shmem[elem_idx] = f16(select(
0.0,
Q[global_q_row_offset + q_col],
head_q_row < params.seq_len_q && q_col < HEAD_DIM_QK));
}
for (var kv_tile = 0u; kv_tile < params.seq_len_kv; kv_tile += KV_TILE) {
// clear inter_shmem to ensure zero-initialized accumulators
for (var elem_idx = local_id.x; elem_idx < Q_TILE * KV_TILE; elem_idx += WG_SIZE) {
inter_shmem[elem_idx] = 0.0;
}
// load k tile into shared memory
#if defined(KV_Q4_0)
for (var elem_idx = local_id.x * NQ; elem_idx < KV_TILE * HEAD_DIM_QK; elem_idx += WG_SIZE * NQ) {
let blck_idx = elem_idx / BLOCK_SIZE;
let block_offset = (elem_idx % BLOCK_SIZE) / WEIGHTS_PER_F16;
let k_row = blck_idx / BLOCKS_K;
let global_k_row = kv_tile + k_row;
let block_k = blck_idx % BLOCKS_K;
let row_offset = k_row * HEAD_DIM_QK;
if (global_k_row < params.seq_len_kv) {
let global_block_idx = k_head_offset + global_k_row * params.stride_k1 + block_k;
let base_idx = global_block_idx * F16_PER_BLOCK;
let d = K[base_idx]; // scale
for (var j = 0u; j < F16_PER_THREAD; j += 2) {
let q_0 = K[base_idx + 1u + block_offset + j];
let q_1 = K[base_idx + 1u + block_offset + j + 1];
let q_packed = bitcast<u32>(vec2(q_0, q_1));
for (var k = 0u; k < 4u; k++) {
let q_byte = get_byte(q_packed, k);
let q_hi = (f16((q_byte >> 4) & 0xF) - 8.0) * d;
let q_lo = (f16(q_byte & 0xF) - 8.0) * d;
let idx = block_k * BLOCK_SIZE + block_offset * 2u + j * 2u + k;
kv_shmem[row_offset + idx] = q_lo;
kv_shmem[row_offset + idx + 16u] = q_hi;
}
}
}
}
#elif defined(KV_Q8_0)
for (var elem_idx = local_id.x * NQ; elem_idx < KV_TILE * HEAD_DIM_QK; elem_idx += WG_SIZE * NQ) {
let blck_idx = elem_idx / BLOCK_SIZE;
let block_offset = (elem_idx % BLOCK_SIZE) / WEIGHTS_PER_F16;
let k_row = blck_idx / BLOCKS_K;
let global_k_row = kv_tile + k_row;
let block_k = blck_idx % BLOCKS_K;
let row_offset = k_row * HEAD_DIM_QK;
if (global_k_row < params.seq_len_kv) {
let global_block_idx = k_head_offset + global_k_row * params.stride_k1 + block_k;
let base_idx = global_block_idx * F16_PER_BLOCK;
let d = K[base_idx]; // scale
for (var j = 0u; j < F16_PER_THREAD; j += 2) {
let q_0 = K[base_idx + 1u + block_offset + j];
let q_1 = K[base_idx + 1u + block_offset + j + 1];
let q_packed = bitcast<u32>(vec2(q_0, q_1));
for (var k = 0u; k < 4u; k++) {
let q_byte = get_byte_i32(q_packed, k);
let q_val = f16(q_byte) * d;
let idx = block_k * BLOCK_SIZE + block_offset * 2u + j * 2u + k;
kv_shmem[row_offset + idx] = q_val;
}
}
}
}
#elif defined(KV_DIRECT)
// Direct global loads for KV
#else
for (var elem_idx = local_id.x; elem_idx < KV_TILE * HEAD_DIM_QK; elem_idx += WG_SIZE) {
let k_row = elem_idx / HEAD_DIM_QK;
let k_col = elem_idx % HEAD_DIM_QK;
let global_k_row = kv_tile + k_row;
let global_k_row_offset = k_head_offset + global_k_row * params.stride_k1;
kv_shmem[elem_idx] = f16(select(
0.0,
K[global_k_row_offset + k_col],
global_k_row < params.seq_len_kv && k_col < HEAD_DIM_QK));
}
#endif
workgroupBarrier();
// accumulate q block * k block into registers across the entire KV tile
// TODO: this loop seems to be the current largest bottleneck
for (var kv_block = subgroup_id; kv_block < KV_BLOCKS; kv_block += num_subgroups) {
let inter_offset = kv_block * SG_MAT_N;
var acc: subgroup_matrix_result<f16, SG_MAT_M, SG_MAT_N> = subgroupMatrixLoad<
subgroup_matrix_result<f16, SG_MAT_M, SG_MAT_N>>(&inter_shmem, inter_offset, false, KV_TILE);
#ifdef KV_DIRECT
let k_block_row = kv_tile + kv_block * SG_MAT_N;
let k_global_offset = k_head_offset + k_block_row * params.stride_k1;
#else
let k_block_offset = kv_block * SG_MAT_N * HEAD_DIM_QK;
#endif
for (var head_dim_block = 0u; head_dim_block < HEAD_DIM_QK; head_dim_block += SG_MAT_K) {
// load q submatrix from shared memory
var q_sg_mat: subgroup_matrix_left<f16, SG_MAT_M, SG_MAT_K> = subgroupMatrixLoad<subgroup_matrix_left<f16, SG_MAT_M, SG_MAT_K>>(
&q_shmem,
head_dim_block,
false,
HEAD_DIM_QK
);
// load k submatrix from device or shared memory
#ifdef KV_DIRECT
var k_sg_mat: subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N> = subgroupMatrixLoad<subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N>>(
&K,
k_global_offset + head_dim_block,
true,
params.stride_k1
);
#else
var k_sg_mat: subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N> = subgroupMatrixLoad<subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N>>(
&kv_shmem,
k_block_offset + head_dim_block,
true,
HEAD_DIM_QK
);
#endif
acc = subgroupMatrixMultiplyAccumulate(q_sg_mat, k_sg_mat, acc);
}
// store acc to shared memory for softmax (S matrix from paper)
subgroupMatrixStore(&inter_shmem, inter_offset, acc, false, KV_TILE);
}
#ifdef MASK
// load mask tile into shared memory for this KV block
// TODO: optimize and skip if mask is -INF for the entire tile
for (var elem_idx = local_id.x; elem_idx < Q_TILE * KV_TILE; elem_idx += WG_SIZE) {
let mask_row = elem_idx / KV_TILE;
let mask_col = elem_idx % KV_TILE;
let global_q_row = q_row_start + mask_row;
let global_k_col = kv_tile + mask_col;
let mask_in_bounds = global_q_row < params.seq_len_q && global_k_col < params.seq_len_kv;
let mask_idx = mask_global_offset + mask_row * params.seq_len_kv + global_k_col;
mask_shmem[elem_idx] = select(0.0, mask[mask_idx], mask_in_bounds);
}
#endif
workgroupBarrier();
// online softmax
for (var q_tile_row = subgroup_id; q_tile_row < Q_TILE; q_tile_row += num_subgroups) {
let global_q_row = q_row_start + q_tile_row;
if (global_q_row >= params.seq_len_q) {
break;
}
// initialize running max for this row
var prev_max = row_max_shmem[q_tile_row];
var final_max = prev_max;
// pass 1: compute final max across the full KV tile in chunks
for (var kv_offset = 0u; kv_offset < KV_TILE; kv_offset += subgroup_size) {
let kv_idx = kv_offset + sg_inv_id;
let softmax_term = calc_softmax_term(kv_idx, q_tile_row, slope);
final_max = subgroupMax(max(final_max, softmax_term));
}
var total_exp_term: f32 = 0.0;
// pass 2: compute exp sum and write P using final_max
for (var kv_offset = 0u; kv_offset < KV_TILE; kv_offset += subgroup_size) {
let kv_idx = kv_offset + sg_inv_id;
let softmax_term = calc_softmax_term(kv_idx, q_tile_row, slope);
let cur_p = select(0.0,
exp(softmax_term - final_max),
kv_tile + kv_idx < params.seq_len_kv && kv_idx < KV_TILE);
total_exp_term += subgroupAdd(cur_p);
if (kv_idx < KV_TILE) {
inter_shmem[kv_idx + q_tile_row * KV_TILE] = f16(cur_p);
}
}
let cur_exp = exp(prev_max - final_max);
if (sg_inv_id == 0) {
row_max_shmem[q_tile_row] = final_max;
exp_sum_shmem[q_tile_row] = exp_sum_shmem[q_tile_row] * cur_exp + total_exp_term;
}
for (var elem_idx = sg_inv_id; elem_idx < HEAD_DIM_V; elem_idx += subgroup_size) {
let idx = q_tile_row * HEAD_DIM_V + elem_idx;
o_shmem[idx] = f16(f32(o_shmem[idx]) * cur_exp);
}
}
// load v tile into shared memory
#if defined(KV_Q4_0)
for (var elem_idx = local_id.x * NQ; elem_idx < KV_TILE * HEAD_DIM_V; elem_idx += WG_SIZE * NQ) {
let blck_idx = elem_idx / BLOCK_SIZE;
let block_offset = (elem_idx % BLOCK_SIZE) / WEIGHTS_PER_F16;
let v_row = blck_idx / BLOCKS_V;
let global_v_row = kv_tile + v_row;
let block_k = blck_idx % BLOCKS_V;
let row_offset = v_row * HEAD_DIM_V;
if (global_v_row < params.seq_len_kv) {
let global_block_idx = v_head_offset + global_v_row * params.stride_v1 + block_k;
let base_idx = global_block_idx * F16_PER_BLOCK;
let d = V[base_idx]; // scale
for (var j = 0u; j < F16_PER_THREAD; j += 2) {
let q_0 = V[base_idx + 1u + block_offset + j];
let q_1 = V[base_idx + 1u + block_offset + j + 1];
let q_packed = bitcast<u32>(vec2(q_0, q_1));
for (var k = 0u; k < 4u; k++) {
let q_byte = get_byte(q_packed, k);
let q_hi = (f16((q_byte >> 4) & 0xF) - 8.0) * d;
let q_lo = (f16(q_byte & 0xF) - 8.0) * d;
let idx = block_k * BLOCK_SIZE + block_offset * 2u + j * 2u + k;
kv_shmem[row_offset + idx] = q_lo;
kv_shmem[row_offset + idx + 16u] = q_hi;
}
}
}
}
#elif defined(KV_Q8_0)
for (var elem_idx = local_id.x * NQ; elem_idx < KV_TILE * HEAD_DIM_V; elem_idx += WG_SIZE * NQ) {
let blck_idx = elem_idx / BLOCK_SIZE;
let block_offset = (elem_idx % BLOCK_SIZE) / WEIGHTS_PER_F16;
let v_row = blck_idx / BLOCKS_V;
let global_v_row = kv_tile + v_row;
let block_k = blck_idx % BLOCKS_V;
let row_offset = v_row * HEAD_DIM_V;
if (global_v_row < params.seq_len_kv) {
let global_block_idx = v_head_offset + global_v_row * params.stride_v1 + block_k;
let base_idx = global_block_idx * F16_PER_BLOCK;
let d = V[base_idx]; // scale
for (var j = 0u; j < F16_PER_THREAD; j += 2) {
let q_0 = V[base_idx + 1u + block_offset + j];
let q_1 = V[base_idx + 1u + block_offset + j + 1];
let q_packed = bitcast<u32>(vec2(q_0, q_1));
for (var k = 0u; k < 4u; k++) {
let q_byte = get_byte_i32(q_packed, k);
let q_val = f16(q_byte) * d;
let idx = block_k * BLOCK_SIZE + block_offset * 2u + j * 2u + k;
kv_shmem[row_offset + idx] = q_val;
}
}
}
}
#elif defined(KV_DIRECT)
// Direct global loads for KV
#else
for (var elem_idx = local_id.x; elem_idx < KV_TILE * HEAD_DIM_V; elem_idx += WG_SIZE) {
let v_row = elem_idx / HEAD_DIM_V;
let v_col = elem_idx % HEAD_DIM_V;
let global_v_row = kv_tile + v_row;
let global_v_row_offset = v_head_offset + global_v_row * params.stride_v1;
kv_shmem[elem_idx] = f16(select(
0.0,
V[global_v_row_offset + v_col],
global_v_row < params.seq_len_kv && v_col < HEAD_DIM_V));
}
#endif
workgroupBarrier();
// we have P (Q_TILE x KV_TILE) in inter_shmem and V (KV_TILE x head_dim_v) in kv_shmem
// we want to compute O += P * V across the full KV tile
for (var head_dim_block = subgroup_id * SG_MAT_N;
head_dim_block < HEAD_DIM_V;
head_dim_block += num_subgroups * SG_MAT_N) {
// load O submatrix from shared memory
var o_sg_mat: subgroup_matrix_result<f16, SG_MAT_M, SG_MAT_N> = subgroupMatrixLoad<subgroup_matrix_result<f16, SG_MAT_M, SG_MAT_N>>(
&o_shmem,
head_dim_block,
false,
HEAD_DIM_V
);
for (var kv_block = 0u; kv_block < KV_BLOCKS; kv_block++) {
let p_offset = kv_block * SG_MAT_N;
var p_sg_mat: subgroup_matrix_left<f16, SG_MAT_M, SG_MAT_K> = subgroupMatrixLoad<subgroup_matrix_left<f16, SG_MAT_M, SG_MAT_K>>(
&inter_shmem,
p_offset,
false,
KV_TILE
);
// load V submatrix from global or shared memory
#ifdef KV_DIRECT
let v_block_row = kv_tile + kv_block * SG_MAT_N;
let v_global_offset = v_head_offset + v_block_row * params.stride_v1 + head_dim_block;
var v_sg_mat: subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N> = subgroupMatrixLoad<subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N>>(
&V,
v_global_offset,
false,
params.stride_v1
);
#else
let v_block_offset = kv_block * SG_MAT_N * HEAD_DIM_V;
var v_sg_mat: subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N> = subgroupMatrixLoad<subgroup_matrix_right<f16, SG_MAT_K, SG_MAT_N>>(
&kv_shmem,
v_block_offset + head_dim_block,
false,
HEAD_DIM_V
);
#endif
// O += P * V
o_sg_mat = subgroupMatrixMultiplyAccumulate(p_sg_mat, v_sg_mat, o_sg_mat);
}
// store O back to shared memory
subgroupMatrixStore(&o_shmem, head_dim_block, o_sg_mat, false, HEAD_DIM_V);
}
workgroupBarrier();
}
#ifdef SINKS
// add sinks (applied once after processing all KV tiles)
for (var q_tile_row = subgroup_id;
q_tile_row < Q_TILE;
q_tile_row += num_subgroups) {
// no need to process rows beyond seq_len_q
let global_q_row = q_row_start + q_tile_row;
if (global_q_row >= params.seq_len_q) {
break;
}
var prev_max = row_max_shmem[q_tile_row];
// for non-sink threads, exp(FLOAT_MIN) effectively zeroes out their contribution to the sum
let sink_val = select(FLOAT_MIN, sinks[params.offset_sinks + head_idx], sg_inv_id == 0);
let new_max = subgroupMax(max(prev_max, sink_val));
let max_exp = exp(prev_max - new_max);
let sink_exp = exp(sink_val - new_max);
let sink_exp_sum = subgroupAdd(sink_exp);
if (sg_inv_id == 0) {
exp_sum_shmem[q_tile_row] = exp_sum_shmem[q_tile_row] * max_exp + sink_exp_sum;
}
for (var elem_idx = sg_inv_id; elem_idx < HEAD_DIM_V; elem_idx += subgroup_size) {
let idx = q_tile_row * HEAD_DIM_V + elem_idx;
let val = f32(o_shmem[idx]) * max_exp;
o_shmem[idx] = f16(val);
}
}
workgroupBarrier();
#endif
// write output back to global memory
for (var q_tile_row = subgroup_id;
q_tile_row < Q_TILE;
q_tile_row += num_subgroups) {
let global_q_row = q_row_start + q_tile_row;
if (global_q_row >= params.seq_len_q) {
break;
}
let exp_sum = exp_sum_shmem[q_tile_row];
let scale = select(0.0, 1.0 / exp_sum, exp_sum != 0);
for (var elem_idx = sg_inv_id; elem_idx < HEAD_DIM_V; elem_idx += subgroup_size) {
let o_val = o_shmem[q_tile_row * HEAD_DIM_V + elem_idx];
let scaled = f32(o_val) * scale;
dst[dst_global_offset + q_tile_row * dst2_stride + elem_idx] = scaled;
}
}
}

View File

@ -282,6 +282,7 @@ class Keys:
HAS_LLAVA_PROJECTOR = "clip.has_llava_projector" HAS_LLAVA_PROJECTOR = "clip.has_llava_projector"
class ClipVision: class ClipVision:
PROJECTOR_TYPE = "clip.vision.projector_type" # for mixed modality models
IMAGE_SIZE = "clip.vision.image_size" IMAGE_SIZE = "clip.vision.image_size"
PREPROC_IMAGE_SIZE = "clip.vision.preproc_image_size" PREPROC_IMAGE_SIZE = "clip.vision.preproc_image_size"
PATCH_SIZE = "clip.vision.patch_size" PATCH_SIZE = "clip.vision.patch_size"
@ -307,6 +308,7 @@ class Keys:
SCALE_FACTOR = "clip.vision.projector.scale_factor" SCALE_FACTOR = "clip.vision.projector.scale_factor"
class ClipAudio: class ClipAudio:
PROJECTOR_TYPE = "clip.audio.projector_type" # for mixed modality models
NUM_MEL_BINS = "clip.audio.num_mel_bins" NUM_MEL_BINS = "clip.audio.num_mel_bins"
EMBEDDING_LENGTH = "clip.audio.embedding_length" EMBEDDING_LENGTH = "clip.audio.embedding_length"
FEED_FORWARD_LENGTH = "clip.audio.feed_forward_length" FEED_FORWARD_LENGTH = "clip.audio.feed_forward_length"
@ -465,6 +467,7 @@ class VISION_PROJECTOR_TYPE(IntEnum):
RESAMPLER = auto() RESAMPLER = auto()
GLM_EDGE = auto() GLM_EDGE = auto()
MERGER = auto() MERGER = auto()
GEMMA3N = auto()
GEMMA3 = auto() GEMMA3 = auto()
QWEN3VL = auto() QWEN3VL = auto()
COGVLM = auto() COGVLM = auto()
@ -675,6 +678,15 @@ class MODEL_TENSOR(IntEnum):
V_MM_INP_NORM = auto() V_MM_INP_NORM = auto()
V_MM_INP_PROJ = auto() # gemma3 V_MM_INP_PROJ = auto() # gemma3
V_MM_SOFT_EMB_NORM = auto() # gemma3 V_MM_SOFT_EMB_NORM = auto() # gemma3
V_MM_EMBEDDING = auto() # gemma3n
V_MM_HARD_EMB_NORM = auto() # gemma3n
V_ENC_CONV_STEM = auto() # gemma3n
V_ENC_CONV_STEM_NORM = auto() # gemma3n
V_ENC_MSFA_EXP = auto() # gemma3n
V_ENC_MSFA_EXP_NORM = auto() # gemma3n
V_ENC_MSFA_PROJ = auto() # gemma3n
V_ENC_MSFA_PROJ_NORM = auto() # gemma3n
V_ENC_MSFA_NORM = auto() # gemma3n
V_RESMPL_POS_EMBD_K = auto() # minicpmv V_RESMPL_POS_EMBD_K = auto() # minicpmv
V_RESMPL_ATTN_Q = auto() # minicpmv V_RESMPL_ATTN_Q = auto() # minicpmv
V_RESMPL_ATTN_K = auto() # minicpmv V_RESMPL_ATTN_K = auto() # minicpmv
@ -700,28 +712,39 @@ class MODEL_TENSOR(IntEnum):
# audio (mtmd) # audio (mtmd)
A_ENC_EMBD_POS = auto() A_ENC_EMBD_POS = auto()
A_ENC_EMBD_NORM = auto() A_ENC_EMBD_NORM = auto()
A_ENC_EMBD_TO_LOGITS = auto() A_ENC_EMBD_TO_LOGITS = auto() # lfm2
A_ENC_CONV1D = auto() A_ENC_CONV1D = auto()
A_ENC_CONV1D_NORM = auto() # gemma3n
A_PRE_NORM = auto() A_PRE_NORM = auto()
A_POST_NORM = auto() A_POST_NORM = auto()
A_ENC_LAYER_PRE_NORM = auto() # gemma3n
A_ENC_ATTN_Q = auto() A_ENC_ATTN_Q = auto()
A_ENC_ATTN_K = auto() A_ENC_ATTN_K = auto()
A_ENC_ATTN_V = auto() A_ENC_ATTN_V = auto()
A_ENC_PER_DIM_SCALE = auto() # gemma3n
A_ENC_INPUT_NORM = auto() A_ENC_INPUT_NORM = auto()
A_ENC_OUTPUT = auto() A_ENC_OUTPUT = auto()
A_ENC_OUTPUT_NORM = auto() A_ENC_OUTPUT_NORM = auto()
A_ENC_FFN_UP = auto() A_ENC_FFN_UP = auto()
A_ENC_FFN_NORM = auto() A_ENC_FFN_NORM = auto()
A_ENC_FFN_POST_NORM = auto() # gemma3n
A_ENC_FFN_SCALE = auto() # gemma3n
A_ENC_FFN_GATE = auto() A_ENC_FFN_GATE = auto()
A_ENC_FFN_DOWN = auto() A_ENC_FFN_DOWN = auto()
A_ENC_FFN_UP_1 = auto() A_ENC_FFN_UP_1 = auto() # lfm2, gemma3n
A_ENC_FFN_NORM_1 = auto() A_ENC_FFN_NORM_1 = auto() # lfm2, gemma3n (pre-norm)
A_ENC_FFN_GATE_1 = auto() A_ENC_FFN_POST_NORM_1 = auto() # gemma3n
A_ENC_FFN_DOWN_1 = auto() A_ENC_FFN_SCALE_1 = auto() # gemma3n
A_ENC_FFN_GATE_1 = auto() # lfm2, gemma3n
A_ENC_FFN_DOWN_1 = auto() # lfm2, gemma3n
A_MMPROJ = auto() A_MMPROJ = auto()
A_MMPROJ_FC = auto() A_MMPROJ_FC = auto()
A_MM_NORM_PRE = auto() A_MM_NORM_PRE = auto()
A_MM_NORM_MID = auto() A_MM_NORM_MID = auto()
A_MM_EMBEDDING = auto() # gemma3n
A_MM_HARD_EMB_NORM = auto() # gemma3n
A_MM_SOFT_EMB_NORM = auto() # gemma3n
A_MM_INP_PROJ = auto() # gemma3n
# nextn/mtp # nextn/mtp
NEXTN_EH_PROJ = auto() NEXTN_EH_PROJ = auto()
NEXTN_EMBED_TOKENS = auto() NEXTN_EMBED_TOKENS = auto()
@ -1071,7 +1094,16 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.V_MM_POST_NORM: "mm.post_norm", MODEL_TENSOR.V_MM_POST_NORM: "mm.post_norm",
MODEL_TENSOR.V_MM_INP_PROJ: "mm.input_projection", MODEL_TENSOR.V_MM_INP_PROJ: "mm.input_projection",
MODEL_TENSOR.V_MM_INP_NORM: "mm.input_norm", MODEL_TENSOR.V_MM_INP_NORM: "mm.input_norm",
MODEL_TENSOR.V_MM_SOFT_EMB_NORM: "mm.soft_emb_norm", MODEL_TENSOR.V_MM_SOFT_EMB_NORM: "mm.soft_emb_norm", # gemma3n
MODEL_TENSOR.V_MM_EMBEDDING: "mm.embedding", # gemma3n
MODEL_TENSOR.V_MM_HARD_EMB_NORM: "mm.hard_emb_norm", # gemma3n
MODEL_TENSOR.V_ENC_CONV_STEM: "v.conv_stem.conv", # gemma3n
MODEL_TENSOR.V_ENC_CONV_STEM_NORM: "v.conv_stem.bn", # gemma3n
MODEL_TENSOR.V_ENC_MSFA_EXP: "v.msfa.ffn.pw_exp.conv", # gemma3n
MODEL_TENSOR.V_ENC_MSFA_EXP_NORM: "v.msfa.ffn.pw_exp.bn", # gemma3n
MODEL_TENSOR.V_ENC_MSFA_PROJ: "v.msfa.ffn.pw_proj.conv", # gemma3n
MODEL_TENSOR.V_ENC_MSFA_PROJ_NORM: "v.msfa.ffn.pw_proj.bn", # gemma3n
MODEL_TENSOR.V_ENC_MSFA_NORM: "v.msfa.norm", # gemma3n
MODEL_TENSOR.V_RESMPL_POS_EMBD_K: "resampler.pos_embd_k", MODEL_TENSOR.V_RESMPL_POS_EMBD_K: "resampler.pos_embd_k",
MODEL_TENSOR.V_RESMPL_ATTN_Q: "resampler.attn.q", MODEL_TENSOR.V_RESMPL_ATTN_Q: "resampler.attn.q",
MODEL_TENSOR.V_RESMPL_ATTN_K: "resampler.attn.k", MODEL_TENSOR.V_RESMPL_ATTN_K: "resampler.attn.k",
@ -1100,19 +1132,26 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.A_ENC_EMBD_NORM: "a.position_embd_norm", MODEL_TENSOR.A_ENC_EMBD_NORM: "a.position_embd_norm",
MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS: "a.embd_to_logits", MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS: "a.embd_to_logits",
MODEL_TENSOR.A_ENC_CONV1D: "a.conv1d.{bid}", MODEL_TENSOR.A_ENC_CONV1D: "a.conv1d.{bid}",
MODEL_TENSOR.A_ENC_CONV1D_NORM: "a.conv1d.{bid}.norm",
MODEL_TENSOR.A_PRE_NORM: "a.pre_ln", MODEL_TENSOR.A_PRE_NORM: "a.pre_ln",
MODEL_TENSOR.A_POST_NORM: "a.post_ln", MODEL_TENSOR.A_POST_NORM: "a.post_ln",
MODEL_TENSOR.A_ENC_LAYER_PRE_NORM: "a.blk.{bid}.layer_pre_norm",
MODEL_TENSOR.A_ENC_ATTN_Q: "a.blk.{bid}.attn_q", MODEL_TENSOR.A_ENC_ATTN_Q: "a.blk.{bid}.attn_q",
MODEL_TENSOR.A_ENC_ATTN_K: "a.blk.{bid}.attn_k", MODEL_TENSOR.A_ENC_ATTN_K: "a.blk.{bid}.attn_k",
MODEL_TENSOR.A_ENC_ATTN_V: "a.blk.{bid}.attn_v", MODEL_TENSOR.A_ENC_ATTN_V: "a.blk.{bid}.attn_v",
MODEL_TENSOR.A_ENC_PER_DIM_SCALE: "a.blk.{bid}.per_dim_scale",
MODEL_TENSOR.A_ENC_INPUT_NORM: "a.blk.{bid}.ln1", MODEL_TENSOR.A_ENC_INPUT_NORM: "a.blk.{bid}.ln1",
MODEL_TENSOR.A_ENC_OUTPUT: "a.blk.{bid}.attn_out", MODEL_TENSOR.A_ENC_OUTPUT: "a.blk.{bid}.attn_out",
MODEL_TENSOR.A_ENC_OUTPUT_NORM: "a.blk.{bid}.ln2", MODEL_TENSOR.A_ENC_OUTPUT_NORM: "a.blk.{bid}.ln2",
MODEL_TENSOR.A_ENC_FFN_NORM: "a.blk.{bid}.ffn_norm", MODEL_TENSOR.A_ENC_FFN_NORM: "a.blk.{bid}.ffn_norm",
MODEL_TENSOR.A_ENC_FFN_POST_NORM: "a.blk.{bid}.ffn_post_norm",
MODEL_TENSOR.A_ENC_FFN_SCALE: "a.blk.{bid}.ffn_scale",
MODEL_TENSOR.A_ENC_FFN_UP: "a.blk.{bid}.ffn_up", MODEL_TENSOR.A_ENC_FFN_UP: "a.blk.{bid}.ffn_up",
MODEL_TENSOR.A_ENC_FFN_GATE: "a.blk.{bid}.ffn_gate", MODEL_TENSOR.A_ENC_FFN_GATE: "a.blk.{bid}.ffn_gate",
MODEL_TENSOR.A_ENC_FFN_DOWN: "a.blk.{bid}.ffn_down", MODEL_TENSOR.A_ENC_FFN_DOWN: "a.blk.{bid}.ffn_down",
MODEL_TENSOR.A_ENC_FFN_NORM_1: "a.blk.{bid}.ffn_norm_1", MODEL_TENSOR.A_ENC_FFN_NORM_1: "a.blk.{bid}.ffn_norm_1",
MODEL_TENSOR.A_ENC_FFN_POST_NORM_1: "a.blk.{bid}.ffn_post_norm_1",
MODEL_TENSOR.A_ENC_FFN_SCALE_1: "a.blk.{bid}.ffn_scale_1",
MODEL_TENSOR.A_ENC_FFN_UP_1: "a.blk.{bid}.ffn_up_1", MODEL_TENSOR.A_ENC_FFN_UP_1: "a.blk.{bid}.ffn_up_1",
MODEL_TENSOR.A_ENC_FFN_GATE_1: "a.blk.{bid}.ffn_gate_1", MODEL_TENSOR.A_ENC_FFN_GATE_1: "a.blk.{bid}.ffn_gate_1",
MODEL_TENSOR.A_ENC_FFN_DOWN_1: "a.blk.{bid}.ffn_down_1", MODEL_TENSOR.A_ENC_FFN_DOWN_1: "a.blk.{bid}.ffn_down_1",
@ -1120,6 +1159,10 @@ TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
MODEL_TENSOR.A_MMPROJ_FC: "mm.a.fc", MODEL_TENSOR.A_MMPROJ_FC: "mm.a.fc",
MODEL_TENSOR.A_MM_NORM_PRE: "mm.a.norm_pre", MODEL_TENSOR.A_MM_NORM_PRE: "mm.a.norm_pre",
MODEL_TENSOR.A_MM_NORM_MID: "mm.a.norm_mid", MODEL_TENSOR.A_MM_NORM_MID: "mm.a.norm_mid",
MODEL_TENSOR.A_MM_INP_PROJ: "mm.a.input_projection", # gemma3n
MODEL_TENSOR.A_MM_SOFT_EMB_NORM: "mm.a.soft_emb_norm", # gemma3n
MODEL_TENSOR.A_MM_EMBEDDING: "mm.a.embedding", # gemma3n
MODEL_TENSOR.A_MM_HARD_EMB_NORM: "mm.a.hard_emb_norm", # gemma3n
# lfm2 audio # lfm2 audio
MODEL_TENSOR.A_ENC_NORM_CONV: "a.blk.{bid}.norm_conv", MODEL_TENSOR.A_ENC_NORM_CONV: "a.blk.{bid}.norm_conv",
MODEL_TENSOR.A_ENC_LINEAR_POS: "a.blk.{bid}.linear_pos", MODEL_TENSOR.A_ENC_LINEAR_POS: "a.blk.{bid}.linear_pos",
@ -1170,6 +1213,15 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.V_MM_INP_PROJ, MODEL_TENSOR.V_MM_INP_PROJ,
MODEL_TENSOR.V_MM_INP_NORM, MODEL_TENSOR.V_MM_INP_NORM,
MODEL_TENSOR.V_MM_SOFT_EMB_NORM, MODEL_TENSOR.V_MM_SOFT_EMB_NORM,
MODEL_TENSOR.V_MM_EMBEDDING,
MODEL_TENSOR.V_MM_HARD_EMB_NORM,
MODEL_TENSOR.V_ENC_CONV_STEM,
MODEL_TENSOR.V_ENC_CONV_STEM_NORM,
MODEL_TENSOR.V_ENC_MSFA_EXP,
MODEL_TENSOR.V_ENC_MSFA_EXP_NORM,
MODEL_TENSOR.V_ENC_MSFA_PROJ,
MODEL_TENSOR.V_ENC_MSFA_PROJ_NORM,
MODEL_TENSOR.V_ENC_MSFA_NORM,
MODEL_TENSOR.V_RESMPL_POS_EMBD_K, MODEL_TENSOR.V_RESMPL_POS_EMBD_K,
MODEL_TENSOR.V_RESMPL_ATTN_Q, MODEL_TENSOR.V_RESMPL_ATTN_Q,
MODEL_TENSOR.V_RESMPL_ATTN_K, MODEL_TENSOR.V_RESMPL_ATTN_K,
@ -1197,19 +1249,26 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.A_ENC_EMBD_NORM, MODEL_TENSOR.A_ENC_EMBD_NORM,
MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS, MODEL_TENSOR.A_ENC_EMBD_TO_LOGITS,
MODEL_TENSOR.A_ENC_CONV1D, MODEL_TENSOR.A_ENC_CONV1D,
MODEL_TENSOR.A_ENC_CONV1D_NORM,
MODEL_TENSOR.A_PRE_NORM, MODEL_TENSOR.A_PRE_NORM,
MODEL_TENSOR.A_POST_NORM, MODEL_TENSOR.A_POST_NORM,
MODEL_TENSOR.A_ENC_LAYER_PRE_NORM,
MODEL_TENSOR.A_ENC_ATTN_Q, MODEL_TENSOR.A_ENC_ATTN_Q,
MODEL_TENSOR.A_ENC_ATTN_K, MODEL_TENSOR.A_ENC_ATTN_K,
MODEL_TENSOR.A_ENC_ATTN_V, MODEL_TENSOR.A_ENC_ATTN_V,
MODEL_TENSOR.A_ENC_PER_DIM_SCALE,
MODEL_TENSOR.A_ENC_INPUT_NORM, MODEL_TENSOR.A_ENC_INPUT_NORM,
MODEL_TENSOR.A_ENC_OUTPUT, MODEL_TENSOR.A_ENC_OUTPUT,
MODEL_TENSOR.A_ENC_OUTPUT_NORM, MODEL_TENSOR.A_ENC_OUTPUT_NORM,
MODEL_TENSOR.A_ENC_FFN_NORM, MODEL_TENSOR.A_ENC_FFN_NORM,
MODEL_TENSOR.A_ENC_FFN_POST_NORM,
MODEL_TENSOR.A_ENC_FFN_SCALE,
MODEL_TENSOR.A_ENC_FFN_UP, MODEL_TENSOR.A_ENC_FFN_UP,
MODEL_TENSOR.A_ENC_FFN_GATE, MODEL_TENSOR.A_ENC_FFN_GATE,
MODEL_TENSOR.A_ENC_FFN_DOWN, MODEL_TENSOR.A_ENC_FFN_DOWN,
MODEL_TENSOR.A_ENC_FFN_NORM_1, MODEL_TENSOR.A_ENC_FFN_NORM_1,
MODEL_TENSOR.A_ENC_FFN_POST_NORM_1,
MODEL_TENSOR.A_ENC_FFN_SCALE_1,
MODEL_TENSOR.A_ENC_FFN_UP_1, MODEL_TENSOR.A_ENC_FFN_UP_1,
MODEL_TENSOR.A_ENC_FFN_GATE_1, MODEL_TENSOR.A_ENC_FFN_GATE_1,
MODEL_TENSOR.A_ENC_FFN_DOWN_1, MODEL_TENSOR.A_ENC_FFN_DOWN_1,
@ -1226,6 +1285,10 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.A_ENC_CONV_NORM, MODEL_TENSOR.A_ENC_CONV_NORM,
MODEL_TENSOR.A_ENC_CONV_PW1, MODEL_TENSOR.A_ENC_CONV_PW1,
MODEL_TENSOR.A_ENC_CONV_PW2, MODEL_TENSOR.A_ENC_CONV_PW2,
MODEL_TENSOR.A_MM_INP_PROJ,
MODEL_TENSOR.A_MM_SOFT_EMB_NORM,
MODEL_TENSOR.A_MM_EMBEDDING,
MODEL_TENSOR.A_MM_HARD_EMB_NORM,
], ],
MODEL_ARCH.LLAMA: [ MODEL_ARCH.LLAMA: [
MODEL_TENSOR.TOKEN_EMBD, MODEL_TENSOR.TOKEN_EMBD,
@ -1675,6 +1738,7 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.ATTN_OUT, MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.ATTN_POST_NORM, MODEL_TENSOR.ATTN_POST_NORM,
MODEL_TENSOR.ATTN_GATE, MODEL_TENSOR.ATTN_GATE,
MODEL_TENSOR.ATTN_QKV,
MODEL_TENSOR.FFN_GATE_INP, MODEL_TENSOR.FFN_GATE_INP,
MODEL_TENSOR.FFN_GATE_INP_SHEXP, MODEL_TENSOR.FFN_GATE_INP_SHEXP,
MODEL_TENSOR.FFN_UP_SHEXP, MODEL_TENSOR.FFN_UP_SHEXP,
@ -3496,6 +3560,8 @@ class GGUFValueType(IntEnum):
class VisionProjectorType: class VisionProjectorType:
GEMMA3 = "gemma3" GEMMA3 = "gemma3"
GEMMA3NV = "gemma3nv"
GEMMA3NA = "gemma3na"
IDEFICS3 = "idefics3" IDEFICS3 = "idefics3"
PIXTRAL = "pixtral" PIXTRAL = "pixtral"
LLAMA4 = "llama4" LLAMA4 = "llama4"

View File

@ -1086,6 +1086,9 @@ class GGUFWriter:
def add_clip_projector_type(self, value: str) -> None: def add_clip_projector_type(self, value: str) -> None:
self.add_string(Keys.Clip.PROJECTOR_TYPE, value) self.add_string(Keys.Clip.PROJECTOR_TYPE, value)
def add_clip_vision_projector_type(self, value: str) -> None:
self.add_string(Keys.ClipVision.PROJECTOR_TYPE, value)
def add_vision_projection_dim(self, value: int) -> None: def add_vision_projection_dim(self, value: int) -> None:
self.add_uint32(Keys.ClipVision.PROJECTION_DIM, value) self.add_uint32(Keys.ClipVision.PROJECTION_DIM, value)
@ -1168,6 +1171,9 @@ class GGUFWriter:
# audio models # audio models
def add_clip_audio_projector_type(self, value: str) -> None:
self.add_string(Keys.ClipAudio.PROJECTOR_TYPE, value)
def add_audio_projection_dim(self, value: int) -> None: def add_audio_projection_dim(self, value: int) -> None:
self.add_uint32(Keys.ClipAudio.PROJECTION_DIM, value) self.add_uint32(Keys.ClipAudio.PROJECTION_DIM, value)

View File

@ -123,6 +123,40 @@ class TensorNameMap:
MODEL_TENSOR.CONV1D: ( MODEL_TENSOR.CONV1D: (
"backbone.embed", # roberta "backbone.embed", # roberta
), ),
MODEL_TENSOR.V_MM_EMBEDDING: (
"model.embed_vision.embedding", # gemma3n
),
MODEL_TENSOR.V_MM_HARD_EMB_NORM: (
"model.embed_vision.hard_embedding_norm", # gemma3n
),
MODEL_TENSOR.V_MM_INP_PROJ: (
"model.embed_vision.embedding_projection", # gemma3n
),
MODEL_TENSOR.V_MM_SOFT_EMB_NORM: (
"model.embed_vision.soft_embedding_norm", # gemma3n
),
MODEL_TENSOR.V_ENC_CONV_STEM: (
"model.vision_tower.timm_model.conv_stem.conv", # gemma3n
),
MODEL_TENSOR.V_ENC_CONV_STEM_NORM: (
"model.vision_tower.timm_model.conv_stem.bn", # gemma3n
),
MODEL_TENSOR.V_ENC_MSFA_EXP: (
"model.vision_tower.timm_model.msfa.ffn.pw_exp.conv", # gemma3n
),
MODEL_TENSOR.V_ENC_MSFA_EXP_NORM: (
"model.vision_tower.timm_model.msfa.ffn.pw_exp.bn", # gemma3n
),
MODEL_TENSOR.V_ENC_MSFA_PROJ: (
"model.vision_tower.timm_model.msfa.ffn.pw_proj.conv", # gemma3n
),
MODEL_TENSOR.V_ENC_MSFA_PROJ_NORM: (
"model.vision_tower.timm_model.msfa.ffn.pw_proj.bn", # gemma3n
),
MODEL_TENSOR.V_ENC_MSFA_NORM: (
"model.vision_tower.timm_model.msfa.norm", # gemma3n
),
} }
block_mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = { block_mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = {
@ -1575,6 +1609,11 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_CONV1D: ( MODEL_TENSOR.A_ENC_CONV1D: (
"audio_tower.conv{bid}", # ultravox "audio_tower.conv{bid}", # ultravox
"conformer.pre_encode.conv.{bid}", # lfm2 "conformer.pre_encode.conv.{bid}", # lfm2
"model.audio_tower.subsample_conv_projection.conv_{bid}.conv", # gemma3n
),
MODEL_TENSOR.A_ENC_CONV1D_NORM: (
"model.audio_tower.subsample_conv_projection.conv_{bid}.norm", # gemma3n
), ),
MODEL_TENSOR.A_PRE_NORM: (), MODEL_TENSOR.A_PRE_NORM: (),
@ -1587,40 +1626,64 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_ATTN_Q: ( MODEL_TENSOR.A_ENC_ATTN_Q: (
"audio_tower.layers.{bid}.self_attn.q_proj", # ultravox "audio_tower.layers.{bid}.self_attn.q_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_q", # lfm2 "conformer.layers.{bid}.self_attn.linear_q", # lfm2
"conformer.layers.{bid}.attention.attn.q_proj", # gemma3n
), ),
MODEL_TENSOR.A_ENC_ATTN_K: ( MODEL_TENSOR.A_ENC_ATTN_K: (
"audio_tower.layers.{bid}.self_attn.k_proj", # ultravox "audio_tower.layers.{bid}.self_attn.k_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_k", # lfm2 "conformer.layers.{bid}.self_attn.linear_k", # lfm2
"conformer.layers.{bid}.attention.attn.k_proj", # gemma3n
), ),
MODEL_TENSOR.A_ENC_ATTN_V: ( MODEL_TENSOR.A_ENC_ATTN_V: (
"audio_tower.layers.{bid}.self_attn.v_proj", # ultravox "audio_tower.layers.{bid}.self_attn.v_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_v", # lfm2 "conformer.layers.{bid}.self_attn.linear_v", # lfm2
"conformer.layers.{bid}.attention.attn.v_proj", # gemma3n
),
MODEL_TENSOR.A_ENC_PER_DIM_SCALE: (
"conformer.layers.{bid}.attention.attn.per_dim_scale", # gemma3n
),
MODEL_TENSOR.A_ENC_LAYER_PRE_NORM: (
"conformer.layers.{bid}.norm", # gemma3n
), ),
MODEL_TENSOR.A_ENC_INPUT_NORM: ( MODEL_TENSOR.A_ENC_INPUT_NORM: (
"audio_tower.layers.{bid}.self_attn_layer_norm", # ultravox "audio_tower.layers.{bid}.self_attn_layer_norm", # ultravox
"conformer.layers.{bid}.norm_self_att", # lfm2 "conformer.layers.{bid}.norm_self_att", # lfm2
"conformer.layers.{bid}.attention.pre_attn_norm", # gemma3n
), ),
MODEL_TENSOR.A_ENC_OUTPUT: ( MODEL_TENSOR.A_ENC_OUTPUT: (
"audio_tower.layers.{bid}.self_attn.out_proj", # ultravox "audio_tower.layers.{bid}.self_attn.out_proj", # ultravox
"conformer.layers.{bid}.self_attn.linear_out", # lfm2 "conformer.layers.{bid}.self_attn.linear_out", # lfm2
"conformer.layers.{bid}.attention.post", # gemma3n
), ),
MODEL_TENSOR.A_ENC_OUTPUT_NORM: ( MODEL_TENSOR.A_ENC_OUTPUT_NORM: (
"audio_tower.layers.{bid}.final_layer_norm", # ultravox "audio_tower.layers.{bid}.final_layer_norm", # ultravox
"conformer.layers.{bid}.norm_out", # lfm2 "conformer.layers.{bid}.norm_out", # lfm2
"conformer.layers.{bid}.attention.post_norm", # gemma3n
), ),
MODEL_TENSOR.A_ENC_FFN_NORM: ( MODEL_TENSOR.A_ENC_FFN_NORM: (
"conformer.layers.{bid}.norm_feed_forward1", # lfm2 "conformer.layers.{bid}.norm_feed_forward1", # lfm2
"conformer.layers.{bid}.ffw_layer_start.pre_layer_norm", # gemma3n
),
MODEL_TENSOR.A_ENC_FFN_POST_NORM: (
"conformer.layers.{bid}.ffw_layer_start.post_layer_norm", # gemma3n
),
MODEL_TENSOR.A_ENC_FFN_SCALE: (
"conformer.layers.{bid}.ffw_layer_start.post_layer_scale", # gemma3n
), ),
MODEL_TENSOR.A_ENC_FFN_UP: ( MODEL_TENSOR.A_ENC_FFN_UP: (
"audio_tower.layers.{bid}.fc1", # ultravox "audio_tower.layers.{bid}.fc1", # ultravox
"conformer.layers.{bid}.feed_forward1.linear1", # lfm2 "conformer.layers.{bid}.feed_forward1.linear1", # lfm2
"conformer.layers.{bid}.ffw_layer_start.ffw_layer_1", # gemma3n
), ),
MODEL_TENSOR.A_ENC_FFN_GATE: (), MODEL_TENSOR.A_ENC_FFN_GATE: (),
@ -1628,22 +1691,35 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_FFN_DOWN: ( MODEL_TENSOR.A_ENC_FFN_DOWN: (
"audio_tower.layers.{bid}.fc2", # ultravox "audio_tower.layers.{bid}.fc2", # ultravox
"conformer.layers.{bid}.feed_forward1.linear2", # lfm2 "conformer.layers.{bid}.feed_forward1.linear2", # lfm2
"conformer.layers.{bid}.ffw_layer_start.ffw_layer_2", # gemma3n
), ),
MODEL_TENSOR.A_ENC_FFN_UP_1: ( MODEL_TENSOR.A_ENC_FFN_UP_1: (
"conformer.layers.{bid}.feed_forward2.linear1", # lfm2 "conformer.layers.{bid}.feed_forward2.linear1", # lfm2
"conformer.layers.{bid}.ffw_layer_end.ffw_layer_1", # gemma3n
), ),
MODEL_TENSOR.A_ENC_FFN_DOWN_1: ( MODEL_TENSOR.A_ENC_FFN_DOWN_1: (
"conformer.layers.{bid}.feed_forward2.linear2", # lfm2 "conformer.layers.{bid}.feed_forward2.linear2", # lfm2
"conformer.layers.{bid}.ffw_layer_end.ffw_layer_2", # gemma3n
), ),
MODEL_TENSOR.A_ENC_FFN_NORM_1: ( MODEL_TENSOR.A_ENC_FFN_NORM_1: (
"conformer.layers.{bid}.norm_feed_forward2", # lfm2 "conformer.layers.{bid}.norm_feed_forward2", # lfm2
"conformer.layers.{bid}.ffw_layer_end.pre_layer_norm", # gemma3n
),
MODEL_TENSOR.A_ENC_FFN_POST_NORM_1: (
"conformer.layers.{bid}.ffw_layer_end.post_layer_norm", # gemma3n
),
MODEL_TENSOR.A_ENC_FFN_SCALE_1: (
"conformer.layers.{bid}.ffw_layer_end.post_layer_scale", # gemma3n
), ),
MODEL_TENSOR.A_ENC_LINEAR_POS: ( MODEL_TENSOR.A_ENC_LINEAR_POS: (
"conformer.layers.{bid}.self_attn.linear_pos", # lfm2 "conformer.layers.{bid}.self_attn.linear_pos", # lfm2
"conformer.layers.{bid}.attention.attn.relative_position_embedding.pos_proj", # gemma3n
), ),
MODEL_TENSOR.A_ENC_POS_BIAS_U: ( MODEL_TENSOR.A_ENC_POS_BIAS_U: (
@ -1656,6 +1732,7 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_OUT: ( MODEL_TENSOR.A_ENC_OUT: (
"conformer.pre_encode.out", # lfm2 "conformer.pre_encode.out", # lfm2
"model.audio_tower.subsample_conv_projection.input_proj_linear", # gemma3n
), ),
# note: some tensors below has "audio." pseudo-prefix, to prevent conflicts with vision tensors # note: some tensors below has "audio." pseudo-prefix, to prevent conflicts with vision tensors
@ -1681,22 +1758,40 @@ class TensorNameMap:
MODEL_TENSOR.A_ENC_CONV_DW: ( MODEL_TENSOR.A_ENC_CONV_DW: (
"conformer.layers.{bid}.conv.depthwise_conv", # lfm2 "conformer.layers.{bid}.conv.depthwise_conv", # lfm2
"conformer.layers.{bid}.lconv1d.depthwise_conv1d", # gemma3n
), ),
MODEL_TENSOR.A_ENC_CONV_NORM: ( MODEL_TENSOR.A_ENC_CONV_NORM: (
"conformer.layers.{bid}.conv.batch_norm", # lfm2 "conformer.layers.{bid}.conv.batch_norm", # lfm2
"conformer.layers.{bid}.lconv1d.pre_layer_norm", # gemma3n
), ),
MODEL_TENSOR.A_ENC_CONV_PW1: ( MODEL_TENSOR.A_ENC_CONV_PW1: (
"conformer.layers.{bid}.conv.pointwise_conv1", # lfm2 "conformer.layers.{bid}.conv.pointwise_conv1", # lfm2
"conformer.layers.{bid}.lconv1d.linear_start", # gemma3n
), ),
MODEL_TENSOR.A_ENC_CONV_PW2: ( MODEL_TENSOR.A_ENC_CONV_PW2: (
"conformer.layers.{bid}.conv.pointwise_conv2", # lfm2 "conformer.layers.{bid}.conv.pointwise_conv2", # lfm2
"conformer.layers.{bid}.lconv1d.linear_end", # gemma3n
), ),
MODEL_TENSOR.A_ENC_NORM_CONV: ( MODEL_TENSOR.A_ENC_NORM_CONV: (
"conformer.layers.{bid}.norm_conv", # lfm2 "conformer.layers.{bid}.norm_conv", # lfm2
"conformer.layers.{bid}.lconv1d.conv_norm", # gemma3n
),
MODEL_TENSOR.A_MM_EMBEDDING: (
"model.embed_audio.embedding", # gemma3n
),
MODEL_TENSOR.A_MM_HARD_EMB_NORM: (
"model.embed_audio.hard_embedding_norm", # gemma3n
),
MODEL_TENSOR.A_MM_INP_PROJ: (
"model.embed_audio.embedding_projection", # gemma3n
),
MODEL_TENSOR.A_MM_SOFT_EMB_NORM: (
"model.embed_audio.soft_embedding_norm", # gemma3n
), ),
# NextN/MTP tensors for GLM4_MOE # NextN/MTP tensors for GLM4_MOE

View File

@ -309,6 +309,7 @@ extern "C" {
// Keep the booleans together to avoid misalignment during copy-by-value. // Keep the booleans together to avoid misalignment during copy-by-value.
bool vocab_only; // only load the vocabulary, no weights bool vocab_only; // only load the vocabulary, no weights
bool use_mmap; // use mmap if possible bool use_mmap; // use mmap if possible
bool use_direct_io; // use direct io, takes precedence over use_mmap
bool use_mlock; // force system to keep model in RAM bool use_mlock; // force system to keep model in RAM
bool check_tensors; // validate model tensor data bool check_tensors; // validate model tensor data
bool use_extra_bufts; // use extra buffer types (used for weight repacking) bool use_extra_bufts; // use extra buffer types (used for weight repacking)
@ -494,7 +495,7 @@ extern "C" {
struct llama_context_params * cparams, struct llama_context_params * cparams,
float * tensor_split, // writable buffer for tensor split, needs at least llama_max_devices elements float * tensor_split, // writable buffer for tensor split, needs at least llama_max_devices elements
struct llama_model_tensor_buft_override * tensor_buft_overrides, // writable buffer for overrides, needs at least llama_max_tensor_buft_overrides elements struct llama_model_tensor_buft_override * tensor_buft_overrides, // writable buffer for overrides, needs at least llama_max_tensor_buft_overrides elements
size_t margin, // margin of memory to leave per device in bytes size_t * margins, // margins of memory to leave per device in bytes
uint32_t n_ctx_min, // minimum context size to set when trying to reduce memory use uint32_t n_ctx_min, // minimum context size to set when trying to reduce memory use
enum ggml_log_level log_level); // minimum log level to print during fitting, lower levels go to debug log enum ggml_log_level log_level); // minimum log level to print during fitting, lower levels go to debug log
@ -1291,6 +1292,8 @@ extern "C" {
// available samplers: // available samplers:
LLAMA_API struct llama_sampler * llama_sampler_init_greedy(void); LLAMA_API struct llama_sampler * llama_sampler_init_greedy(void);
/// seed == LLAMA_DEFAULT_SEED to use a random seed.
LLAMA_API struct llama_sampler * llama_sampler_init_dist(uint32_t seed); LLAMA_API struct llama_sampler * llama_sampler_init_dist(uint32_t seed);
/// @details Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751 /// @details Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

View File

@ -1,9 +1,22 @@
Copyright (c) 1996 - 2025, Daniel Stenberg, daniel@haxx.se, and many contributors, see the THANKS file. COPYRIGHT AND PERMISSION NOTICE
Copyright (c) 1996 - 2026, Daniel Stenberg, <daniel@haxx.se>, and many
contributors, see the THANKS file.
All rights reserved. All rights reserved.
Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. Permission to use, copy, modify, and distribute this software for any purpose
with or without fee is hereby granted, provided that the above copyright
notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN
NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
OR OTHER DEALINGS IN THE SOFTWARE.
Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder. Except as contained in this notice, the name of a copyright holder shall not
be used in advertising or otherwise to promote the sale, use or other dealings
in this Software without prior written authorization of the copyright holder.

View File

@ -1,26 +0,0 @@
Copyright (c) 2010-2014, Salvatore Sanfilippo <antirez at gmail dot com>
Copyright (c) 2010-2013, Pieter Noordhuis <pcnoordhuis at gmail dot com>
Copyright (c) 2025, Eric Curtin <ericcurtin17 at gmail dot com>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

79
scripts/pr2wt.sh Executable file
View File

@ -0,0 +1,79 @@
#!/usr/bin/env bash
# intialize a new worktree from a PR number:
#
# - creates a new remote using the fork's clone URL
# - creates a local branch tracking the remote branch
# - creates a new worktree in a parent folder, suffixed with "-pr-$PR"
#
# sample usage:
# ./scripts/pr2wt.sh 12345
# ./scripts/pr2wt.sh 12345 opencode
# ./scripts/pr2wt.sh 12345 "cmake -B build && cmake --build build"
# ./scripts/pr2wt.sh 12345 "bash -l"
function usage() {
echo "usage: $0 <pr_number> [cmd]"
exit 1
}
# check we are in the right directory
if [[ ! -f "scripts/pr2wt.sh" ]]; then
echo "error: this script must be run from the root of the repository"
exit 1
fi
if [[ $# -lt 1 || $# -gt 2 ]]; then
usage
fi
PR=$1
[[ "$PR" =~ ^[0-9]+$ ]] || { echo "error: PR number must be numeric"; exit 1; }
url_origin=$(git config --get remote.origin.url) || {
echo "error: no remote named 'origin' in this repository"
exit 1
}
org_repo=$(echo $url_origin | cut -d/ -f4-)
org_repo=${org_repo%.git}
echo "org/repo: $org_repo"
meta=$(curl -sSLf -H "Accept: application/vnd.github+json" "https://api.github.com/repos/$org_repo/pulls/$PR")
url_remote=$(echo "$meta" | jq -r '.head.repo.clone_url')
head_ref=$(echo "$meta" | jq -r '.head.ref')
echo "url: $url_remote"
echo "head_ref: $head_ref"
url_remote_cur=$(git config --get "remote.pr/$PR.url" 2>/dev/null || true)
if [[ "$url_remote_cur" != "$url_remote" ]]; then
git remote rm pr/$PR 2> /dev/null
git remote add pr/$PR "$url_remote"
fi
git fetch "pr/$PR" "$head_ref"
dir=$(basename $(pwd))
git branch -D pr/$PR 2> /dev/null
git worktree add -b pr/$PR ../$dir-pr-$PR pr/$PR/$head_ref 2> /dev/null
wt_path=$(cd ../$dir-pr-$PR && pwd)
echo "git worktree created in $wt_path"
cd $wt_path
git branch --set-upstream-to=pr/$PR/$head_ref
git pull --ff-only || {
echo "error: failed to pull pr/$PR"
exit 1
}
if [[ $# -eq 2 ]]; then
echo "executing: $2"
eval "$2"
fi

View File

@ -16,7 +16,8 @@ vendor = {
# "https://github.com/mackron/miniaudio/raw/refs/tags/0.11.23/miniaudio.h": "vendor/miniaudio/miniaudio.h", # "https://github.com/mackron/miniaudio/raw/refs/tags/0.11.23/miniaudio.h": "vendor/miniaudio/miniaudio.h",
"https://github.com/mackron/miniaudio/raw/669ed3e844524fcd883231b13095baee9f6de304/miniaudio.h": "vendor/miniaudio/miniaudio.h", "https://github.com/mackron/miniaudio/raw/669ed3e844524fcd883231b13095baee9f6de304/miniaudio.h": "vendor/miniaudio/miniaudio.h",
"https://raw.githubusercontent.com/yhirose/cpp-httplib/refs/tags/v0.28.0/httplib.h": "vendor/cpp-httplib/httplib.h", "https://raw.githubusercontent.com/yhirose/cpp-httplib/refs/tags/v0.30.1/httplib.h": "vendor/cpp-httplib/httplib.h",
"https://raw.githubusercontent.com/yhirose/cpp-httplib/refs/tags/v0.30.1/LICENSE": "vendor/cpp-httplib/LICENSE",
"https://raw.githubusercontent.com/sheredom/subprocess.h/b49c56e9fe214488493021017bf3954b91c7c1f5/subprocess.h": "vendor/sheredom/subprocess.h", "https://raw.githubusercontent.com/sheredom/subprocess.h/b49c56e9fe214488493021017bf3954b91c7c1f5/subprocess.h": "vendor/sheredom/subprocess.h",
} }

View File

@ -950,6 +950,8 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
LLM_TENSOR_ATTN_K_NORM, LLM_TENSOR_ATTN_K_NORM,
LLM_TENSOR_ATTN_V, LLM_TENSOR_ATTN_V,
LLM_TENSOR_ATTN_OUT, LLM_TENSOR_ATTN_OUT,
LLM_TENSOR_ATTN_QKV,
LLM_TENSOR_ATTN_GATE,
LLM_TENSOR_FFN_NORM, LLM_TENSOR_FFN_NORM,
LLM_TENSOR_FFN_GATE_INP, LLM_TENSOR_FFN_GATE_INP,
LLM_TENSOR_FFN_GATE_EXPS, LLM_TENSOR_FFN_GATE_EXPS,

View File

@ -96,11 +96,9 @@ void llm_graph_input_pos_bucket::set_input(const llama_ubatch * ubatch) {
int32_t * data = (int32_t *) pos_bucket->data; int32_t * data = (int32_t *) pos_bucket->data;
for (int h = 0; h < 1; ++h) {
for (int j = 0; j < n_tokens; ++j) { for (int j = 0; j < n_tokens; ++j) {
for (int i = 0; i < n_tokens; ++i) { for (int i = 0; i < n_tokens; ++i) {
data[h*(n_tokens*n_tokens) + j*n_tokens + i] = llama_relative_position_bucket(ubatch->pos[i], ubatch->pos[j], hparams.n_rel_attn_bkts, true); data[j*n_tokens + i] = llama_relative_position_bucket(ubatch->pos[i], ubatch->pos[j], hparams.n_rel_attn_bkts, true);
}
} }
} }
} }
@ -323,12 +321,11 @@ void llm_graph_input_attn_no_cache::set_input(const llama_ubatch * ubatch) {
const int64_t n_tokens = ubatch->n_tokens; const int64_t n_tokens = ubatch->n_tokens;
const auto fill_mask = [&](float * data, int n_swa, llama_swa_type swa_type) { const auto fill_mask = [&](float * data, int n_swa, llama_swa_type swa_type) {
for (int h = 0; h < 1; ++h) {
for (int i1 = 0; i1 < n_tokens; ++i1) { for (int i1 = 0; i1 < n_tokens; ++i1) {
const llama_seq_id s1 = ubatch->seq_id[i1][0]; const llama_seq_id s1 = ubatch->seq_id[i1][0];
const llama_pos p1 = ubatch->pos[i1]; const llama_pos p1 = ubatch->pos[i1];
const uint64_t idst = h*(n_kv*n_tokens) + i1*n_kv; const uint64_t idst = i1*n_kv;
for (int i0 = 0; i0 < n_tokens; ++i0) { for (int i0 = 0; i0 < n_tokens; ++i0) {
const llama_seq_id s0 = ubatch->seq_id[i0][0]; const llama_seq_id s0 = ubatch->seq_id[i0][0];
@ -352,7 +349,6 @@ void llm_graph_input_attn_no_cache::set_input(const llama_ubatch * ubatch) {
data[idst + i0] = hparams.use_alibi ? -std::abs(p0 - p1) : 0.0f; data[idst + i0] = hparams.use_alibi ? -std::abs(p0 - p1) : 0.0f;
} }
} }
}
}; };
{ {
@ -454,7 +450,6 @@ void llm_graph_input_attn_cross::set_input(const llama_ubatch * ubatch) {
float * data = (float *) cross_kq_mask->data; float * data = (float *) cross_kq_mask->data;
for (int h = 0; h < 1; ++h) {
for (int i = 0; i < n_tokens; ++i) { for (int i = 0; i < n_tokens; ++i) {
for (int j = 0; j < n_enc; ++j) { for (int j = 0; j < n_enc; ++j) {
float f = -INFINITY; float f = -INFINITY;
@ -467,14 +462,7 @@ void llm_graph_input_attn_cross::set_input(const llama_ubatch * ubatch) {
} }
} }
data[h*(n_enc*n_tokens) + i*n_enc + j] = f; data[i*n_enc + j] = f;
}
}
for (int i = n_tokens; i < n_tokens; ++i) {
for (int j = 0; j < n_enc; ++j) {
data[h*(n_enc*n_tokens) + i*n_enc + j] = -INFINITY;
}
} }
} }
} }

View File

@ -110,7 +110,7 @@ struct llama_file::impl {
} }
} }
void read_raw(void * ptr, size_t len) const { void read_raw(void * ptr, size_t len) {
size_t bytes_read = 0; size_t bytes_read = 0;
while (bytes_read < len) { while (bytes_read < len) {
size_t chunk_size = std::min<size_t>(len - bytes_read, 64*1024*1024); size_t chunk_size = std::min<size_t>(len - bytes_read, 64*1024*1024);
@ -127,7 +127,7 @@ struct llama_file::impl {
} }
} }
uint32_t read_u32() const { uint32_t read_u32() {
uint32_t val; uint32_t val;
read_raw(&val, sizeof(val)); read_raw(&val, sizeof(val));
return val; return val;
@ -154,8 +154,8 @@ struct llama_file::impl {
write_raw(&val, sizeof(val)); write_raw(&val, sizeof(val));
} }
void read_aligned_chunk(size_t offset, void * dest, size_t size) const { bool has_direct_io() const {
throw std::runtime_error("DirectIO is not implemented on Windows."); return true;
} }
~impl() { ~impl() {
@ -164,11 +164,23 @@ struct llama_file::impl {
} }
} }
#else #else
impl(const char * fname, const char * mode, [[maybe_unused]] const bool use_direct_io = false) { impl(const char * fname, const char * mode, [[maybe_unused]] const bool use_direct_io = false) : fname(fname) {
#ifdef __linux__ #ifdef __linux__
// Try unbuffered I/O for read only // Try unbuffered I/O for read only
if (use_direct_io && std::strcmp(mode, "rb") == 0) { if (use_direct_io && std::strcmp(mode, "rb") == 0) {
fd = open(fname, O_RDONLY | O_DIRECT); if (init_fd()) {
return;
}
LLAMA_LOG_WARN("Failed to open file '%s' with error: %s. Falling back to buffered I/O",
fname, strerror(errno));
}
#endif
init_fp(mode);
}
#ifdef __linux__
bool init_fd() {
fd = open(fname.c_str(), O_RDONLY | O_DIRECT);
if (fd != -1) { if (fd != -1) {
struct stat file_stats{}; struct stat file_stats{};
@ -181,16 +193,16 @@ struct llama_file::impl {
if (ret == -1) { if (ret == -1) {
throw std::runtime_error(format("seek error: %s", strerror(errno))); throw std::runtime_error(format("seek error: %s", strerror(errno)));
} }
return; return true;
} }
return false;
LLAMA_LOG_WARN("Failed to open model %s with error: %s. Falling back to buffered I/O",
fname, strerror(errno));
} }
#endif #endif
fp = ggml_fopen(fname, mode);
void init_fp(const char * mode) {
fp = ggml_fopen(fname.c_str(), mode);
if (fp == NULL) { if (fp == NULL) {
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno))); throw std::runtime_error(format("failed to open %s: %s", fname.c_str(), strerror(errno)));
} }
seek(0, SEEK_END); seek(0, SEEK_END);
size = tell(); size = tell();
@ -226,7 +238,7 @@ struct llama_file::impl {
} }
} }
void read_raw(void * ptr, size_t len) const { void read_raw_unsafe(void * ptr, size_t len) {
if (len == 0) { if (len == 0) {
return; return;
} }
@ -249,6 +261,17 @@ struct llama_file::impl {
if (errno == EINTR) { if (errno == EINTR) {
continue; // Interrupted by signal, retry continue; // Interrupted by signal, retry
} }
// Fallback to std::fread in case the DMA controller cannot access the buffer
if (errno == EFAULT) {
auto curr_off = tell();
close(fd);
fd = -1;
alignment = 1;
init_fp("rb");
seek(curr_off, SEEK_SET);
read_raw_unsafe(ptr, len);
return;
}
throw std::runtime_error(format("read error: %s", strerror(errno))); throw std::runtime_error(format("read error: %s", strerror(errno)));
} }
if (ret == 0) { if (ret == 0) {
@ -266,7 +289,8 @@ struct llama_file::impl {
} }
} }
void read_aligned_chunk(size_t offset, void * dest, size_t size) const { void read_aligned_chunk(void * dest, size_t size) {
size_t offset = tell();
off_t aligned_offset = offset & ~(alignment - 1); off_t aligned_offset = offset & ~(alignment - 1);
off_t offset_from_alignment = offset - aligned_offset; off_t offset_from_alignment = offset - aligned_offset;
size_t bytes_to_read = (offset_from_alignment + size + alignment - 1) & ~(alignment - 1); size_t bytes_to_read = (offset_from_alignment + size + alignment - 1) & ~(alignment - 1);
@ -283,13 +307,21 @@ struct llama_file::impl {
std::unique_ptr<void, aligned_buffer_deleter> buffer(raw_buffer); std::unique_ptr<void, aligned_buffer_deleter> buffer(raw_buffer);
seek(aligned_offset, SEEK_SET); seek(aligned_offset, SEEK_SET);
read_raw(buffer.get(), bytes_to_read); read_raw_unsafe(buffer.get(), bytes_to_read);
uintptr_t actual_data = reinterpret_cast<uintptr_t>(buffer.get()) + offset_from_alignment; uintptr_t actual_data = reinterpret_cast<uintptr_t>(buffer.get()) + offset_from_alignment;
memcpy(dest, reinterpret_cast<void *>(actual_data), size); memcpy(dest, reinterpret_cast<void *>(actual_data), size);
} }
uint32_t read_u32() const { void read_raw(void * ptr, size_t len) {
if (has_direct_io()) {
read_aligned_chunk(ptr, len);
} else {
read_raw_unsafe(ptr, len);
}
}
uint32_t read_u32() {
uint32_t ret; uint32_t ret;
read_raw(&ret, sizeof(ret)); read_raw(&ret, sizeof(ret));
return ret; return ret;
@ -310,6 +342,10 @@ struct llama_file::impl {
write_raw(&val, sizeof(val)); write_raw(&val, sizeof(val));
} }
bool has_direct_io() const {
return fd != -1 && alignment > 1;
}
~impl() { ~impl() {
if (fd != -1) { if (fd != -1) {
close(fd); close(fd);
@ -318,17 +354,9 @@ struct llama_file::impl {
} }
} }
int fd = -1; int fd = -1;
std::string fname;
#endif #endif
void read_raw_at(void * ptr, size_t len, size_t offset) const {
if (alignment != 1) {
read_aligned_chunk(offset, ptr, len);
} else {
seek(offset, SEEK_SET);
read_raw(ptr, len);
}
}
size_t read_alignment() const { size_t read_alignment() const {
return alignment; return alignment;
} }
@ -347,6 +375,7 @@ size_t llama_file::tell() const { return pimpl->tell(); }
size_t llama_file::size() const { return pimpl->size; } size_t llama_file::size() const { return pimpl->size; }
size_t llama_file::read_alignment() const { return pimpl->read_alignment(); } size_t llama_file::read_alignment() const { return pimpl->read_alignment(); }
bool llama_file::has_direct_io() const { return pimpl->has_direct_io(); }
int llama_file::file_id() const { int llama_file::file_id() const {
#ifdef _WIN32 #ifdef _WIN32
@ -361,10 +390,14 @@ int llama_file::file_id() const {
} }
void llama_file::seek(size_t offset, int whence) const { pimpl->seek(offset, whence); } void llama_file::seek(size_t offset, int whence) const { pimpl->seek(offset, whence); }
void llama_file::read_raw(void * ptr, size_t len) const { pimpl->read_raw(ptr, len); } void llama_file::read_raw(void * ptr, size_t len) { pimpl->read_raw(ptr, len); }
void llama_file::read_raw_at(void * ptr, size_t len, size_t offset) const { pimpl->read_raw_at(ptr, len, offset); } #ifdef _WIN32
void llama_file::read_raw_unsafe(void * ptr, size_t len) { pimpl->read_raw(ptr, len); }
#else
void llama_file::read_raw_unsafe(void * ptr, size_t len) { pimpl->read_raw_unsafe(ptr, len); }
#endif
uint32_t llama_file::read_u32() const { return pimpl->read_u32(); } uint32_t llama_file::read_u32() { return pimpl->read_u32(); }
void llama_file::write_raw(const void * ptr, size_t len) const { pimpl->write_raw(ptr, len); } void llama_file::write_raw(const void * ptr, size_t len) const { pimpl->write_raw(ptr, len); }
void llama_file::write_u32(uint32_t val) const { pimpl->write_u32(val); } void llama_file::write_u32(uint32_t val) const { pimpl->write_u32(val); }

View File

@ -24,15 +24,16 @@ struct llama_file {
void seek(size_t offset, int whence) const; void seek(size_t offset, int whence) const;
void read_raw(void * ptr, size_t len) const; void read_raw(void * ptr, size_t len);
void read_raw_at(void * ptr, size_t len, size_t offset) const; void read_raw_unsafe(void * ptr, size_t len);
void read_aligned_chunk(size_t offset, void * dest, size_t size) const; void read_aligned_chunk(void * dest, size_t size);
uint32_t read_u32() const; uint32_t read_u32();
void write_raw(const void * ptr, size_t len) const; void write_raw(const void * ptr, size_t len) const;
void write_u32(uint32_t val) const; void write_u32(uint32_t val) const;
size_t read_alignment() const; size_t read_alignment() const;
bool has_direct_io() const;
private: private:
struct impl; struct impl;
std::unique_ptr<impl> pimpl; std::unique_ptr<impl> pimpl;

View File

@ -495,6 +495,7 @@ llama_model_loader::llama_model_loader(
const std::string & fname, const std::string & fname,
std::vector<std::string> & splits, std::vector<std::string> & splits,
bool use_mmap, bool use_mmap,
bool use_direct_io,
bool check_tensors, bool check_tensors,
bool no_alloc, bool no_alloc,
const llama_model_kv_override * param_overrides_p, const llama_model_kv_override * param_overrides_p,
@ -527,9 +528,17 @@ llama_model_loader::llama_model_loader(
get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false); get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
llm_kv = LLM_KV(llm_arch_from_string(arch_name)); llm_kv = LLM_KV(llm_arch_from_string(arch_name));
files.emplace_back(new llama_file(fname.c_str(), "rb", !use_mmap)); files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
contexts.emplace_back(ctx); contexts.emplace_back(ctx);
use_direct_io = use_direct_io && files.back()->has_direct_io();
// Disable mmap in case Direct I/O is enabled and available
if (use_direct_io && use_mmap) {
use_mmap = false;
LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
}
// Save tensors data offset of the main file. // Save tensors data offset of the main file.
// For subsidiary files, `meta` tensor data offset must not be used, // For subsidiary files, `meta` tensor data offset must not be used,
// so we build a unified tensors index for weights. // so we build a unified tensors index for weights.
@ -595,7 +604,7 @@ llama_model_loader::llama_model_loader(
} }
} }
files.emplace_back(new llama_file(fname_split, "rb", !use_mmap)); files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
contexts.emplace_back(ctx); contexts.emplace_back(ctx);
// Save tensors data offset info of the shard. // Save tensors data offset info of the shard.
@ -739,6 +748,7 @@ llama_model_loader::llama_model_loader(
} }
this->use_mmap = use_mmap; this->use_mmap = use_mmap;
this->use_direct_io = use_direct_io;
this->check_tensors = check_tensors; this->check_tensors = check_tensors;
this->no_alloc = no_alloc; this->no_alloc = no_alloc;
} }
@ -1100,7 +1110,8 @@ bool llama_model_loader::load_all_data(
const auto & file = files.at(weight->idx); const auto & file = files.at(weight->idx);
if (ggml_backend_buffer_is_host(cur->buffer)) { if (ggml_backend_buffer_is_host(cur->buffer)) {
file->read_raw_at(cur->data, n_size, weight->offs); file->seek(weight->offs, SEEK_SET);
file->read_raw(cur->data, n_size);
if (check_tensors) { if (check_tensors) {
validation_result.emplace_back(std::async(std::launch::async, [cur, n_size] { validation_result.emplace_back(std::async(std::launch::async, [cur, n_size] {
return std::make_pair(cur, ggml_validate_row_data(cur->type, cur->data, n_size)); return std::make_pair(cur, ggml_validate_row_data(cur->type, cur->data, n_size));
@ -1132,7 +1143,7 @@ bool llama_model_loader::load_all_data(
ggml_backend_event_synchronize(events[buffer_idx]); ggml_backend_event_synchronize(events[buffer_idx]);
// Read aligned chunk from file // Read aligned chunk from file
file->read_raw(reinterpret_cast<void *>(ptr_dest_aligned), read_size); file->read_raw_unsafe(reinterpret_cast<void *>(ptr_dest_aligned), read_size);
// Calculate actual data portion (excluding alignment padding) // Calculate actual data portion (excluding alignment padding)
uintptr_t ptr_data = ptr_dest_aligned; uintptr_t ptr_data = ptr_dest_aligned;
@ -1162,7 +1173,8 @@ bool llama_model_loader::load_all_data(
} }
} else { } else {
read_buf.resize(n_size); read_buf.resize(n_size);
file->read_raw_at(read_buf.data(), n_size, weight->offs); file->seek(weight->offs, SEEK_SET);
file->read_raw(read_buf.data(), n_size);
ggml_backend_tensor_set(cur, read_buf.data(), 0, n_size); ggml_backend_tensor_set(cur, read_buf.data(), 0, n_size);
if (check_tensors && !ggml_validate_row_data(cur->type, read_buf.data(), n_size)) { if (check_tensors && !ggml_validate_row_data(cur->type, read_buf.data(), n_size)) {
throw std::runtime_error(format("tensor '%s' has invalid data", ggml_get_name(cur))); throw std::runtime_error(format("tensor '%s' has invalid data", ggml_get_name(cur)));

View File

@ -70,6 +70,7 @@ struct llama_model_loader {
size_t n_bytes = 0; size_t n_bytes = 0;
bool use_mmap = false; bool use_mmap = false;
bool use_direct_io = false;
bool check_tensors; bool check_tensors;
bool no_alloc; bool no_alloc;
@ -97,6 +98,7 @@ struct llama_model_loader {
const std::string & fname, const std::string & fname,
std::vector<std::string> & splits, // optional, only need if the split does not follow naming scheme std::vector<std::string> & splits, // optional, only need if the split does not follow naming scheme
bool use_mmap, bool use_mmap,
bool use_direct_io,
bool check_tensors, bool check_tensors,
bool no_alloc, bool no_alloc,
const llama_model_kv_override * param_overrides_p, const llama_model_kv_override * param_overrides_p,

View File

@ -2440,7 +2440,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
const bool use_mmap_buffer = true; const bool use_mmap_buffer = true;
LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false"); LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
__func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
// build a list of buffer types for the CPU and GPU devices // build a list of buffer types for the CPU and GPU devices
pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host); pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host);
@ -2451,6 +2452,11 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
pimpl->gpu_buft_list.emplace(dev, std::move(buft_list)); pimpl->gpu_buft_list.emplace(dev, std::move(buft_list));
} }
ggml_backend_dev_t cpu_dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
if (cpu_dev == nullptr) {
throw std::runtime_error(format("%s: no CPU backend found", __func__));
}
// calculate the split points // calculate the split points
bool all_zero = tensor_split == nullptr || std::all_of(tensor_split, tensor_split + n_devices(), [](float x) { return x == 0.0f; }); bool all_zero = tensor_split == nullptr || std::all_of(tensor_split, tensor_split + n_devices(), [](float x) { return x == 0.0f; });
std::vector<float> splits(n_devices()); std::vector<float> splits(n_devices());
@ -2461,6 +2467,13 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
size_t total; size_t total;
size_t free; size_t free;
ggml_backend_dev_memory(dev, &free, &total); ggml_backend_dev_memory(dev, &free, &total);
// devices can return 0 bytes for free and total memory if they do not
// have any to report. in this case, we will use the host memory as a fallback
// fixes: https://github.com/ggml-org/llama.cpp/issues/18577
if (free == 0 && total == 0) {
ggml_backend_dev_memory(cpu_dev, &free, &total);
}
splits[i] = free; splits[i] = free;
} }
} else { } else {
@ -2477,10 +2490,6 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
splits[i] /= split_sum; splits[i] /= split_sum;
} }
ggml_backend_dev_t cpu_dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
if (cpu_dev == nullptr) {
throw std::runtime_error(format("%s: no CPU backend found", __func__));
}
const int i_gpu_start = std::max(int(hparams.n_layer) + 1 - n_gpu_layers, 0); const int i_gpu_start = std::max(int(hparams.n_layer) + 1 - n_gpu_layers, 0);
const int act_gpu_layers = devices.empty() ? 0 : std::min(n_gpu_layers, int(n_layer) + 1); const int act_gpu_layers = devices.empty() ? 0 : std::min(n_gpu_layers, int(n_layer) + 1);
auto get_layer_buft_list = [&](int il) -> llama_model::impl::layer_dev { auto get_layer_buft_list = [&](int il) -> llama_model::impl::layer_dev {
@ -6754,7 +6763,10 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
} else { } else {
// Linear attention (gated delta net) specific tensors // Linear attention (gated delta net) specific tensors
// Create tensors with calculated dimensions // Create tensors with calculated dimensions
layer.ssm_in = create_tensor(tn(LLM_TENSOR_SSM_IN, "weight", i), { n_embd, qkvz_dim }, 0); // note: ssm_in is used by legacy GGUF
layer.ssm_in = create_tensor(tn(LLM_TENSOR_SSM_IN, "weight", i), { n_embd, qkvz_dim }, TENSOR_NOT_REQUIRED);
layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i), { n_embd, key_dim * 2 + value_dim }, TENSOR_NOT_REQUIRED);
layer.wqkv_gate = create_tensor(tn(LLM_TENSOR_ATTN_GATE, "weight", i), { n_embd, value_dim }, TENSOR_NOT_REQUIRED);
layer.ssm_conv1d = create_tensor(tn(LLM_TENSOR_SSM_CONV1D, "weight", i), { hparams.ssm_d_conv, conv_dim }, 0); layer.ssm_conv1d = create_tensor(tn(LLM_TENSOR_SSM_CONV1D, "weight", i), { hparams.ssm_d_conv, conv_dim }, 0);
layer.ssm_dt = create_tensor(tn(LLM_TENSOR_SSM_DT, "bias", i), { hparams.ssm_dt_rank }, 0); layer.ssm_dt = create_tensor(tn(LLM_TENSOR_SSM_DT, "bias", i), { hparams.ssm_dt_rank }, 0);
layer.ssm_a = create_tensor(tn(LLM_TENSOR_SSM_A_NOSCAN, i), { hparams.ssm_dt_rank }, 0); layer.ssm_a = create_tensor(tn(LLM_TENSOR_SSM_A_NOSCAN, i), { hparams.ssm_dt_rank }, 0);
@ -7973,6 +7985,7 @@ llama_model_params llama_model_default_params() {
/*.kv_overrides =*/ nullptr, /*.kv_overrides =*/ nullptr,
/*.vocab_only =*/ false, /*.vocab_only =*/ false,
/*.use_mmap =*/ true, /*.use_mmap =*/ true,
/*.use_direct_io =*/ true,
/*.use_mlock =*/ false, /*.use_mlock =*/ false,
/*.check_tensors =*/ false, /*.check_tensors =*/ false,
/*.use_extra_bufts =*/ true, /*.use_extra_bufts =*/ true,

View File

@ -596,7 +596,7 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
} }
std::vector<std::string> splits = {}; std::vector<std::string> splits = {};
llama_model_loader ml(fname_inp, splits, use_mmap, /*check_tensors*/ true, /*no_alloc*/ false, kv_overrides, nullptr); llama_model_loader ml(fname_inp, splits, use_mmap, /*use_direct_io*/ true, /*check_tensors*/ true, /*no_alloc*/ false, kv_overrides, nullptr);
ml.init_mappings(false); // no prefetching ml.init_mappings(false); // no prefetching
llama_model model(llama_model_default_params()); llama_model model(llama_model_default_params());

View File

@ -111,8 +111,20 @@ static std::vector<llama_device_memory_data> llama_get_device_memory_data(
} }
} }
for (size_t i = 0; i < ret.size(); i++) { for (size_t i = 0; i < ret.size(); i++) {
size_t free, total; size_t free;
size_t total;
ggml_backend_dev_memory(model->devices[i], &free, &total); ggml_backend_dev_memory(model->devices[i], &free, &total);
// devices can return 0 bytes for free and total memory if they do not
// have any to report. in this case, we will use the host memory as a fallback
// fixes: https://github.com/ggml-org/llama.cpp/issues/18577
if (free == 0 && total == 0) {
ggml_backend_dev_t cpu_dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
if (cpu_dev == nullptr) {
throw std::runtime_error(format("%s: no CPU backend found", __func__));
}
ggml_backend_dev_memory(cpu_dev, &free, &total);
}
ret[i].free = free; ret[i].free = free;
ret[i].total = total; ret[i].total = total;
} }
@ -147,9 +159,8 @@ class llama_params_fit_exception : public std::runtime_error {
static void llama_params_fit_impl( static void llama_params_fit_impl(
const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams, const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams,
float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides, float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides,
size_t margin_s, uint32_t n_ctx_min, enum ggml_log_level log_level) { size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level) {
constexpr int64_t MiB = 1024*1024; constexpr int64_t MiB = 1024*1024;
const int64_t margin = margin_s; // this function uses int64_t rather than size_t for memory sizes to more conveniently handle deficits
typedef std::vector<llama_device_memory_data> dmds_t; typedef std::vector<llama_device_memory_data> dmds_t;
const llama_model_params default_mparams = llama_model_default_params(); const llama_model_params default_mparams = llama_model_default_params();
@ -168,6 +179,12 @@ static void llama_params_fit_impl(
return; return;
} }
std::vector<int64_t> margins; // this function uses int64_t rather than size_t for memory sizes to more conveniently handle deficits
margins.reserve(nd);
for (size_t id = 0; id < nd; id++) {
margins.push_back(margins_s[id]);
}
std::vector<std::string> dev_names; std::vector<std::string> dev_names;
{ {
dev_names.reserve(nd); dev_names.reserve(nd);
@ -187,9 +204,10 @@ static void llama_params_fit_impl(
int64_t sum_free = 0; int64_t sum_free = 0;
int64_t sum_projected_free = 0; int64_t sum_projected_free = 0;
int64_t min_projected_free = INT64_MAX;
int64_t sum_projected_used = 0; int64_t sum_projected_used = 0;
int64_t sum_projected_model = 0; int64_t sum_projected_model = 0;
std::vector<int64_t> projected_free_per_device;
projected_free_per_device.reserve(nd);
if (nd > 1) { if (nd > 1) {
LLAMA_LOG_INFO("%s: projected memory use with initial parameters [MiB]:\n", __func__); LLAMA_LOG_INFO("%s: projected memory use with initial parameters [MiB]:\n", __func__);
@ -199,45 +217,63 @@ static void llama_params_fit_impl(
const int64_t projected_used = dmd.mb.total(); const int64_t projected_used = dmd.mb.total();
const int64_t projected_free = dmd.free - projected_used; const int64_t projected_free = dmd.free - projected_used;
projected_free_per_device.push_back(projected_free);
sum_free += dmd.free; sum_free += dmd.free;
sum_projected_used += projected_used; sum_projected_used += projected_used;
sum_projected_free += projected_free; sum_projected_free += projected_free;
min_projected_free = std::min(min_projected_free, projected_free);
sum_projected_model += dmd.mb.model; sum_projected_model += dmd.mb.model;
if (nd > 1) { if (nd > 1) {
LLAMA_LOG_INFO("%s: - %s: %6" PRId64 " total, %6" PRId64 " used, %6" PRId64 " %s\n", LLAMA_LOG_INFO("%s: - %s: %6" PRId64 " total, %6" PRId64 " used, %6" PRId64 " free vs. target of %6" PRId64 "\n",
__func__, dev_names[id].c_str(), dmd.total/MiB, projected_used/MiB, std::abs(projected_free)/MiB, __func__, dev_names[id].c_str(), dmd.total/MiB, projected_used/MiB, projected_free/MiB, margins[id]/MiB);
projected_free >= 0 ? "surplus" : "deficit");
} }
} }
assert(sum_free >= 0 && sum_projected_used >= 0); assert(sum_free >= 0 && sum_projected_used >= 0);
LLAMA_LOG_INFO("%s: projected to use %" PRId64 " MiB of device memory vs. %" PRId64 " MiB of free device memory\n", LLAMA_LOG_INFO("%s: projected to use %" PRId64 " MiB of device memory vs. %" PRId64 " MiB of free device memory\n",
__func__, sum_projected_used/MiB, sum_free/MiB); __func__, sum_projected_used/MiB, sum_free/MiB);
if (min_projected_free >= margin) {
if (nd == 1) { if (nd == 1) {
if (projected_free_per_device[0] >= margins[0]) {
LLAMA_LOG_INFO("%s: will leave %" PRId64 " >= %" PRId64 " MiB of free device memory, no changes needed\n", LLAMA_LOG_INFO("%s: will leave %" PRId64 " >= %" PRId64 " MiB of free device memory, no changes needed\n",
__func__, min_projected_free/MiB, margin/MiB); __func__, projected_free_per_device[0]/MiB, margins[0]/MiB);
return; return;
} }
LLAMA_LOG_INFO("%s: will leave at least %" PRId64 " >= %" PRId64 " MiB of free memory on all devices, no changes needed\n", } else {
__func__, min_projected_free/MiB, margin/MiB); bool changes_needed = false;
for (size_t id = 0; id < nd; id++) {
if (projected_free_per_device[id] < margins[id]) {
changes_needed = true;
break;
}
}
if (!changes_needed) {
LLAMA_LOG_INFO("%s: targets for free memory can be met on all devices, no changes needed\n", __func__);
return; return;
} }
}
// step 2: try reducing memory use by reducing the context size // step 2: try reducing memory use by reducing the context size
{ {
int64_t global_surplus = sum_projected_free - int64_t(nd)*margin; int64_t global_surplus = sum_projected_free;
for (size_t id = 0; id < nd; id++) {
global_surplus -= margins[id];
}
if (global_surplus < 0) { if (global_surplus < 0) {
LLAMA_LOG_INFO(nd == 1 ? if (nd == 1) {
"%s: cannot fulfill margin of %" PRId64 " MiB, need to reduce device memory by %" PRId64 " MiB\n" : LLAMA_LOG_INFO("%s: cannot meet free memory target of %" PRId64 " MiB, need to reduce device memory by %" PRId64 " MiB\n",
"%s: cannot fulfill margin of %" PRId64 " MiB on all devices, need to use %" PRId64 " MiB less in total\n", __func__, margins[0]/MiB, -global_surplus/MiB);
__func__, margin/MiB, -global_surplus/MiB); } else {
LLAMA_LOG_INFO(
"%s: cannot meet free memory targets on all devices, need to use %" PRId64 " MiB less in total\n",
__func__, -global_surplus/MiB);
}
if (cparams->n_ctx == 0) { if (cparams->n_ctx == 0) {
if (hp_nct > n_ctx_min) { if (hp_nct > n_ctx_min) {
int64_t sum_used_target = sum_free - nd*margin_s; int64_t sum_used_target = sum_free;
for (size_t id = 0; id < nd; id++) {
sum_used_target -= margins[id];
}
if (nd > 1) { if (nd > 1) {
// for multiple devices we need to be more conservative in terms of how much context we think can fit: // for multiple devices we need to be more conservative in terms of how much context we think can fit:
// - for dense models only whole layers can be assigned to devices // - for dense models only whole layers can be assigned to devices
@ -448,9 +484,9 @@ static void llama_params_fit_impl(
const dmds_t dmds_cpu_moe = llama_get_device_memory_data( const dmds_t dmds_cpu_moe = llama_get_device_memory_data(
path_model, mparams, cparams, devs, hp_ngl, hp_nct, hp_nex, log_level); path_model, mparams, cparams, devs, hp_ngl, hp_nct, hp_nex, log_level);
for (const llama_device_memory_data & dmd : dmds_cpu_moe) { for (size_t id = 0; id < nd; id++) {
global_surplus_cpu_moe += dmd.free; global_surplus_cpu_moe += dmds_cpu_moe[id].free;
global_surplus_cpu_moe -= int64_t(dmd.mb.total()) + margin; global_surplus_cpu_moe -= int64_t(dmds_cpu_moe[id].mb.total()) + margins[id];
} }
if (global_surplus_cpu_moe > 0) { if (global_surplus_cpu_moe > 0) {
@ -469,7 +505,7 @@ static void llama_params_fit_impl(
std::vector<int64_t> targets; // maximum acceptable memory use per device std::vector<int64_t> targets; // maximum acceptable memory use per device
targets.reserve(nd); targets.reserve(nd);
for (size_t id = 0; id < nd; id++) { for (size_t id = 0; id < nd; id++) {
targets.push_back(dmds_full[id].free - margin); targets.push_back(dmds_full[id].free - margins[id]);
LLAMA_LOG_DEBUG("%s: id=%zu, target=%" PRId64 " MiB\n", __func__, id, targets[id]/MiB); LLAMA_LOG_DEBUG("%s: id=%zu, target=%" PRId64 " MiB\n", __func__, id, targets[id]/MiB);
} }
@ -701,11 +737,11 @@ static void llama_params_fit_impl(
enum llama_params_fit_status llama_params_fit( enum llama_params_fit_status llama_params_fit(
const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams, const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams,
float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides, float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides,
size_t margin_s, uint32_t n_ctx_min, enum ggml_log_level log_level) { size_t * margins, uint32_t n_ctx_min, enum ggml_log_level log_level) {
const int64_t t0_us = llama_time_us(); const int64_t t0_us = llama_time_us();
llama_params_fit_status status = LLAMA_PARAMS_FIT_STATUS_SUCCESS; llama_params_fit_status status = LLAMA_PARAMS_FIT_STATUS_SUCCESS;
try { try {
llama_params_fit_impl(path_model, mparams, cparams, tensor_split, tensor_buft_overrides, margin_s, n_ctx_min, log_level); llama_params_fit_impl(path_model, mparams, cparams, tensor_split, tensor_buft_overrides, margins, n_ctx_min, log_level);
LLAMA_LOG_INFO("%s: successfully fit params to free device memory\n", __func__); LLAMA_LOG_INFO("%s: successfully fit params to free device memory\n", __func__);
} catch (const llama_params_fit_exception & e) { } catch (const llama_params_fit_exception & e) {
LLAMA_LOG_WARN("%s: failed to fit params to free device memory: %s\n", __func__, e.what()); LLAMA_LOG_WARN("%s: failed to fit params to free device memory: %s\n", __func__, e.what());
@ -794,7 +830,7 @@ static int llama_model_load(const std::string & fname, std::vector<std::string>
model.t_start_us = tm.t_start_us; model.t_start_us = tm.t_start_us;
try { try {
llama_model_loader ml(fname, splits, params.use_mmap, params.check_tensors, params.no_alloc, params.kv_overrides, params.tensor_buft_overrides); llama_model_loader ml(fname, splits, params.use_mmap, params.use_direct_io, params.check_tensors, params.no_alloc, params.kv_overrides, params.tensor_buft_overrides);
ml.print_info(); ml.print_info();

Some files were not shown because too many files have changed in this diff Show More