Commit Graph

1230 Commits

Author SHA1 Message Date
nullname 2cd429ca75
feat: perf opt part5 (#52)
* rename

* Refactor vector operations in vec_op_impl and vec_dot_product_impl for improved clarity and performance

* wip

* Enhance vector copy functions for improved performance and clarity in vec_ops.hpp

* wip

* wip

* wip

* Optimize vector dot product implementations for enhanced performance and efficiency

* Enhance flash attention implementation and type traits for improved vector operations and alignment checks

# Conflicts:
#	ggml/src/ggml-qnn/npu/device/type_traits.cpp

* remove align

* wip

* Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs

* Revert "Enhance vector dot product implementation for improved performance by adding parallel processing for multiple vector pairs"

This reverts commit 78cc24ed2285002ca29d6189fa61ba4ce24f8d16.

* Enhance flash attention implementation with type checks for tensor data types and improved constexpr usage

* wip

* opt mask calc

* Revert "opt mask calc"

This reverts commit bb1840876692a11511d5ab7828b8a707402e30b9.

* wip

* opt mul mat caching logic to add dst cache

* Revert "opt mul mat caching logic to add dst cache"

This reverts commit ab442fa9f763b3873c929936e4cb739cb1c83850.

* wip

* Refactor matrix multiplication implementation to include vector conversion and performance tracking

* wip

* wip

* wip

* create vec_ops.inl for more aggressive compiler inline

* wip

* refactor vector dot product implementations for improved readability and performance

* refactor vector conversion functions to use HVX_Vector_Dual for improved clarity and consistency

* wip

* wip

* wip

* implement row size caching logic and enhance type traits for F32 support

* refactor matrix multiplication functions to improve caching logic and simplify tensor alignment handling

* add vector zeroing functions for F32 and F16 types to optimize memory initialization

* Revert "add vector zeroing functions for F32 and F16 types to optimize memory initialization"

This reverts commit e374326dc74d049e6603e393ade418d9ef2b83f3.

* wip

* refactor alignment checks in dot product function to handle null pointers

* wip

* refactor load_block_generic and related functions for improved alignment handling

* wip

* refactor flash attention implementation and introduce type-erased dot function for improved type handling

* refactor dot product implementations for improved loop handling and clarity

* refactor thread_pool constructor to pre-allocate VTCM cache for each thread

* Revert "refactor thread_pool constructor to pre-allocate VTCM cache for each thread"

This reverts commit 00cdd3fa88d909feef44ddaa42095274b7627685.

* wip

* opt interfaces for tensor cleanup

* refactor mul_mat_impl to use aligned size for src0 row calculation

* refactor: update dequantized_row_size logic and add size alignment checks for tensors

* wip

* wip

* refactor: replace raw pointer initialization with invalid handle constants for better clarity

* wip
2025-07-23 00:38:09 +08:00
hongruichen fc45ad51d2 Merge branch 'master' into dev-refactoring 2025-07-21 23:22:46 +08:00
Charles Xu 922042601b
kleidiai: add support for get_rows (#14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-21 16:49:52 +03:00
Jeff Bolz c2e058f1b4
vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Ervin Áron Tasnádi a979ca22db
ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
Peter0x44 d4b91ea7b2
vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (#14707) 2025-07-19 17:58:03 +02:00
0cc4m 83f5872404
Vulkan: Fix fprintf format-security warning (#14770) 2025-07-19 17:47:53 +02:00
Georgi Gerganov bf9087f59a
metal : fuse add, mul + add tests (#14596)
ggml-ci
2025-07-18 20:37:26 +03:00
hongruichen c5187054b3 Merge branch 'master' into dev-refactoring
# Conflicts:
#	ggml/CMakeLists.txt
2025-07-18 23:43:20 +08:00
Oliver Simons 021cc28bef
cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (#14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-18 04:35:32 -07:00
Aman Gupta f9a31eea06
CUDA: set_rows + cpy.cu refactor (#14712) 2025-07-18 14:54:18 +08:00
Neo Zhang Jianyu 349ea79fce
use max work group size for device to replace the magic number (#14732) 2025-07-18 10:23:14 +08:00
Reese Levine 21c021745d
ggml: Add initial WebGPU backend (#14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-16 18:18:51 +03:00
Georgi Gerganov 225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
hongruichen 9a43a23e0b fix compiling error at new hexagon sdk 2025-07-16 21:11:11 +08:00
Georgi Gerganov 64978340b0
ggml : add asserts (#14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-16 14:43:32 +03:00
Jeff Bolz ba1ceb3456
vulkan: fix noncontig check for mat_mul_id splitting (#14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-15 21:51:09 +02:00
Jeff Bolz 10a0351a97
vulkan: add RTE variants for glu/add/sub/mul/div (#14653) 2025-07-15 21:32:11 +02:00
R0CKSTAR cbc68be51d
cuda: fix build warnings in set-rows.cu (unused variable) (#14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-15 15:28:53 +08:00
Anton Mitkov bdca38376f
sycl: Hotfix for non dnnl codepath (#14677) 2025-07-14 18:12:42 +01:00
shalinib-ibm 55c509daf5
ggml : refactor llamafile_sgemm PPC code (#14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-14 16:16:42 +03:00
Akarshan Biswas 0f4c6ec0f1
SYCL: use 1D kernel for set_rows (#14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-14 10:37:55 +01:00
Anton Mitkov 65a3ebb0aa
sycl: Batched mulmat rework for oneDNN dispatch (#14617) 2025-07-14 10:37:35 +01:00
Sigbjørn Skjæret 923e3ea2e3
cuda : add set rows for bf16 (#14664) 2025-07-13 15:01:24 +02:00
Yavor Ivanov e743cddb60
cuda : add ELU support (#14657) 2025-07-13 11:33:16 +02:00
Georgi Gerganov 05fec5bd29
ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
2025-07-13 10:36:33 +03:00
Yavor Ivanov dcf7f2ea3c
metal : Add missing unary ops Metal support (#14660) 2025-07-13 08:38:13 +03:00
Aman Gupta 7de5c7cab6
CUDA: add set rows for f32 and f16 (#14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-12 16:31:38 +03:00
Georgi Gerganov 3120413ccd vulkan : remove unused vars (#0)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly 74bb294591 vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly 3e303b1107 vulkan : implement ggml_roll (ggml/1290)
ggml-ci
2025-07-12 14:25:44 +03:00
Jeff Bolz b3ad3a0191
vulkan: support SET_ROWS (#14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 12:12:26 +02:00
Jeff Bolz 98197e5c98
vulkan: optimizations for deepseek prompt processing (#14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 11:51:58 +02:00
hongruichen 4a3a87409b disable broadcast on flash_attn_ext 2025-07-12 11:46:02 +08:00
Tarek Dakhran f5e96b368f
model : support LiquidAI LFM2 hybrid family (#14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
2025-07-11 20:27:01 +02:00
Slobodan Josic 756aa1020a
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634) 2025-07-11 18:55:00 +02:00
hongruichen 560729ed6f fix unit test failure 2025-07-12 00:39:14 +08:00
hongruichen b720e47606 fix compiling error 2025-07-11 17:31:21 +08:00
hongruichen ce1167de8c Merge branch 'master' into dev-refactoring
# Conflicts:
#	ggml/src/ggml-backend-reg.cpp
2025-07-11 17:29:26 +08:00
nullname a29243e7a4
feat: perf opt quant (#47)
* feat: add mixed precision dot product implementation and function declaration

* feat: implement mixed precision vector dot product and conversion functions

* fix: update data type handling in matrix multiplication implementation

* fix: adjust row count handling in matrix multiplication implementation for accurate slicing

* fix: optimize matrix multiplication implementation by unroll loop

* update performance tracking for matrix multiplication implementation

* add fetching

* wip

* fix: support F16 * F32 multiplication in is_mul_mat_supported function

* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling

* fix test failure for row width 67

* try fix failed test

* fix: rename aligned_address to align_down for clarity in vector alignment handling

* wip

* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility

* fix test failure at width == 193

* fix: replace zero vector initialization with previous vector in mixed dot product implementation

* wip

* fix: improve handling of last vector in mixed dot product implementation

* wip

* wip

* wip

* wip

* Enhance mul_mat_f32 function to support quantized types and improve static assertions

* rename

* Refactor dequantization functions to use npu_device_fp16_t and improve type handling

* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16

* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16

* Add hvx_vsf_convert_vhf function for improved vector conversion

* add perf logs

* Refactor dequantize_row_q4_0 for alignment

* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity

* Add support for ROPE operation in NPU capabilities and related functions

* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations

* enable ROPE by adding operation validation

* add support to freq is null case

* wip

* Refactor rope_f32 to improve indexing by introducing total_planes calculation

* reformat

* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers

* Add performance tracking to rope_f32 function for enhanced profiling

* Refactor rope_f32 to use a templated implementation

* Refactor rope_impl to replace loop with memcpy for improved performance

* Refactor mul_mat_impl to support quantization as a template parameter

* wip

* wip

* Refactor rope_impl to optimize plane indexing in the processing loop

* Add aligned vector dot product implementation for mixed precision types

* wip

* Enhance matrix multiplication for F32 and F16 types with alignment checks

* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums

* Add alignment checks for matrix multiplication and vector dot products

* Refactor matrix multiplication to use function pointers for improved readability and maintainability

* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling

* Remove unused f16_to_f32_table parameter from quantization and dequantization functions

* wip

* Add L2 fetch for src1 plane rows in matrix multiplication implementation

* wip

* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication

* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity

* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance

* Refactor vector operation functions to improve clarity and consistency in variable usage

* wip

* wip

* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations

* wip

* Update load_dual_block_generic to use intrinsics

* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity

* wip

* wip

* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop

* wip

* wip

* fix typo
2025-07-11 16:58:45 +08:00
rmatif 6bdda13981
opencl: add tiled mul_mat_f16_f32 (#14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-10 14:58:12 -07:00
lhez 0b8855775c
opencl: add `set_rows` for `f16` and `f32` (#14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-10 11:48:52 -07:00
Akarshan Biswas 704bb7a71c
SYCL: Initial set_rows kernel implementation (#14562)
* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE
2025-07-10 09:29:38 +01:00
compilade a57d1bcb3c
cuda : support Falcon-H1 state size for SSM_SCAN (#14602) 2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen 98bab638fb
ggml : add ggml_scale_bias (#14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Miaoqian Lin 26a48ad699
ggml : prevent integer overflow in gguf tensor size calculation (#14595) 2025-07-09 14:33:53 +02:00
Jeff Bolz 6efcd65945
vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
Jeff Bolz b8eeb8741d
vulkan : fix rope with partial rotation and non-cont src (#14582) 2025-07-08 15:21:21 +02:00
Georgi Gerganov 4d0dcd4a06
cuda : fix rope with partial rotation and non-cont src (#14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-08 10:15:21 +03:00
Aman Gupta 75c91de6e9
CUDA: add bilinear interpolation for upscale (#14563) 2025-07-08 10:11:18 +08:00