Commit Graph

289 Commits

Author SHA1 Message Date
Oliver Simons e652566139 Readd `cub::DeviceScan::InclusiveSum`-based CumSum
For single rows and large columns doing a for-loop over the function
`cub::DeviceScan::InclusiveSum` offered by CUB outperforms the
`cumsum_cub_kernel` where `cub::BlockScan` is used.

Numbers before this change

  Backend 1/3: CUDA0
  Device description: NVIDIA RTX 6000 Ada Generation
  Device memory: 48510 MB (48039 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  311258 runs -     3.26 us/run -     2048 kB/run -  599.76 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  229390 runs -     4.40 us/run -     5120 kB/run - 1110.23 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  37583 runs -    29.63 us/run -     6250 kB/run -  201.18 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    892819 runs -     1.12 us/run -        1 kB/run -    0.85 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   450505 runs -     2.25 us/run -        8 kB/run -    3.39 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   155629 runs -     6.61 us/run -       32 kB/run -    4.62 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                    81910 runs -    12.60 us/run -       64 kB/run -    4.85 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                   49146 runs -    23.99 us/run -      128 kB/run -    5.09 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                   24573 runs -    47.10 us/run -      256 kB/run -    5.18 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                   16382 runs -    93.57 us/run -      512 kB/run -    5.22 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                   8191 runs -   184.79 us/run -     1024 kB/run -    5.29 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                   8191 runs -   280.43 us/run -     1562 kB/run -    5.31 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                  2148 runs -  2771.23 us/run -    15625 kB/run -    5.38 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    458696 runs -     2.21 us/run -        4 kB/run -    1.73 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   360404 runs -     2.82 us/run -       32 kB/run -   10.83 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   147438 runs -     7.12 us/run -      128 kB/run -   17.15 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    81910 runs -    12.90 us/run -      256 kB/run -   18.92 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   49146 runs -    24.32 us/run -      512 kB/run -   20.08 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   24573 runs -    47.28 us/run -     1024 kB/run -   20.66 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   16382 runs -    93.21 us/run -     2048 kB/run -   20.96 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                   8191 runs -   185.04 us/run -     4096 kB/run -   21.11 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                   5369 runs -   282.08 us/run -     6250 kB/run -   21.13 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                   537 runs -  2806.46 us/run -    62500 kB/run -   21.26 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    458696 runs -     2.20 us/run -        8 kB/run -    3.47 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   360404 runs -     2.82 us/run -       64 kB/run -   21.66 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   147438 runs -     7.12 us/run -      256 kB/run -   34.28 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    81910 runs -    12.90 us/run -      512 kB/run -   37.84 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   49146 runs -    24.32 us/run -     1024 kB/run -   40.15 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    47.28 us/run -     2048 kB/run -   41.31 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -    93.20 us/run -     4096 kB/run -   41.92 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                   8194 runs -   185.05 us/run -     8192 kB/run -   42.22 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                   5370 runs -   282.15 us/run -    12500 kB/run -   42.26 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                   269 runs -  4067.61 us/run -   125000 kB/run -   29.36 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   303067 runs -     3.32 us/run -       16 kB/run -    4.60 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  303067 runs -     3.32 us/run -      128 kB/run -   36.76 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  147438 runs -     7.17 us/run -      512 kB/run -   68.13 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   81910 runs -    12.90 us/run -     1024 kB/run -   75.68 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  49146 runs -    24.33 us/run -     2048 kB/run -   80.28 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    47.30 us/run -     4096 kB/run -   82.59 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -    93.24 us/run -     8192 kB/run -   83.80 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                  6147 runs -   185.07 us/run -    16384 kB/run -   84.45 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                  4029 runs -   282.40 us/run -    25000 kB/run -   84.46 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                  270 runs -  4118.40 us/run -   250000 kB/run -   58.11 GB/s
  Backend CUDA0: OK
Backend 2/3: CUDA1
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97250 MB (96677 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  368595 runs -     2.73 us/run -     2048 kB/run -  715.83 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  216282 runs -     4.72 us/run -     5120 kB/run - 1035.32 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  32214 runs -    34.33 us/run -     6250 kB/run -  173.64 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    810909 runs -     1.24 us/run -        1 kB/run -    0.77 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   401359 runs -     2.52 us/run -        8 kB/run -    3.03 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   139247 runs -     7.44 us/run -       32 kB/run -    4.10 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                    73719 runs -    14.27 us/run -       64 kB/run -    4.28 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                   40955 runs -    27.24 us/run -      128 kB/run -    4.48 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                   24573 runs -    53.46 us/run -      256 kB/run -    4.57 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                   16382 runs -   105.29 us/run -      512 kB/run -    4.64 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                   8191 runs -   210.15 us/run -     1024 kB/run -    4.65 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                   8191 runs -   318.22 us/run -     1562 kB/run -    4.68 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                  2148 runs -  3142.23 us/run -    15625 kB/run -    4.74 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    303067 runs -     3.34 us/run -        4 kB/run -    1.14 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   253921 runs -     4.03 us/run -       32 kB/run -    7.58 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   122865 runs -     8.20 us/run -      128 kB/run -   14.89 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    73719 runs -    14.96 us/run -      256 kB/run -   16.32 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   40955 runs -    28.66 us/run -      512 kB/run -   17.04 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   24573 runs -    54.21 us/run -     1024 kB/run -   18.01 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   16382 runs -   106.49 us/run -     2048 kB/run -   18.34 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                   8191 runs -   210.88 us/run -     4096 kB/run -   18.52 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                   5369 runs -   321.77 us/run -     6250 kB/run -   18.53 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                   537 runs -  3191.79 us/run -    62500 kB/run -   18.69 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    376786 runs -     2.67 us/run -        8 kB/run -    2.86 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   245730 runs -     4.10 us/run -       64 kB/run -   14.90 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   122865 runs -     8.20 us/run -      256 kB/run -   29.79 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    65528 runs -    16.38 us/run -      512 kB/run -   29.82 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   40955 runs -    28.69 us/run -     1024 kB/run -   34.04 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    55.28 us/run -     2048 kB/run -   35.33 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -   108.50 us/run -     4096 kB/run -   36.00 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                   8194 runs -   213.75 us/run -     8192 kB/run -   36.55 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                   5370 runs -   326.31 us/run -    12500 kB/run -   36.54 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                   538 runs -  3252.68 us/run -   125000 kB/run -   36.72 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   303067 runs -     3.32 us/run -       16 kB/run -    4.60 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  253921 runs -     4.06 us/run -      128 kB/run -   30.09 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  122865 runs -     8.20 us/run -      512 kB/run -   59.57 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   65528 runs -    16.38 us/run -     1024 kB/run -   59.63 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  40955 runs -    28.69 us/run -     2048 kB/run -   68.09 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    55.28 us/run -     4096 kB/run -   70.67 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -   108.50 us/run -     8192 kB/run -   72.02 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                  6147 runs -   213.60 us/run -    16384 kB/run -   73.17 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                  4029 runs -   326.04 us/run -    25000 kB/run -   73.15 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                  270 runs -  5458.69 us/run -   250000 kB/run -   43.84 GB/s

----
Numbers after:

Backend 1/3: CUDA0
  Device description: NVIDIA RTX 6000 Ada Generation
  Device memory: 48510 MB (48039 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  311258 runs -     3.25 us/run -     2048 kB/run -  601.62 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  229390 runs -     4.40 us/run -     5120 kB/run - 1110.14 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  37583 runs -    29.67 us/run -     6250 kB/run -  200.89 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    892819 runs -     1.12 us/run -        1 kB/run -    0.85 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   458696 runs -     2.21 us/run -        8 kB/run -    3.45 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   376786 runs -     2.66 us/run -       32 kB/run -   11.46 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                   393168 runs -     2.59 us/run -       64 kB/run -   23.57 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                  393168 runs -     2.59 us/run -      128 kB/run -   47.15 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                  376786 runs -     2.69 us/run -      256 kB/run -   90.69 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                  327640 runs -     3.06 us/run -      512 kB/run -  159.65 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                 311258 runs -     3.28 us/run -     1024 kB/run -  297.77 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                 270303 runs -     3.74 us/run -     1562 kB/run -  398.14 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                137472 runs -     7.35 us/run -    15625 kB/run - 2026.94 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    876437 runs -     1.14 us/run -        4 kB/run -    3.33 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   442314 runs -     2.28 us/run -       32 kB/run -   13.39 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   155629 runs -     6.69 us/run -      128 kB/run -   18.24 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    81910 runs -    12.53 us/run -      256 kB/run -   19.49 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   49146 runs -    24.18 us/run -      512 kB/run -   20.20 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   65528 runs -    15.34 us/run -     1024 kB/run -   63.66 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   73719 runs -    14.76 us/run -     2048 kB/run -  132.35 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                  65528 runs -    16.01 us/run -     4096 kB/run -  244.07 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                  64428 runs -    16.51 us/run -     6250 kB/run -  360.97 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                 33831 runs -    29.59 us/run -    62500 kB/run - 2016.08 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    868246 runs -     1.16 us/run -        8 kB/run -    6.59 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   442314 runs -     2.28 us/run -       64 kB/run -   26.76 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   155629 runs -     6.69 us/run -      256 kB/run -   36.48 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    81910 runs -    12.53 us/run -      512 kB/run -   38.97 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   49146 runs -    24.17 us/run -     1024 kB/run -   40.41 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    47.53 us/run -     2048 kB/run -   41.10 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -    61.25 us/run -     4096 kB/run -   63.77 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                  32776 runs -    31.79 us/run -     8192 kB/run -  245.82 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                  32220 runs -    32.90 us/run -    12500 kB/run -  362.35 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                  6725 runs -   151.99 us/run -   125000 kB/run -  785.77 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   851864 runs -     1.18 us/run -       16 kB/run -   12.97 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  442314 runs -     2.30 us/run -      128 kB/run -   53.13 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  155629 runs -     6.68 us/run -      512 kB/run -   73.13 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   81910 runs -    12.68 us/run -     1024 kB/run -   77.00 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  40955 runs -    24.56 us/run -     2048 kB/run -   79.53 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    47.52 us/run -     4096 kB/run -   82.21 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -    93.44 us/run -     8192 kB/run -   83.62 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                 16392 runs -    63.36 us/run -    16384 kB/run -  246.68 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                 16116 runs -    65.25 us/run -    25000 kB/run -  365.53 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                 3375 runs -   304.46 us/run -   250000 kB/run -  785.98 GB/s
  Backend CUDA0: OK
Backend 2/3: CUDA1
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97250 MB (96677 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  376786 runs -     2.69 us/run -     2048 kB/run -  727.04 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  216282 runs -     4.64 us/run -     5120 kB/run - 1053.30 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  32214 runs -    34.21 us/run -     6250 kB/run -  174.27 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    819100 runs -     1.22 us/run -        1 kB/run -    0.78 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   409550 runs -     2.47 us/run -        8 kB/run -    3.09 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   303067 runs -     3.31 us/run -       32 kB/run -    9.21 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                   237539 runs -     4.33 us/run -       64 kB/run -   14.08 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                  237539 runs -     4.33 us/run -      128 kB/run -   28.17 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                  188393 runs -     5.37 us/run -      256 kB/run -   45.47 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                  188393 runs -     5.41 us/run -      512 kB/run -   90.20 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                 188393 runs -     5.41 us/run -     1024 kB/run -  180.41 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                 188393 runs -     5.41 us/run -     1562 kB/run -  275.27 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                128880 runs -     7.76 us/run -    15625 kB/run - 1920.33 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    802718 runs -     1.26 us/run -        4 kB/run -    3.03 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   401359 runs -     2.51 us/run -       32 kB/run -   12.18 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   139247 runs -     7.51 us/run -      128 kB/run -   16.26 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    73719 runs -    14.17 us/run -      256 kB/run -   17.23 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   40955 runs -    27.37 us/run -      512 kB/run -   17.84 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   40955 runs -    26.33 us/run -     1024 kB/run -   37.10 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   40955 runs -    26.19 us/run -     2048 kB/run -   74.59 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                  40955 runs -    26.35 us/run -     4096 kB/run -  148.26 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                  42952 runs -    24.18 us/run -     6250 kB/run -  246.51 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                 32757 runs -    31.01 us/run -    62500 kB/run - 1923.68 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    786336 runs -     1.28 us/run -        8 kB/run -    5.95 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   393168 runs -     2.57 us/run -       64 kB/run -   23.73 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   131056 runs -     7.67 us/run -      256 kB/run -   31.82 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    73719 runs -    14.43 us/run -      512 kB/run -   33.84 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   40955 runs -    27.90 us/run -     1024 kB/run -   35.01 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    54.63 us/run -     2048 kB/run -   35.75 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -    72.24 us/run -     4096 kB/run -   54.08 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                  20485 runs -    52.66 us/run -     8192 kB/run -  148.37 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                  21480 runs -    48.00 us/run -    12500 kB/run -  248.42 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                 16140 runs -    61.99 us/run -   125000 kB/run - 1926.51 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   786336 runs -     1.28 us/run -       16 kB/run -   11.90 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  393168 runs -     2.57 us/run -      128 kB/run -   47.57 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  131056 runs -     7.65 us/run -      512 kB/run -   63.83 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   73719 runs -    14.42 us/run -     1024 kB/run -   67.74 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  40955 runs -    27.87 us/run -     2048 kB/run -   70.09 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    54.54 us/run -     4096 kB/run -   71.63 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -   107.53 us/run -     8192 kB/run -   72.66 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                 10245 runs -   105.10 us/run -    16384 kB/run -  148.70 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                 10744 runs -    95.36 us/run -    25000 kB/run -  250.11 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                 5400 runs -   186.97 us/run -   250000 kB/run - 1279.90 GB/s
2025-12-05 16:26:18 +01:00
Oliver Simons 7668999518 Merge branch 'master' into gpu-sampling
Let's keep `master's` cumsum implementation for it's likely better AMD
perf and add back pure-CUB-implementation in follow-up commit
2025-12-05 14:41:08 +01:00
Oliver Simons dd11f6eb7b Add perf-tests for CUMSUM 2025-12-05 14:34:06 +01:00
Piotr Wilkin (ilintar) 96fe9badfc
Add support for CUMSUM and TRI for CUDA. (#17584)
* Add support for CUMSUM and TRI for CUDA.

* Minor optimizations.

* Correct warp_prefix_inclusive_sum in float2 variant to return float2

* Optimize TRI

* Whitespace

* Fix strides.

* Implement double loop

* Whitespace

* Fix HIP compilation bugs

* Optimizations + big case performance tests

* Implement using CUB with fallback to custom kernel

* Remove error message.

* Fixes from code review

* Comment out CPU-unsupported F16/BF16 cases to fix CI

* Fine, you win :P

* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

* Vary warp-size based on physical warp size

* Add GGML_UNUSED_VARS in tri as well

* Use constexpr and call prefix_inclusive with warp_size template param

* Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Change to tid % warp_size

* Fix strides; hardcode mask; add ggml_lane_mask_t

* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()

* Too hasty...

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-04 22:19:51 +01:00
Daniel Bevenius c0b182f4d6
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-04 08:17:50 +01:00
Reese Levine 7ca5991d2b
ggml webgpu: add support for emscripten builds (#17184)
* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (#9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Fix .gitignore

* Add memory64 option and remove unneeded macros for setting threads to 1

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-12-03 10:25:34 +01:00
Georgi Gerganov 16451d6bc3
Merge branch 'master' into HEAD 2025-12-01 14:47:50 +02:00
Tarek Dakhran 2ba719519d
model: LFM2-VL fixes (#17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
2025-11-30 21:57:31 +01:00
Georgi Gerganov d8d98bb4bb
Merge branch 'master' into HEAD 2025-11-29 22:38:44 +02:00
Jeff Bolz 59d8d4e963
vulkan: improve topk perf for large k, fix overflow in unit tests (#17582) 2025-11-29 08:39:57 +01:00
Piotr Wilkin (ilintar) cd0e3a7a3b
SOLVE_TRI CUDA kernel for small matrices (#17457) 2025-11-28 12:15:32 +08:00
Daniel Bevenius 7c2bfb352e
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-26 17:52:29 +01:00
Jeff Bolz 879d673759
vulkan: Implement top-k (#17418)
* vulkan: Implement top-k

Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.

* fix pipeline selection

* vulkan: Add N-ary search algorithm for topk

* microoptimizations
2025-11-26 16:45:43 +01:00
Oliver Simons f23b306cc5 CUDA: Add top-k implementation 2025-11-25 15:25:25 +01:00
Daniel Bevenius ec047e12ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 15:16:44 +01:00
Georgi Gerganov 583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Daniel Bevenius 53dca56d9b Merge remote-tracking branch 'upstream/master' into gpu-sampling 2025-11-25 08:20:50 +01:00
Jeff Bolz d414db02d3
vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (#17455) 2025-11-25 07:11:27 +01:00
Daniel Bevenius 7816f0bb56
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-24 07:44:06 +01:00
Sigbjørn Skjæret 96ac5a2329
cuda : support non-contiguous i32 to i32 copy (#17326)
* support non-contiguous i32 to i32 copy

* add tests

* rename cpy_flt to cpy_scalar and reindent params
2025-11-23 11:13:34 +01:00
Masato Nakasaka 3f3a4fb9c3
Revive MUL_MAT_ID to perf testing (#17397) 2025-11-22 10:55:43 +01:00
Daniel Bevenius 0c660e7390
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-20 06:57:24 +01:00
Giuseppe Scrivano 7d77f07325
vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (#17319)
* vulkan: initialize array

* vulkan: implement ADD1

* vulkan: implement ARANGE

* vulkan: implement FILL

* vulkan: implement SOFTPLUS

* vulkan: implement STEP

* vulkan: implement ROUND

* vulkan: implement CEIL

* vulkan: implement FLOOR

* vulkan: implement TRUNC

* docs: update Vulkan ops

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-19 17:29:45 +01:00
Jeff Bolz 1fa4551af0
vulkan: support larger argsort (#17313)
* vulkan: support larger argsort

This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.

To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.

* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost

* reduce loop overhead

* run multiple cols per invocation, to reduce barrier overhead
2025-11-19 17:25:50 +01:00
Piotr Wilkin (ilintar) 6fd4f95367
Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition (#17332)
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition

* Argh.

* Making CISC happy ;)

* Integrate CONT tests

* Use loopy loop

* Skip new tests for (B)F16 for now.
2025-11-19 10:36:33 +01:00
Oliver Simons 26be108be8 CUDA: Optimize argsort for gpu-based token sampling
Argsort is used for top-k currently. WE optimize argsort by 2 things:

1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
   across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
   correct entrypoint (the function chooses different execution paths,
   it contains `DeviceSegmentedRadixSort` as one of the paths and will
   choose the best one according to heuristics.
   https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview

Some perf numbers for a RTX PRO 6000:

On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   359.24 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                  8192 runs -   861.34 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -  1020.01 us/run
```

After:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   312.41 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                 16384 runs -    63.48 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -   874.36 us/run
```

---
On the model level, tested with
`llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is
the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling`

Before:
```
llama_perf_sampler_print:    sampling time =       0.25 ms /   207 runs   (    0.00 ms per token, 824701.20 tokens per second)
llama_perf_context_print:        load time =   18215.58 ms
llama_perf_context_print: prompt eval time =      28.20 ms /     7 tokens (    4.03 ms per token,   248.19 tokens per second)
llama_perf_context_print:        eval time =     714.79 ms /   199 runs   (    3.59 ms per token,   278.40 tokens per second)
llama_perf_context_print:       total time =     857.62 ms /   206 tokens
```

After
```
llama_perf_sampler_print:    sampling time =       0.25 ms /   207 runs   (    0.00 ms per token, 828000.00 tokens per second)
llama_perf_context_print:        load time =   18366.92 ms
llama_perf_context_print: prompt eval time =      35.92 ms /     7 tokens (    5.13 ms per token,   194.87 tokens per second)
llama_perf_context_print:        eval time =     532.79 ms /   199 runs   (    2.68 ms per token,   373.50 tokens per second)
llama_perf_context_print:       total time =     683.65 ms /   206 tokens
```
2025-11-18 18:17:44 +01:00
Georgi Gerganov 1a139644a8
metal : add cumsum (#17305) 2025-11-17 11:51:13 +02:00
Jeff Bolz 24dc769f1b
vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287)
These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.
2025-11-15 19:54:23 +01:00
Georgi Gerganov 45c6ef7307
metal : support argsort for ne00 > 1024 (#17247)
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
2025-11-14 09:36:06 +02:00
Piotr Wilkin (ilintar) 389ac78b26
ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063)
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from #14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-13 20:54:47 +02:00
Diego Devesa 879dec341a
ggml-cpu : use template for argsort (#17222) 2025-11-13 10:59:05 +02:00
duduta 73460f6278
ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805)
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-11 13:33:24 +02:00
Acly 1032256ec9
cuda/vulkan : bicubic interpolation (#17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-10 10:19:39 +01:00
Ruben Ortlam 8a3519b708
vulkan: fix mmq out of bounds reads (#17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 09:52:57 +01:00
Jeff Bolz 80a6cf6347
vulkan: fuse mul_mat_id + mul (#17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
2025-11-09 09:48:42 +01:00
Aman Gupta 64fe17fbb8
Revert "CUDA: add expert reduce kernel (#16857)" (#17100) 2025-11-08 21:05:19 +08:00
Aman Gupta c1b187688d
CUDA: skip fusion for repeating adds in bias (#17080) 2025-11-08 16:58:05 +08:00
Jeff Bolz b4e335d8dc
vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
2025-11-08 08:52:15 +01:00
bssrdf 299f5d782c
CUDA: properly handle nb00=nb02 case for cpy (#17081) 2025-11-07 23:41:58 +01:00
Johannes Gäßler aa374175c3
CUDA: fix crash on uneven context without FA (#16988) 2025-11-06 14:05:47 +01:00
bssrdf 230d1169e5
improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)
* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
2025-11-05 21:55:04 +01:00
Shagun Bera a2054e3a8f
test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (#16936)
* tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage

* tests: init gf and filter out fusion tests for support mode

* tests: filter out fusion cases before calling eval_support

* tests: filter out fusion cases from show_test_coverage as well, fix lint
2025-11-03 00:10:30 +01:00
Georgi Gerganov 2f966b8ed8
clip : use FA (#16837)
* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-02 21:21:48 +01:00
Aman Gupta 4146d6a1a6
CUDA: add expert reduce kernel (#16857)
* CUDA: add expert reduce kernel

* contigous checks, better formatting, use std::vector instead of array

* use vector empty instead of size

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-10-31 20:05:07 +08:00
Ruben Ortlam d2a2673dd1
vulkan: fix shmem overrun in mmq id shader (#16873)
* vulkan: fix shmem overrun in mmq id shader

* metal : fix mul_mm_id

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-31 08:14:49 +01:00
JJJYmmm d261223d24
model: add support for qwen3vl series (#16780)
* support qwen3vl series.

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

* bugfix: fix the arch check for qwen3vl-moe.

* use build_ffn

* optimize deepstack structure

* optimize deepstack feature saving

* Revert "optimize deepstack feature saving" for temporal fix

This reverts commit f321b9fdf1.

* code clean

* use fused qkv in clip

* clean up / rm is_deepstack_layers for simplification

* add test model

* move test model to "big" section

* fix imrope check

* remove trailing whitespace

* fix rope fail

* metal : add imrope support

* add imrope support for sycl

* vulkan: add imrope w/o check

* fix vulkan

* webgpu: add imrope w/o check

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix tensor mapping

---------

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Sigbjørn Skjæret 229bf68628
cuda : fix argsort with 64k+ rows (#16849) 2025-10-30 08:56:28 +01:00
Jeff Bolz b9ce940177
vulkan: Fuse rope+set_rows (#16769)
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).

Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).

Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.

Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.

Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.

Add new backend tests.
2025-10-29 15:13:10 -05:00
Acly 10640e31aa
ggml : fix interpolate with align-corners and ne=1 (#16700)
* ggml : fix interpolate with align-corners and ne=1

* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken

* fix clang warning
2025-10-27 21:50:22 +01:00
Aman Gupta 75cbdd3fce
test-backend-ops: print failed tests at the end (#16785) 2025-10-27 09:25:10 +08:00