Commit Graph

7385 Commits

Author SHA1 Message Date
Georgi Gerganov b8eb3b3501
wip fix tests 2025-12-06 16:13:27 +02:00
Oliver Simons e652566139 Readd `cub::DeviceScan::InclusiveSum`-based CumSum
For single rows and large columns doing a for-loop over the function
`cub::DeviceScan::InclusiveSum` offered by CUB outperforms the
`cumsum_cub_kernel` where `cub::BlockScan` is used.

Numbers before this change

  Backend 1/3: CUDA0
  Device description: NVIDIA RTX 6000 Ada Generation
  Device memory: 48510 MB (48039 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  311258 runs -     3.26 us/run -     2048 kB/run -  599.76 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  229390 runs -     4.40 us/run -     5120 kB/run - 1110.23 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  37583 runs -    29.63 us/run -     6250 kB/run -  201.18 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    892819 runs -     1.12 us/run -        1 kB/run -    0.85 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   450505 runs -     2.25 us/run -        8 kB/run -    3.39 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   155629 runs -     6.61 us/run -       32 kB/run -    4.62 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                    81910 runs -    12.60 us/run -       64 kB/run -    4.85 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                   49146 runs -    23.99 us/run -      128 kB/run -    5.09 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                   24573 runs -    47.10 us/run -      256 kB/run -    5.18 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                   16382 runs -    93.57 us/run -      512 kB/run -    5.22 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                   8191 runs -   184.79 us/run -     1024 kB/run -    5.29 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                   8191 runs -   280.43 us/run -     1562 kB/run -    5.31 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                  2148 runs -  2771.23 us/run -    15625 kB/run -    5.38 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    458696 runs -     2.21 us/run -        4 kB/run -    1.73 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   360404 runs -     2.82 us/run -       32 kB/run -   10.83 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   147438 runs -     7.12 us/run -      128 kB/run -   17.15 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    81910 runs -    12.90 us/run -      256 kB/run -   18.92 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   49146 runs -    24.32 us/run -      512 kB/run -   20.08 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   24573 runs -    47.28 us/run -     1024 kB/run -   20.66 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   16382 runs -    93.21 us/run -     2048 kB/run -   20.96 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                   8191 runs -   185.04 us/run -     4096 kB/run -   21.11 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                   5369 runs -   282.08 us/run -     6250 kB/run -   21.13 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                   537 runs -  2806.46 us/run -    62500 kB/run -   21.26 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    458696 runs -     2.20 us/run -        8 kB/run -    3.47 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   360404 runs -     2.82 us/run -       64 kB/run -   21.66 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   147438 runs -     7.12 us/run -      256 kB/run -   34.28 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    81910 runs -    12.90 us/run -      512 kB/run -   37.84 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   49146 runs -    24.32 us/run -     1024 kB/run -   40.15 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    47.28 us/run -     2048 kB/run -   41.31 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -    93.20 us/run -     4096 kB/run -   41.92 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                   8194 runs -   185.05 us/run -     8192 kB/run -   42.22 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                   5370 runs -   282.15 us/run -    12500 kB/run -   42.26 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                   269 runs -  4067.61 us/run -   125000 kB/run -   29.36 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   303067 runs -     3.32 us/run -       16 kB/run -    4.60 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  303067 runs -     3.32 us/run -      128 kB/run -   36.76 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  147438 runs -     7.17 us/run -      512 kB/run -   68.13 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   81910 runs -    12.90 us/run -     1024 kB/run -   75.68 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  49146 runs -    24.33 us/run -     2048 kB/run -   80.28 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    47.30 us/run -     4096 kB/run -   82.59 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -    93.24 us/run -     8192 kB/run -   83.80 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                  6147 runs -   185.07 us/run -    16384 kB/run -   84.45 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                  4029 runs -   282.40 us/run -    25000 kB/run -   84.46 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                  270 runs -  4118.40 us/run -   250000 kB/run -   58.11 GB/s
  Backend CUDA0: OK
Backend 2/3: CUDA1
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97250 MB (96677 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  368595 runs -     2.73 us/run -     2048 kB/run -  715.83 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  216282 runs -     4.72 us/run -     5120 kB/run - 1035.32 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  32214 runs -    34.33 us/run -     6250 kB/run -  173.64 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    810909 runs -     1.24 us/run -        1 kB/run -    0.77 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   401359 runs -     2.52 us/run -        8 kB/run -    3.03 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   139247 runs -     7.44 us/run -       32 kB/run -    4.10 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                    73719 runs -    14.27 us/run -       64 kB/run -    4.28 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                   40955 runs -    27.24 us/run -      128 kB/run -    4.48 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                   24573 runs -    53.46 us/run -      256 kB/run -    4.57 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                   16382 runs -   105.29 us/run -      512 kB/run -    4.64 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                   8191 runs -   210.15 us/run -     1024 kB/run -    4.65 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                   8191 runs -   318.22 us/run -     1562 kB/run -    4.68 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                  2148 runs -  3142.23 us/run -    15625 kB/run -    4.74 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    303067 runs -     3.34 us/run -        4 kB/run -    1.14 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   253921 runs -     4.03 us/run -       32 kB/run -    7.58 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   122865 runs -     8.20 us/run -      128 kB/run -   14.89 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    73719 runs -    14.96 us/run -      256 kB/run -   16.32 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   40955 runs -    28.66 us/run -      512 kB/run -   17.04 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   24573 runs -    54.21 us/run -     1024 kB/run -   18.01 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   16382 runs -   106.49 us/run -     2048 kB/run -   18.34 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                   8191 runs -   210.88 us/run -     4096 kB/run -   18.52 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                   5369 runs -   321.77 us/run -     6250 kB/run -   18.53 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                   537 runs -  3191.79 us/run -    62500 kB/run -   18.69 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    376786 runs -     2.67 us/run -        8 kB/run -    2.86 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   245730 runs -     4.10 us/run -       64 kB/run -   14.90 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   122865 runs -     8.20 us/run -      256 kB/run -   29.79 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    65528 runs -    16.38 us/run -      512 kB/run -   29.82 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   40955 runs -    28.69 us/run -     1024 kB/run -   34.04 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    55.28 us/run -     2048 kB/run -   35.33 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -   108.50 us/run -     4096 kB/run -   36.00 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                   8194 runs -   213.75 us/run -     8192 kB/run -   36.55 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                   5370 runs -   326.31 us/run -    12500 kB/run -   36.54 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                   538 runs -  3252.68 us/run -   125000 kB/run -   36.72 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   303067 runs -     3.32 us/run -       16 kB/run -    4.60 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  253921 runs -     4.06 us/run -      128 kB/run -   30.09 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  122865 runs -     8.20 us/run -      512 kB/run -   59.57 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   65528 runs -    16.38 us/run -     1024 kB/run -   59.63 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  40955 runs -    28.69 us/run -     2048 kB/run -   68.09 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    55.28 us/run -     4096 kB/run -   70.67 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -   108.50 us/run -     8192 kB/run -   72.02 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                  6147 runs -   213.60 us/run -    16384 kB/run -   73.17 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                  4029 runs -   326.04 us/run -    25000 kB/run -   73.15 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                  270 runs -  5458.69 us/run -   250000 kB/run -   43.84 GB/s

----
Numbers after:

Backend 1/3: CUDA0
  Device description: NVIDIA RTX 6000 Ada Generation
  Device memory: 48510 MB (48039 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  311258 runs -     3.25 us/run -     2048 kB/run -  601.62 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  229390 runs -     4.40 us/run -     5120 kB/run - 1110.14 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  37583 runs -    29.67 us/run -     6250 kB/run -  200.89 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    892819 runs -     1.12 us/run -        1 kB/run -    0.85 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   458696 runs -     2.21 us/run -        8 kB/run -    3.45 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   376786 runs -     2.66 us/run -       32 kB/run -   11.46 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                   393168 runs -     2.59 us/run -       64 kB/run -   23.57 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                  393168 runs -     2.59 us/run -      128 kB/run -   47.15 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                  376786 runs -     2.69 us/run -      256 kB/run -   90.69 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                  327640 runs -     3.06 us/run -      512 kB/run -  159.65 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                 311258 runs -     3.28 us/run -     1024 kB/run -  297.77 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                 270303 runs -     3.74 us/run -     1562 kB/run -  398.14 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                137472 runs -     7.35 us/run -    15625 kB/run - 2026.94 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    876437 runs -     1.14 us/run -        4 kB/run -    3.33 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   442314 runs -     2.28 us/run -       32 kB/run -   13.39 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   155629 runs -     6.69 us/run -      128 kB/run -   18.24 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    81910 runs -    12.53 us/run -      256 kB/run -   19.49 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   49146 runs -    24.18 us/run -      512 kB/run -   20.20 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   65528 runs -    15.34 us/run -     1024 kB/run -   63.66 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   73719 runs -    14.76 us/run -     2048 kB/run -  132.35 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                  65528 runs -    16.01 us/run -     4096 kB/run -  244.07 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                  64428 runs -    16.51 us/run -     6250 kB/run -  360.97 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                 33831 runs -    29.59 us/run -    62500 kB/run - 2016.08 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    868246 runs -     1.16 us/run -        8 kB/run -    6.59 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   442314 runs -     2.28 us/run -       64 kB/run -   26.76 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   155629 runs -     6.69 us/run -      256 kB/run -   36.48 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    81910 runs -    12.53 us/run -      512 kB/run -   38.97 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   49146 runs -    24.17 us/run -     1024 kB/run -   40.41 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    47.53 us/run -     2048 kB/run -   41.10 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -    61.25 us/run -     4096 kB/run -   63.77 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                  32776 runs -    31.79 us/run -     8192 kB/run -  245.82 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                  32220 runs -    32.90 us/run -    12500 kB/run -  362.35 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                  6725 runs -   151.99 us/run -   125000 kB/run -  785.77 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   851864 runs -     1.18 us/run -       16 kB/run -   12.97 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  442314 runs -     2.30 us/run -      128 kB/run -   53.13 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  155629 runs -     6.68 us/run -      512 kB/run -   73.13 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   81910 runs -    12.68 us/run -     1024 kB/run -   77.00 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  40955 runs -    24.56 us/run -     2048 kB/run -   79.53 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    47.52 us/run -     4096 kB/run -   82.21 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -    93.44 us/run -     8192 kB/run -   83.62 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                 16392 runs -    63.36 us/run -    16384 kB/run -  246.68 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                 16116 runs -    65.25 us/run -    25000 kB/run -  365.53 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                 3375 runs -   304.46 us/run -   250000 kB/run -  785.98 GB/s
  Backend CUDA0: OK
Backend 2/3: CUDA1
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97250 MB (96677 MB free)

  CUMSUM(type=f32,ne=[128,128,4,4]):                  376786 runs -     2.69 us/run -     2048 kB/run -  727.04 GB/s
  CUMSUM(type=f32,ne=[2048,16,5,4]):                  216282 runs -     4.64 us/run -     5120 kB/run - 1053.30 GB/s
  CUMSUM(type=f32,ne=[20000,10,4,1]):                  32214 runs -    34.21 us/run -     6250 kB/run -  174.27 GB/s
  CUMSUM(type=f32,ne=[128,1,1,1]):                    819100 runs -     1.22 us/run -        1 kB/run -    0.78 GB/s
  CUMSUM(type=f32,ne=[1024,1,1,1]):                   409550 runs -     2.47 us/run -        8 kB/run -    3.09 GB/s
  CUMSUM(type=f32,ne=[4096,1,1,1]):                   303067 runs -     3.31 us/run -       32 kB/run -    9.21 GB/s
  CUMSUM(type=f32,ne=[8192,1,1,1]):                   237539 runs -     4.33 us/run -       64 kB/run -   14.08 GB/s
  CUMSUM(type=f32,ne=[16384,1,1,1]):                  237539 runs -     4.33 us/run -      128 kB/run -   28.17 GB/s
  CUMSUM(type=f32,ne=[32768,1,1,1]):                  188393 runs -     5.37 us/run -      256 kB/run -   45.47 GB/s
  CUMSUM(type=f32,ne=[65536,1,1,1]):                  188393 runs -     5.41 us/run -      512 kB/run -   90.20 GB/s
  CUMSUM(type=f32,ne=[131072,1,1,1]):                 188393 runs -     5.41 us/run -     1024 kB/run -  180.41 GB/s
  CUMSUM(type=f32,ne=[200000,1,1,1]):                 188393 runs -     5.41 us/run -     1562 kB/run -  275.27 GB/s
  CUMSUM(type=f32,ne=[2000000,1,1,1]):                128880 runs -     7.76 us/run -    15625 kB/run - 1920.33 GB/s
  CUMSUM(type=f32,ne=[128,4,1,1]):                    802718 runs -     1.26 us/run -        4 kB/run -    3.03 GB/s
  CUMSUM(type=f32,ne=[1024,4,1,1]):                   401359 runs -     2.51 us/run -       32 kB/run -   12.18 GB/s
  CUMSUM(type=f32,ne=[4096,4,1,1]):                   139247 runs -     7.51 us/run -      128 kB/run -   16.26 GB/s
  CUMSUM(type=f32,ne=[8192,4,1,1]):                    73719 runs -    14.17 us/run -      256 kB/run -   17.23 GB/s
  CUMSUM(type=f32,ne=[16384,4,1,1]):                   40955 runs -    27.37 us/run -      512 kB/run -   17.84 GB/s
  CUMSUM(type=f32,ne=[32768,4,1,1]):                   40955 runs -    26.33 us/run -     1024 kB/run -   37.10 GB/s
  CUMSUM(type=f32,ne=[65536,4,1,1]):                   40955 runs -    26.19 us/run -     2048 kB/run -   74.59 GB/s
  CUMSUM(type=f32,ne=[131072,4,1,1]):                  40955 runs -    26.35 us/run -     4096 kB/run -  148.26 GB/s
  CUMSUM(type=f32,ne=[200000,4,1,1]):                  42952 runs -    24.18 us/run -     6250 kB/run -  246.51 GB/s
  CUMSUM(type=f32,ne=[2000000,4,1,1]):                 32757 runs -    31.01 us/run -    62500 kB/run - 1923.68 GB/s
  CUMSUM(type=f32,ne=[128,8,1,1]):                    786336 runs -     1.28 us/run -        8 kB/run -    5.95 GB/s
  CUMSUM(type=f32,ne=[1024,8,1,1]):                   393168 runs -     2.57 us/run -       64 kB/run -   23.73 GB/s
  CUMSUM(type=f32,ne=[4096,8,1,1]):                   131056 runs -     7.67 us/run -      256 kB/run -   31.82 GB/s
  CUMSUM(type=f32,ne=[8192,8,1,1]):                    73719 runs -    14.43 us/run -      512 kB/run -   33.84 GB/s
  CUMSUM(type=f32,ne=[16384,8,1,1]):                   40955 runs -    27.90 us/run -     1024 kB/run -   35.01 GB/s
  CUMSUM(type=f32,ne=[32768,8,1,1]):                   24573 runs -    54.63 us/run -     2048 kB/run -   35.75 GB/s
  CUMSUM(type=f32,ne=[65536,8,1,1]):                   16382 runs -    72.24 us/run -     4096 kB/run -   54.08 GB/s
  CUMSUM(type=f32,ne=[131072,8,1,1]):                  20485 runs -    52.66 us/run -     8192 kB/run -  148.37 GB/s
  CUMSUM(type=f32,ne=[200000,8,1,1]):                  21480 runs -    48.00 us/run -    12500 kB/run -  248.42 GB/s
  CUMSUM(type=f32,ne=[2000000,8,1,1]):                 16140 runs -    61.99 us/run -   125000 kB/run - 1926.51 GB/s
  CUMSUM(type=f32,ne=[128,16,1,1]):                   786336 runs -     1.28 us/run -       16 kB/run -   11.90 GB/s
  CUMSUM(type=f32,ne=[1024,16,1,1]):                  393168 runs -     2.57 us/run -      128 kB/run -   47.57 GB/s
  CUMSUM(type=f32,ne=[4096,16,1,1]):                  131056 runs -     7.65 us/run -      512 kB/run -   63.83 GB/s
  CUMSUM(type=f32,ne=[8192,16,1,1]):                   73719 runs -    14.42 us/run -     1024 kB/run -   67.74 GB/s
  CUMSUM(type=f32,ne=[16384,16,1,1]):                  40955 runs -    27.87 us/run -     2048 kB/run -   70.09 GB/s
  CUMSUM(type=f32,ne=[32768,16,1,1]):                  24573 runs -    54.54 us/run -     4096 kB/run -   71.63 GB/s
  CUMSUM(type=f32,ne=[65536,16,1,1]):                  12291 runs -   107.53 us/run -     8192 kB/run -   72.66 GB/s
  CUMSUM(type=f32,ne=[131072,16,1,1]):                 10245 runs -   105.10 us/run -    16384 kB/run -  148.70 GB/s
  CUMSUM(type=f32,ne=[200000,16,1,1]):                 10744 runs -    95.36 us/run -    25000 kB/run -  250.11 GB/s
  CUMSUM(type=f32,ne=[2000000,16,1,1]):                 5400 runs -   186.97 us/run -   250000 kB/run - 1279.90 GB/s
2025-12-05 16:26:18 +01:00
Oliver Simons 7668999518 Merge branch 'master' into gpu-sampling
Let's keep `master's` cumsum implementation for it's likely better AMD
perf and add back pure-CUB-implementation in follow-up commit
2025-12-05 14:41:08 +01:00
Oliver Simons dd11f6eb7b Add perf-tests for CUMSUM 2025-12-05 14:34:06 +01:00
Georgi Gerganov cf74b1a8ec
sampling : fix candidates logic 2025-12-05 14:24:28 +02:00
Pascal 1be97831e4
fix: prevent segfault in tokenizer on highly repetitive input (#17786)
Add nosubs|optimize flags to std::regex constructors to prevent
catastrophic backtracking when processing prompts with repeated
identical characters (e.g., 'A' * 10000).

The nosubs flag disables subgroup capture, significantly reducing
memory usage and backtracking on uniform token sequences
2025-12-05 13:52:23 +02:00
Adrien Gallouët a6cfc212ed
ci : fix winget workflow (#17790) 2025-12-05 19:44:17 +08:00
shalinib-ibm 3a0d10533a
Q4/Q8 Tiled Gemm Optimization. (#16999) 2025-12-05 19:41:51 +08:00
Piotr Wilkin (ilintar) 6648989673
Add pwilkin to CODEOWNERS for chat files (#17789)
* Add pwilkin to CODEOWNERS for chat files

* Reorder alphabetically
2025-12-05 12:00:57 +01:00
Johannes Gäßler e95d0bc8fd
CUDA: fix FA VKQ accumulator overflow (#17746) 2025-12-05 09:18:10 +01:00
Jiacheng (Jason) Chen 668ed76574
HIP: enable WMMA-MMQ INT kernels for RDNA 3 (#17576)
* enabled wmma instructions for most quantizations other than q2k

* fixed the last q2_k test case failure

* address comments: fix out of bound write for RDNA4, add comments after #endif

* clean up rebase: fix ne error in half2

* fix the EditorConfig CI
2025-12-05 09:17:37 +01:00
Sigbjørn Skjæret 03d9a77b85
ci : transform release binary root dir in tar to llama-bXXXX (#17773)
* transform release binary root dir in tar to llama-bXXXX

* bsdtar supports -s instead of --transform
2025-12-05 01:50:19 +01:00
Gabe Goodhart 3143a755c8
docs : update ops.md (Metal, BLAS) (#17768)
* docs: Regen Metal.csv

Branch: UpdateOpsMd

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* docs: Regen BLAS.csv

Branch: UpdateOpsMd

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* docs: Update ops.md

Branch: UpdateOpsMd

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-12-05 00:55:34 +01:00
Piotr Wilkin (ilintar) 96fe9badfc
Add support for CUMSUM and TRI for CUDA. (#17584)
* Add support for CUMSUM and TRI for CUDA.

* Minor optimizations.

* Correct warp_prefix_inclusive_sum in float2 variant to return float2

* Optimize TRI

* Whitespace

* Fix strides.

* Implement double loop

* Whitespace

* Fix HIP compilation bugs

* Optimizations + big case performance tests

* Implement using CUB with fallback to custom kernel

* Remove error message.

* Fixes from code review

* Comment out CPU-unsupported F16/BF16 cases to fix CI

* Fine, you win :P

* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

* Vary warp-size based on physical warp size

* Add GGML_UNUSED_VARS in tri as well

* Use constexpr and call prefix_inclusive with warp_size template param

* Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Change to tid % warp_size

* Fix strides; hardcode mask; add ggml_lane_mask_t

* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()

* Too hasty...

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-04 22:19:51 +01:00
Georgi Gerganov 7864074fdb
sampling : fix outputs and device checks 2025-12-04 19:33:01 +02:00
Gabe Goodhart bde188d60f
metal: TRI, FILL, EXPM1, SOFTPLUS (#16623)
* feat(wip): Port initial TRI impl from pervious work

The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove argument for constant val override

This was added in the original draft, but later removed. With this, the
kernel now passes tests.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Move the ttype conditional to templating to avoid conditional in kernel

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Type fixes

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* feat: Add softplus for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add EXPM1 for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add FILL for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused arguments

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use select instead of branch for softplus non-vec

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-04 19:12:19 +02:00
Georgi Gerganov abc19635a3
cont : keep backend sampling disabled for now 2025-12-04 17:42:09 +02:00
Xuan-Son Nguyen 9d0229967a
server: strip content-length header on proxy (#17734) 2025-12-04 16:32:57 +01:00
Georgi Gerganov 6958d41366
sampling : check backend support during init 2025-12-04 17:29:08 +02:00
Xuan-Son Nguyen c4c10bfb86
server: move msg diffs tracking to HTTP thread (#17740)
* server: move msg diffs tracking to HTTP thread

* wip

* tool call tests ok

* minor : style

* cont : fix

* move states to server_response_reader

* add safe-guard

* fix

* fix 2

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-04 15:46:08 +01:00
Daniel Bevenius 817d743cc1
examples : add missing code block end marker [no ci] (#17756)
This commit adds the missing code block end marker in simple-cmake-pkg
to correct the formatting.
2025-12-04 14:17:30 +01:00
Daniel Bevenius bd4ef13476
common : skip model validation when --help is requested (#17755)
This commit skips the model validation check when the user specifies the
--help option.

The motivation for this is that currently and error is thrown before the
--help could be processed. Now skips validation if params.usage is set,
allowing help to display without requiring --model.

Resolves: https://github.com/ggml-org/llama.cpp/issues/17754
2025-12-04 13:36:50 +01:00
Georgi Gerganov 1bde70785d
sampling : remove redundant calls to ggml_build_forward_expand 2025-12-04 14:25:28 +02:00
Georgi Gerganov fce571ee51
sampling : simplify temp sampling 2025-12-04 14:23:02 +02:00
Alberto Cabrera Pérez 87a2084c45
ggml-cpu : remove asserts always evaluating to false (#17728) 2025-12-04 13:16:38 +01:00
SmartestWashingMachine 3659aa28e9
convert: use existing local chat_template if mistral-format model has one. (#17749)
* conversion: use existing local chat_template.jinja file if mistral-format model has one.

* fix --mistral-format mistakenly assuming some <=v7 chat template names are file paths and reading them.

* Update convert_hf_to_gguf.py - change from exists() to is_file()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-04 12:12:45 +01:00
Adrien Gallouët 2a73f81f8a
cmake : simplify build info detection using standard variables (#17423)
The current approach has several drawbacks. Mostly, when
cross-compiling, invoking the compiler binary directly to query the
machine hardware can behave unexpectedly depending on the toolchain
wrapper (using COMPILER_TARGET, CFLAGS, etc).

As CMake is the official tool to build llama.cpp, I propose to only rely
on it to get those variables (`CMAKE_SYSTEM_NAME` and
`CMAKE_SYSTEM_PROCESSOR`).

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-04 12:42:13 +02:00
Sigbjørn Skjæret 7dba049b07
ci : disable ggml-ci-x64-amd-* (#17753) 2025-12-04 11:25:08 +01:00
Adrien Gallouët 83c1171529
common: use native MultiByteToWideChar (#17738)
`std::codecvt_utf8<wchar_t>` is deprecated and produces warnings:

    common/common.cpp:792:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
      792 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
          |

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-04 12:06:49 +02:00
Daniel Bevenius ac9e164714
sampling : fix backend temp sampling to use logits masking 2025-12-04 09:39:20 +01:00
Georgi Gerganov 0d1324856f
metal : use params per pipeline instance (#17739) 2025-12-04 10:34:11 +02:00
Georgi Gerganov a67ef0f47f
llama : fix sanity checks during quantization (#17721) 2025-12-04 10:33:42 +02:00
Daniel Bevenius 10bd640aae
Revert "sampling : stop short if backend sampler sampled a token"
This reverts commit 87b2719eca.
2025-12-04 08:26:33 +01:00
Daniel Bevenius c0b182f4d6
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-04 08:17:50 +01:00
Daniel Bevenius 87b2719eca
sampling : stop short if backend sampler sampled a token
This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.

It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).
2025-12-04 08:13:49 +01:00
Adrien Gallouët ef75a89fdb
build : move _WIN32_WINNT definition to headers (#17736)
Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds,
This caused "macro redefined" warnings with toolchains that define the version.

This also removes the `GGML_WIN_VER` variable as it is no longer needed.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-04 07:04:02 +01:00
Jeff Bolz d8b5cdc4fe
build: enable parallel builds in msbuild using MTT (#17708)
* build: enable parallel builds in msbuild using MTT

* check LLAMA_STANDALONE
2025-12-03 22:42:29 -06:00
Herman Semenoff dea9ba27cb
ggml-cpu: remove duplicate conditional check 'iid' (#17650) 2025-12-04 05:03:19 +08:00
Piotr Wilkin (ilintar) c6d1a00aa7
Add a couple of file types to the text section (#17670)
* Add a couple of file types to the text section

* Format + regenerate index

* Rebuild after rebase
2025-12-03 21:45:06 +01:00
SmartestWashingMachine 424c579455
convert : support latest mistral-common (fix conversion with --mistral-format) (#17712)
* fix convert_hf_to_gguf.py failing with --mistral-format using later mistral-common versions.

* use get_one_valid_tokenizer_file from mistral-common if available and fallback to old logic otherwise.

* use file name instead of file path for get_one_valid_tokenizer_file.

* fix --mistral-format tokenizer file failing for tokenizers in subdirectories.

* move get_one_valid_tokenizer_file import to avoid nested try-except.
2025-12-03 21:15:04 +01:00
Aleksander Grygier e9f9483464
Use OpenAI-compatible `/v1/models` endpoint by default (#17689)
* refactor: Data fetching via stores

* chore: update webui build output

* refactor: Use OpenAI compat `/v1/models` endpoint by default to list models

* chore: update webui build output

* chore: update webui build output
2025-12-03 20:49:09 +01:00
Andika Wasisto 41c5e02f42
webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (#17445)
* webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden

Zero pasteLongTextToFileLen should disable the conversion, but it was
overwritten with 2500.

* Apply suggestions from code review

* Update webui build
2025-12-03 20:45:17 +01:00
Johannes Gäßler 2e1c9cd814
CUDA: generalized (mma) FA, add Volta support (#17505)
* CUDA: generalized (mma) FA, add Volta support

* use struct for MMA FA kernel config

---------

Co-authored-by: Aman Gupta <aman>
2025-12-03 16:57:05 +01:00
Georgi Gerganov 190c4838bd
chat : reserve memory in compute_diffs and improve naming (#17729) 2025-12-03 17:22:10 +02:00
Pascal e7c2cf1356
server: add router multi-model tests (#17704) (#17722)
* llama-server: add router multi-model tests (#17704)

Add 4 test cases for model router:
- test_router_unload_model: explicit model unloading
- test_router_models_max_evicts_lru: LRU eviction with --models-max
- test_router_no_models_autoload: --no-models-autoload flag behavior
- test_router_api_key_required: API key authentication

Tests use async model loading with polling and graceful skip when
insufficient models available for eviction testing.

utils.py changes:
- Add models_max, models_dir, no_models_autoload attributes to ServerProcess
- Handle JSONDecodeError for non-JSON error responses (fallback to text)

* llama-server: update test models to new HF repos

* add offline

* llama-server: fix router LRU eviction test and add preloading

Fix eviction test: load 2 models first, verify state, then load
3rd to trigger eviction. Previous logic loaded all 3 at once,
causing first model to be evicted before verification could occur.

Add module fixture to preload models via ServerPreset.load_all()
and mark test presets as offline to use cached models

* llama-server: fix split model download on Windows

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-12-03 15:10:37 +01:00
Adrien Gallouët 1257491047
server : fix bad fmt, size() is a size_type (#17735)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-03 15:47:22 +02:00
Adrien Gallouët 083e18b11c
cmake: explicitly link against crypt32 on non-MSVC Windows builds (#17727)
Some toolchains do not support linking via pragmas such as:

    #pragma comment(lib, "crypt32.lib")

so we need to add the library explicitly.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-03 15:47:02 +02:00
Georgi Gerganov cce3b2a8ad
sampling : minor cleanup 2025-12-03 15:39:44 +02:00
Georgi Gerganov 3d94e967a1
metal : fix data race in pipeline library (#17731) 2025-12-03 14:03:40 +02:00
jiahao su 7feb0a1005
ci : remove the build of openeuler-cann in release (#17724)
* Remove the build of openeuler-cann in release

* Remove the relevant release files
2025-12-03 12:24:59 +01:00