llama.cpp/benches/nemotron/nemotron-dgx-spark.md

5.0 KiB

NVIDIA DGX Spark

System info

uname --all
Linux spark-17ed 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

g++ --version
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

nvidia-smi
Fri Mar  6 11:39:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   52C    P0             13W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

ggml-org/nemotron-3-super-120b-GGUF

Model: https://huggingface.co/ggml-org/nemotron-3-super-120b-GGUF

  • llama-batched-bench

main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 99, n_threads = 20, n_threads_batch = 20

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 1.094 468.05 1.621 19.74 2.715 200.37
512 32 2 1088 1.463 700.16 2.437 26.26 3.900 279.01
512 32 4 2176 2.647 773.76 4.043 31.66 6.689 325.29
512 32 8 4352 5.291 774.14 6.151 41.62 11.442 380.37
512 32 16 8704 10.603 772.62 10.385 49.30 20.987 414.72
512 32 32 17408 21.231 771.69 18.235 56.16 39.466 441.09
4096 32 1 4128 5.340 767.05 1.616 19.81 6.956 593.47
4096 32 2 8256 10.673 767.55 2.454 26.08 13.127 628.94
4096 32 4 16512 21.348 767.46 4.072 31.44 25.420 649.57
4096 32 8 33024 42.714 767.15 6.277 40.78 48.991 674.08
4096 32 16 66048 85.385 767.54 10.596 48.32 95.981 688.14
4096 32 32 132096 170.819 767.32 18.619 55.00 189.437 697.31
8192 32 1 8224 10.690 766.32 1.619 19.76 12.310 668.10
8192 32 2 16448 21.382 766.24 2.467 25.94 23.850 689.65
8192 32 4 32896 42.782 765.92 4.098 31.23 46.881 701.69
8192 32 8 65792 85.582 765.77 6.368 40.20 91.951 715.52
8192 32 16 131584 171.066 766.21 10.774 47.52 181.840 723.62
8192 32 32 263168 342.140 766.19 18.969 53.98 361.109 728.78
  • llama-bench
model size params backend n_ubatch fa test t/s
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 pp2048 768.84 ± 0.90
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 tg32 19.94 ± 0.16
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 pp2048 @ d4096 764.51 ± 0.50
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 tg32 @ d4096 19.95 ± 0.18
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 pp2048 @ d8192 759.53 ± 0.71
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 tg32 @ d8192 19.83 ± 0.18
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 pp2048 @ d16384 747.98 ± 1.58
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 tg32 @ d16384 19.84 ± 0.18
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 pp2048 @ d32768 724.40 ± 2.70
nemotron 120B.A12B Q4_K 65.10 GiB 120.67 B CUDA 2048 1 tg32 @ d32768 19.45 ± 0.18

build: 04a65daab (8268)