9.1 KiB
Cross-Backend Profiler
llama.cpp includes a built-in cross-backend profiler that captures per-operation timing, data transfer costs, and tensor shapes across all compute backends. It works with any application built on the ggml scheduler — no source changes needed.
Supported Backends
| Backend | Status | Timing method |
|---|---|---|
| CPU | Supported | Wall-clock (CLOCK_MONOTONIC_RAW) |
| CUDA | Supported | cudaEvent GPU timestamps |
| Vulkan | Supported | GPU timestamp queries |
| BLAS | Supported | Wall-clock |
| Metal | Not yet supported | — |
| OpenCL | Not yet supported | — |
The scheduler also profiles data copies (H2D, D2H, D2D) between backends regardless of which backends have native profiler support.
Enabling the Profiler
There are two independent ways to enable profiling. They can be used separately or together.
CLI flags (--profile, --profile-output)
Available in llama-cli, llama-completion, llama-server, and debug:
# Print summary to stdout
llama-completion -m model.gguf --profile -p "Hello world"
# Export to JSON
llama-completion -m model.gguf --profile --profile-output profile.json -p "Hello world"
# Export to plain text
llama-completion -m model.gguf --profile --profile-output profile.txt -p "Hello world"
The output format is chosen by file extension: .json for JSON, .txt for plain text. Any other extension defaults to JSON.
Environment variable (GGML_PROFILE)
The GGML_PROFILE environment variable enables profiling at the ggml scheduler level. This works with any application that uses the scheduler — including third-party tools like sd.cpp — without CLI flag support.
# Print summary to stdout
GGML_PROFILE=1 llama-completion -m model.gguf -p "Hello world"
# Export JSON
GGML_PROFILE=profile.json llama-completion -m model.gguf -p "Hello world"
# Export plain text
GGML_PROFILE=profile.txt llama-completion -m model.gguf -p "Hello world"
# Works with any ggml-based application
GGML_PROFILE=1 sd -m model.gguf -p "a cat"
| Value | Behavior |
|---|---|
1, stdout, or empty |
Print summary to stdout |
path.json |
Export JSON to file |
path.txt |
Export plain text to file |
| Any other path | Export JSON to file |
The export happens automatically when the scheduler is freed (typically at program exit).
Output Formats
Console summary (stdout)
The default when --profile is used without --profile-output, or GGML_PROFILE=1:
=== Profiling Summary ===
[OP ] backend 0 MUL_MAT 45.2% count=1200 total= 120.50 ms avg= 100.42 us ... 12.30 GB/s [4096 x 4096]
[OP ] backend 1 MUL_MAT_ID 30.1% count= 600 total= 80.20 ms avg= 133.67 us ... 0.08 GB/s [2688 x 1856 x 128]
[COPY] backend 0 copy_H2D 5.3% count= 200 total= 14.10 ms avg= 70.50 us ... 2.50 GB/s
...
Each line shows: event type (OP or COPY), backend index, operation name, percentage of total time, call count, timing stats, bandwidth, and representative tensor shape.
Plain text (.txt)
A more detailed report with three sections:
- Profiling Summary — total time, record count, unique ops
- Per-Backend Summary — ops and copies per backend with aggregate bandwidth
- Operations table — full breakdown with bandwidth and tensor shapes for all source tensors
JSON (.json)
Machine-readable format suitable for the Python analysis tool. Contains:
version: Format version (currently2)backends[]: Backend metadata (name, device, device type)records[]: Every profiling event with:type:0= OP,1= COPYname: Operation name (e.g."MUL_MAT","copy_H2D")backend_id,split_id: Scheduler indicesstart_ns,duration_ns: Timing in nanosecondsbytes: Output tensor size (OPs) or transfer size (COPYs)extra: Fusion name for fused ops, ornullne_src0,ne_src1,ne_src2: Source tensor dimensions (4-element arrays)
ne_src2 is populated only for MUL_MAT_ID (expert selection indices); it is [0,0,0,0] for all other ops.
Python Analysis Tool
The tools/profiler/profiler.py script reads JSON exports and produces analysis reports and visualizations.
Basic usage
# Print summary
python -m tools.profiler.profiler profile.json
# Show top 10 operations by time
python -m tools.profiler.profiler profile.json --top-ops 10
# Show top 10 longest individual kernels
python -m tools.profiler.profiler profile.json --top-kernels 10
# Show inefficiency ranking (highest time-per-byte)
python -m tools.profiler.profiler profile.json --inefficiency
Export visualizations
# Interactive HTML timeline (self-contained, no dependencies)
python -m tools.profiler.profiler profile.json --html-viewer timeline.html
# Chrome Trace format (open in chrome://tracing or Perfetto)
python -m tools.profiler.profiler profile.json --chrome-trace trace.json
# Downsample large traces for the HTML viewer
python -m tools.profiler.profiler profile.json --html-viewer timeline.html --html-max-records 50000
Multiple exports can be combined in a single invocation:
python -m tools.profiler.profiler profile.json --html-viewer timeline.html --chrome-trace trace.json --top-ops 20
CLI reference
| Argument | Description |
|---|---|
profile (positional) |
Path to profiler JSON file |
--chrome-trace FILE |
Export Chrome Trace Event format |
--html-viewer FILE |
Export interactive HTML timeline |
--html-max-records N |
Limit records in HTML output (0 = unlimited) |
--top-ops N |
Show top N operations by total time |
--top-kernels N |
Show top N longest individual kernels |
--inefficiency |
Rank operations by time per byte (higher = worse) |
HTML viewer features
The HTML viewer is a self-contained file with no external dependencies:
- Canvas timeline with per-backend lanes and color-coded operations
- Zoom controls (1s / 100ms / 1ms / 100us) and mouse drag navigation
- Minimap showing the full trace with a viewport indicator
- Hover tooltips with operation name, duration, shape, and bytes
- Stats table with collapsible tree: Operation → Backend → Tensor shape, showing % time, count, avg/min/max, and bandwidth
- Legend showing the most frequent operation types
What Gets Measured
OP events
Every tensor operation (MUL_MAT, ADD, UNARY, FLASH_ATTN_EXT, etc.) is recorded with:
- Timing: Start/end timestamps (nanosecond precision)
- Bytes: Output tensor size (
ggml_nbytes(node)) - Tensor shapes: Dimensions of
src[0],src[1], andsrc[2](when applicable) - Bandwidth: Computed as
bytes / duration— useful for identifying memory-bound vs compute-bound operations
COPY events
Data transfers between backends:
- Direction:
copy_H2D(host→device),copy_D2H(device→host),copy_D2D(device→device) - Bytes: Exact transfer size
- Bandwidth: Transfer throughput
MoE weight copies
When --cpu-moe is used, the scheduler selectively copies only the active experts. These partial copies are recorded as individual COPY events with the actual bytes transferred.
Programmatic API
For custom applications, the profiler can be controlled through the C API defined in ggml/include/ggml-profiler.h:
// Enable profiling on a scheduler
ggml_backend_sched_set_profiling(sched, true);
// ... run inference ...
// Get raw records
const ggml_profile_record * records;
int n = ggml_backend_sched_get_profiling_records(sched, &records);
// Or export directly
ggml_backend_sched_print_profiling(sched); // stdout
ggml_backend_sched_export_profiling_json(sched, "profile.json"); // JSON file
ggml_backend_sched_export_profiling_text(sched, "profile.txt"); // text file
ggml_backend_sched_write_profiling_json(sched, fp); // JSON to FILE*
ggml_backend_sched_write_profiling_text(sched, fp); // text to FILE*
// Reset for next measurement window
ggml_backend_sched_reset_profiling(sched);
Records accumulate across multiple graph_compute calls until explicitly reset or the scheduler is freed.
Tips
- Prompt eval vs generation: The profiler captures all graph computes. During prompt evaluation you'll see larger batch sizes in tensor shapes; during generation, batch size is typically 1-2.
- Vulkan concurrent mode: When Vulkan dispatches multiple operations concurrently, they are reported as a single combined record spanning the full GPU time interval.
- Bandwidth interpretation: For compute ops, bandwidth =
output_bytes / duration. This is not memory bandwidth — it's a proxy for throughput. MUL_MAT with low bandwidth typically indicates compute-bound behavior; high bandwidth indicates memory-bound. - Large traces: For long inference runs, the JSON can be large. Use
--html-max-recordsto downsample the HTML viewer, or use Chrome Trace format which handles large files well. - Multiple backends: Backend IDs in the output correspond to the scheduler's priority order (0 = highest priority, typically GPU; last = CPU).