Skip to content

Monitoring GPU Usage on Linux Servers

Monitoring your GPU performance is essential to keep servers stable, avoid overheating, and ensure that workloads run efficiently.

This guide covers how to monitor GPUs on Linux servers using built-in and third-party tools.


1. Why GPU Monitoring Matters

GPU-intensive workloads like AI training, rendering, or analytics can consume high power and generate a lot of heat.

Regular monitoring helps you:

  • πŸ–₯️ Detect performance bottlenecks
  • 🌑️ Prevent overheating and thermal throttling
  • πŸ“Š Optimize power and memory usage
  • 🧠 Plan scaling and capacity

2. Checking NVIDIA GPUs with nvidia-smi

For NVIDIA GPUs, the most commonly used tool is:

nvidia-smi

This displays:

  • GPU model and driver version
  • Power consumption
  • Memory usage
  • GPU utilization
  • Running processes

Example output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14    Driver Version: 550.54.14    CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100         On   | 00000000:65:00.0 Off |                    0 |
|  30%   62C    P0   280W / 400W|  20000MiB / 81920MiB |    90%      Default  |
+-------------------------------+----------------------+----------------------+

Useful flags

  • nvidia-smi -l 1 β†’ refresh every second
  • nvidia-smi --query-gpu=utilization.gpu,temperature.gpu,memory.used --format=csv β†’ custom metrics
  • nvidia-smi dmon β†’ real-time monitoring with graphs

3. AMD GPUs with rocm-smi

For AMD GPUs (ROCm stack):

rocm-smi

This gives similar information to nvidia-smi, including:

  • Temperature
  • Power draw
  • Fan speed
  • VRAM usage

Example output

======================= ROCm System Management Interface =======================
GPU  Temp   AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    58.0c  220.0W  1500Mhz  800Mhz   45%  auto  300.0W  25%    80%
===============================================================================

πŸ’‘ You may need to install ROCm utilities:

sudo apt install rocm-smi

4. Intel GPUs with intel_gpu_top

For Intel GPUs:

sudo intel_gpu_top

This provides a real-time dashboard similar to htop for CPU:

  • GPU usage per engine
  • Frequency and temperature
  • Memory usage

πŸ“¦ Install with:

bash sudo apt install intel-gpu-tools


5. Using watch for Continuous Monitoring

You can combine monitoring commands with watch to update the display:

watch -n 1 nvidia-smi

or

watch -n 1 rocm-smi

This gives you live GPU metrics without external dashboards.


6. Logging GPU Metrics Over Time

For long-running training or compute jobs, it’s useful to log metrics:

nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,temperature.gpu \
           --format=csv,noheader,nounits -l 5 >> gpu_usage_log.csv
  • Logs every 5 seconds
  • Can be plotted later for performance analysis

πŸ“Š You can visualize these logs using tools like:

  • Python (Matplotlib / Pandas)
  • Grafana (with exporters)
  • Excel or Google Sheets

7. GPU Exporters for Grafana & Prometheus

For production environments, use exporters to integrate GPU metrics into monitoring dashboards.

Tool Vendor Description
DCGM Exporter NVIDIA Exposes GPU metrics to Prometheus
rocm-exporter AMD AMD ROCm Prometheus exporter
Node Exporter + Intel GPU Plugin Intel Intel GPU monitoring integration

πŸ–₯️ Typical dashboard metrics:

  • GPU utilization
  • VRAM usage
  • Temperature and fan speed
  • Power consumption
  • ECC errors

8. Temperature & Thermal Control

Keeping the GPU at safe temperatures ensures:

  • Better performance
  • Longer lifespan
  • Fewer crashes
Temperature Range Status Action Needed
30–70 Β°C Normal None
70–85 Β°C High load Monitor closely
85 Β°C+ Critical threshold Check airflow, consider throttling

πŸ”Έ Consider using rack fans or liquid cooling for dense GPU setups.


9. Alerts & Automation

To avoid downtime:

  • Set Prometheus alerts on temperature or utilization
  • Use systemd services or cron jobs to monitor logs
  • Trigger shutdown or migration if temps exceed limits

Example (simple shell alert):

#!/bin/bash
TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
if [ "$TEMP" -gt 85 ]; then
  echo "⚠️ GPU Overheating: $TEMP Β°C" | mail -s "GPU Alert" [email protected]
fi

10. Best Practices

βœ… Enable persistence mode on NVIDIA GPUs:

sudo nvidia-smi -pm 1

βœ… Use persistence mode to reduce latency for repeated jobs. βœ… Monitor not just utilization, but also memory usage, temperature, and power. βœ… For clusters, centralize monitoring with Prometheus + Grafana. βœ… Automate alerts to prevent downtime.