Monitoring GPU Usage on Linux Servers
Monitoring your GPU performance is essential to keep servers stable, avoid overheating, and ensure that workloads run efficiently.
This guide covers how to monitor GPUs on Linux servers using built-in and third-party tools.
1. Why GPU Monitoring Matters
GPU-intensive workloads like AI training, rendering, or analytics can consume high power and generate a lot of heat.
Regular monitoring helps you:
- π₯οΈ Detect performance bottlenecks
- π‘οΈ Prevent overheating and thermal throttling
- π Optimize power and memory usage
- π§ Plan scaling and capacity
2. Checking NVIDIA GPUs with nvidia-smi
For NVIDIA GPUs, the most commonly used tool is:
nvidia-smi
This displays:
- GPU model and driver version
- Power consumption
- Memory usage
- GPU utilization
- Running processes
Example output
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A100 On | 00000000:65:00.0 Off | 0 |
| 30% 62C P0 280W / 400W| 20000MiB / 81920MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
Useful flags
nvidia-smi -l 1β refresh every secondnvidia-smi --query-gpu=utilization.gpu,temperature.gpu,memory.used --format=csvβ custom metricsnvidia-smi dmonβ real-time monitoring with graphs
3. AMD GPUs with rocm-smi
For AMD GPUs (ROCm stack):
rocm-smi
This gives similar information to nvidia-smi, including:
- Temperature
- Power draw
- Fan speed
- VRAM usage
Example output
======================= ROCm System Management Interface =======================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 58.0c 220.0W 1500Mhz 800Mhz 45% auto 300.0W 25% 80%
===============================================================================
π‘ You may need to install ROCm utilities:
sudo apt install rocm-smi
4. Intel GPUs with intel_gpu_top
For Intel GPUs:
sudo intel_gpu_top
This provides a real-time dashboard similar to htop for CPU:
- GPU usage per engine
- Frequency and temperature
- Memory usage
π¦ Install with:
bash sudo apt install intel-gpu-tools
5. Using watch for Continuous Monitoring
You can combine monitoring commands with watch to update the display:
watch -n 1 nvidia-smi
or
watch -n 1 rocm-smi
This gives you live GPU metrics without external dashboards.
6. Logging GPU Metrics Over Time
For long-running training or compute jobs, itβs useful to log metrics:
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,temperature.gpu \
--format=csv,noheader,nounits -l 5 >> gpu_usage_log.csv
- Logs every 5 seconds
- Can be plotted later for performance analysis
π You can visualize these logs using tools like:
- Python (Matplotlib / Pandas)
- Grafana (with exporters)
- Excel or Google Sheets
7. GPU Exporters for Grafana & Prometheus
For production environments, use exporters to integrate GPU metrics into monitoring dashboards.
| Tool | Vendor | Description |
|---|---|---|
| DCGM Exporter | NVIDIA | Exposes GPU metrics to Prometheus |
| rocm-exporter | AMD | AMD ROCm Prometheus exporter |
| Node Exporter + Intel GPU Plugin | Intel | Intel GPU monitoring integration |
π₯οΈ Typical dashboard metrics:
- GPU utilization
- VRAM usage
- Temperature and fan speed
- Power consumption
- ECC errors
8. Temperature & Thermal Control
Keeping the GPU at safe temperatures ensures:
- Better performance
- Longer lifespan
- Fewer crashes
| Temperature Range | Status | Action Needed |
|---|---|---|
| 30β70 Β°C | Normal | None |
| 70β85 Β°C | High load | Monitor closely |
| 85 Β°C+ | Critical threshold | Check airflow, consider throttling |
πΈ Consider using rack fans or liquid cooling for dense GPU setups.
9. Alerts & Automation
To avoid downtime:
- Set Prometheus alerts on temperature or utilization
- Use systemd services or cron jobs to monitor logs
- Trigger shutdown or migration if temps exceed limits
Example (simple shell alert):
#!/bin/bash
TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
if [ "$TEMP" -gt 85 ]; then
echo "β οΈ GPU Overheating: $TEMP Β°C" | mail -s "GPU Alert" [email protected]
fi
10. Best Practices
β Enable persistence mode on NVIDIA GPUs:
sudo nvidia-smi -pm 1
β Use persistence mode to reduce latency for repeated jobs. β Monitor not just utilization, but also memory usage, temperature, and power. β For clusters, centralize monitoring with Prometheus + Grafana. β Automate alerts to prevent downtime.