NVIDIA GPU monitoring with Netdata¶
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using
nvidia-smi cli tool.
Requirements and Notes:
You must have the
nvidia-smitool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi.
You must enable this plugin as its disabled by default due to minor performance issues.
On some systems when the GPU is idle the
nvidia-smitool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn’t likely to be an issue.
nvidia-smitool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357
Contributions are welcome.
netdatauser can execute
/usr/bin/nvidia-smior wherever your binary is.
nvidia-smiprocess is not killed after netdata restart you need to off
poll_secondsis how often in seconds the tool is polled for as an integer.
- GPU utilization
- memory allocation
- memory utilization
- fan speed
- power usage
- clock speed
- PCI bandwidth
python.d/nvidia_smi.conf configuration file using
edit-config from the your agent’s config
directory, which is typically at
cd /etc/netdata # Replace this path with your Netdata config directory, if different sudo ./edit-config python.d/nvidia_smi.conf
loop_mode : yes poll_seconds : 1