DCGM Exporter: Fixing Incorrect MIG GPU Profile Reporting
Introduction
Hey guys! Today, we're diving into a tricky issue where the DCGM Exporter is reporting an incorrect MIG GPU profile. This problem can mess up your Grafana dashboards and lead to inaccurate monitoring of your GPU resources. We'll break down the issue, explore the configurations, and provide a step-by-step guide on how to reproduce it. So, if you're facing similar problems, you're in the right place! Let's get started and figure this out together.
Background
The DCGM (Data Center GPU Manager) Exporter is a crucial tool for monitoring NVIDIA GPUs in Kubernetes environments. It collects and exposes metrics that help you understand GPU utilization, memory usage, and other vital stats. When MIG (Multi-Instance GPU) is enabled, the DCGM Exporter should accurately report the profiles of the MIG instances. However, sometimes, it can report incorrect profiles, leading to confusion and broken dashboards. This article addresses an issue where the DCGM Exporter reports a 1g.11gb profile instead of the correct 1g.12gb profile, causing problems with monitoring and resource allocation. Understanding the root cause and how to fix it is essential for maintaining a healthy and efficient GPU-accelerated environment.
Problem Description
The core issue is that the DCGM Exporter is reporting the wrong MIG GPU profile. Specifically, when an H100 NVL GPU is configured with the all-1g.12gb MIG profile, the exporter incorrectly reports it as 1g.11gb. This discrepancy breaks Grafana dashboards that rely on this information for accurate monitoring. The problem occurs with both single and mixed MIG strategies, making it a persistent issue regardless of the configuration. This misreporting can lead to incorrect resource allocation and monitoring, impacting the performance and stability of GPU-accelerated workloads.
Technical Details
- Environment: Kubernetes
- GPU: H100 NVL
- MIG Profile: all-1g.12gb
- Reported Profile (Incorrect): 1g.11gb
- DCGM Exporter Version: 4.2.3-4.1.3
- Driver Version: 535.230.02
- Kubernetes Version: v1.32.3
- Operator Version: 25.3.2
Example Metrics
Here's a snippet of the metrics reported by the DCGM Exporter, showing the incorrect GPU_I_PROFILE label:
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="7",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="8",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="9",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="10",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="11",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="12",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-153bdcb7-df30-40e2-3f2c-379902ba72fa",pci_bus_id="00000000:00:06.0",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="**1g.11gb**",GPU_I_ID="13",Hostname="vijay-1-bodha-h100-nvl-nfm45-5h7w7-gj5nn-9sbrb",DCGM_FI_DRIVER_VERSION="535.230.02"} 12
Expected Outcome
Ideally, the GPU_I_PROFILE label should accurately reflect the configured MIG GPU profile, which in this case is 1g.12gb. This ensures that monitoring tools and dashboards display the correct information, allowing for accurate resource management and troubleshooting.
Environment Details
Understanding the environment in which this issue occurs is crucial for troubleshooting. Here’s a breakdown of the key components and configurations:
Hardware and Software
- GPU Model: NVIDIA H100 NVL. The NVIDIA H100 NVL is a high-performance GPU designed for data center workloads, making it essential for AI, machine learning, and high-performance computing tasks. This issue specifically affects this model, which requires precise monitoring for optimal performance.
- Driver Version: 535.230.02. The GPU driver plays a vital role in ensuring the correct operation of the hardware. Using the specified driver version helps in pinpointing any driver-specific issues that might be contributing to the problem. Consistent and updated drivers are essential for maintaining GPU stability and performance.
- Kubernetes Version: v1.32.3. Kubernetes is the container orchestration platform that manages the deployment and scaling of applications. Knowing the Kubernetes version helps in identifying any compatibility issues or version-specific bugs that might affect the DCGM Exporter's functionality. Keeping Kubernetes up-to-date is crucial for leveraging the latest features and security patches.
- Operating System: Ubuntu 22.04. The underlying operating system can influence how software components interact with the hardware. Ubuntu 22.04 is a widely used and supported Linux distribution, but specific OS configurations could potentially affect GPU monitoring tools.
- DCGM Exporter Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04. The DCGM Exporter image version is critical because different versions may have varying levels of support for MIG configurations and bug fixes. Specifying the exact image helps in reproducing the issue and testing potential solutions.
- Operator Version: 25.3.2. The GPU Operator simplifies the deployment and management of NVIDIA GPU drivers and related components in Kubernetes. The operator version can impact how the DCGM Exporter is configured and deployed. Ensuring the operator is up-to-date and correctly configured is essential for managing GPU resources effectively.
Node Labels
Node labels provide additional context about the environment. Here are some relevant labels from the affected node:
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/instance-type": "ahv-vm",
"beta.kubernetes.io/os": "linux",