How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?
Did a Resource Scheduling Failure Event Occur on a Cluster Node?
Symptom
A node is running properly and has GPU resources. However, the following error information is displayed:
0/9 nodes are available: 9 insufficient nvidia.com/gpu
Analysis
- Check whether the node is attached with NVIDIA label.
- Check whether the NVIDIA driver is running properly.
Log in to the node where the add-on is running and view the driver installation log in the following path:
/opt/cloud/cce/nvidia/nvidia_installer.log
View standard output logs of the NVIDIA container.
Filter the container ID by running the following command:
docker ps –a | grep nvidia
View logs by running the following command:
docker logs Container ID
What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?
Run the following command to check the CUDA version in the container:
cat /usr/local/cuda/version.txt
Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.
Node Running FAQs
- What Should I Do If a Cluster Is Available But Some Nodes Are Unavailable?
- How Do I Troubleshoot the Failure to Remotely Log In to a Node in a CCE Cluster?
- How Do I Log In to a Node Using a Password and Reset the Password?
- How Do I Collect Logs of Nodes in a CCE Cluster?
- What Can I Do If the Container Network Becomes Unavailable After yum update Is Used to Upgrade the OS?
- What Should I Do If the vdb Disk of a Node Is Damaged and the Node Cannot Be Recovered After Reset?
- Which Ports Are Used to Install kubelet on CCE Cluster Nodes?
- How Do I Configure a Pod to Use the Acceleration Capability of a GPU Node?
- What Should I Do If I/O Suspension Occasionally Occurs When SCSI EVS Disks Are Used?
- What Should I Do If Excessive Docker Audit Logs Affect the Disk I/O?
- How Do I Fix an Abnormal Container or Node Due to No Thin Pool Disk Space?
- Which Ports Does a Node Listen On?
- How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?
- What Should I Do If a Node Does Not Synchronize with the NTP Clock Source?
- What Should I Do If the Data Disk Usage Is High Because a Large Volume of Data Is Written Into the Log File?
- Why Does My Node Memory Usage Obtained by Running the kubelet top node Command Exceeds 100%?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbotmore