Why Is My Training Job Pending?
Common Reasons for Pending Jobs
When you submit a training job, it may remain in a pending state for various reasons. Below are some common issues that can cause a job to be pending and their solutions.
Resource Availability
If there are not enough resources available in the cluster to run your job, it will remain pending until resources become available. This can happen if the cluster is busy with other jobs or if your job requests more resources than are currently available.
Resource Are Not Available For Project
If your project is set to cpu only, and you are trying to submit a job with GPU resources, the job will remain pending until you either change the project settings or remove the GPU resource request from your job submission. The same applies to GPU-only projects trying to run CPU jobs.
Scheduler Fairness
The scheduler is configured to ensure fairness among jobs. If other jobs in the queue have a quota or are waiting a long time, based on your cluster usage and department quota, your job may be switched to pending until resources are available.
| This is not a bug, but a scheduler feature to ensure that all users have fair access to resources. |
| When jobs are rescheduled, your job will restart the training from the beginning. Refer to checkpoints on the best practices guide for more information. |