Why Did My Training Job Fail?
Job failures can occur for various reasons, including configuration errors, resource limits, or incorrect job specifications. Below are some common problems and their solutions.
Investigating the Failure
Lookig Up Logs
To investigate the failure of a training job, you can use the following command to view the logs
$ runai-bgu logs <job-name>
This command will provide you with the logs of the specified job, which can help you identify whether the issue came from your code.
Describing the Job
To get more details about the job, you can use the following command:
$ runai-bgu describe <job-name>
This command will give you detailed information about the job, including its status, resource usage, and any error messages that may have been generated during its execution. This information can help you diagnose the issue.
Common Errors and Solutions
OOMKill Error
If you encounter an OOMKill error, it indicates that the job was terminated due to exceeding the memory limits set for the job. This can happen if your job requires more memory than what is allocated to it. To resolve this issue, you can try the following:
-
Increase the memory limit for the job by specifying a higher value in the job submission command.
-
Optimize your code to use less memory, such as by reducing the batch size or using more efficient data structures.
-
If you are using a large dataset, consider using data streaming or chunking techniques to process the data in smaller parts.
Terminated Error without Logs
If your job is terminated without any logs, it may be due to a configuration issue or a CPU or GPU resource limit being exceeded. Here are some steps to troubleshoot this issue:
-
Check the job configuration to ensure that all required parameters are set correctly, including resource limits.
-
Verify that the job is not exceeding the CPU or GPU resource limits set for the project.
Common Errors in Logs
Here are some common problems you might encounter when submitting jobs via the CLI:
-
Forgetting to activate the Conda environment, activating the wrong environment, and the package not being installed in the Conda environment.
-
Path for the code file/execution file is not correct, or the file does not exist.