Why Did My X11 Workspace Fail?
X11 Workspace can fail for several reasons, including configuration issues, resource limits, or problems that occur after launch. This page explains how to investigate a failure and outlines common causes and fixes.
Investigating the Failure
Lookig up Logs
To investigate the failure of a job, you can use the following command to view the logs
$ runai-bgu logs <job-name>
This command will provide you with the logs of the specified job, which can help you identify the issue.
Describing the Job
To get more details about the job, you can use the following command:
$ runai-bgu describe <job-name>
This command will give you detailed information about the job, including its status and any error messages that may have been generated during its execution. This information can help you diagnose the problem.
|
If the job Failed, it typically records whether it was due to OOMKill or Terminated. |
Common Failure Types and Fixes
OOMKill (Out of Memory)
Symptom: logs may show memory allocation errors before the container is killed.
Why it happens: Your process used more memory than the workspace limit (CPU RAM or, in some cases, GPU memory leading to process failure).
What to do:
-
Increase the memory requested/limited for the workspace when launching it (as long as it is within your quota).
-
Reduce memory pressure.
Terminated
Symptom: logs may be truncated or absent.
Common causes:
-
The process exited with a non-zero code (uncaught exception).
-
Manual stop/restart actions.
What to do:
-
Check Logs and job description for the last exception/traceback and fix the root cause (paths, credentials, bad arguments).
-
Ensure requested resources (CPU/GPU/memory) match the workload’s needs.