Why Did My Interactive SSH Job Fail?

Interactive SSH jobs can fail for several reasons, including configuration issues, resource limits, or problems that occur after launch. This page explains how to investigate a failure and outlines common causes and fixes.

Investigating the Failure

Looking up Logs

To investigate a failure, use the following command to view logs:

$ runai-bgu logs <job-name>

This command will provide you with the logs of the specified job, which can help you identify the issue.


Describing the Job

To get more details about the job, you can use the following command:

$ runai-bgu describe <job-name>

This command will give you detailed information about the job, including its status, resource usage, and any error messages that may have been generated during its execution. This information can help you diagnose the issue.

Logs often contain the first concrete hint (e.g., out-of-memory messages, Python exceptions, missing files). Scroll to the end for the most recent entries.

If the job Failed, it typically records whether it was due to OOMKill or Terminated.

Common Failure Types and Fixes

OOMKill (Out of Memory)

Symptom: logs may show memory allocation errors before the container is killed.

Why it happens: Your process used more memory than the job limit (CPU RAM or, in some cases, GPU memory leading to process failure).

What to do:

  • Increase the memory requested/limited for the workspace when launching it (as long as it is within your quota).

  • Reduce memory pressure in your code (e.g., smaller batch sizes, lower precision, stream/chunk data, free tensors/arrays sooner).

Remember: the interactive workspace is designed primarily for code editing, debugging, and running small test jobs. If your workload is memory- or compute-intensive, it should be submitted as a training job.

Terminated

Symptom: logs may be truncated or absent.

Common causes:

  • The process exited with a non-zero code (uncaught exception).

  • Manual stop/restart actions.

What to do:

  • Check Logs for the last exception/traceback and fix the root cause (paths, credentials, missing files, bad arguments).

  • Ensure requested resources (CPU/GPU/memory) match the workload’s needs.