Why Did My Interactive Workspace Fail?
Interactive workspaces can fail for several reasons, including configuration issues, resource limits, or problems that occur after launch. This page explains how to investigate a failure in the browser and outlines common causes and fixes.
Investigating the Failure (in the Browser)
Viewing Logs
To see the runtime logs of your workspace:
-
Go to Workload Manager > Workloads
-
Select your job
-
Click Show details (top-right)
-
Choose the Logs tab
|
Logs often contain the first concrete hint (e.g., out-of-memory messages, Python exceptions, missing files). Scroll to the end for the most recent entries. |
Checking Event History and Status
To see the event history of your workspace:
-
Go to Workload Manager > Workloads
-
Select your job
-
Click Show details (top-right)
-
Choose the Event History tab
-
Review status changes and reasons (e.g., Failed with OOMKilled or Terminated).
|
If the job Failed, it typically records whether it was due to OOMKill or Terminated. |
Common Failure Types and Fixes
OOMKill (Out of Memory)
Symptom: Event History indicates OOMKilled; logs may show memory allocation errors before the container is killed.
Why it happens: Your process used more memory than the workspace limit (CPU RAM or, in some cases, GPU memory leading to process failure).
What to do:
-
Increase the memory requested/limited for the workspace when launching it.
-
Reduce memory pressure in your code (e.g., smaller batch sizes, lower precision, stream/chunk data, free tensors/arrays sooner).
-
Close memory-hungry background processes within the workspace.
|
Remember: the interactive workspace is designed primarily for code editing, debugging, and running small test jobs. If your workload is memory- or compute-intensive, it should be submitted as a training job, not run inside the workspace. This ensures better stability, performance, and fair resource usage across the cluster. |
Terminated
Symptom: Event History shows Terminated; logs may be truncated or absent.
Common causes:
-
The process exited with a non-zero code (uncaught exception).
-
Manual stop/restart actions.
What to do:
-
Check Logs for the last exception/traceback and fix the root cause (paths, credentials, missing files, bad arguments).
-
Ensure requested resources (CPU/GPU/memory) match the workload’s needs.