Why Did My Interactive Workspace Fail?

Interactive workspaces can fail for several reasons, including configuration issues, resource limits, or problems that occur after launch. This page explains how to investigate a failure in the browser and outlines common causes and fixes.

Investigating the Failure (in the Browser)

Viewing Logs

To see the runtime logs of your workspace:

  1. Go to Workload Manager > Workloads

  2. Select your job

  3. Click Show details (top-right)

  4. Choose the Logs tab

Logs often contain the first concrete hint (e.g., out-of-memory messages, Python exceptions, missing files). Scroll to the end for the most recent entries.

Checking Event History and Status

To see the event history of your workspace:

  1. Go to Workload Manager > Workloads

  2. Select your job

  3. Click Show details (top-right)

  4. Choose the Event History tab

  5. Review status changes and reasons (e.g., Failed with OOMKilled or Terminated).

If the job Failed, it typically records whether it was due to OOMKill or Terminated.

View logs and event history in the browser

Common Failure Types and Fixes

OOMKill (Out of Memory)

Symptom: Event History indicates OOMKilled; logs may show memory allocation errors before the container is killed.

Why it happens: Your process used more memory than the workspace limit (CPU RAM or, in some cases, GPU memory leading to process failure).

What to do:

  • Increase the memory requested/limited for the workspace when launching it.

  • Reduce memory pressure in your code (e.g., smaller batch sizes, lower precision, stream/chunk data, free tensors/arrays sooner).

  • Close memory-hungry background processes within the workspace.

Remember: the interactive workspace is designed primarily for code editing, debugging, and running small test jobs. If your workload is memory- or compute-intensive, it should be submitted as a training job, not run inside the workspace. This ensures better stability, performance, and fair resource usage across the cluster.

Terminated

Symptom: Event History shows Terminated; logs may be truncated or absent.

Common causes:

  • The process exited with a non-zero code (uncaught exception).

  • Manual stop/restart actions.

What to do:

  • Check Logs for the last exception/traceback and fix the root cause (paths, credentials, missing files, bad arguments).

  • Ensure requested resources (CPU/GPU/memory) match the workload’s needs.

Quick Triage Checklist

  1. Select your job and open Show details → Logs; scan the last output lines.

  2. Open Show details → Event History and note the exact reason and timestamp.

  3. Apply the corresponding fix (increase memory, fix code error, adjust resources).

  4. Submit again and re-check logs/events.