runai-bgu resume Manual

Introduction

runai-bgu resume is a command-line interface (CLI) for resuming suspended workloads on the BGU HPC cluster. This command restarts workloads that have been previously suspended, allowing you to continue your work where you left off. The command automatically detects the workload type (workspace or training) and resumes it with its original configuration and allocated resources.

This manual explains how to use runai-bgu resume to restart your suspended workloads.

Quick Start

To resume a suspended workload, use:

$ runai-bgu resume my-workload

Resumes the specified workload in your default project.

Basic Usage

Resume a Workload

Restart a suspended workload by name:

$ runai-bgu resume research-job

Resumes the workload named research-job in your default project.

Resume in Specific Project

Resume a workload in a specific project:

$ runai-bgu resume research-job -p my-research-project

Resumes the workload research-job in the specified project my-research-project.

Understanding Resume Operation

When you resume a workload:

Resource Restoration

The workload is restarted with its original resource allocation (CPU, memory, GPU).

State Continuation

For workspace workloads, your files and environment are preserved from when the workload was suspended.

Queue Position

The resumed workload enters the scheduling queue and will start when resources become available.

Configuration Preservation

All original settings including environment variables, volumes, and network configuration are maintained.

Common Use Cases

Continue Interactive Work

Resume a suspended workspace to continue development:

$ runai-bgu resume my-workspace

Restart Long-Running Training

Resume a training job that was suspended to free resources:

$ runai-bgu resume training-experiment-1

Prerequisites for Resume

Before resuming a workload, ensure:

Suspended Status

The workload must be in "Suspended" status. Use runai-bgu list or runai-bgu describe to check status.

Resource Availability

Sufficient cluster resources must be available for the workload’s requirements.

Project Access

You must have appropriate permissions in the project containing the workload.

Valid Configuration

The original workload configuration must still be valid (images accessible, volumes available, etc.).

Best Practices

Monitor Status

After resuming, check the workload status to ensure it starts successfully.

Resource Planning

Consider cluster load before resuming resource-intensive workloads.

Project Organization

Keep track of which projects contain your suspended workloads.

Documentation

Maintain notes about why workloads were suspended to inform resume decisions.