Workloads in the BGU HPC Cluster

In the BGU HPC cluster, workloads submitted via the Run:AI platform can be classified into two main types: Workspaces and Trainings. Understanding the differences between these workload types will help you choose the appropriate configuration for your workflows.

Workspaces

Workspaces provide a dynamic environment where you can interact with your workload in real-time. These workloads are ideal for:

  • Exploratory Work: EDA, Testing code snippets, tuning hyperparameters, or debugging.

  • Development: Writing and iterating on scripts or notebooks in a Jupyter environment.

  • Data Inspection: Analyzing datasets or viewing intermediate outputs interactively.

Features of Workspaces

  • Live Access: Connect to the workload environment in real-time using tools like Jupyter Notebook, SSH, or an IDE.

  • Idle Timeout: Workspaces are automatically stopped if GPUs remain idle for a specified duration, ensuring efficient resource usage.

Training Workloads

Training workloads are designed for running machine learning models, simulations, or other long-running, non-interactive tasks. These jobs are best suited for:

  • Model Training: Executing deep learning or machine learning training scripts.

  • Batch Processing: Running jobs that process large datasets or perform simulations.

Features of Training Workloads

  • Batch Execution: Workload runs in the background without user interaction.

  • Scalability: Support for multi-GPU and distributed training setups.

  • Resilience: Can be configured to restart on failure or resume from checkpoints.

Key Differences Between Workspace and Training Workloads

Feature Workspaces Training Workload

Purpose

Real-time interaction

Background execution

Typical Use Case

Debugging, development, data inspection

Model training, batch processing

Duration

Short-term

Long-running

Resource Usage

Active during user interaction

Active for the workload’s entire duration

Choosing the Right Workload Type

  • Use workspace for tasks requiring real-time feedback or frequent adjustments.

  • Use training workloads for defined, repeatable tasks that can run autonomously.

By understanding and leveraging the distinctions between these job types, you can optimize resource utilization and streamline your workflows in the BGU HPC cluster.