runai-bgu describe Manual

Introduction

runai-bgu describe is a command-line interface (CLI) for describing specific workloads on the BGU HPC cluster. This command provides detailed information about a workload including its general properties, compute resources, pods, events, and network configuration. The command automatically detects the workload type (workspace or training) and retrieves comprehensive details useful for monitoring, debugging, and resource management.

This manual explains how to use runai-bgu describe to get detailed information about your workloads.

Quick Start

To describe a workload, use:

$ runai-bgu describe my-workload

Shows comprehensive details about the specified workload.

Basic Usage

Describe a Workload

Get detailed information about a specific workload:

$ runai-bgu describe research-job

Displays complete information about the workload named research-job including status, resources, and configuration.

Check Workload Status

Quickly verify the current state of a workload:

$ runai-bgu describe my-workspace

Shows whether the workload is running, pending, failed, or in another state.

Review Resource Allocation

Examine how resources are allocated and used:

$ runai-bgu describe gpu-training-job

Displays CPU, memory, and GPU resource limits and current usage.

Understanding the Output

The describe command provides comprehensive information organized in sections:

General Information

Basic workload metadata:

Name and Type

The workload name and whether it’s a workspace or training job.

Status and Phase

Current operational state and lifecycle phase.

Creation Time

When the workload was originally submitted.

Project

Which project contains the workload.

Compute Resources

Resource allocation details:

CPU Allocation

Number of CPU cores requested and allocated.

Memory Limits

RAM allocation and usage information.

GPU Resources

GPU allocation including memory or compute units.

Resource Requests vs Limits

Minimum required resources versus maximum allowed resources.

Pods Information

Container execution details:

Pod Status

Status of individual pods running the workload.

Container States

Status of containers within each pod.

Resource Consumption

Actual resource usage by running containers.

Restart Counts

Number of times containers have been restarted.

Events

Recent activity and status changes:

State Transitions

History of workload state changes.

Scheduling Events

Information about pod scheduling and placement.

Error Messages

Any error conditions or warnings.

System Events

Cluster-level events affecting the workload.

Networks

Network configuration and connectivity:

Port Mappings

Exposed ports and their mappings.

Service Endpoints

Network endpoints for accessing the workload.

Network Policies

Applied network security policies.

Common Use Cases

Troubleshooting Failed Jobs

Investigate why a workload failed to start or crashed:

$ runai-bgu describe failed-training

Check the events section for error messages and scheduling issues.

Monitoring Resource Usage

Track how efficiently resources are being used:

$ runai-bgu describe resource-intensive-job

Compare allocated resources with actual usage patterns.

Debugging Connectivity Issues

Examine network configuration for workspace access problems:

$ runai-bgu describe my-workspace

Review port mappings and network endpoints.

Pre-Deletion Review

Verify workload details before deletion:

$ runai-bgu describe old-experiment
$ runai-bgu delete old-experiment

Confirm you’re deleting the correct workload and understand its current state.

Advanced Usage

Cross-Project Description

Describe workloads in different projects:

$ runai-bgu describe shared-resource -p team-project

Useful when working with workloads across multiple projects.

Automation and Scripting

Extract specific information for scripts:

$ runai-bgu describe my-job | grep -A 5 "Status:"

Parse output for automated monitoring or reporting.

Information Analysis

Resource Optimization

Use describe output to optimize resource allocation:

Over-allocation

Identify workloads using significantly less than allocated resources.

Under-allocation

Find workloads that might benefit from additional resources.

Efficiency Metrics

Compare resource requests with actual usage patterns.

Performance Monitoring

Track workload performance over time:

State History

Monitor how frequently workloads restart or fail.

Scheduling Delays

Identify patterns in scheduling delays or resource contention.

Error Patterns

Analyze recurring error conditions or failure modes.

Best Practices

Regular Monitoring

Use describe regularly to stay informed about workload health and performance.

Proactive Troubleshooting

Check workload details at the first sign of issues rather than waiting for failures.

Resource Planning

Use resource information to plan future workload submissions and cluster capacity.

Documentation

Save describe output for important workloads as documentation for troubleshooting and optimization.

Integration with Other Commands

Describe works well in combination with other runai-bgu commands:

Investigation Workflow

listdescribelogs for comprehensive troubleshooting.

Management Workflow

describesuspend/resume/delete for informed workload management.

Monitoring Workflow

listdescribebash for interactive investigation.