runai-bgu describe Manual

Introduction

runai-bgu describe is a command-line interface (CLI) for describing specific workloads on the BGU HPC cluster. This command provides detailed information about a workload including its general properties, compute resources, pods, events, and network configuration. The command automatically detects the workload type (workspace or training) and retrieves comprehensive details useful for monitoring, debugging, and resource management.

This manual explains how to use runai-bgu describe to get detailed information about your workloads.

Quick Start

To describe a workload, use:

$ runai-bgu describe my-workload

Shows comprehensive details about the specified workload.

Basic Usage

Describe a Workload

Get detailed information about a specific workload:

$ runai-bgu describe research-job

Displays complete information about the workload named research-job including status, resources, and configuration.

Check Workload Status

Quickly verify the current state of a workload:

$ runai-bgu describe my-workspace

Shows whether the workload is running, pending, failed, or in another state.

Review Resource Allocation

Examine how resources are allocated and used:

$ runai-bgu describe gpu-training-job

Displays CPU, memory, and GPU resource limits and current usage.

Understanding the Output

The describe command provides comprehensive information organized in sections:

General Information

Basic workload metadata:

Name and Type: The workload name and whether it’s a workspace or training job.
Status and Phase: Current operational state and lifecycle phase.
Creation Time: When the workload was originally submitted.
Project: Which project contains the workload.

Compute Resources

Resource allocation details:

CPU Allocation: Number of CPU cores requested and allocated.
Memory Limits: RAM allocation and usage information.
GPU Resources: GPU allocation including memory or compute units.
Resource Requests vs Limits: Minimum required resources versus maximum allowed resources.

Pods Information

Container execution details:

Pod Status: Status of individual pods running the workload.
Container States: Status of containers within each pod.
Resource Consumption: Actual resource usage by running containers.
Restart Counts: Number of times containers have been restarted.

Events

Recent activity and status changes:

State Transitions: History of workload state changes.
Scheduling Events: Information about pod scheduling and placement.
Error Messages: Any error conditions or warnings.
System Events: Cluster-level events affecting the workload.

Networks

Network configuration and connectivity:

Port Mappings: Exposed ports and their mappings.
Service Endpoints: Network endpoints for accessing the workload.
Network Policies: Applied network security policies.

Common Use Cases

Troubleshooting Failed Jobs

Investigate why a workload failed to start or crashed:

$ runai-bgu describe failed-training

Check the events section for error messages and scheduling issues.

Monitoring Resource Usage

Track how efficiently resources are being used:

$ runai-bgu describe resource-intensive-job

Compare allocated resources with actual usage patterns.

Debugging Connectivity Issues

Examine network configuration for workspace access problems:

$ runai-bgu describe my-workspace

Review port mappings and network endpoints.

Pre-Deletion Review

Verify workload details before deletion:

$ runai-bgu describe old-experiment
$ runai-bgu delete old-experiment

Confirm you’re deleting the correct workload and understand its current state.

Advanced Usage

Cross-Project Description

Describe workloads in different projects:

$ runai-bgu describe shared-resource -p team-project

Useful when working with workloads across multiple projects.

Automation and Scripting

Extract specific information for scripts:

$ runai-bgu describe my-job | grep -A 5 "Status:"

Parse output for automated monitoring or reporting.

Information Analysis

Resource Optimization

Use describe output to optimize resource allocation:

Over-allocation: Identify workloads using significantly less than allocated resources.
Under-allocation: Find workloads that might benefit from additional resources.
Efficiency Metrics: Compare resource requests with actual usage patterns.

Performance Monitoring

Track workload performance over time:

State History: Monitor how frequently workloads restart or fail.
Scheduling Delays: Identify patterns in scheduling delays or resource contention.
Error Patterns: Analyze recurring error conditions or failure modes.

Best Practices

Regular Monitoring: Use describe regularly to stay informed about workload health and performance.
Proactive Troubleshooting: Check workload details at the first sign of issues rather than waiting for failures.
Resource Planning: Use resource information to plan future workload submissions and cluster capacity.
Documentation: Save describe output for important workloads as documentation for troubleshooting and optimization.

Integration with Other Commands

Describe works well in combination with other runai-bgu commands:

Investigation Workflow: list → describe → logs for comprehensive troubleshooting.
Management Workflow: describe → suspend/resume/delete for informed workload management.
Monitoring Workflow: list → describe → bash for interactive investigation.