runai-bgu describe Manual
Introduction
runai-bgu describe is a command-line interface (CLI) for describing specific workloads on the BGU HPC cluster.
This command provides detailed information about a workload including its general properties, compute resources, pods, events, and network configuration. The command automatically detects the workload type (workspace or training) and retrieves comprehensive details useful for monitoring, debugging, and resource management.
This manual explains how to use runai-bgu describe to get detailed information about your workloads.
Quick Start
To describe a workload, use:
$ runai-bgu describe my-workload
Shows comprehensive details about the specified workload.
Basic Usage
Describe a Workload
Get detailed information about a specific workload:
$ runai-bgu describe research-job
Displays complete information about the workload named research-job including status, resources, and configuration.
Understanding the Output
The describe command provides comprehensive information organized in sections:
General Information
Basic workload metadata:
- Name and Type
-
The workload name and whether it’s a workspace or training job.
- Status and Phase
-
Current operational state and lifecycle phase.
- Creation Time
-
When the workload was originally submitted.
- Project
-
Which project contains the workload.
Compute Resources
Resource allocation details:
- CPU Allocation
-
Number of CPU cores requested and allocated.
- Memory Limits
-
RAM allocation and usage information.
- GPU Resources
-
GPU allocation including memory or compute units.
- Resource Requests vs Limits
-
Minimum required resources versus maximum allowed resources.
Pods Information
Container execution details:
- Pod Status
-
Status of individual pods running the workload.
- Container States
-
Status of containers within each pod.
- Resource Consumption
-
Actual resource usage by running containers.
- Restart Counts
-
Number of times containers have been restarted.
Common Use Cases
Troubleshooting Failed Jobs
Investigate why a workload failed to start or crashed:
$ runai-bgu describe failed-training
Check the events section for error messages and scheduling issues.
Monitoring Resource Usage
Track how efficiently resources are being used:
$ runai-bgu describe resource-intensive-job
Compare allocated resources with actual usage patterns.
Advanced Usage
Information Analysis
Resource Optimization
Use describe output to optimize resource allocation:
- Over-allocation
-
Identify workloads using significantly less than allocated resources.
- Under-allocation
-
Find workloads that might benefit from additional resources.
- Efficiency Metrics
-
Compare resource requests with actual usage patterns.
Best Practices
- Regular Monitoring
-
Use describe regularly to stay informed about workload health and performance.
- Proactive Troubleshooting
-
Check workload details at the first sign of issues rather than waiting for failures.
- Resource Planning
-
Use resource information to plan future workload submissions and cluster capacity.
- Documentation
-
Save describe output for important workloads as documentation for troubleshooting and optimization.
Integration with Other Commands
Describe works well in combination with other runai-bgu commands:
- Investigation Workflow
-
list→describe→logsfor comprehensive troubleshooting. - Management Workflow
-
describe→suspend/resume/deletefor informed workload management. - Monitoring Workflow
-
list→describe→bashfor interactive investigation.