kblaunch
kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
Commands
- Launching GPU jobs with various configurations
- Monitoring GPU usage and job statistics
- Setting up user configurations and preferences
- Managing persistent volumes and Git authentication
Features
- Interactive and batch job support
- GPU resource management and constraints
- Environment variable handling from multiple sources
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- VS Code integration with remote tunneling
- Slack notifications for job status
- Real-time cluster monitoring
Resource Types
- A100 GPUs (40GB and 80GB variants)
- H100 GPUs (80GB variant)
- CPU and RAM allocation
- Persistent storage volumes
Job Priority Classes
- default: Standard priority for most workloads
- batch: Lower priority for long-running jobs
- short: High priority for quick jobs (with GPU constraints)
Environment Integration
- Kubernetes secrets
- Local environment variables
- .env file support
- SSH key management
- NFS workspace mounting
1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters. 2 3## Commands 4* Launching GPU jobs with various configurations 5* Monitoring GPU usage and job statistics 6* Setting up user configurations and preferences 7* Managing persistent volumes and Git authentication 8 9## Features 10* Interactive and batch job support 11* GPU resource management and constraints 12* Environment variable handling from multiple sources 13* Persistent Volume Claims (PVC) for storage 14* Git SSH authentication 15* VS Code integration with remote tunneling 16* Slack notifications for job status 17* Real-time cluster monitoring 18 19## Resource Types 20* A100 GPUs (40GB and 80GB variants) 21* H100 GPUs (80GB variant) 22* CPU and RAM allocation 23* Persistent storage volumes 24 25## Job Priority Classes 26* default: Standard priority for most workloads 27* batch: Lower priority for long-running jobs 28* short: High priority for quick jobs (with GPU constraints) 29 30## Environment Integration 31* Kubernetes secrets 32* Local environment variables 33* .env file support 34* SSH key management 35* NFS workspace mounting 36""" 37 38import importlib.metadata 39 40__version__ = importlib.metadata.version("kblaunch") 41 42__all__ = [ 43 "setup", 44 "launch", 45 "monitor_gpus", 46 "monitor_users", 47 "monitor_jobs", 48 "monitor_queue", 49] 50 51from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
768@app.command() 769def setup(): 770 """ 771 `kblaunch setup` 772 773 Interactive setup wizard for kblaunch configuration. 774 No arguments - all configuration is done through interactive prompts. 775 776 This command walks users through the initial setup process, configuring: 777 - User identity and email 778 - Namespace and queue settings 779 - Slack notifications webhook 780 - Persistent Volume Claims (PVC) for storage 781 - Git SSH authentication 782 - NFS server configuration 783 784 The configuration is stored in ~/.cache/.kblaunch/config.json. 785 786 Configuration includes: 787 - User: Kubernetes username for job ownership 788 - Email: User email for notifications and Git configuration 789 - Namespace: Kubernetes namespace for job deployment 790 - Queue: Kueue queue name for job scheduling 791 - Slack webhook: URL for job status notifications 792 - PVC: Persistent storage configuration 793 - Git SSH: Authentication for private repositories 794 - NFS: Server address for mounting storage 795 """ 796 config = load_config() 797 798 # validate user 799 default_user = os.getenv("USER") 800 if "user" in config: 801 default_user = config["user"] 802 else: 803 config["user"] = default_user 804 805 if typer.confirm( 806 f"Would you like to set the user? (default: {default_user})", default=False 807 ): 808 user = typer.prompt("Please enter your user", default=default_user) 809 config["user"] = user 810 811 # Get email 812 existing_email = config.get("email", None) 813 email = typer.prompt( 814 f"Please enter your email (existing: {existing_email})", default=existing_email 815 ) 816 config["email"] = email 817 818 # Configure namespace 819 existing_namespace = config.get("namespace", os.getenv("KUBE_NAMESPACE")) 820 if typer.confirm("Would you like to configure your namespace?", default=True): 821 namespace = typer.prompt( 822 f"Please enter your namespace (existing: {existing_namespace})", 823 default=existing_namespace, 824 ) 825 config["namespace"] = namespace 826 # Now that we have namespace, ask about queue 827 existing_queue = config.get("queue", get_user_queue(namespace)) 828 if typer.confirm("Would you like to configure your queue?", default=True): 829 queue = typer.prompt( 830 f"Please enter your queue name (existing: {existing_queue})", 831 default=existing_queue or f"{namespace}-user-queue", 832 ) 833 config["queue"] = queue 834 835 # Get NFS Server 836 # Get the current NFS server from config or default 837 current_nfs = config.get("nfs_server", NFS_SERVER) 838 if typer.confirm("Would you like to configure the NFS server?", default=False): 839 nfs_server = typer.prompt( 840 f"Enter your NFS server address (existing: {current_nfs})", 841 default=current_nfs, 842 ) 843 config["nfs_server"] = nfs_server 844 845 # Get Slack webhook 846 if typer.confirm("Would you like to set up Slack notifications?", default=False): 847 existing_webhook = config.get("slack_webhook", None) 848 webhook = typer.prompt( 849 f"Enter your Slack webhook URL (existing: {existing_webhook})", 850 default=existing_webhook, 851 ) 852 config["slack_webhook"] = webhook 853 854 if typer.confirm("Would you like to use a PVC?", default=False): 855 user = config["user"] 856 current_default = config.get("default_pvc", f"{user}-pvc") 857 858 pvc_name = typer.prompt( 859 f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.", 860 default=current_default, 861 ) 862 863 namespace = config.get("namespace", get_current_namespace(config)) 864 if check_if_pvc_exists(pvc_name, namespace): 865 if typer.confirm( 866 f"Would you like to set {pvc_name} as the default PVC?", 867 default=True, 868 ): 869 config["default_pvc"] = pvc_name 870 else: 871 if typer.confirm( 872 f"PVC '{pvc_name}' does not exist. Would you like to create it?", 873 default=True, 874 ): 875 pvc_size = typer.prompt( 876 "Enter the desired PVC size (e.g. 10Gi)", default="10Gi" 877 ) 878 try: 879 if create_pvc(user, pvc_name, pvc_size, namespace): 880 config["default_pvc"] = pvc_name 881 except (ValueError, ApiException) as e: 882 logger.error(f"Failed to create PVC: {e}") 883 884 # Git authentication setup 885 if typer.confirm("Would you like to set up Git SSH authentication?", default=False): 886 default_key_path = str(Path.home() / ".ssh" / "id_rsa") 887 key_path = typer.prompt( 888 "Enter the path to your SSH private key", 889 default=default_key_path, 890 ) 891 secret_name = f"{config['user']}-git-ssh" 892 namespace = config.get("namespace", get_current_namespace(config)) 893 if create_git_secret(secret_name, key_path, namespace): 894 config["git_secret"] = secret_name 895 896 # validate slack webhook 897 if "slack_webhook" in config: 898 # test post to slack 899 try: 900 logger.info("Sending test message to Slack") 901 message = "Hello :wave: from ```kblaunch```" 902 response = requests.post( 903 config["slack_webhook"], 904 json={"text": message}, 905 ) 906 response.raise_for_status() 907 except Exception as e: 908 logger.error(f"Error sending test message to Slack: {e}") 909 910 # Save config 911 save_config(config) 912 logger.info(f"Configuration saved to {CONFIG_FILE}")
kblaunch setup
Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.
This command walks users through the initial setup process, configuring:
- User identity and email
- Namespace and queue settings
- Slack notifications webhook
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- NFS server configuration
The configuration is stored in ~/.cache/.kblaunch/config.json.
Configuration includes:
- User: Kubernetes username for job ownership
- Email: User email for notifications and Git configuration
- Namespace: Kubernetes namespace for job deployment
- Queue: Kueue queue name for job scheduling
- Slack webhook: URL for job status notifications
- PVC: Persistent storage configuration
- Git SSH: Authentication for private repositories
- NFS: Server address for mounting storage
915@app.command() 916def launch( 917 email: str = typer.Option(None, help="User email (overrides config)"), 918 job_name: str = typer.Option(..., help="Name of the Kubernetes job"), 919 docker_image: str = typer.Option( 920 "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image" 921 ), 922 namespace: str = typer.Option( 923 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 924 ), 925 queue_name: str = typer.Option( 926 None, help="Kueue queue name (defaults to KUBE_USER_QUEUE)" 927 ), 928 interactive: bool = typer.Option(False, help="Run in interactive mode"), 929 command: str = typer.Option( 930 "", help="Command to run in the container" 931 ), # Made optional 932 cpu_request: str = typer.Option("6", help="CPU request"), 933 ram_request: str = typer.Option("40Gi", help="RAM request"), 934 gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"), 935 gpu_product: GPU_PRODUCTS = typer.Option( 936 "NVIDIA-A100-SXM4-40GB", 937 help="GPU product type to use (ignored for non-GPU jobs)", 938 show_choices=True, 939 show_default=True, 940 ), 941 secrets_env_vars: list[str] = typer.Option( 942 [], # Use empty list as default instead of None 943 help="List of secret environment variables to export to the container", 944 ), 945 local_env_vars: list[str] = typer.Option( 946 [], # Use empty list as default instead of None 947 help="List of local environment variables to export to the container", 948 ), 949 load_dotenv: bool = typer.Option( 950 True, help="Load environment variables from .env file" 951 ), 952 nfs_server: Optional[str] = typer.Option( 953 None, help="NFS server (overrides config and environment)" 954 ), 955 pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"), 956 pvcs: str = typer.Option( 957 None, 958 help='Multiple PVCs with mount paths in JSON format (e.g., \'[{"name":"my-pvc","mount_path":"/data"}]\')', 959 ), 960 dry_run: bool = typer.Option(False, help="Dry run"), 961 priority: PRIORITY = typer.Option( 962 "default", help="Priority class name", show_default=True, show_choices=True 963 ), 964 vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"), 965 tunnel: bool = typer.Option( 966 False, 967 help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode", 968 ), 969 startup_script: str = typer.Option( 970 None, help="Path to startup script to run in container" 971 ), 972): 973 """ 974 `kblaunch launch` 975 Launch a Kubernetes job with specified configuration. 976 977 This command creates and deploys a Kubernetes job with the given specifications, 978 handling GPU allocation, resource requests, and environment setup. 979 980 Args: 981 * email (str, optional): User email for notifications 982 * job_name (str, required): Name of the Kubernetes job 983 * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image 984 * namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace 985 * queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name 986 * interactive (bool, default=False): Run in interactive mode 987 * command (str, default=""): Command to run in container 988 * cpu_request (str, default="6"): CPU cores request 989 * ram_request (str, default="40Gi"): RAM request 990 * gpu_limit (int, default=1): Number of GPUs 991 * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type 992 * secrets_env_vars (List[str], default=[]): Secret environment variables 993 * local_env_vars (List[str], default=[]): Local environment variables 994 * load_dotenv (bool, default=True): Load .env file 995 * nfs_server (str, optional): NFS server IP (overrides config) 996 * pvc_name (str, optional): PVC name for single PVC mounting at /pvc 997 * pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs) 998 * dry_run (bool, default=False): Print YAML only 999 * priority (PRIORITY, default="default"): Job priority 1000 * vscode (bool, default=False): Install VS Code 1001 * tunnel (bool, default=False): Start VS Code tunnel 1002 * startup_script (str, optional): Path to startup script 1003 1004 Examples: 1005 ```bash 1006 # Launch an interactive GPU job 1007 kblaunch launch --job-name test-job --interactive 1008 1009 # Launch a batch GPU job with custom command 1010 kblaunch launch --job-name batch-job --command "python train.py" 1011 1012 # Launch a CPU-only job 1013 kblaunch launch --job-name cpu-job --gpu-limit 0 1014 1015 # Launch with VS Code support 1016 kblaunch launch --job-name dev-job --interactive --vscode --tunnel 1017 1018 # Launch with multiple PVCs 1019 kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]' 1020 ``` 1021 1022 Notes: 1023 - Interactive jobs keep running until manually terminated 1024 - GPU jobs require appropriate queue and priority settings 1025 - VS Code tunnel requires Slack webhook configuration 1026 - Multiple PVCs can be mounted with custom paths using the --pvcs option 1027 """ 1028 1029 # Load config 1030 config = load_config() 1031 1032 # Determine namespace if not provided 1033 if namespace is None: 1034 namespace = get_current_namespace(config) 1035 if namespace is None: 1036 raise typer.BadParameter( 1037 "Namespace not provided.", 1038 "Please provide --namespace or run 'kblaunch setup' to configure.", 1039 ) 1040 1041 # Determine queue name if not provided 1042 if queue_name is None: 1043 queue_name = get_user_queue(namespace) 1044 if queue_name is None: 1045 raise typer.BadParameter( 1046 "Queue name not provided.", 1047 "Please provide --queue-name or run 'kblaunch setup' to configure.", 1048 ) 1049 1050 # Use email from config if not provided 1051 if email is None: 1052 email = config.get("email") 1053 if email is None: 1054 raise typer.BadParameter( 1055 "Email not provided and not found in config. " 1056 "Please provide --email or run 'kblaunch setup' to configure." 1057 ) 1058 1059 # Determine which NFS server to use (priority: command-line > config > env var > default) 1060 if nfs_server is None: 1061 nfs_server = config.get("nfs_server", NFS_SERVER) 1062 if nfs_server is None: 1063 # warn if NFS server is not set 1064 logger.warning( 1065 "NFS server not set/found. Please provide --nfs-server or run 'kblaunch setup' mount the NFS partition." 1066 ) 1067 1068 # Add SLACK_WEBHOOK to local_env_vars if configured 1069 if "slack_webhook" in config: 1070 os.environ["SLACK_WEBHOOK"] = config["slack_webhook"] 1071 if "SLACK_WEBHOOK" not in local_env_vars: 1072 local_env_vars.append("SLACK_WEBHOOK") 1073 1074 if "user" in config and os.getenv("USER") is None: 1075 os.environ["USER"] = config["user"] 1076 1077 if pvc_name is None: 1078 pvc_name = config.get("default_pvc") 1079 1080 if pvc_name is not None: 1081 if not check_if_pvc_exists(pvc_name, namespace): 1082 raise typer.BadParameter( 1083 f"PVC '{pvc_name}' does not exist in namespace '{namespace}'" 1084 ) 1085 1086 # Parse multiple PVCs if provided 1087 parsed_pvcs = [] 1088 if pvcs: 1089 try: 1090 parsed_pvcs = json.loads(pvcs) 1091 # Validate the format 1092 for pvc in parsed_pvcs: 1093 if ( 1094 not isinstance(pvc, dict) 1095 or "name" not in pvc 1096 or "mount_path" not in pvc 1097 ): 1098 raise typer.BadParameter( 1099 "Each PVC entry must be a dictionary with 'name' and 'mount_path' keys" 1100 ) 1101 # Validate that the PVC exists 1102 if not check_if_pvc_exists(pvc["name"], namespace): 1103 raise typer.BadParameter( 1104 f"PVC '{pvc['name']}' does not exist in namespace '{namespace}'" 1105 ) 1106 1107 except json.JSONDecodeError: 1108 raise typer.BadParameter("Invalid JSON format for pvcs parameter") 1109 1110 # Add validation for command parameter 1111 if not interactive and command == "": 1112 raise typer.BadParameter("--command is required when not in interactive mode") 1113 1114 # Validate GPU constraints only if requesting GPUs 1115 if gpu_limit > 0: 1116 try: 1117 validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value) 1118 except ValueError as e: 1119 raise typer.BadParameter(str(e)) 1120 1121 is_completed = check_if_completed(job_name, namespace=namespace) 1122 if not is_completed: 1123 if typer.confirm( 1124 f"Job '{job_name}' already exists. Do you want to delete it and create a new one?", 1125 default=False, 1126 ): 1127 if not delete_namespaced_job_safely( 1128 job_name, 1129 namespace=namespace, 1130 user=config.get("user"), 1131 ): 1132 logger.error("Failed to delete existing job") 1133 return 1 1134 else: 1135 logger.info("Operation cancelled by user") 1136 return 1 1137 1138 logger.info(f"Job '{job_name}' is completed. Launching a new job.") 1139 1140 # Get local environment variables 1141 env_vars_dict = get_env_vars( 1142 local_env_vars=local_env_vars, 1143 load_dotenv=load_dotenv, 1144 ) 1145 1146 # Add USER and GIT_EMAIL to env_vars if git_secret is configured 1147 if config.get("git_secret"): 1148 env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown")) 1149 env_vars_dict["GIT_EMAIL"] = email 1150 1151 secrets_env_vars_dict = get_secret_env_vars( 1152 secrets_names=secrets_env_vars, 1153 namespace=namespace, 1154 ) 1155 1156 # Check for overlapping keys in local and secret environment variables 1157 intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys()) 1158 if intersection: 1159 logger.warning( 1160 f"Overlapping keys in local and secret environment variables: {intersection}" 1161 ) 1162 # Combine the environment variables 1163 union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys()) 1164 1165 # Handle startup script 1166 script_content = None 1167 if startup_script: 1168 script_content = read_startup_script(startup_script) 1169 # Create ConfigMap for startup script 1170 try: 1171 api = client.CoreV1Api() 1172 config_map = client.V1ConfigMap( 1173 metadata=client.V1ObjectMeta( 1174 name=f"{job_name}-startup", namespace=namespace 1175 ), 1176 data={"startup.sh": script_content}, 1177 ) 1178 try: 1179 api.create_namespaced_config_map(namespace=namespace, body=config_map) 1180 except ApiException as e: 1181 if e.status == 409: # Already exists 1182 api.patch_namespaced_config_map( 1183 name=f"{job_name}-startup", namespace=namespace, body=config_map 1184 ) 1185 else: 1186 raise 1187 except Exception as e: 1188 raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}") 1189 1190 if interactive: 1191 cmd = "while true; do sleep 60; done;" 1192 else: 1193 cmd = command 1194 logger.info(f"Command: {cmd}") 1195 1196 logger.info(f"Creating job for: {cmd}") 1197 1198 # Modify command to include startup script 1199 if script_content: 1200 cmd = f"bash /startup.sh && {cmd}" 1201 1202 # Build the start command with optional VS Code installation 1203 start_command = send_message_command(union) 1204 if config.get("git_secret"): 1205 start_command += setup_git_command() 1206 if vscode: 1207 start_command += install_vscode_command() 1208 if tunnel: 1209 start_command += start_vscode_tunnel_command(union) 1210 elif tunnel: 1211 logger.error("Cannot start tunnel without VS Code installation") 1212 1213 full_cmd = start_command + cmd 1214 1215 job = KubernetesJob( 1216 name=job_name, 1217 cpu_request=cpu_request, 1218 ram_request=ram_request, 1219 image=docker_image, 1220 gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None, 1221 gpu_limit=gpu_limit, 1222 gpu_product=gpu_product.value if gpu_limit > 0 else None, 1223 command=["/bin/bash", "-c", "--"], 1224 args=[full_cmd], 1225 env_vars=env_vars_dict, 1226 secret_env_vars=secrets_env_vars_dict, 1227 user_email=email, 1228 namespace=namespace, 1229 kueue_queue_name=queue_name, 1230 nfs_server=nfs_server, 1231 pvc_name=pvc_name, 1232 pvcs=parsed_pvcs, 1233 priority=priority.value, 1234 startup_script=script_content, 1235 git_secret=config.get("git_secret"), 1236 ) 1237 job_yaml = job.generate_yaml() 1238 logger.info(job_yaml) 1239 # Run the Job on the Kubernetes cluster 1240 if not dry_run: 1241 job.run()
kblaunch launch
Launch a Kubernetes job with specified configuration.
This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.
Args:
- email (str, optional): User email for notifications
- job_name (str, required): Name of the Kubernetes job
- docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
- namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
- queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
- interactive (bool, default=False): Run in interactive mode
- command (str, default=""): Command to run in container
- cpu_request (str, default="6"): CPU cores request
- ram_request (str, default="40Gi"): RAM request
- gpu_limit (int, default=1): Number of GPUs
- gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
- secrets_env_vars (List[str], default=[]): Secret environment variables
- local_env_vars (List[str], default=[]): Local environment variables
- load_dotenv (bool, default=True): Load .env file
- nfs_server (str, optional): NFS server IP (overrides config)
- pvc_name (str, optional): PVC name for single PVC mounting at /pvc
- pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
- dry_run (bool, default=False): Print YAML only
- priority (PRIORITY, default="default"): Job priority
- vscode (bool, default=False): Install VS Code
- tunnel (bool, default=False): Start VS Code tunnel
- startup_script (str, optional): Path to startup script
Examples:
# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive
# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"
# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0
# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel
# Launch with multiple PVCs
kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'
Notes:
- Interactive jobs keep running until manually terminated
- GPU jobs require appropriate queue and priority settings
- VS Code tunnel requires Slack webhook configuration
- Multiple PVCs can be mounted with custom paths using the --pvcs option
1329@monitor_app.command("gpus") 1330def monitor_gpus( 1331 namespace: str = typer.Option( 1332 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1333 ), 1334): 1335 """ 1336 `kblaunch monitor gpus` 1337 Display overall GPU statistics and utilization by type. 1338 1339 Shows a comprehensive view of GPU allocation and usage across the cluster, 1340 including both running and pending GPU requests. 1341 1342 Args: 1343 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1344 1345 Output includes: 1346 - Total GPU count by type 1347 - Running vs. pending GPUs 1348 - Details of pending GPU requests 1349 - Wait times for pending requests 1350 1351 Examples: 1352 ```bash 1353 kblaunch monitor gpus 1354 kblaunch monitor gpus --namespace custom-namespace 1355 ``` 1356 """ 1357 try: 1358 config = load_config() 1359 namespace = namespace or get_current_namespace(config) 1360 print_gpu_total(namespace=namespace) 1361 except Exception as e: 1362 print(f"Error displaying GPU stats: {e}")
kblaunch monitor gpus
Display overall GPU statistics and utilization by type.
Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- Total GPU count by type
- Running vs. pending GPUs
- Details of pending GPU requests
- Wait times for pending requests
Examples:
kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
1365@monitor_app.command("users") 1366def monitor_users( 1367 namespace: str = typer.Option( 1368 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1369 ), 1370): 1371 """ 1372 `kblaunch monitor users` 1373 Display GPU usage statistics grouped by user. 1374 1375 Provides a user-centric view of GPU allocation and utilization, 1376 helping identify resource usage patterns across users. 1377 1378 Args: 1379 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1380 1381 Output includes: 1382 - GPUs allocated per user 1383 - Average memory usage per user 1384 - Inactive GPU count per user 1385 - Overall usage totals 1386 1387 Examples: 1388 ```bash 1389 kblaunch monitor users 1390 kblaunch monitor users --namespace custom-namespace 1391 ``` 1392 """ 1393 try: 1394 config = load_config() 1395 namespace = namespace or get_current_namespace(config) 1396 print_user_stats(namespace=namespace) 1397 except Exception as e: 1398 print(f"Error displaying user stats: {e}")
kblaunch monitor users
Display GPU usage statistics grouped by user.
Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- GPUs allocated per user
- Average memory usage per user
- Inactive GPU count per user
- Overall usage totals
Examples:
kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
1401@monitor_app.command("jobs") 1402def monitor_jobs( 1403 namespace: str = typer.Option( 1404 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1405 ), 1406): 1407 """ 1408 `kblaunch monitor jobs` 1409 Display detailed job-level GPU statistics. 1410 1411 Shows comprehensive information about all running GPU jobs, 1412 including resource usage and job characteristics. 1413 1414 Args: 1415 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1416 1417 Output includes: 1418 - Job identification and ownership 1419 - Resource allocation (CPU, RAM, GPU) 1420 - GPU memory usage 1421 - Job status (active/inactive) 1422 - Job mode (interactive/batch) 1423 - Resource totals and averages 1424 1425 Examples: 1426 ```bash 1427 kblaunch monitor jobs 1428 kblaunch monitor jobs --namespace custom-namespace 1429 ``` 1430 """ 1431 try: 1432 config = load_config() 1433 namespace = namespace or get_current_namespace(config) 1434 print_job_stats(namespace=namespace) 1435 except Exception as e: 1436 print(f"Error displaying job stats: {e}")
kblaunch monitor jobs
Display detailed job-level GPU statistics.
Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- Job identification and ownership
- Resource allocation (CPU, RAM, GPU)
- GPU memory usage
- Job status (active/inactive)
- Job mode (interactive/batch)
- Resource totals and averages
Examples:
kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
1439@monitor_app.command("queue") 1440def monitor_queue( 1441 namespace: str = typer.Option( 1442 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1443 ), 1444 reasons: bool = typer.Option(False, help="Display queued job event messages"), 1445 include_cpu: bool = typer.Option(False, help="Show CPU jobs in the queue"), 1446): 1447 """ 1448 `kblaunch monitor queue` 1449 Display statistics about queued workloads. 1450 1451 Shows information about jobs waiting in the Kueue scheduler, 1452 including wait times and resource requests. 1453 1454 Args: 1455 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1456 - reasons: Show detailed reason messages for queued jobs 1457 - include_cpu: Include CPU jobs in the queue 1458 1459 Output includes: 1460 - Queue position and wait time 1461 - Resource requests (CPU, RAM, GPU) 1462 - Job priority 1463 - Queueing reasons (if --reasons flag is used) 1464 1465 Examples: 1466 ```bash 1467 kblaunch monitor queue 1468 kblaunch monitor queue --reasons 1469 kblaunch monitor queue --namespace custom-namespace 1470 ``` 1471 """ 1472 try: 1473 config = load_config() 1474 namespace = namespace or get_current_namespace(config) 1475 print_queue_stats(namespace=namespace, reasons=reasons, include_cpu=include_cpu) 1476 except Exception as e: 1477 print(f"Error displaying queue stats: {e}")
kblaunch monitor queue
Display statistics about queued workloads.
Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
- reasons: Show detailed reason messages for queued jobs
- include_cpu: Include CPU jobs in the queue
Output includes:
- Queue position and wait time
- Resource requests (CPU, RAM, GPU)
- Job priority
- Queueing reasons (if --reasons flag is used)
Examples:
kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace