kblaunch
kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
Commands
- Launching GPU jobs with various configurations
- Monitoring GPU usage and job statistics
- Setting up user configurations and preferences
- Managing persistent volumes and Git authentication
Features
- Interactive and batch job support
- GPU resource management and constraints
- Environment variable handling from multiple sources
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- VS Code integration with remote tunneling
- Slack notifications for job status
- Real-time cluster monitoring
Resource Types
- A100 GPUs (40GB and 80GB variants)
- H100 GPUs (80GB variant)
- CPU and RAM allocation
- Persistent storage volumes
Job Priority Classes
- default: Standard priority for most workloads
- batch: Lower priority for long-running jobs
- short: High priority for quick jobs (with GPU constraints)
Environment Integration
- Kubernetes secrets
- Local environment variables
- .env file support
- SSH key management
- NFS workspace mounting
1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters. 2 3## Commands 4* Launching GPU jobs with various configurations 5* Monitoring GPU usage and job statistics 6* Setting up user configurations and preferences 7* Managing persistent volumes and Git authentication 8 9## Features 10* Interactive and batch job support 11* GPU resource management and constraints 12* Environment variable handling from multiple sources 13* Persistent Volume Claims (PVC) for storage 14* Git SSH authentication 15* VS Code integration with remote tunneling 16* Slack notifications for job status 17* Real-time cluster monitoring 18 19## Resource Types 20* A100 GPUs (40GB and 80GB variants) 21* H100 GPUs (80GB variant) 22* CPU and RAM allocation 23* Persistent storage volumes 24 25## Job Priority Classes 26* default: Standard priority for most workloads 27* batch: Lower priority for long-running jobs 28* short: High priority for quick jobs (with GPU constraints) 29 30## Environment Integration 31* Kubernetes secrets 32* Local environment variables 33* .env file support 34* SSH key management 35* NFS workspace mounting 36""" 37 38import importlib.metadata 39 40__version__ = importlib.metadata.version("kblaunch") 41 42__all__ = [ 43 "setup", 44 "launch", 45 "monitor_gpus", 46 "monitor_users", 47 "monitor_jobs", 48 "monitor_queue", 49] 50 51from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
767@app.command() 768def setup(): 769 """ 770 `kblaunch setup` 771 772 Interactive setup wizard for kblaunch configuration. 773 No arguments - all configuration is done through interactive prompts. 774 775 This command walks users through the initial setup process, configuring: 776 - User identity and email 777 - Namespace and queue settings 778 - Slack notifications webhook 779 - Persistent Volume Claims (PVC) for storage 780 - Git SSH authentication 781 - NFS server configuration 782 783 The configuration is stored in ~/.cache/.kblaunch/config.json. 784 785 Configuration includes: 786 - User: Kubernetes username for job ownership 787 - Email: User email for notifications and Git configuration 788 - Namespace: Kubernetes namespace for job deployment 789 - Queue: Kueue queue name for job scheduling 790 - Slack webhook: URL for job status notifications 791 - PVC: Persistent storage configuration 792 - Git SSH: Authentication for private repositories 793 - NFS: Server address for mounting storage 794 """ 795 config = load_config() 796 797 # validate user 798 default_user = os.getenv("USER") 799 if "user" in config: 800 default_user = config["user"] 801 else: 802 config["user"] = default_user 803 804 if typer.confirm( 805 f"Would you like to set the user? (default: {default_user})", default=False 806 ): 807 user = typer.prompt("Please enter your user", default=default_user) 808 config["user"] = user 809 810 # Get email 811 existing_email = config.get("email", None) 812 email = typer.prompt( 813 f"Please enter your email (existing: {existing_email})", default=existing_email 814 ) 815 config["email"] = email 816 817 # Configure namespace 818 existing_namespace = config.get("namespace", os.getenv("KUBE_NAMESPACE")) 819 if typer.confirm("Would you like to configure your namespace?", default=True): 820 namespace = typer.prompt( 821 f"Please enter your namespace (existing: {existing_namespace})", 822 default=existing_namespace, 823 ) 824 config["namespace"] = namespace 825 # Now that we have namespace, ask about queue 826 existing_queue = config.get("queue", get_user_queue(namespace)) 827 if typer.confirm("Would you like to configure your queue?", default=True): 828 queue = typer.prompt( 829 f"Please enter your queue name (existing: {existing_queue})", 830 default=existing_queue or f"{namespace}-user-queue", 831 ) 832 config["queue"] = queue 833 834 # Get NFS Server 835 # Get the current NFS server from config or default 836 current_nfs = config.get("nfs_server", NFS_SERVER) 837 if typer.confirm("Would you like to configure the NFS server?", default=False): 838 nfs_server = typer.prompt( 839 f"Enter your NFS server address (existing: {current_nfs})", 840 default=current_nfs, 841 ) 842 config["nfs_server"] = nfs_server 843 844 # Get Slack webhook 845 if typer.confirm("Would you like to set up Slack notifications?", default=False): 846 existing_webhook = config.get("slack_webhook", None) 847 webhook = typer.prompt( 848 f"Enter your Slack webhook URL (existing: {existing_webhook})", 849 default=existing_webhook, 850 ) 851 config["slack_webhook"] = webhook 852 853 if typer.confirm("Would you like to use a PVC?", default=False): 854 user = config["user"] 855 current_default = config.get("default_pvc", f"{user}-pvc") 856 857 pvc_name = typer.prompt( 858 f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.", 859 default=current_default, 860 ) 861 862 namespace = config.get("namespace", get_current_namespace(config)) 863 if check_if_pvc_exists(pvc_name, namespace): 864 if typer.confirm( 865 f"Would you like to set {pvc_name} as the default PVC?", 866 default=True, 867 ): 868 config["default_pvc"] = pvc_name 869 else: 870 if typer.confirm( 871 f"PVC '{pvc_name}' does not exist. Would you like to create it?", 872 default=True, 873 ): 874 pvc_size = typer.prompt( 875 "Enter the desired PVC size (e.g. 10Gi)", default="10Gi" 876 ) 877 try: 878 if create_pvc(user, pvc_name, pvc_size, namespace): 879 config["default_pvc"] = pvc_name 880 except (ValueError, ApiException) as e: 881 logger.error(f"Failed to create PVC: {e}") 882 883 # Git authentication setup 884 if typer.confirm("Would you like to set up Git SSH authentication?", default=False): 885 default_key_path = str(Path.home() / ".ssh" / "id_rsa") 886 key_path = typer.prompt( 887 "Enter the path to your SSH private key", 888 default=default_key_path, 889 ) 890 secret_name = f"{config['user']}-git-ssh" 891 namespace = config.get("namespace", get_current_namespace(config)) 892 if create_git_secret(secret_name, key_path, namespace): 893 config["git_secret"] = secret_name 894 895 # validate slack webhook 896 if "slack_webhook" in config: 897 # test post to slack 898 try: 899 logger.info("Sending test message to Slack") 900 message = "Hello :wave: from ```kblaunch```" 901 response = requests.post( 902 config["slack_webhook"], 903 json={"text": message}, 904 ) 905 response.raise_for_status() 906 except Exception as e: 907 logger.error(f"Error sending test message to Slack: {e}") 908 909 # Save config 910 save_config(config) 911 logger.info(f"Configuration saved to {CONFIG_FILE}")
kblaunch setup
Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.
This command walks users through the initial setup process, configuring:
- User identity and email
- Namespace and queue settings
- Slack notifications webhook
- Persistent Volume Claims (PVC) for storage
- Git SSH authentication
- NFS server configuration
The configuration is stored in ~/.cache/.kblaunch/config.json.
Configuration includes:
- User: Kubernetes username for job ownership
- Email: User email for notifications and Git configuration
- Namespace: Kubernetes namespace for job deployment
- Queue: Kueue queue name for job scheduling
- Slack webhook: URL for job status notifications
- PVC: Persistent storage configuration
- Git SSH: Authentication for private repositories
- NFS: Server address for mounting storage
914@app.command() 915def launch( 916 email: str = typer.Option(None, help="User email (overrides config)"), 917 job_name: str = typer.Option(..., help="Name of the Kubernetes job"), 918 docker_image: str = typer.Option( 919 "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image" 920 ), 921 namespace: str = typer.Option( 922 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 923 ), 924 queue_name: str = typer.Option( 925 None, help="Kueue queue name (defaults to KUBE_USER_QUEUE)" 926 ), 927 interactive: bool = typer.Option(False, help="Run in interactive mode"), 928 command: str = typer.Option( 929 "", help="Command to run in the container" 930 ), # Made optional 931 cpu_request: str = typer.Option("6", help="CPU request"), 932 ram_request: str = typer.Option("40Gi", help="RAM request"), 933 gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"), 934 gpu_product: GPU_PRODUCTS = typer.Option( 935 "NVIDIA-A100-SXM4-40GB", 936 help="GPU product type to use (ignored for non-GPU jobs)", 937 show_choices=True, 938 show_default=True, 939 ), 940 secrets_env_vars: list[str] = typer.Option( 941 [], # Use empty list as default instead of None 942 help="List of secret environment variables to export to the container", 943 ), 944 local_env_vars: list[str] = typer.Option( 945 [], # Use empty list as default instead of None 946 help="List of local environment variables to export to the container", 947 ), 948 load_dotenv: bool = typer.Option( 949 True, help="Load environment variables from .env file" 950 ), 951 nfs_server: Optional[str] = typer.Option( 952 None, help="NFS server (overrides config and environment)" 953 ), 954 pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"), 955 pvcs: str = typer.Option( 956 None, 957 help='Multiple PVCs with mount paths in JSON format (e.g., \'[{"name":"my-pvc","mount_path":"/data"}]\')', 958 ), 959 dry_run: bool = typer.Option(False, help="Dry run"), 960 priority: PRIORITY = typer.Option( 961 "default", help="Priority class name", show_default=True, show_choices=True 962 ), 963 vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"), 964 tunnel: bool = typer.Option( 965 False, 966 help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode", 967 ), 968 startup_script: str = typer.Option( 969 None, help="Path to startup script to run in container" 970 ), 971): 972 """ 973 `kblaunch launch` 974 Launch a Kubernetes job with specified configuration. 975 976 This command creates and deploys a Kubernetes job with the given specifications, 977 handling GPU allocation, resource requests, and environment setup. 978 979 Args: 980 * email (str, optional): User email for notifications 981 * job_name (str, required): Name of the Kubernetes job 982 * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image 983 * namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace 984 * queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name 985 * interactive (bool, default=False): Run in interactive mode 986 * command (str, default=""): Command to run in container 987 * cpu_request (str, default="6"): CPU cores request 988 * ram_request (str, default="40Gi"): RAM request 989 * gpu_limit (int, default=1): Number of GPUs 990 * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type 991 * secrets_env_vars (List[str], default=[]): Secret environment variables 992 * local_env_vars (List[str], default=[]): Local environment variables 993 * load_dotenv (bool, default=True): Load .env file 994 * nfs_server (str, optional): NFS server IP (overrides config) 995 * pvc_name (str, optional): PVC name for single PVC mounting at /pvc 996 * pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs) 997 * dry_run (bool, default=False): Print YAML only 998 * priority (PRIORITY, default="default"): Job priority 999 * vscode (bool, default=False): Install VS Code 1000 * tunnel (bool, default=False): Start VS Code tunnel 1001 * startup_script (str, optional): Path to startup script 1002 1003 Examples: 1004 ```bash 1005 # Launch an interactive GPU job 1006 kblaunch launch --job-name test-job --interactive 1007 1008 # Launch a batch GPU job with custom command 1009 kblaunch launch --job-name batch-job --command "python train.py" 1010 1011 # Launch a CPU-only job 1012 kblaunch launch --job-name cpu-job --gpu-limit 0 1013 1014 # Launch with VS Code support 1015 kblaunch launch --job-name dev-job --interactive --vscode --tunnel 1016 1017 # Launch with multiple PVCs 1018 kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]' 1019 ``` 1020 1021 Notes: 1022 - Interactive jobs keep running until manually terminated 1023 - GPU jobs require appropriate queue and priority settings 1024 - VS Code tunnel requires Slack webhook configuration 1025 - Multiple PVCs can be mounted with custom paths using the --pvcs option 1026 """ 1027 1028 # Load config 1029 config = load_config() 1030 1031 # Determine namespace if not provided 1032 if namespace is None: 1033 namespace = get_current_namespace(config) 1034 if namespace is None: 1035 raise typer.BadParameter( 1036 "Namespace not provided.", 1037 "Please provide --namespace or run 'kblaunch setup' to configure.", 1038 ) 1039 1040 # Determine queue name if not provided 1041 if queue_name is None: 1042 queue_name = get_user_queue(namespace) 1043 if queue_name is None: 1044 raise typer.BadParameter( 1045 "Queue name not provided.", 1046 "Please provide --queue-name or run 'kblaunch setup' to configure.", 1047 ) 1048 1049 # Use email from config if not provided 1050 if email is None: 1051 email = config.get("email") 1052 if email is None: 1053 raise typer.BadParameter( 1054 "Email not provided and not found in config. " 1055 "Please provide --email or run 'kblaunch setup' to configure." 1056 ) 1057 1058 # Determine which NFS server to use (priority: command-line > config > env var > default) 1059 if nfs_server is None: 1060 nfs_server = config.get("nfs_server", NFS_SERVER) 1061 if nfs_server is None: 1062 # warn if NFS server is not set 1063 logger.warning( 1064 "NFS server not set/found. Please provide --nfs-server or run 'kblaunch setup' mount the NFS partition." 1065 ) 1066 1067 # Add SLACK_WEBHOOK to local_env_vars if configured 1068 if "slack_webhook" in config: 1069 os.environ["SLACK_WEBHOOK"] = config["slack_webhook"] 1070 if "SLACK_WEBHOOK" not in local_env_vars: 1071 local_env_vars.append("SLACK_WEBHOOK") 1072 1073 if "user" in config and os.getenv("USER") is None: 1074 os.environ["USER"] = config["user"] 1075 1076 if pvc_name is None: 1077 pvc_name = config.get("default_pvc") 1078 1079 if pvc_name is not None: 1080 if not check_if_pvc_exists(pvc_name, namespace): 1081 raise typer.BadParameter( 1082 f"PVC '{pvc_name}' does not exist in namespace '{namespace}'" 1083 ) 1084 1085 # Parse multiple PVCs if provided 1086 parsed_pvcs = [] 1087 if pvcs: 1088 try: 1089 parsed_pvcs = json.loads(pvcs) 1090 # Validate the format 1091 for pvc in parsed_pvcs: 1092 if ( 1093 not isinstance(pvc, dict) 1094 or "name" not in pvc 1095 or "mount_path" not in pvc 1096 ): 1097 raise typer.BadParameter( 1098 "Each PVC entry must be a dictionary with 'name' and 'mount_path' keys" 1099 ) 1100 # Validate that the PVC exists 1101 if not check_if_pvc_exists(pvc["name"], namespace): 1102 raise typer.BadParameter( 1103 f"PVC '{pvc['name']}' does not exist in namespace '{namespace}'" 1104 ) 1105 1106 except json.JSONDecodeError: 1107 raise typer.BadParameter("Invalid JSON format for pvcs parameter") 1108 1109 # Add validation for command parameter 1110 if not interactive and command == "": 1111 raise typer.BadParameter("--command is required when not in interactive mode") 1112 1113 # Validate GPU constraints only if requesting GPUs 1114 if gpu_limit > 0: 1115 try: 1116 validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value) 1117 except ValueError as e: 1118 raise typer.BadParameter(str(e)) 1119 1120 is_completed = check_if_completed(job_name, namespace=namespace) 1121 if not is_completed: 1122 if typer.confirm( 1123 f"Job '{job_name}' already exists. Do you want to delete it and create a new one?", 1124 default=False, 1125 ): 1126 if not delete_namespaced_job_safely( 1127 job_name, 1128 namespace=namespace, 1129 user=config.get("user"), 1130 ): 1131 logger.error("Failed to delete existing job") 1132 return 1 1133 else: 1134 logger.info("Operation cancelled by user") 1135 return 1 1136 1137 logger.info(f"Job '{job_name}' is completed. Launching a new job.") 1138 1139 # Get local environment variables 1140 env_vars_dict = get_env_vars( 1141 local_env_vars=local_env_vars, 1142 load_dotenv=load_dotenv, 1143 ) 1144 1145 # Add USER and GIT_EMAIL to env_vars if git_secret is configured 1146 if config.get("git_secret"): 1147 env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown")) 1148 env_vars_dict["GIT_EMAIL"] = email 1149 1150 secrets_env_vars_dict = get_secret_env_vars( 1151 secrets_names=secrets_env_vars, 1152 namespace=namespace, 1153 ) 1154 1155 # Check for overlapping keys in local and secret environment variables 1156 intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys()) 1157 if intersection: 1158 logger.warning( 1159 f"Overlapping keys in local and secret environment variables: {intersection}" 1160 ) 1161 # Combine the environment variables 1162 union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys()) 1163 1164 # Handle startup script 1165 script_content = None 1166 if startup_script: 1167 script_content = read_startup_script(startup_script) 1168 # Create ConfigMap for startup script 1169 try: 1170 api = client.CoreV1Api() 1171 config_map = client.V1ConfigMap( 1172 metadata=client.V1ObjectMeta( 1173 name=f"{job_name}-startup", namespace=namespace 1174 ), 1175 data={"startup.sh": script_content}, 1176 ) 1177 try: 1178 api.create_namespaced_config_map(namespace=namespace, body=config_map) 1179 except ApiException as e: 1180 if e.status == 409: # Already exists 1181 api.patch_namespaced_config_map( 1182 name=f"{job_name}-startup", namespace=namespace, body=config_map 1183 ) 1184 else: 1185 raise 1186 except Exception as e: 1187 raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}") 1188 1189 if interactive: 1190 cmd = "while true; do sleep 60; done;" 1191 else: 1192 cmd = command 1193 logger.info(f"Command: {cmd}") 1194 1195 logger.info(f"Creating job for: {cmd}") 1196 1197 # Modify command to include startup script 1198 if script_content: 1199 cmd = f"bash /startup.sh && {cmd}" 1200 1201 # Build the start command with optional VS Code installation 1202 start_command = send_message_command(union) 1203 if config.get("git_secret"): 1204 start_command += setup_git_command() 1205 if vscode: 1206 start_command += install_vscode_command() 1207 if tunnel: 1208 start_command += start_vscode_tunnel_command(union) 1209 elif tunnel: 1210 logger.error("Cannot start tunnel without VS Code installation") 1211 1212 full_cmd = start_command + cmd 1213 1214 job = KubernetesJob( 1215 name=job_name, 1216 cpu_request=cpu_request, 1217 ram_request=ram_request, 1218 image=docker_image, 1219 gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None, 1220 gpu_limit=gpu_limit, 1221 gpu_product=gpu_product.value if gpu_limit > 0 else None, 1222 command=["/bin/bash", "-c", "--"], 1223 args=[full_cmd], 1224 env_vars=env_vars_dict, 1225 secret_env_vars=secrets_env_vars_dict, 1226 user_email=email, 1227 namespace=namespace, 1228 kueue_queue_name=queue_name, 1229 nfs_server=nfs_server, 1230 pvc_name=pvc_name, 1231 pvcs=parsed_pvcs, 1232 priority=priority.value, 1233 startup_script=script_content, 1234 git_secret=config.get("git_secret"), 1235 ) 1236 job_yaml = job.generate_yaml() 1237 logger.info(job_yaml) 1238 # Run the Job on the Kubernetes cluster 1239 if not dry_run: 1240 job.run()
kblaunch launch
Launch a Kubernetes job with specified configuration.
This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.
Args:
- email (str, optional): User email for notifications
- job_name (str, required): Name of the Kubernetes job
- docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
- namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
- queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
- interactive (bool, default=False): Run in interactive mode
- command (str, default=""): Command to run in container
- cpu_request (str, default="6"): CPU cores request
- ram_request (str, default="40Gi"): RAM request
- gpu_limit (int, default=1): Number of GPUs
- gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
- secrets_env_vars (List[str], default=[]): Secret environment variables
- local_env_vars (List[str], default=[]): Local environment variables
- load_dotenv (bool, default=True): Load .env file
- nfs_server (str, optional): NFS server IP (overrides config)
- pvc_name (str, optional): PVC name for single PVC mounting at /pvc
- pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
- dry_run (bool, default=False): Print YAML only
- priority (PRIORITY, default="default"): Job priority
- vscode (bool, default=False): Install VS Code
- tunnel (bool, default=False): Start VS Code tunnel
- startup_script (str, optional): Path to startup script
Examples:
# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive
# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"
# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0
# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel
# Launch with multiple PVCs
kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'
Notes:
- Interactive jobs keep running until manually terminated
- GPU jobs require appropriate queue and priority settings
- VS Code tunnel requires Slack webhook configuration
- Multiple PVCs can be mounted with custom paths using the --pvcs option
1328@monitor_app.command("gpus") 1329def monitor_gpus( 1330 namespace: str = typer.Option( 1331 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1332 ), 1333): 1334 """ 1335 `kblaunch monitor gpus` 1336 Display overall GPU statistics and utilization by type. 1337 1338 Shows a comprehensive view of GPU allocation and usage across the cluster, 1339 including both running and pending GPU requests. 1340 1341 Args: 1342 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1343 1344 Output includes: 1345 - Total GPU count by type 1346 - Running vs. pending GPUs 1347 - Details of pending GPU requests 1348 - Wait times for pending requests 1349 1350 Examples: 1351 ```bash 1352 kblaunch monitor gpus 1353 kblaunch monitor gpus --namespace custom-namespace 1354 ``` 1355 """ 1356 try: 1357 namespace = namespace or get_current_namespace(config) 1358 print_gpu_total(namespace=namespace) 1359 except Exception as e: 1360 print(f"Error displaying GPU stats: {e}")
kblaunch monitor gpus
Display overall GPU statistics and utilization by type.
Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- Total GPU count by type
- Running vs. pending GPUs
- Details of pending GPU requests
- Wait times for pending requests
Examples:
kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
1363@monitor_app.command("users") 1364def monitor_users( 1365 namespace: str = typer.Option( 1366 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1367 ), 1368): 1369 """ 1370 `kblaunch monitor users` 1371 Display GPU usage statistics grouped by user. 1372 1373 Provides a user-centric view of GPU allocation and utilization, 1374 helping identify resource usage patterns across users. 1375 1376 Args: 1377 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1378 1379 Output includes: 1380 - GPUs allocated per user 1381 - Average memory usage per user 1382 - Inactive GPU count per user 1383 - Overall usage totals 1384 1385 Examples: 1386 ```bash 1387 kblaunch monitor users 1388 kblaunch monitor users --namespace custom-namespace 1389 ``` 1390 """ 1391 try: 1392 namespace = namespace or get_current_namespace(config) 1393 print_user_stats(namespace=namespace) 1394 except Exception as e: 1395 print(f"Error displaying user stats: {e}")
kblaunch monitor users
Display GPU usage statistics grouped by user.
Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- GPUs allocated per user
- Average memory usage per user
- Inactive GPU count per user
- Overall usage totals
Examples:
kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
1398@monitor_app.command("jobs") 1399def monitor_jobs( 1400 namespace: str = typer.Option( 1401 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1402 ), 1403): 1404 """ 1405 `kblaunch monitor jobs` 1406 Display detailed job-level GPU statistics. 1407 1408 Shows comprehensive information about all running GPU jobs, 1409 including resource usage and job characteristics. 1410 1411 Args: 1412 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1413 1414 Output includes: 1415 - Job identification and ownership 1416 - Resource allocation (CPU, RAM, GPU) 1417 - GPU memory usage 1418 - Job status (active/inactive) 1419 - Job mode (interactive/batch) 1420 - Resource totals and averages 1421 1422 Examples: 1423 ```bash 1424 kblaunch monitor jobs 1425 kblaunch monitor jobs --namespace custom-namespace 1426 ``` 1427 """ 1428 try: 1429 namespace = namespace or get_current_namespace(config) 1430 print_job_stats(namespace=namespace) 1431 except Exception as e: 1432 print(f"Error displaying job stats: {e}")
kblaunch monitor jobs
Display detailed job-level GPU statistics.
Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
Output includes:
- Job identification and ownership
- Resource allocation (CPU, RAM, GPU)
- GPU memory usage
- Job status (active/inactive)
- Job mode (interactive/batch)
- Resource totals and averages
Examples:
kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
1435@monitor_app.command("queue") 1436def monitor_queue( 1437 namespace: str = typer.Option( 1438 None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)" 1439 ), 1440 reasons: bool = typer.Option(False, help="Display queued job event messages"), 1441 include_cpu: bool = typer.Option(False, help="Show CPU jobs in the queue"), 1442): 1443 """ 1444 `kblaunch monitor queue` 1445 Display statistics about queued workloads. 1446 1447 Shows information about jobs waiting in the Kueue scheduler, 1448 including wait times and resource requests. 1449 1450 Args: 1451 - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE) 1452 - reasons: Show detailed reason messages for queued jobs 1453 - include_cpu: Include CPU jobs in the queue 1454 1455 Output includes: 1456 - Queue position and wait time 1457 - Resource requests (CPU, RAM, GPU) 1458 - Job priority 1459 - Queueing reasons (if --reasons flag is used) 1460 1461 Examples: 1462 ```bash 1463 kblaunch monitor queue 1464 kblaunch monitor queue --reasons 1465 kblaunch monitor queue --namespace custom-namespace 1466 ``` 1467 """ 1468 try: 1469 namespace = namespace or get_current_namespace(config) 1470 print_queue_stats(namespace=namespace, reasons=reasons, include_cpu=include_cpu) 1471 except Exception as e: 1472 print(f"Error displaying queue stats: {e}")
kblaunch monitor queue
Display statistics about queued workloads.
Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.
Args:
- namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
- reasons: Show detailed reason messages for queued jobs
- include_cpu: Include CPU jobs in the queue
Output includes:
- Queue position and wait time
- Resource requests (CPU, RAM, GPU)
- Job priority
- Queueing reasons (if --reasons flag is used)
Examples:
kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace