kblaunch

kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.

Commands

  • Launching GPU jobs with various configurations
  • Monitoring GPU usage and job statistics
  • Setting up user configurations and preferences
  • Managing persistent volumes and Git authentication

Features

  • Interactive and batch job support
  • GPU resource management and constraints
  • Environment variable handling from multiple sources
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • VS Code integration with remote tunneling
  • Slack notifications for job status
  • Real-time cluster monitoring

Resource Types

  • A100 GPUs (40GB and 80GB variants)
  • H100 GPUs (80GB variant)
  • CPU and RAM allocation
  • Persistent storage volumes

Job Priority Classes

  • default: Standard priority for most workloads
  • batch: Lower priority for long-running jobs
  • short: High priority for quick jobs (with GPU constraints)

Environment Integration

  • Kubernetes secrets
  • Local environment variables
  • .env file support
  • SSH key management
  • NFS workspace mounting
 1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
 2
 3## Commands
 4* Launching GPU jobs with various configurations
 5* Monitoring GPU usage and job statistics
 6* Setting up user configurations and preferences
 7* Managing persistent volumes and Git authentication
 8
 9## Features
10* Interactive and batch job support
11* GPU resource management and constraints
12* Environment variable handling from multiple sources
13* Persistent Volume Claims (PVC) for storage
14* Git SSH authentication
15* VS Code integration with remote tunneling
16* Slack notifications for job status
17* Real-time cluster monitoring
18
19## Resource Types
20* A100 GPUs (40GB and 80GB variants)
21* H100 GPUs (80GB variant)
22* CPU and RAM allocation
23* Persistent storage volumes
24
25## Job Priority Classes
26* default: Standard priority for most workloads
27* batch: Lower priority for long-running jobs
28* short: High priority for quick jobs (with GPU constraints)
29
30## Environment Integration
31* Kubernetes secrets
32* Local environment variables
33* .env file support
34* SSH key management
35* NFS workspace mounting
36"""
37
38import importlib.metadata
39
40__version__ = importlib.metadata.version("kblaunch")
41
42__all__ = [
43    "setup",
44    "launch",
45    "monitor_gpus",
46    "monitor_users",
47    "monitor_jobs",
48    "monitor_queue",
49]
50
51from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
@app.command()
def setup():
767@app.command()
768def setup():
769    """
770     `kblaunch setup`
771
772    Interactive setup wizard for kblaunch configuration.
773    No arguments - all configuration is done through interactive prompts.
774
775    This command walks users through the initial setup process, configuring:
776    - User identity and email
777    - Namespace and queue settings
778    - Slack notifications webhook
779    - Persistent Volume Claims (PVC) for storage
780    - Git SSH authentication
781    - NFS server configuration
782
783    The configuration is stored in ~/.cache/.kblaunch/config.json.
784
785    Configuration includes:
786    - User: Kubernetes username for job ownership
787    - Email: User email for notifications and Git configuration
788    - Namespace: Kubernetes namespace for job deployment
789    - Queue: Kueue queue name for job scheduling
790    - Slack webhook: URL for job status notifications
791    - PVC: Persistent storage configuration
792    - Git SSH: Authentication for private repositories
793    - NFS: Server address for mounting storage
794    """
795    config = load_config()
796
797    # validate user
798    default_user = os.getenv("USER")
799    if "user" in config:
800        default_user = config["user"]
801    else:
802        config["user"] = default_user
803
804    if typer.confirm(
805        f"Would you like to set the user? (default: {default_user})", default=False
806    ):
807        user = typer.prompt("Please enter your user", default=default_user)
808        config["user"] = user
809
810    # Get email
811    existing_email = config.get("email", None)
812    email = typer.prompt(
813        f"Please enter your email (existing: {existing_email})", default=existing_email
814    )
815    config["email"] = email
816
817    # Configure namespace
818    existing_namespace = config.get("namespace", os.getenv("KUBE_NAMESPACE"))
819    if typer.confirm("Would you like to configure your namespace?", default=True):
820        namespace = typer.prompt(
821            f"Please enter your namespace (existing: {existing_namespace})",
822            default=existing_namespace,
823        )
824        config["namespace"] = namespace
825        # Now that we have namespace, ask about queue
826        existing_queue = config.get("queue", get_user_queue(namespace))
827        if typer.confirm("Would you like to configure your queue?", default=True):
828            queue = typer.prompt(
829                f"Please enter your queue name (existing: {existing_queue})",
830                default=existing_queue or f"{namespace}-user-queue",
831            )
832            config["queue"] = queue
833
834    # Get NFS Server
835    # Get the current NFS server from config or default
836    current_nfs = config.get("nfs_server", NFS_SERVER)
837    if typer.confirm("Would you like to configure the NFS server?", default=False):
838        nfs_server = typer.prompt(
839            f"Enter your NFS server address (existing: {current_nfs})",
840            default=current_nfs,
841        )
842        config["nfs_server"] = nfs_server
843
844    # Get Slack webhook
845    if typer.confirm("Would you like to set up Slack notifications?", default=False):
846        existing_webhook = config.get("slack_webhook", None)
847        webhook = typer.prompt(
848            f"Enter your Slack webhook URL (existing: {existing_webhook})",
849            default=existing_webhook,
850        )
851        config["slack_webhook"] = webhook
852
853    if typer.confirm("Would you like to use a PVC?", default=False):
854        user = config["user"]
855        current_default = config.get("default_pvc", f"{user}-pvc")
856
857        pvc_name = typer.prompt(
858            f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.",
859            default=current_default,
860        )
861
862        namespace = config.get("namespace", get_current_namespace(config))
863        if check_if_pvc_exists(pvc_name, namespace):
864            if typer.confirm(
865                f"Would you like to set {pvc_name} as the default PVC?",
866                default=True,
867            ):
868                config["default_pvc"] = pvc_name
869        else:
870            if typer.confirm(
871                f"PVC '{pvc_name}' does not exist. Would you like to create it?",
872                default=True,
873            ):
874                pvc_size = typer.prompt(
875                    "Enter the desired PVC size (e.g. 10Gi)", default="10Gi"
876                )
877                try:
878                    if create_pvc(user, pvc_name, pvc_size, namespace):
879                        config["default_pvc"] = pvc_name
880                except (ValueError, ApiException) as e:
881                    logger.error(f"Failed to create PVC: {e}")
882
883    # Git authentication setup
884    if typer.confirm("Would you like to set up Git SSH authentication?", default=False):
885        default_key_path = str(Path.home() / ".ssh" / "id_rsa")
886        key_path = typer.prompt(
887            "Enter the path to your SSH private key",
888            default=default_key_path,
889        )
890        secret_name = f"{config['user']}-git-ssh"
891        namespace = config.get("namespace", get_current_namespace(config))
892        if create_git_secret(secret_name, key_path, namespace):
893            config["git_secret"] = secret_name
894
895    # validate slack webhook
896    if "slack_webhook" in config:
897        # test post to slack
898        try:
899            logger.info("Sending test message to Slack")
900            message = "Hello :wave: from ```kblaunch```"
901            response = requests.post(
902                config["slack_webhook"],
903                json={"text": message},
904            )
905            response.raise_for_status()
906        except Exception as e:
907            logger.error(f"Error sending test message to Slack: {e}")
908
909    # Save config
910    save_config(config)
911    logger.info(f"Configuration saved to {CONFIG_FILE}")

kblaunch setup

Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.

This command walks users through the initial setup process, configuring:

  • User identity and email
  • Namespace and queue settings
  • Slack notifications webhook
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • NFS server configuration

The configuration is stored in ~/.cache/.kblaunch/config.json.

Configuration includes:

  • User: Kubernetes username for job ownership
  • Email: User email for notifications and Git configuration
  • Namespace: Kubernetes namespace for job deployment
  • Queue: Kueue queue name for job scheduling
  • Slack webhook: URL for job status notifications
  • PVC: Persistent storage configuration
  • Git SSH: Authentication for private repositories
  • NFS: Server address for mounting storage
@app.command()
def launch( email: str = <typer.models.OptionInfo object>, job_name: str = <typer.models.OptionInfo object>, docker_image: str = <typer.models.OptionInfo object>, namespace: str = <typer.models.OptionInfo object>, queue_name: str = <typer.models.OptionInfo object>, interactive: bool = <typer.models.OptionInfo object>, command: str = <typer.models.OptionInfo object>, cpu_request: str = <typer.models.OptionInfo object>, ram_request: str = <typer.models.OptionInfo object>, gpu_limit: int = <typer.models.OptionInfo object>, gpu_product: kblaunch.cli.GPU_PRODUCTS = <typer.models.OptionInfo object>, secrets_env_vars: list[str] = <typer.models.OptionInfo object>, local_env_vars: list[str] = <typer.models.OptionInfo object>, load_dotenv: bool = <typer.models.OptionInfo object>, nfs_server: Optional[str] = <typer.models.OptionInfo object>, pvc_name: str = <typer.models.OptionInfo object>, pvcs: str = <typer.models.OptionInfo object>, dry_run: bool = <typer.models.OptionInfo object>, priority: kblaunch.cli.PRIORITY = <typer.models.OptionInfo object>, vscode: bool = <typer.models.OptionInfo object>, tunnel: bool = <typer.models.OptionInfo object>, startup_script: str = <typer.models.OptionInfo object>):
 914@app.command()
 915def launch(
 916    email: str = typer.Option(None, help="User email (overrides config)"),
 917    job_name: str = typer.Option(..., help="Name of the Kubernetes job"),
 918    docker_image: str = typer.Option(
 919        "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image"
 920    ),
 921    namespace: str = typer.Option(
 922        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
 923    ),
 924    queue_name: str = typer.Option(
 925        None, help="Kueue queue name (defaults to KUBE_USER_QUEUE)"
 926    ),
 927    interactive: bool = typer.Option(False, help="Run in interactive mode"),
 928    command: str = typer.Option(
 929        "", help="Command to run in the container"
 930    ),  # Made optional
 931    cpu_request: str = typer.Option("6", help="CPU request"),
 932    ram_request: str = typer.Option("40Gi", help="RAM request"),
 933    gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"),
 934    gpu_product: GPU_PRODUCTS = typer.Option(
 935        "NVIDIA-A100-SXM4-40GB",
 936        help="GPU product type to use (ignored for non-GPU jobs)",
 937        show_choices=True,
 938        show_default=True,
 939    ),
 940    secrets_env_vars: list[str] = typer.Option(
 941        [],  # Use empty list as default instead of None
 942        help="List of secret environment variables to export to the container",
 943    ),
 944    local_env_vars: list[str] = typer.Option(
 945        [],  # Use empty list as default instead of None
 946        help="List of local environment variables to export to the container",
 947    ),
 948    load_dotenv: bool = typer.Option(
 949        True, help="Load environment variables from .env file"
 950    ),
 951    nfs_server: Optional[str] = typer.Option(
 952        None, help="NFS server (overrides config and environment)"
 953    ),
 954    pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"),
 955    pvcs: str = typer.Option(
 956        None,
 957        help='Multiple PVCs with mount paths in JSON format (e.g., \'[{"name":"my-pvc","mount_path":"/data"}]\')',
 958    ),
 959    dry_run: bool = typer.Option(False, help="Dry run"),
 960    priority: PRIORITY = typer.Option(
 961        "default", help="Priority class name", show_default=True, show_choices=True
 962    ),
 963    vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"),
 964    tunnel: bool = typer.Option(
 965        False,
 966        help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode",
 967    ),
 968    startup_script: str = typer.Option(
 969        None, help="Path to startup script to run in container"
 970    ),
 971):
 972    """
 973    `kblaunch launch`
 974    Launch a Kubernetes job with specified configuration.
 975
 976    This command creates and deploys a Kubernetes job with the given specifications,
 977    handling GPU allocation, resource requests, and environment setup.
 978
 979    Args:
 980    * email (str, optional): User email for notifications
 981    * job_name (str, required): Name of the Kubernetes job
 982    * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
 983    * namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
 984    * queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
 985    * interactive (bool, default=False): Run in interactive mode
 986    * command (str, default=""): Command to run in container
 987    * cpu_request (str, default="6"): CPU cores request
 988    * ram_request (str, default="40Gi"): RAM request
 989    * gpu_limit (int, default=1): Number of GPUs
 990    * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
 991    * secrets_env_vars (List[str], default=[]): Secret environment variables
 992    * local_env_vars (List[str], default=[]): Local environment variables
 993    * load_dotenv (bool, default=True): Load .env file
 994    * nfs_server (str, optional): NFS server IP (overrides config)
 995    * pvc_name (str, optional): PVC name for single PVC mounting at /pvc
 996    * pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
 997    * dry_run (bool, default=False): Print YAML only
 998    * priority (PRIORITY, default="default"): Job priority
 999    * vscode (bool, default=False): Install VS Code
1000    * tunnel (bool, default=False): Start VS Code tunnel
1001    * startup_script (str, optional): Path to startup script
1002
1003    Examples:
1004        ```bash
1005        # Launch an interactive GPU job
1006        kblaunch launch --job-name test-job --interactive
1007
1008        # Launch a batch GPU job with custom command
1009        kblaunch launch --job-name batch-job --command "python train.py"
1010
1011        # Launch a CPU-only job
1012        kblaunch launch --job-name cpu-job --gpu-limit 0
1013
1014        # Launch with VS Code support
1015        kblaunch launch --job-name dev-job --interactive --vscode --tunnel
1016
1017        # Launch with multiple PVCs
1018        kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'
1019        ```
1020
1021    Notes:
1022    - Interactive jobs keep running until manually terminated
1023    - GPU jobs require appropriate queue and priority settings
1024    - VS Code tunnel requires Slack webhook configuration
1025    - Multiple PVCs can be mounted with custom paths using the --pvcs option
1026    """
1027
1028    # Load config
1029    config = load_config()
1030
1031    # Determine namespace if not provided
1032    if namespace is None:
1033        namespace = get_current_namespace(config)
1034        if namespace is None:
1035            raise typer.BadParameter(
1036                "Namespace not provided.",
1037                "Please provide --namespace or run 'kblaunch setup' to configure.",
1038            )
1039
1040    # Determine queue name if not provided
1041    if queue_name is None:
1042        queue_name = get_user_queue(namespace)
1043        if queue_name is None:
1044            raise typer.BadParameter(
1045                "Queue name not provided.",
1046                "Please provide --queue-name or run 'kblaunch setup' to configure.",
1047            )
1048
1049    # Use email from config if not provided
1050    if email is None:
1051        email = config.get("email")
1052        if email is None:
1053            raise typer.BadParameter(
1054                "Email not provided and not found in config. "
1055                "Please provide --email or run 'kblaunch setup' to configure."
1056            )
1057
1058    # Determine which NFS server to use (priority: command-line > config > env var > default)
1059    if nfs_server is None:
1060        nfs_server = config.get("nfs_server", NFS_SERVER)
1061        if nfs_server is None:
1062            # warn if NFS server is not set
1063            logger.warning(
1064                "NFS server not set/found. Please provide --nfs-server or run 'kblaunch setup' mount the NFS partition."
1065            )
1066
1067    # Add SLACK_WEBHOOK to local_env_vars if configured
1068    if "slack_webhook" in config:
1069        os.environ["SLACK_WEBHOOK"] = config["slack_webhook"]
1070        if "SLACK_WEBHOOK" not in local_env_vars:
1071            local_env_vars.append("SLACK_WEBHOOK")
1072
1073    if "user" in config and os.getenv("USER") is None:
1074        os.environ["USER"] = config["user"]
1075
1076    if pvc_name is None:
1077        pvc_name = config.get("default_pvc")
1078
1079    if pvc_name is not None:
1080        if not check_if_pvc_exists(pvc_name, namespace):
1081            raise typer.BadParameter(
1082                f"PVC '{pvc_name}' does not exist in namespace '{namespace}'"
1083            )
1084
1085    # Parse multiple PVCs if provided
1086    parsed_pvcs = []
1087    if pvcs:
1088        try:
1089            parsed_pvcs = json.loads(pvcs)
1090            # Validate the format
1091            for pvc in parsed_pvcs:
1092                if (
1093                    not isinstance(pvc, dict)
1094                    or "name" not in pvc
1095                    or "mount_path" not in pvc
1096                ):
1097                    raise typer.BadParameter(
1098                        "Each PVC entry must be a dictionary with 'name' and 'mount_path' keys"
1099                    )
1100                # Validate that the PVC exists
1101                if not check_if_pvc_exists(pvc["name"], namespace):
1102                    raise typer.BadParameter(
1103                        f"PVC '{pvc['name']}' does not exist in namespace '{namespace}'"
1104                    )
1105
1106        except json.JSONDecodeError:
1107            raise typer.BadParameter("Invalid JSON format for pvcs parameter")
1108
1109    # Add validation for command parameter
1110    if not interactive and command == "":
1111        raise typer.BadParameter("--command is required when not in interactive mode")
1112
1113    # Validate GPU constraints only if requesting GPUs
1114    if gpu_limit > 0:
1115        try:
1116            validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value)
1117        except ValueError as e:
1118            raise typer.BadParameter(str(e))
1119
1120    is_completed = check_if_completed(job_name, namespace=namespace)
1121    if not is_completed:
1122        if typer.confirm(
1123            f"Job '{job_name}' already exists. Do you want to delete it and create a new one?",
1124            default=False,
1125        ):
1126            if not delete_namespaced_job_safely(
1127                job_name,
1128                namespace=namespace,
1129                user=config.get("user"),
1130            ):
1131                logger.error("Failed to delete existing job")
1132                return 1
1133        else:
1134            logger.info("Operation cancelled by user")
1135            return 1
1136
1137    logger.info(f"Job '{job_name}' is completed. Launching a new job.")
1138
1139    # Get local environment variables
1140    env_vars_dict = get_env_vars(
1141        local_env_vars=local_env_vars,
1142        load_dotenv=load_dotenv,
1143    )
1144
1145    # Add USER and GIT_EMAIL to env_vars if git_secret is configured
1146    if config.get("git_secret"):
1147        env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown"))
1148        env_vars_dict["GIT_EMAIL"] = email
1149
1150    secrets_env_vars_dict = get_secret_env_vars(
1151        secrets_names=secrets_env_vars,
1152        namespace=namespace,
1153    )
1154
1155    # Check for overlapping keys in local and secret environment variables
1156    intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys())
1157    if intersection:
1158        logger.warning(
1159            f"Overlapping keys in local and secret environment variables: {intersection}"
1160        )
1161    # Combine the environment variables
1162    union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys())
1163
1164    # Handle startup script
1165    script_content = None
1166    if startup_script:
1167        script_content = read_startup_script(startup_script)
1168        # Create ConfigMap for startup script
1169        try:
1170            api = client.CoreV1Api()
1171            config_map = client.V1ConfigMap(
1172                metadata=client.V1ObjectMeta(
1173                    name=f"{job_name}-startup", namespace=namespace
1174                ),
1175                data={"startup.sh": script_content},
1176            )
1177            try:
1178                api.create_namespaced_config_map(namespace=namespace, body=config_map)
1179            except ApiException as e:
1180                if e.status == 409:  # Already exists
1181                    api.patch_namespaced_config_map(
1182                        name=f"{job_name}-startup", namespace=namespace, body=config_map
1183                    )
1184                else:
1185                    raise
1186        except Exception as e:
1187            raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}")
1188
1189    if interactive:
1190        cmd = "while true; do sleep 60; done;"
1191    else:
1192        cmd = command
1193        logger.info(f"Command: {cmd}")
1194
1195    logger.info(f"Creating job for: {cmd}")
1196
1197    # Modify command to include startup script
1198    if script_content:
1199        cmd = f"bash /startup.sh && {cmd}"
1200
1201    # Build the start command with optional VS Code installation
1202    start_command = send_message_command(union)
1203    if config.get("git_secret"):
1204        start_command += setup_git_command()
1205    if vscode:
1206        start_command += install_vscode_command()
1207        if tunnel:
1208            start_command += start_vscode_tunnel_command(union)
1209    elif tunnel:
1210        logger.error("Cannot start tunnel without VS Code installation")
1211
1212    full_cmd = start_command + cmd
1213
1214    job = KubernetesJob(
1215        name=job_name,
1216        cpu_request=cpu_request,
1217        ram_request=ram_request,
1218        image=docker_image,
1219        gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None,
1220        gpu_limit=gpu_limit,
1221        gpu_product=gpu_product.value if gpu_limit > 0 else None,
1222        command=["/bin/bash", "-c", "--"],
1223        args=[full_cmd],
1224        env_vars=env_vars_dict,
1225        secret_env_vars=secrets_env_vars_dict,
1226        user_email=email,
1227        namespace=namespace,
1228        kueue_queue_name=queue_name,
1229        nfs_server=nfs_server,
1230        pvc_name=pvc_name,
1231        pvcs=parsed_pvcs,
1232        priority=priority.value,
1233        startup_script=script_content,
1234        git_secret=config.get("git_secret"),
1235    )
1236    job_yaml = job.generate_yaml()
1237    logger.info(job_yaml)
1238    # Run the Job on the Kubernetes cluster
1239    if not dry_run:
1240        job.run()

kblaunch launch Launch a Kubernetes job with specified configuration.

This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.

Args:

  • email (str, optional): User email for notifications
  • job_name (str, required): Name of the Kubernetes job
  • docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
  • namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
  • queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
  • interactive (bool, default=False): Run in interactive mode
  • command (str, default=""): Command to run in container
  • cpu_request (str, default="6"): CPU cores request
  • ram_request (str, default="40Gi"): RAM request
  • gpu_limit (int, default=1): Number of GPUs
  • gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
  • secrets_env_vars (List[str], default=[]): Secret environment variables
  • local_env_vars (List[str], default=[]): Local environment variables
  • load_dotenv (bool, default=True): Load .env file
  • nfs_server (str, optional): NFS server IP (overrides config)
  • pvc_name (str, optional): PVC name for single PVC mounting at /pvc
  • pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
  • dry_run (bool, default=False): Print YAML only
  • priority (PRIORITY, default="default"): Job priority
  • vscode (bool, default=False): Install VS Code
  • tunnel (bool, default=False): Start VS Code tunnel
  • startup_script (str, optional): Path to startup script

Examples:

# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive

# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"

# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0

# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel

# Launch with multiple PVCs
kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'

Notes:

  • Interactive jobs keep running until manually terminated
  • GPU jobs require appropriate queue and priority settings
  • VS Code tunnel requires Slack webhook configuration
  • Multiple PVCs can be mounted with custom paths using the --pvcs option
@monitor_app.command('gpus')
def monitor_gpus(namespace: str = <typer.models.OptionInfo object>):
1328@monitor_app.command("gpus")
1329def monitor_gpus(
1330    namespace: str = typer.Option(
1331        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1332    ),
1333):
1334    """
1335    `kblaunch monitor gpus`
1336    Display overall GPU statistics and utilization by type.
1337
1338    Shows a comprehensive view of GPU allocation and usage across the cluster,
1339    including both running and pending GPU requests.
1340
1341    Args:
1342    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1343
1344    Output includes:
1345    - Total GPU count by type
1346    - Running vs. pending GPUs
1347    - Details of pending GPU requests
1348    - Wait times for pending requests
1349
1350    Examples:
1351        ```bash
1352        kblaunch monitor gpus
1353        kblaunch monitor gpus --namespace custom-namespace
1354        ```
1355    """
1356    try:
1357        namespace = namespace or get_current_namespace(config)
1358        print_gpu_total(namespace=namespace)
1359    except Exception as e:
1360        print(f"Error displaying GPU stats: {e}")

kblaunch monitor gpus Display overall GPU statistics and utilization by type.

Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • Total GPU count by type
  • Running vs. pending GPUs
  • Details of pending GPU requests
  • Wait times for pending requests

Examples:

kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
@monitor_app.command('users')
def monitor_users(namespace: str = <typer.models.OptionInfo object>):
1363@monitor_app.command("users")
1364def monitor_users(
1365    namespace: str = typer.Option(
1366        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1367    ),
1368):
1369    """
1370    `kblaunch monitor users`
1371    Display GPU usage statistics grouped by user.
1372
1373    Provides a user-centric view of GPU allocation and utilization,
1374    helping identify resource usage patterns across users.
1375
1376    Args:
1377    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1378
1379    Output includes:
1380    - GPUs allocated per user
1381    - Average memory usage per user
1382    - Inactive GPU count per user
1383    - Overall usage totals
1384
1385    Examples:
1386        ```bash
1387        kblaunch monitor users
1388        kblaunch monitor users --namespace custom-namespace
1389        ```
1390    """
1391    try:
1392        namespace = namespace or get_current_namespace(config)
1393        print_user_stats(namespace=namespace)
1394    except Exception as e:
1395        print(f"Error displaying user stats: {e}")

kblaunch monitor users Display GPU usage statistics grouped by user.

Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • GPUs allocated per user
  • Average memory usage per user
  • Inactive GPU count per user
  • Overall usage totals

Examples:

kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
@monitor_app.command('jobs')
def monitor_jobs(namespace: str = <typer.models.OptionInfo object>):
1398@monitor_app.command("jobs")
1399def monitor_jobs(
1400    namespace: str = typer.Option(
1401        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1402    ),
1403):
1404    """
1405    `kblaunch monitor jobs`
1406    Display detailed job-level GPU statistics.
1407
1408    Shows comprehensive information about all running GPU jobs,
1409    including resource usage and job characteristics.
1410
1411    Args:
1412    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1413
1414    Output includes:
1415    - Job identification and ownership
1416    - Resource allocation (CPU, RAM, GPU)
1417    - GPU memory usage
1418    - Job status (active/inactive)
1419    - Job mode (interactive/batch)
1420    - Resource totals and averages
1421
1422    Examples:
1423        ```bash
1424        kblaunch monitor jobs
1425        kblaunch monitor jobs --namespace custom-namespace
1426        ```
1427    """
1428    try:
1429        namespace = namespace or get_current_namespace(config)
1430        print_job_stats(namespace=namespace)
1431    except Exception as e:
1432        print(f"Error displaying job stats: {e}")

kblaunch monitor jobs Display detailed job-level GPU statistics.

Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • Job identification and ownership
  • Resource allocation (CPU, RAM, GPU)
  • GPU memory usage
  • Job status (active/inactive)
  • Job mode (interactive/batch)
  • Resource totals and averages

Examples:

kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
@monitor_app.command('queue')
def monitor_queue( namespace: str = <typer.models.OptionInfo object>, reasons: bool = <typer.models.OptionInfo object>, include_cpu: bool = <typer.models.OptionInfo object>):
1435@monitor_app.command("queue")
1436def monitor_queue(
1437    namespace: str = typer.Option(
1438        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1439    ),
1440    reasons: bool = typer.Option(False, help="Display queued job event messages"),
1441    include_cpu: bool = typer.Option(False, help="Show CPU jobs in the queue"),
1442):
1443    """
1444    `kblaunch monitor queue`
1445    Display statistics about queued workloads.
1446
1447    Shows information about jobs waiting in the Kueue scheduler,
1448    including wait times and resource requests.
1449
1450    Args:
1451    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1452    - reasons: Show detailed reason messages for queued jobs
1453    - include_cpu: Include CPU jobs in the queue
1454
1455    Output includes:
1456    - Queue position and wait time
1457    - Resource requests (CPU, RAM, GPU)
1458    - Job priority
1459    - Queueing reasons (if --reasons flag is used)
1460
1461    Examples:
1462        ```bash
1463        kblaunch monitor queue
1464        kblaunch monitor queue --reasons
1465        kblaunch monitor queue --namespace custom-namespace
1466        ```
1467    """
1468    try:
1469        namespace = namespace or get_current_namespace(config)
1470        print_queue_stats(namespace=namespace, reasons=reasons, include_cpu=include_cpu)
1471    except Exception as e:
1472        print(f"Error displaying queue stats: {e}")

kblaunch monitor queue Display statistics about queued workloads.

Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
  • reasons: Show detailed reason messages for queued jobs
  • include_cpu: Include CPU jobs in the queue

Output includes:

  • Queue position and wait time
  • Resource requests (CPU, RAM, GPU)
  • Job priority
  • Queueing reasons (if --reasons flag is used)

Examples:

kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace