kblaunch

kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.

Commands

  • Launching GPU jobs with various configurations
  • Monitoring GPU usage and job statistics
  • Setting up user configurations and preferences
  • Managing persistent volumes and Git authentication

Features

  • Interactive and batch job support
  • GPU resource management and constraints
  • Environment variable handling from multiple sources
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • VS Code integration with remote tunneling
  • Slack notifications for job status
  • Real-time cluster monitoring

Resource Types

  • A100 GPUs (40GB and 80GB variants)
  • H100 GPUs (80GB variant)
  • CPU and RAM allocation
  • Persistent storage volumes

Job Priority Classes

  • default: Standard priority for most workloads
  • batch: Lower priority for long-running jobs
  • short: High priority for quick jobs (with GPU constraints)

Environment Integration

  • Kubernetes secrets
  • Local environment variables
  • .env file support
  • SSH key management
  • NFS workspace mounting
 1"""kblaunch - A CLI tool for launching and monitoring GPU jobs on Kubernetes clusters.
 2
 3## Commands
 4* Launching GPU jobs with various configurations
 5* Monitoring GPU usage and job statistics
 6* Setting up user configurations and preferences
 7* Managing persistent volumes and Git authentication
 8
 9## Features
10* Interactive and batch job support
11* GPU resource management and constraints
12* Environment variable handling from multiple sources
13* Persistent Volume Claims (PVC) for storage
14* Git SSH authentication
15* VS Code integration with remote tunneling
16* Slack notifications for job status
17* Real-time cluster monitoring
18
19## Resource Types
20* A100 GPUs (40GB and 80GB variants)
21* H100 GPUs (80GB variant)
22* CPU and RAM allocation
23* Persistent storage volumes
24
25## Job Priority Classes
26* default: Standard priority for most workloads
27* batch: Lower priority for long-running jobs
28* short: High priority for quick jobs (with GPU constraints)
29
30## Environment Integration
31* Kubernetes secrets
32* Local environment variables
33* .env file support
34* SSH key management
35* NFS workspace mounting
36"""
37
38import importlib.metadata
39
40__version__ = importlib.metadata.version("kblaunch")
41
42__all__ = [
43    "setup",
44    "launch",
45    "monitor_gpus",
46    "monitor_users",
47    "monitor_jobs",
48    "monitor_queue",
49]
50
51from .cli import setup, launch, monitor_gpus, monitor_users, monitor_jobs, monitor_queue
@app.command()
def setup():
768@app.command()
769def setup():
770    """
771     `kblaunch setup`
772
773    Interactive setup wizard for kblaunch configuration.
774    No arguments - all configuration is done through interactive prompts.
775
776    This command walks users through the initial setup process, configuring:
777    - User identity and email
778    - Namespace and queue settings
779    - Slack notifications webhook
780    - Persistent Volume Claims (PVC) for storage
781    - Git SSH authentication
782    - NFS server configuration
783
784    The configuration is stored in ~/.cache/.kblaunch/config.json.
785
786    Configuration includes:
787    - User: Kubernetes username for job ownership
788    - Email: User email for notifications and Git configuration
789    - Namespace: Kubernetes namespace for job deployment
790    - Queue: Kueue queue name for job scheduling
791    - Slack webhook: URL for job status notifications
792    - PVC: Persistent storage configuration
793    - Git SSH: Authentication for private repositories
794    - NFS: Server address for mounting storage
795    """
796    config = load_config()
797
798    # validate user
799    default_user = os.getenv("USER")
800    if "user" in config:
801        default_user = config["user"]
802    else:
803        config["user"] = default_user
804
805    if typer.confirm(
806        f"Would you like to set the user? (default: {default_user})", default=False
807    ):
808        user = typer.prompt("Please enter your user", default=default_user)
809        config["user"] = user
810
811    # Get email
812    existing_email = config.get("email", None)
813    email = typer.prompt(
814        f"Please enter your email (existing: {existing_email})", default=existing_email
815    )
816    config["email"] = email
817
818    # Configure namespace
819    existing_namespace = config.get("namespace", os.getenv("KUBE_NAMESPACE"))
820    if typer.confirm("Would you like to configure your namespace?", default=True):
821        namespace = typer.prompt(
822            f"Please enter your namespace (existing: {existing_namespace})",
823            default=existing_namespace,
824        )
825        config["namespace"] = namespace
826        # Now that we have namespace, ask about queue
827        existing_queue = config.get("queue", get_user_queue(namespace))
828        if typer.confirm("Would you like to configure your queue?", default=True):
829            queue = typer.prompt(
830                f"Please enter your queue name (existing: {existing_queue})",
831                default=existing_queue or f"{namespace}-user-queue",
832            )
833            config["queue"] = queue
834
835    # Get NFS Server
836    # Get the current NFS server from config or default
837    current_nfs = config.get("nfs_server", NFS_SERVER)
838    if typer.confirm("Would you like to configure the NFS server?", default=False):
839        nfs_server = typer.prompt(
840            f"Enter your NFS server address (existing: {current_nfs})",
841            default=current_nfs,
842        )
843        config["nfs_server"] = nfs_server
844
845    # Get Slack webhook
846    if typer.confirm("Would you like to set up Slack notifications?", default=False):
847        existing_webhook = config.get("slack_webhook", None)
848        webhook = typer.prompt(
849            f"Enter your Slack webhook URL (existing: {existing_webhook})",
850            default=existing_webhook,
851        )
852        config["slack_webhook"] = webhook
853
854    if typer.confirm("Would you like to use a PVC?", default=False):
855        user = config["user"]
856        current_default = config.get("default_pvc", f"{user}-pvc")
857
858        pvc_name = typer.prompt(
859            f"Enter the PVC name to use (default: {current_default}). We will help you create it if it does not exist.",
860            default=current_default,
861        )
862
863        namespace = config.get("namespace", get_current_namespace(config))
864        if check_if_pvc_exists(pvc_name, namespace):
865            if typer.confirm(
866                f"Would you like to set {pvc_name} as the default PVC?",
867                default=True,
868            ):
869                config["default_pvc"] = pvc_name
870        else:
871            if typer.confirm(
872                f"PVC '{pvc_name}' does not exist. Would you like to create it?",
873                default=True,
874            ):
875                pvc_size = typer.prompt(
876                    "Enter the desired PVC size (e.g. 10Gi)", default="10Gi"
877                )
878                try:
879                    if create_pvc(user, pvc_name, pvc_size, namespace):
880                        config["default_pvc"] = pvc_name
881                except (ValueError, ApiException) as e:
882                    logger.error(f"Failed to create PVC: {e}")
883
884    # Git authentication setup
885    if typer.confirm("Would you like to set up Git SSH authentication?", default=False):
886        default_key_path = str(Path.home() / ".ssh" / "id_rsa")
887        key_path = typer.prompt(
888            "Enter the path to your SSH private key",
889            default=default_key_path,
890        )
891        secret_name = f"{config['user']}-git-ssh"
892        namespace = config.get("namespace", get_current_namespace(config))
893        if create_git_secret(secret_name, key_path, namespace):
894            config["git_secret"] = secret_name
895
896    # validate slack webhook
897    if "slack_webhook" in config:
898        # test post to slack
899        try:
900            logger.info("Sending test message to Slack")
901            message = "Hello :wave: from ```kblaunch```"
902            response = requests.post(
903                config["slack_webhook"],
904                json={"text": message},
905            )
906            response.raise_for_status()
907        except Exception as e:
908            logger.error(f"Error sending test message to Slack: {e}")
909
910    # Save config
911    save_config(config)
912    logger.info(f"Configuration saved to {CONFIG_FILE}")

kblaunch setup

Interactive setup wizard for kblaunch configuration. No arguments - all configuration is done through interactive prompts.

This command walks users through the initial setup process, configuring:

  • User identity and email
  • Namespace and queue settings
  • Slack notifications webhook
  • Persistent Volume Claims (PVC) for storage
  • Git SSH authentication
  • NFS server configuration

The configuration is stored in ~/.cache/.kblaunch/config.json.

Configuration includes:

  • User: Kubernetes username for job ownership
  • Email: User email for notifications and Git configuration
  • Namespace: Kubernetes namespace for job deployment
  • Queue: Kueue queue name for job scheduling
  • Slack webhook: URL for job status notifications
  • PVC: Persistent storage configuration
  • Git SSH: Authentication for private repositories
  • NFS: Server address for mounting storage
@app.command()
def launch( email: str = <typer.models.OptionInfo object>, job_name: str = <typer.models.OptionInfo object>, docker_image: str = <typer.models.OptionInfo object>, namespace: str = <typer.models.OptionInfo object>, queue_name: str = <typer.models.OptionInfo object>, interactive: bool = <typer.models.OptionInfo object>, command: str = <typer.models.OptionInfo object>, cpu_request: str = <typer.models.OptionInfo object>, ram_request: str = <typer.models.OptionInfo object>, gpu_limit: int = <typer.models.OptionInfo object>, gpu_product: kblaunch.cli.GPU_PRODUCTS = <typer.models.OptionInfo object>, secrets_env_vars: list[str] = <typer.models.OptionInfo object>, local_env_vars: list[str] = <typer.models.OptionInfo object>, load_dotenv: bool = <typer.models.OptionInfo object>, nfs_server: Optional[str] = <typer.models.OptionInfo object>, pvc_name: str = <typer.models.OptionInfo object>, pvcs: str = <typer.models.OptionInfo object>, dry_run: bool = <typer.models.OptionInfo object>, priority: kblaunch.cli.PRIORITY = <typer.models.OptionInfo object>, vscode: bool = <typer.models.OptionInfo object>, tunnel: bool = <typer.models.OptionInfo object>, startup_script: str = <typer.models.OptionInfo object>):
 915@app.command()
 916def launch(
 917    email: str = typer.Option(None, help="User email (overrides config)"),
 918    job_name: str = typer.Option(..., help="Name of the Kubernetes job"),
 919    docker_image: str = typer.Option(
 920        "nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04", help="Docker image"
 921    ),
 922    namespace: str = typer.Option(
 923        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
 924    ),
 925    queue_name: str = typer.Option(
 926        None, help="Kueue queue name (defaults to KUBE_USER_QUEUE)"
 927    ),
 928    interactive: bool = typer.Option(False, help="Run in interactive mode"),
 929    command: str = typer.Option(
 930        "", help="Command to run in the container"
 931    ),  # Made optional
 932    cpu_request: str = typer.Option("6", help="CPU request"),
 933    ram_request: str = typer.Option("40Gi", help="RAM request"),
 934    gpu_limit: int = typer.Option(1, help="GPU limit (0 for non-GPU jobs)"),
 935    gpu_product: GPU_PRODUCTS = typer.Option(
 936        "NVIDIA-A100-SXM4-40GB",
 937        help="GPU product type to use (ignored for non-GPU jobs)",
 938        show_choices=True,
 939        show_default=True,
 940    ),
 941    secrets_env_vars: list[str] = typer.Option(
 942        [],  # Use empty list as default instead of None
 943        help="List of secret environment variables to export to the container",
 944    ),
 945    local_env_vars: list[str] = typer.Option(
 946        [],  # Use empty list as default instead of None
 947        help="List of local environment variables to export to the container",
 948    ),
 949    load_dotenv: bool = typer.Option(
 950        True, help="Load environment variables from .env file"
 951    ),
 952    nfs_server: Optional[str] = typer.Option(
 953        None, help="NFS server (overrides config and environment)"
 954    ),
 955    pvc_name: str = typer.Option(None, help="Persistent Volume Claim name"),
 956    pvcs: str = typer.Option(
 957        None,
 958        help='Multiple PVCs with mount paths in JSON format (e.g., \'[{"name":"my-pvc","mount_path":"/data"}]\')',
 959    ),
 960    dry_run: bool = typer.Option(False, help="Dry run"),
 961    priority: PRIORITY = typer.Option(
 962        "default", help="Priority class name", show_default=True, show_choices=True
 963    ),
 964    vscode: bool = typer.Option(False, help="Install VS Code CLI in the container"),
 965    tunnel: bool = typer.Option(
 966        False,
 967        help="Start a VS Code SSH tunnel on startup. Requires SLACK_WEBHOOK and --vscode",
 968    ),
 969    startup_script: str = typer.Option(
 970        None, help="Path to startup script to run in container"
 971    ),
 972):
 973    """
 974    `kblaunch launch`
 975    Launch a Kubernetes job with specified configuration.
 976
 977    This command creates and deploys a Kubernetes job with the given specifications,
 978    handling GPU allocation, resource requests, and environment setup.
 979
 980    Args:
 981    * email (str, optional): User email for notifications
 982    * job_name (str, required): Name of the Kubernetes job
 983    * docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
 984    * namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
 985    * queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
 986    * interactive (bool, default=False): Run in interactive mode
 987    * command (str, default=""): Command to run in container
 988    * cpu_request (str, default="6"): CPU cores request
 989    * ram_request (str, default="40Gi"): RAM request
 990    * gpu_limit (int, default=1): Number of GPUs
 991    * gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
 992    * secrets_env_vars (List[str], default=[]): Secret environment variables
 993    * local_env_vars (List[str], default=[]): Local environment variables
 994    * load_dotenv (bool, default=True): Load .env file
 995    * nfs_server (str, optional): NFS server IP (overrides config)
 996    * pvc_name (str, optional): PVC name for single PVC mounting at /pvc
 997    * pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
 998    * dry_run (bool, default=False): Print YAML only
 999    * priority (PRIORITY, default="default"): Job priority
1000    * vscode (bool, default=False): Install VS Code
1001    * tunnel (bool, default=False): Start VS Code tunnel
1002    * startup_script (str, optional): Path to startup script
1003
1004    Examples:
1005        ```bash
1006        # Launch an interactive GPU job
1007        kblaunch launch --job-name test-job --interactive
1008
1009        # Launch a batch GPU job with custom command
1010        kblaunch launch --job-name batch-job --command "python train.py"
1011
1012        # Launch a CPU-only job
1013        kblaunch launch --job-name cpu-job --gpu-limit 0
1014
1015        # Launch with VS Code support
1016        kblaunch launch --job-name dev-job --interactive --vscode --tunnel
1017
1018        # Launch with multiple PVCs
1019        kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'
1020        ```
1021
1022    Notes:
1023    - Interactive jobs keep running until manually terminated
1024    - GPU jobs require appropriate queue and priority settings
1025    - VS Code tunnel requires Slack webhook configuration
1026    - Multiple PVCs can be mounted with custom paths using the --pvcs option
1027    """
1028
1029    # Load config
1030    config = load_config()
1031
1032    # Determine namespace if not provided
1033    if namespace is None:
1034        namespace = get_current_namespace(config)
1035        if namespace is None:
1036            raise typer.BadParameter(
1037                "Namespace not provided.",
1038                "Please provide --namespace or run 'kblaunch setup' to configure.",
1039            )
1040
1041    # Determine queue name if not provided
1042    if queue_name is None:
1043        queue_name = get_user_queue(namespace)
1044        if queue_name is None:
1045            raise typer.BadParameter(
1046                "Queue name not provided.",
1047                "Please provide --queue-name or run 'kblaunch setup' to configure.",
1048            )
1049
1050    # Use email from config if not provided
1051    if email is None:
1052        email = config.get("email")
1053        if email is None:
1054            raise typer.BadParameter(
1055                "Email not provided and not found in config. "
1056                "Please provide --email or run 'kblaunch setup' to configure."
1057            )
1058
1059    # Determine which NFS server to use (priority: command-line > config > env var > default)
1060    if nfs_server is None:
1061        nfs_server = config.get("nfs_server", NFS_SERVER)
1062        if nfs_server is None:
1063            # warn if NFS server is not set
1064            logger.warning(
1065                "NFS server not set/found. Please provide --nfs-server or run 'kblaunch setup' mount the NFS partition."
1066            )
1067
1068    # Add SLACK_WEBHOOK to local_env_vars if configured
1069    if "slack_webhook" in config:
1070        os.environ["SLACK_WEBHOOK"] = config["slack_webhook"]
1071        if "SLACK_WEBHOOK" not in local_env_vars:
1072            local_env_vars.append("SLACK_WEBHOOK")
1073
1074    if "user" in config and os.getenv("USER") is None:
1075        os.environ["USER"] = config["user"]
1076
1077    if pvc_name is None:
1078        pvc_name = config.get("default_pvc")
1079
1080    if pvc_name is not None:
1081        if not check_if_pvc_exists(pvc_name, namespace):
1082            raise typer.BadParameter(
1083                f"PVC '{pvc_name}' does not exist in namespace '{namespace}'"
1084            )
1085
1086    # Parse multiple PVCs if provided
1087    parsed_pvcs = []
1088    if pvcs:
1089        try:
1090            parsed_pvcs = json.loads(pvcs)
1091            # Validate the format
1092            for pvc in parsed_pvcs:
1093                if (
1094                    not isinstance(pvc, dict)
1095                    or "name" not in pvc
1096                    or "mount_path" not in pvc
1097                ):
1098                    raise typer.BadParameter(
1099                        "Each PVC entry must be a dictionary with 'name' and 'mount_path' keys"
1100                    )
1101                # Validate that the PVC exists
1102                if not check_if_pvc_exists(pvc["name"], namespace):
1103                    raise typer.BadParameter(
1104                        f"PVC '{pvc['name']}' does not exist in namespace '{namespace}'"
1105                    )
1106
1107        except json.JSONDecodeError:
1108            raise typer.BadParameter("Invalid JSON format for pvcs parameter")
1109
1110    # Add validation for command parameter
1111    if not interactive and command == "":
1112        raise typer.BadParameter("--command is required when not in interactive mode")
1113
1114    # Validate GPU constraints only if requesting GPUs
1115    if gpu_limit > 0:
1116        try:
1117            validate_gpu_constraints(gpu_product.value, gpu_limit, priority.value)
1118        except ValueError as e:
1119            raise typer.BadParameter(str(e))
1120
1121    is_completed = check_if_completed(job_name, namespace=namespace)
1122    if not is_completed:
1123        if typer.confirm(
1124            f"Job '{job_name}' already exists. Do you want to delete it and create a new one?",
1125            default=False,
1126        ):
1127            if not delete_namespaced_job_safely(
1128                job_name,
1129                namespace=namespace,
1130                user=config.get("user"),
1131            ):
1132                logger.error("Failed to delete existing job")
1133                return 1
1134        else:
1135            logger.info("Operation cancelled by user")
1136            return 1
1137
1138    logger.info(f"Job '{job_name}' is completed. Launching a new job.")
1139
1140    # Get local environment variables
1141    env_vars_dict = get_env_vars(
1142        local_env_vars=local_env_vars,
1143        load_dotenv=load_dotenv,
1144    )
1145
1146    # Add USER and GIT_EMAIL to env_vars if git_secret is configured
1147    if config.get("git_secret"):
1148        env_vars_dict["USER"] = config.get("user", os.getenv("USER", "unknown"))
1149        env_vars_dict["GIT_EMAIL"] = email
1150
1151    secrets_env_vars_dict = get_secret_env_vars(
1152        secrets_names=secrets_env_vars,
1153        namespace=namespace,
1154    )
1155
1156    # Check for overlapping keys in local and secret environment variables
1157    intersection = set(secrets_env_vars_dict.keys()).intersection(env_vars_dict.keys())
1158    if intersection:
1159        logger.warning(
1160            f"Overlapping keys in local and secret environment variables: {intersection}"
1161        )
1162    # Combine the environment variables
1163    union = set(secrets_env_vars_dict.keys()).union(env_vars_dict.keys())
1164
1165    # Handle startup script
1166    script_content = None
1167    if startup_script:
1168        script_content = read_startup_script(startup_script)
1169        # Create ConfigMap for startup script
1170        try:
1171            api = client.CoreV1Api()
1172            config_map = client.V1ConfigMap(
1173                metadata=client.V1ObjectMeta(
1174                    name=f"{job_name}-startup", namespace=namespace
1175                ),
1176                data={"startup.sh": script_content},
1177            )
1178            try:
1179                api.create_namespaced_config_map(namespace=namespace, body=config_map)
1180            except ApiException as e:
1181                if e.status == 409:  # Already exists
1182                    api.patch_namespaced_config_map(
1183                        name=f"{job_name}-startup", namespace=namespace, body=config_map
1184                    )
1185                else:
1186                    raise
1187        except Exception as e:
1188            raise typer.BadParameter(f"Failed to create startup script ConfigMap: {e}")
1189
1190    if interactive:
1191        cmd = "while true; do sleep 60; done;"
1192    else:
1193        cmd = command
1194        logger.info(f"Command: {cmd}")
1195
1196    logger.info(f"Creating job for: {cmd}")
1197
1198    # Modify command to include startup script
1199    if script_content:
1200        cmd = f"bash /startup.sh && {cmd}"
1201
1202    # Build the start command with optional VS Code installation
1203    start_command = send_message_command(union)
1204    if config.get("git_secret"):
1205        start_command += setup_git_command()
1206    if vscode:
1207        start_command += install_vscode_command()
1208        if tunnel:
1209            start_command += start_vscode_tunnel_command(union)
1210    elif tunnel:
1211        logger.error("Cannot start tunnel without VS Code installation")
1212
1213    full_cmd = start_command + cmd
1214
1215    job = KubernetesJob(
1216        name=job_name,
1217        cpu_request=cpu_request,
1218        ram_request=ram_request,
1219        image=docker_image,
1220        gpu_type="nvidia.com/gpu" if gpu_limit > 0 else None,
1221        gpu_limit=gpu_limit,
1222        gpu_product=gpu_product.value if gpu_limit > 0 else None,
1223        command=["/bin/bash", "-c", "--"],
1224        args=[full_cmd],
1225        env_vars=env_vars_dict,
1226        secret_env_vars=secrets_env_vars_dict,
1227        user_email=email,
1228        namespace=namespace,
1229        kueue_queue_name=queue_name,
1230        nfs_server=nfs_server,
1231        pvc_name=pvc_name,
1232        pvcs=parsed_pvcs,
1233        priority=priority.value,
1234        startup_script=script_content,
1235        git_secret=config.get("git_secret"),
1236    )
1237    job_yaml = job.generate_yaml()
1238    logger.info(job_yaml)
1239    # Run the Job on the Kubernetes cluster
1240    if not dry_run:
1241        job.run()

kblaunch launch Launch a Kubernetes job with specified configuration.

This command creates and deploys a Kubernetes job with the given specifications, handling GPU allocation, resource requests, and environment setup.

Args:

  • email (str, optional): User email for notifications
  • job_name (str, required): Name of the Kubernetes job
  • docker_image (str, default="nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04"): Container image
  • namespace (str, default="KUBE_NAMESPACE"): Kubernetes namespace
  • queue_name (str, default="KUBE_USER_QUEUE"): Kueue queue name
  • interactive (bool, default=False): Run in interactive mode
  • command (str, default=""): Command to run in container
  • cpu_request (str, default="6"): CPU cores request
  • ram_request (str, default="40Gi"): RAM request
  • gpu_limit (int, default=1): Number of GPUs
  • gpu_product (GPU_PRODUCTS, default="NVIDIA-A100-SXM4-40GB"): GPU type
  • secrets_env_vars (List[str], default=[]): Secret environment variables
  • local_env_vars (List[str], default=[]): Local environment variables
  • load_dotenv (bool, default=True): Load .env file
  • nfs_server (str, optional): NFS server IP (overrides config)
  • pvc_name (str, optional): PVC name for single PVC mounting at /pvc
  • pvcs (str, optional): Multiple PVCs with mount paths in JSON format (used for mounting multiple PVCs)
  • dry_run (bool, default=False): Print YAML only
  • priority (PRIORITY, default="default"): Job priority
  • vscode (bool, default=False): Install VS Code
  • tunnel (bool, default=False): Start VS Code tunnel
  • startup_script (str, optional): Path to startup script

Examples:

# Launch an interactive GPU job
kblaunch launch --job-name test-job --interactive

# Launch a batch GPU job with custom command
kblaunch launch --job-name batch-job --command "python train.py"

# Launch a CPU-only job
kblaunch launch --job-name cpu-job --gpu-limit 0

# Launch with VS Code support
kblaunch launch --job-name dev-job --interactive --vscode --tunnel

# Launch with multiple PVCs
kblaunch launch --job-name multi-pvc-job --pvcs '[{"name":"data-pvc","mount_path":"/data"},{"name":"models-pvc","mount_path":"/models"}]'

Notes:

  • Interactive jobs keep running until manually terminated
  • GPU jobs require appropriate queue and priority settings
  • VS Code tunnel requires Slack webhook configuration
  • Multiple PVCs can be mounted with custom paths using the --pvcs option
@monitor_app.command('gpus')
def monitor_gpus(namespace: str = <typer.models.OptionInfo object>):
1329@monitor_app.command("gpus")
1330def monitor_gpus(
1331    namespace: str = typer.Option(
1332        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1333    ),
1334):
1335    """
1336    `kblaunch monitor gpus`
1337    Display overall GPU statistics and utilization by type.
1338
1339    Shows a comprehensive view of GPU allocation and usage across the cluster,
1340    including both running and pending GPU requests.
1341
1342    Args:
1343    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1344
1345    Output includes:
1346    - Total GPU count by type
1347    - Running vs. pending GPUs
1348    - Details of pending GPU requests
1349    - Wait times for pending requests
1350
1351    Examples:
1352        ```bash
1353        kblaunch monitor gpus
1354        kblaunch monitor gpus --namespace custom-namespace
1355        ```
1356    """
1357    try:
1358        config = load_config()
1359        namespace = namespace or get_current_namespace(config)
1360        print_gpu_total(namespace=namespace)
1361    except Exception as e:
1362        print(f"Error displaying GPU stats: {e}")

kblaunch monitor gpus Display overall GPU statistics and utilization by type.

Shows a comprehensive view of GPU allocation and usage across the cluster, including both running and pending GPU requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • Total GPU count by type
  • Running vs. pending GPUs
  • Details of pending GPU requests
  • Wait times for pending requests

Examples:

kblaunch monitor gpus
kblaunch monitor gpus --namespace custom-namespace
@monitor_app.command('users')
def monitor_users(namespace: str = <typer.models.OptionInfo object>):
1365@monitor_app.command("users")
1366def monitor_users(
1367    namespace: str = typer.Option(
1368        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1369    ),
1370):
1371    """
1372    `kblaunch monitor users`
1373    Display GPU usage statistics grouped by user.
1374
1375    Provides a user-centric view of GPU allocation and utilization,
1376    helping identify resource usage patterns across users.
1377
1378    Args:
1379    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1380
1381    Output includes:
1382    - GPUs allocated per user
1383    - Average memory usage per user
1384    - Inactive GPU count per user
1385    - Overall usage totals
1386
1387    Examples:
1388        ```bash
1389        kblaunch monitor users
1390        kblaunch monitor users --namespace custom-namespace
1391        ```
1392    """
1393    try:
1394        config = load_config()
1395        namespace = namespace or get_current_namespace(config)
1396        print_user_stats(namespace=namespace)
1397    except Exception as e:
1398        print(f"Error displaying user stats: {e}")

kblaunch monitor users Display GPU usage statistics grouped by user.

Provides a user-centric view of GPU allocation and utilization, helping identify resource usage patterns across users.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • GPUs allocated per user
  • Average memory usage per user
  • Inactive GPU count per user
  • Overall usage totals

Examples:

kblaunch monitor users
kblaunch monitor users --namespace custom-namespace
@monitor_app.command('jobs')
def monitor_jobs(namespace: str = <typer.models.OptionInfo object>):
1401@monitor_app.command("jobs")
1402def monitor_jobs(
1403    namespace: str = typer.Option(
1404        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1405    ),
1406):
1407    """
1408    `kblaunch monitor jobs`
1409    Display detailed job-level GPU statistics.
1410
1411    Shows comprehensive information about all running GPU jobs,
1412    including resource usage and job characteristics.
1413
1414    Args:
1415    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1416
1417    Output includes:
1418    - Job identification and ownership
1419    - Resource allocation (CPU, RAM, GPU)
1420    - GPU memory usage
1421    - Job status (active/inactive)
1422    - Job mode (interactive/batch)
1423    - Resource totals and averages
1424
1425    Examples:
1426        ```bash
1427        kblaunch monitor jobs
1428        kblaunch monitor jobs --namespace custom-namespace
1429        ```
1430    """
1431    try:
1432        config = load_config()
1433        namespace = namespace or get_current_namespace(config)
1434        print_job_stats(namespace=namespace)
1435    except Exception as e:
1436        print(f"Error displaying job stats: {e}")

kblaunch monitor jobs Display detailed job-level GPU statistics.

Shows comprehensive information about all running GPU jobs, including resource usage and job characteristics.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)

Output includes:

  • Job identification and ownership
  • Resource allocation (CPU, RAM, GPU)
  • GPU memory usage
  • Job status (active/inactive)
  • Job mode (interactive/batch)
  • Resource totals and averages

Examples:

kblaunch monitor jobs
kblaunch monitor jobs --namespace custom-namespace
@monitor_app.command('queue')
def monitor_queue( namespace: str = <typer.models.OptionInfo object>, reasons: bool = <typer.models.OptionInfo object>, include_cpu: bool = <typer.models.OptionInfo object>):
1439@monitor_app.command("queue")
1440def monitor_queue(
1441    namespace: str = typer.Option(
1442        None, help="Kubernetes namespace (defaults to KUBE_NAMESPACE)"
1443    ),
1444    reasons: bool = typer.Option(False, help="Display queued job event messages"),
1445    include_cpu: bool = typer.Option(False, help="Show CPU jobs in the queue"),
1446):
1447    """
1448    `kblaunch monitor queue`
1449    Display statistics about queued workloads.
1450
1451    Shows information about jobs waiting in the Kueue scheduler,
1452    including wait times and resource requests.
1453
1454    Args:
1455    - namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
1456    - reasons: Show detailed reason messages for queued jobs
1457    - include_cpu: Include CPU jobs in the queue
1458
1459    Output includes:
1460    - Queue position and wait time
1461    - Resource requests (CPU, RAM, GPU)
1462    - Job priority
1463    - Queueing reasons (if --reasons flag is used)
1464
1465    Examples:
1466        ```bash
1467        kblaunch monitor queue
1468        kblaunch monitor queue --reasons
1469        kblaunch monitor queue --namespace custom-namespace
1470        ```
1471    """
1472    try:
1473        config = load_config()
1474        namespace = namespace or get_current_namespace(config)
1475        print_queue_stats(namespace=namespace, reasons=reasons, include_cpu=include_cpu)
1476    except Exception as e:
1477        print(f"Error displaying queue stats: {e}")

kblaunch monitor queue Display statistics about queued workloads.

Shows information about jobs waiting in the Kueue scheduler, including wait times and resource requests.

Args:

  • namespace: Kubernetes namespace to monitor (default: KUBE_NAMESPACE)
  • reasons: Show detailed reason messages for queued jobs
  • include_cpu: Include CPU jobs in the queue

Output includes:

  • Queue position and wait time
  • Resource requests (CPU, RAM, GPU)
  • Job priority
  • Queueing reasons (if --reasons flag is used)

Examples:

kblaunch monitor queue
kblaunch monitor queue --reasons
kblaunch monitor queue --namespace custom-namespace