Metadata-Version: 2.4
Name: tropiflo
Version: 1.1.8
Summary: A tool for agentic recursive model improvement
Project-URL: Homepage, https://github.com/TropiFloAI/co-datascientist
Project-URL: Issues, https://github.com/TropiFloAI/co-datascientist/issues
Author-email: David Gedalevich <davidgdalevich7@gmail.com>, Oz Kilim <oz.kilim@tropiflo.io>
License: Copyright (c) 2018 The Python Packaging Authority
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: click>=8.1.8
Requires-Dist: fastmcp>=2.2.5
Requires-Dist: httpx>=0.28.1
Requires-Dist: ipdb>=0.13.13
Requires-Dist: keyring>=25.6.0
Requires-Dist: keyrings-alt>=5.0.0
Requires-Dist: matplotlib>=3.0.0
Requires-Dist: pydantic-settings>=2.9.1
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: reportlab>=4.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: streamlit>=1.40.0
Requires-Dist: yaspin>=3.1.0
Description-Content-Type: text/markdown

# Tropiflo

**Automatically evolve your ML code to maximize a KPI — locally, securely, and reproducibly.**

---

## Is Tropiflo for you?

Tropiflo is for you if:

✓ **You already have working ML code** — not starting from scratch  
✓ **You know your metric (KPI)** — accuracy, RMSE, AUC, whatever you optimize for  
✓ **You want the system to rewrite parts of your code** — to improve that metric  
✓ **You do NOT want AutoML SaaS, data upload, or black boxes** — everything runs locally

If that's you, keep reading.

---

## How Tropiflo Thinks

Here's what actually happens when you run Tropiflo:

1. **You mark a code block** you want to evolve (e.g., your feature engineering)
2. **You define a KPI** by printing it (e.g., `print(f"KPI: {accuracy}")`)
3. **Tropiflo runs your baseline** and records the KPI
4. **Tropiflo proposes a hypothesis** about how to improve the code
5. **Tropiflo modifies ONLY the marked block** with the new approach
6. **Tropiflo executes your full project** to test the hypothesis
7. **Tropiflo scores the new KPI** and keeps the change if it's better
8. **Repeat** — the system keeps evolving toward higher KPIs

### What Tropiflo is NOT

- **Not AutoML** — It doesn't just tune hyperparameters
- **Not parameter search** — It's code evolution, not grid search
- **Not a black box** — You see every change it makes to your code
- **Not a data platform** — Your data never leaves your machine

---

## Quickstart: See it work in 2 minutes

The fastest way to understand Tropiflo is to watch it improve a simple problem.

### Step 1: Install

```bash
pip install tropiflo
```

### Step 2: Mark Your Code

Create `train.py` and mark the block you want to evolve:

```python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load your data
X = pd.read_csv("data/features.csv")
y = pd.read_csv("data/labels.csv")

# CO_DATASCIENTIST_BLOCK_START
# This is the block Tropiflo will evolve
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
preds = model.predict(X)
# CO_DATASCIENTIST_BLOCK_END

# Print your KPI
accuracy = accuracy_score(y, preds)
print(f"KPI: {accuracy:.4f}")
```

### Step 3: Create config.yaml

Minimal configuration:

```yaml
mode: local
entry_command: "python train.py"
```

With more options:

```yaml
mode: local
entry_command: "python train.py"

# Run multiple experiments in parallel
parallel: 3

# Mount external data directory
data_volume: "/path/to/your/data"

# AI evolution (get API key from tropiflo.io)
api_key: "sk_your_token_here"
```

### Step 4: Run

```bash
tropiflo run --config config.yaml
```

Track runs live in a local dashboard:

```bash
# Launch workflow + Streamlit tracking UI
tropiflo run --config config.yaml --dashboard

# Optional: choose a different dashboard port
tropiflo run --config config.yaml --dashboard --dashboard-port 8502

# Launch dashboard later (without starting a new workflow)
tropiflo dashboard
```

The dashboard opens at `http://127.0.0.1:8501` by default and reads local artifacts from `results/runs/`.

**What you'll see:**
- Baseline run with initial KPI
- Evolution hypotheses being tested
- Progress toward better KPIs
- Results saved to `results/runs/{memorable_name}/`

---

## Results: Traceable, Reproducible, Diffable

Every run is fully traceable and reproducible.

```
your_project/
└── results/
    └── runs/
        └── happy_panda_20260207_143025/    ← Memorable run name
            ├── timeline/                     ← Chronological history
            │   ├── 0001_kpi_0.8530_baseline/
            │   ├── 0002_kpi_0.8812_hypothesis_ensemble/
            │   └── 0003_kpi_0.9103_hypothesis_feature_eng/
            ├── by_performance/               ← Auto-sorted by KPI
            └── best → timeline/0003...       ← Symlink to best version
```

**Key features:**
- `timeline/` shows every hypothesis tested, in order
- `by_performance/` automatically sorts runs by KPI for easy comparison
- `best` symlink always points to your best-performing version
- Every checkpoint contains the full modified code + metadata

---

## Important Reassurances

### Your code outside the block is never modified

Tropiflo only touches code between `CO_DATASCIENTIST_BLOCK_START` and `CO_DATASCIENTIST_BLOCK_END`. Everything else stays exactly as you wrote it.

### If KPI doesn't improve, baseline is preserved

Tropiflo only keeps changes that improve your KPI. If a hypothesis performs worse, it's discarded and the previous best version is kept.

### You can Ctrl+C at any time safely

Press Ctrl+C anytime to stop. Docker images and containers are cleaned up automatically. No manual cleanup needed.

### All artifacts are local unless you opt in

Your data, code, and results stay on your machine. Nothing is uploaded unless you explicitly configure a cloud backend.

---

## Configuration

### Minimal Config (80% of users)

```yaml
mode: local
entry_command: "python train.py"
```

### Common Options

```yaml
mode: local
entry_command: "python train.py"

# Parallelization
parallel: 3

# Data mounting (if data is outside your project)
data_volume: "/home/user/datasets"

# API key for AI-powered evolution
api_key: "sk_your_token_here"
```

### Resource Control (Advanced)

```yaml
mode: local
entry_command: "python train.py"
parallel: 4

# GPU configuration
enable_gpu: true           # Force GPU (auto-detected by default)
gpus_per_task: 1           # GPUs per container

# CPU and memory limits
cpus_per_task: 4.0         # CPU cores per container
memory_per_task: "8g"      # Memory per container
```

### Cloud Backends (Optional)

<details>
<summary><strong>Google Cloud Run</strong></summary>

```yaml
mode: gcloud
entry_command: "python train.py"
project_id: "your-gcp-project"
region: "us-central1"
data_volume: "gs://your-bucket"
```

See [full GCloud setup guide](#google-cloud-run-jobs-integration) below.
</details>

<details>
<summary><strong>AWS ECS Fargate</strong></summary>

```yaml
mode: aws
entry_command: "python train.py"
aws:
  cluster: "my-cluster"
  task_definition: "my-task"
  region: "us-east-1"
```

See [full AWS setup guide](#aws-ecs-fargate-integration) below.
</details>

<details>
<summary><strong>Databricks</strong></summary>

```yaml
mode: databricks
entry_command: "python train.py"
databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "your-cluster-id"
```

See [full Databricks setup guide](#databricks-integration) below.
</details>

---

## Using Your Data

After the dummy example works, here's how to use YOUR data:

### Method 1: Hardcoded Paths (Simplest)

Just put the full path in your code:

```python
import pandas as pd

X = pd.read_csv("/full/path/to/your/data.csv")
# ... rest of your code
```

### Method 2: Docker Volume Mounting (Recommended)

For data that lives outside your project:

**Update config.yaml:**
```yaml
mode: local
entry_command: "python train.py"
data_volume: "/home/user/my_datasets"
```

**Update your code:**
```python
import os
import pandas as pd

# Tropiflo automatically sets INPUT_URI to /data inside Docker
DATA_DIR = os.environ.get("INPUT_URI", "/data")
X = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
y = pd.read_csv(os.path.join(DATA_DIR, "labels.csv"))

# CO_DATASCIENTIST_BLOCK_START
# Your model code here
# CO_DATASCIENTIST_BLOCK_END

print(f"KPI: {score}")
```

**What happens:** Tropiflo mounts `/home/user/my_datasets` to `/data` inside the Docker container, so your code can access files like `train.csv`.

---

## Block Placement Rules

**Block markers MUST be at top level** (no indentation):

```python
# ✅ CORRECT - No indentation before the comment
# CO_DATASCIENTIST_BLOCK_START
def my_model():
    return LinearRegression()
# CO_DATASCIENTIST_BLOCK_END

# ❌ WRONG - Inside a function (has tabs/spaces before comment)
def train():
    # CO_DATASCIENTIST_BLOCK_START  ← This will NOT be detected!
    model = train_model()
    # CO_DATASCIENTIST_BLOCK_END
```

**Rule:** Block markers must start at column 0 (no tabs or spaces before `#`).

---

## Multi-File Projects

Tropiflo supports both single-file scripts and multi-file projects:

- **Single File**: `tropiflo run python my_script.py`
- **Multi-File**: Auto-detects `run.sh`, `main.py`, or `run.py` in your project root
- **Custom Entry Point**: `tropiflo run bash custom_script.sh`

When you run Tropiflo on a multi-file project:

1. **Scanning**: Scans all `.py` files for `CO_DATASCIENTIST_BLOCK` markers
2. **Selection**: Each generation, randomly picks ONE file to evolve
3. **Evolution**: The AI generates hypotheses and modifies the selected block
4. **Testing**: Your entire project runs with the new code
5. **Checkpointing**: Best results are saved as complete directories with all files

This means you can have complex multi-file ML pipelines where each file evolves independently but is tested as a complete system.

---

## Deployment

Take your best checkpoint and create a production-ready project:

```bash
# Deploy best checkpoint from latest run
tropiflo deploy results/runs/happy_panda_20260207/best/

# Deploy specific version
tropiflo deploy results/runs/happy_panda_20260207/timeline/0003_kpi_0.9103_feature_eng/

# Custom output directory
tropiflo deploy results/runs/happy_panda_20260207/best/ --output-dir my_optimized_v2
```

**What it does:**
1. Copies your entire original project (including data, configs, assets)
2. Integrates the evolved code from the checkpoint
3. Excludes Tropiflo artifacts (checkpoints, cache, etc.)
4. Creates a `deployment_info.json` with checkpoint metadata

The result is a **complete, standalone project** ready to deploy to production.

---

## Analysis Tools

### Live Local Tracking Dashboard

Run with a live dashboard to monitor experiments as checkpoints are saved:

```bash
tropiflo run --config config.yaml --dashboard
```

Open the same dashboard anytime (even when no run is active):

```bash
# Reads ./results/runs by default
tropiflo dashboard

# Point to another project directory
tropiflo dashboard --working-directory /path/to/project

# Or pass an explicit results root and custom port
tropiflo dashboard --results-root /path/to/project/results/runs --dashboard-port 8502
```

Dashboard highlights:
- KPI over time (all runs as points + running best line)
- Baseline marker and best-so-far trajectory
- Hypotheses table across the workflow
- Diff viewer vs baseline per file
- Stdout/stderr per checkpoint

If you run multiple workflows, select and compare them from the dashboard sidebar.  
Data is loaded from local `results/runs/` folders, so old and new runs appear together.

### Plot KPI Progression

Visualize how your KPI improves over iterations:

```bash
# Basic usage
tropiflo plot-kpi --checkpoints-dir results/runs/happy_panda_20260207/

# With options
tropiflo plot-kpi \
  --checkpoints-dir results/runs/happy_panda_20260207/ \
  --max-iteration 350 \
  --title "AUC Training Progress" \
  --kpi-label "AUC" \
  --output my_kpi_plot.png
```

### Generate PDF Code Diffs

Create professional PDF reports comparing two versions:

```bash
# Compare two Python files
tropiflo diff-pdf baseline.py improved.py

# With custom title
tropiflo diff-pdf \
  baseline.py \
  optimized.py \
  --output "optimization_report.pdf" \
  --title "XOR Problem Optimization Results"
```

---

## Air-Gapped / Offline Deployment

Need to run Tropiflo in an environment without internet access?

### Quick Setup (One-Time, Requires Internet)

```bash
# Run this once while connected to internet
tropiflo setup-airgap

# That's it! Now you can disconnect and work offline
```

### What It Does

1. Pulls base Python Docker image (one-time download)
2. Builds complete image with all your dependencies pre-installed
3. Updates your `config.yaml` to use the pre-built image
4. Everything runs locally - no internet required after setup

### After Setup

```bash
# Disconnect from internet (or work in isolated environment)
tropiflo run --config config.yaml  # Works offline!
```

**Perfect for:**
- Air-gapped production environments
- Isolated VPC deployments
- High-security environments
- Offline development

---

## Private/Self-Hosted Backend

If you run the backend on your own host (VPC, on-prem), point the CLI at it:

**In config.yaml:**
```yaml
backend_url: "https://your-private-backend.example.com"
backend_url_dev: "http://localhost:8000"  # Optional, for dev mode
```

**Or with environment variables:**
```bash
export CO_DATASCIENTIST_CO_DATASCIENTIST_BACKEND_URL="https://your-private-backend.example.com"
export CO_DATASCIENTIST_CO_DATASCIENTIST_BACKEND_URL_DEV="http://localhost:8000"
export CO_DATASCIENTIST_DEV_MODE=true  # To force dev URL
```

If neither YAML nor env are set, the client defaults to `https://co-datascientist.io`.

---

## Resource Allocation (GPU, CPU, Memory)

Control how much hardware each Docker container gets.

### GPU Configuration

**Auto-detection (default):**
```yaml
# No configuration needed - GPUs auto-detected!
# If available: containers get GPU access
# If not available: containers run on CPU automatically
```

**Manual control:**
```yaml
enable_gpu: false       # Force CPU-only (even if GPU available)
enable_gpu: true        # Force GPU (fails if not available)
gpus_per_task: 1        # Each container gets 1 GPU
```

### CPU & Memory Limits

```yaml
cpus_per_task: 4.0      # Each container gets 4 CPU cores
memory_per_task: "8g"   # Each container gets 8GB RAM
```

### Common Scenarios

**Single GPU Workstation:**
```yaml
entry_command: "python train.py"
parallel: 2
gpus_per_task: 1        # Each gets 1 GPU (total: 2 GPUs)
cpus_per_task: 4.0      # Each gets 4 cores (total: 8 cores)
memory_per_task: "8g"   # Each gets 8GB (total: 16GB)
```

**Multi-GPU Server:**
```yaml
entry_command: "python train.py"
parallel: 8
gpus_per_task: 1        # Each gets 1 GPU (total: 8 GPUs)
cpus_per_task: 2.0      # Each gets 2 cores (total: 16 cores)
memory_per_task: "4g"   # Each gets 4GB (total: 32GB)
```

**CPU-Only Machine:**
```yaml
entry_command: "python train.py"
parallel: 4
enable_gpu: false       # Force CPU mode
cpus_per_task: 2.0      # Each gets 2 cores (total: 8 cores)
memory_per_task: "2g"   # Each gets 2GB (total: 8GB)
```

---

## Before vs After Example

<table>
<tr>
<th>Before <br><sub>KPI ≈ 0.50</sub></th>
<th>After <br><sub>KPI 1.00</sub></th>
</tr>
<tr>
<td>

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'KPI: {accuracy:.4f}')
```

</td>
<td>

```python
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

class ChebyshevPolyExpansion(BaseEstimator, TransformerMixin):
    def __init__(self, degree=3):
        self.degree = degree
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = np.asarray(X)
        X_scaled = 2 * X - 1
        n_samples, n_features = X_scaled.shape
        features = []
        for f in range(n_features):
            x = X_scaled[:, f]
            T = np.empty((self.degree + 1, n_samples))
            T[0] = 1
            if self.degree >= 1:
                T[1] = x
            for d in range(2, self.degree + 1):
                T[d] = 2 * x * T[d - 1] - T[d - 2]
            features.append(T.T)
        return np.hstack(features)

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('cheb', ChebyshevPolyExpansion(degree=3)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'KPI: {accuracy:.4f}')
```

</td>
</tr>
</table>

---

## Cloud Integrations

<details>
<summary><h3>Google Cloud Run Jobs Integration</h3></summary>

Execute your code at scale on Google Cloud infrastructure.

#### Prerequisites (One-Time, 5 Minutes)

1. **Install & authenticate gcloud CLI:**
```bash
# Install gcloud CLI (if not installed)
# See: https://cloud.google.com/sdk/docs/install

# Authenticate
gcloud auth login
gcloud auth application-default login

# Set your project
gcloud config set project YOUR_PROJECT_ID
```

2. **Enable required APIs:**
```bash
gcloud services enable artifactregistry.googleapis.com
gcloud services enable run.googleapis.com
```

3. **Create Artifact Registry repository:**
```bash
gcloud artifacts repositories create co-datascientist-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Docker images for Co-DataScientist"
```

#### Configuration

**Minimal config.yaml for GCloud:**
```yaml
mode: gcloud
entry_command: "python train.py"
project_id: "your-gcp-project-id"
```

**With options:**
```yaml
mode: gcloud
entry_command: "python train.py"
project_id: "your-gcp-project-id"

# Optional
region: "us-central1"
repo: "co-datascientist-repo"
parallel: 2
data_volume: "gs://your-bucket"
api_key: "sk_your_token"
```

#### What Happens

When you run `tropiflo run --config config.yaml`:

1. Builds your Docker image locally
2. Pushes to GCP Artifact Registry
3. Creates & executes Cloud Run Job
4. Retrieves results and KPIs
5. Cleans up resources automatically

**Cost efficient:** Cleans up jobs and images automatically (configurable with `cleanup_job` and `cleanup_remote_image`)

#### Using Data from GCS

```yaml
mode: gcloud
project_id: "my-project"
entry_command: "python train.py"
data_volume: "gs://my-data-bucket"
```

Your code accesses data at `/data`:

```python
import os
DATA_DIR = os.environ.get("INPUT_URI", "/data")
df = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
```

**Note:** Your Cloud Run service account needs `storage.objectViewer` permission on the bucket.
</details>

<details>
<summary><h3>AWS ECS Fargate Integration</h3></summary>

Execute and optimize your Python code at scale using AWS ECS Fargate.

#### Setup

1. **Prerequisites:**
   - AWS account with ECS Fargate enabled
   - Authenticated AWS CLI: `aws configure`
   - An ECS cluster and task definition configured for your needs

2. **Create config.yaml:**
```yaml
mode: aws
entry_command: "python train.py"
aws:
  script_path: "/path/to/your/script.py"
  cluster: "my-cluster"
  task_definition: "my-job-taskdef"
  launch_type: "FARGATE"
  region: "us-east-1"
  network_configuration:
    subnets: ["subnet-abc123", "subnet-def456"]
    security_groups: ["sg-123456"]
    assign_public_ip: "ENABLED"
  timeout: 1800  # seconds
```

3. **Run:**
```bash
tropiflo run --config config.yaml
```

Your code will be executed in AWS ECS Fargate containers, with results and KPIs retrieved automatically. Perfect for serverless compute scaling!
</details>

<details>
<summary><h3>Databricks Integration</h3></summary>

Run Tropiflo evolution on a Databricks cluster instead of local Docker containers. Your code is uploaded to Databricks storage and executed as a Spark Python task.

#### Prerequisites

1. **Install the Databricks CLI (v2):**

```bash
# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh

# Windows — download the installer from:
# https://docs.databricks.com/en/dev-tools/cli/install.html
```

2. **Authenticate with a Personal Access Token:**

Generate a token in your Databricks workspace under **Settings > Developer > Access tokens**, then configure the CLI:

```bash
databricks configure
# Enter your workspace URL (e.g. https://dbc-xxxxx.cloud.databricks.com)
# Enter your access token
```

Verify it works:

```bash
databricks auth describe
databricks clusters list
```

#### Choosing Where to Store Files (`volume_uri`)

Tropiflo uploads your project files to Databricks so the cluster can run them. The `volume_uri` setting controls **where** those files go. There are three supported storage types:

| Storage type | `volume_uri` example | Best for |
|---|---|---|
| **Unity Catalog Volume** (recommended) | `dbfs:/Volumes/my_catalog/my_schema/my_volume` | Modern workspaces with Unity Catalog |
| **Workspace Files** | `/Workspace/Users/you@company.com/tropiflo` | Workspaces where DBFS is restricted |
| **Classic DBFS** | `dbfs:/FileStore/tropiflo` | Legacy workspaces without Unity Catalog |

**Unity Catalog Volumes** are recommended because they work with fine-grained permissions and don't require the broad `SELECT on any file` privilege.

> **Finding your Volume path**
>
> If you already have a Unity Catalog Volume, find its path with:
>
> ```bash
> databricks catalogs list
> databricks schemas list <catalog_name>
> databricks volumes list <catalog_name>.<schema_name>
> ```
>
> If you need to create one (run in a Databricks notebook or SQL editor):
>
> ```sql
> CREATE VOLUME my_catalog.my_schema.tropiflo_volume;
> ```
>
> Then use: `volume_uri: "dbfs:/Volumes/my_catalog/my_schema/tropiflo_volume"`

#### Configuration

**Minimal config.yaml:**

```yaml
mode: databricks
entry_command: "python train.py"

databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "0324-151716-abc123"   # your cluster ID
```

Find your cluster ID in the Databricks UI under **Compute > your cluster > JSON view**, or run:

```bash
databricks clusters list
```

**Full config with all options:**

```yaml
mode: databricks
entry_command: "python train.py"

databricks:
  cli: "databricks"                                     # CLI binary name or path
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"                                        # max job runtime
  cleanup_remote_files: true                            # delete uploaded files after each run

  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "0324-151716-abc123"
```

> **Note:** You don't need to set `spark_python_task.python_file` -- Tropiflo automatically sets it to the uploaded launcher script. If your project has a `requirements.txt`, dependencies are auto-detected and installed on the cluster before your code runs.

#### How It Works

When you run `tropiflo run --config config.yaml`:

1. Your project is zipped and uploaded to `{volume_uri}/runs/{run_id}/project.zip`
2. A launcher script is uploaded to `{volume_uri}/runs/{run_id}/launcher.py`
3. A Databricks job is submitted that runs the launcher on your cluster
4. The launcher extracts the project zip and runs your `entry_command`
5. Tropiflo polls for completion and retrieves stdout/stderr/KPI
6. If `cleanup_remote_files: true`, the run directory is deleted afterward

#### Environment & Dependencies

Your code runs inside the Python environment of the Databricks cluster. There is no Docker container -- packages, drivers, and hardware are whatever the cluster provides.

**Base environment:** Comes from the [Databricks Runtime](https://docs.databricks.com/en/release-notes/runtime/index.html) installed on your cluster. Standard runtimes include numpy, pandas, scikit-learn, etc. **ML Runtimes** (e.g. `15.4 LTS ML`) additionally include PyTorch, TensorFlow, XGBoost, and CUDA/cuDNN drivers.

**Adding packages via `requirements.txt` (recommended):** If your project has a `requirements.txt`, Tropiflo reads it and installs the packages automatically before your code runs. For existing clusters, packages are installed via the Databricks [task libraries](https://docs.databricks.com/en/jobs/task-library-dependencies.html) mechanism.

```
my_project/
├── config.yaml
├── train.py
└── requirements.txt   ← auto-detected
```

**Pre-installing on the cluster:** For packages that are slow to install or need special build steps, install them directly on the cluster via **Compute > your cluster > Libraries > Install new**. They'll be available to every job without per-run install overhead.

**Serverless compute:** If you omit `existing_cluster_id`, Tropiflo uses the `environments` spec for Databricks serverless. Dependencies from `requirements.txt` are passed as `environments[*].spec.dependencies`:

```yaml
databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        environment_key: "default"
    environments:
      - environment_key: "default"
        spec:
          client: "1"
          dependencies:
            - "scikit-learn>=1.0.0"
```

#### GPU Clusters

Databricks GPU support works out of the box -- no Tropiflo configuration needed. Unlike local mode (which requires `enable_gpu` and `gpus_per_task` for Docker), Databricks mode runs directly on the cluster hardware with no container layer.

**Setup:** Just point `existing_cluster_id` to a GPU-enabled cluster:

```yaml
databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "0324-151716-gpu-cluster"
```

Your code sees GPUs automatically:

```python
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")  # "Using: cuda"
```

**Recommended cluster setup for GPU workloads:**

- **Runtime:** Use a **ML Runtime** (e.g. `15.4 LTS ML GPU`) -- it comes with CUDA, cuDNN, PyTorch, and TensorFlow pre-installed
- **Node type:** Pick a GPU instance (e.g. `g4dn.xlarge` on AWS, `Standard_NC6s_v3` on Azure, `a2-highgpu-1g` on GCP)
- **Single-node mode:** Enable "Use as single node" under **Advanced options** -- this ensures the driver node (where your code runs) has GPU access. On multi-node clusters, only the driver runs your script via `spark_python_task`, so the driver node must have the GPU

#### Accessing Data on Databricks

Unlike local mode (which mounts a `data_volume` into Docker), Databricks mode runs your code on a remote cluster. Your script must read data from locations the cluster can access directly. There is no automatic `INPUT_URI` or `/data` mount.

**Common patterns:**

**Unity Catalog tables (recommended):**

```python
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.table("my_catalog.my_schema.my_table").toPandas()
```

**Unity Catalog Volumes (files on a volume):**

```python
import pandas as pd

df = pd.read_csv("/Volumes/my_catalog/my_schema/my_volume/data/train.csv")
```

**Cloud storage (S3, ADLS, GCS):**

```python
import pandas as pd

df = pd.read_csv("s3://my-bucket/data/train.csv")
# or: "abfss://container@storage.dfs.core.windows.net/data/train.csv"
# or: "gs://my-bucket/data/train.csv"
```

**Classic DBFS (legacy):**

```python
import pandas as pd

df = pd.read_csv("/dbfs/FileStore/data/train.csv")
```

> **Tip:** Keep large datasets out of your project directory. Tropiflo zips your entire project and uploads it for each run. If you have a `data/` folder inside your project, it will be zipped and uploaded every time -- slow and wasteful. Instead, store data on Volumes / tables / cloud storage and reference it by path in your code.

#### Using Workspace Paths

If Unity Catalog Volumes aren't available, you can store files directly in the Databricks Workspace filesystem:

```yaml
databricks:
  volume_uri: "/Workspace/Users/you@company.com/tropiflo"
```

Tropiflo detects Workspace paths and automatically uses `databricks workspace` CLI commands (instead of `databricks fs`) for uploads. The Jobs API receives `/Workspace/...` paths, which don't require DBFS file privileges.

> If you accidentally write `dbfs:/Workspace/...`, Tropiflo strips the `dbfs:` prefix and logs a warning. It's better to use the correct form from the start.

#### Troubleshooting

**`INSUFFICIENT_PERMISSIONS: User does not have permission SELECT on any file`**

This means the cluster has Unity Catalog enabled but the job references a `dbfs:/` path. Solutions:

- **Best fix:** Switch `volume_uri` to a Unity Catalog Volume: `dbfs:/Volumes/<catalog>/<schema>/<volume>`
- **Alternative:** Use a Workspace path: `/Workspace/Users/you@company.com/tropiflo`
- **If you must use DBFS:** Ask your workspace admin to grant `SELECT on any file` (not recommended — it's a broad privilege)

**`Error: No operations allowed on this path` when running `databricks fs ls dbfs:/Volumes`**

You can't list the bare `/Volumes` root. You need the full path including catalog, schema, and volume name:

```bash
# Wrong
databricks fs ls dbfs:/Volumes

# Correct
databricks fs ls dbfs:/Volumes/my_catalog/my_schema/my_volume/
```

**`Failed to validate python file ...`**

Check that:
1. Your `volume_uri` points to a location the cluster can actually read
2. The cluster is running and accessible (`databricks clusters list`)
3. Your token has permission to submit jobs (`databricks jobs list`)

**Windows-specific: `databricks` not found**

Set the `cli` field to the full path or use `databricks.exe`:

```yaml
databricks:
  cli: "databricks.exe"
  # or the full path:
  # cli: "C:\\Users\\you\\AppData\\Local\\Programs\\databricks\\databricks.exe"
```

</details>

---

## Important Notes

- **Avoid `input()` or interactive prompts** — Tropiflo needs to run your code automatically
- **Mark the parts you want to evolve** — Use `CO_DATASCIENTIST_BLOCK_START` and `CO_DATASCIENTIST_BLOCK_END`
- **Add comments with context** — Tropiflo understands your domain! Explain your problem, constraints, and ideas in comments near your code

---

## Naming Note

**"Co-DataScientist" is the internal engine behind Tropiflo.**  
You only interact with the Tropiflo CLI. If you see references to "Co-DataScientist" in code, logs, or config keys, that's the underlying system. They're the same product.

---

## Need Help?

We'd love to chat: [oz.kilim@tropiflo.io](mailto:oz.kilim@tropiflo.io)

---

<p align="center"><em>Disclaimer: Tropiflo executes your scripts on your own machine. Make sure you trust the code you feed it!</em></p>

<p align="center">Made by the Tropiflo team</p>
