Metadata-Version: 2.4
Name: neotune
Version: 0.2.0
Summary: Supervised fine-tuning of LLMs with LoRA and DeepSpeed
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.40
Requires-Dist: peft>=0.10
Requires-Dist: datasets>=2.0
Requires-Dist: deepspeed>=0.12
Requires-Dist: accelerate>=0.30
Requires-Dist: scikit-learn>=1.0
Requires-Dist: numpy>=1.24
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: tqdm>=4.60
Provides-Extra: ray
Requires-Dist: ray[train]>=2.0; extra == "ray"
Provides-Extra: logging
Requires-Dist: mlflow>=2.0; extra == "logging"
Requires-Dist: wandb>=0.15; extra == "logging"
Provides-Extra: all
Requires-Dist: neotune[logging,ray]; extra == "all"
Dynamic: license-file

# Supervised Fine-Tuning (SFT) with LoRA and DeepSpeed

This project provides a streamlined pipeline for fine-tuning Large Language Models (LLMs) like Llama 3.1 using Low-Rank Adaptation (LoRA) and DeepSpeed for efficient distributed training.

## 📂 Directory Structure

```
SFT/
├── configs/                 # Configuration files (planned)
├── <placeholder>/           # Datasets
├── <placeholder>/           # Output directory for checkpoints and adapters
├── run.sh                   # Main entry point script
├── lora_sft.py              # Main training and inference script
├── config.yaml              # Hyperparameters and paths configuration
├── ds_config.json           # DeepSpeed configuration
├── requirements.txt         # Python dependencies
└── README.md                # This file
```

## 🚀 Setup

1.  **Create and activate a virtual environment:**
    ```bash
    python3 -m venv venv
    source venv/bin/activate
    ```

2.  **Install dependencies:**
    ```bash
    pip install -r requirements.txt
    ```
    *(Ensure `deepspeed` is installed compatible with your CUDA version)*

3.  **Configure Environment:**
    Create a `.env` file in the root directory:
    ```bash
    HF_TOKEN=your_huggingface_token
    WANDB_API_KEY=your_wandb_key  # Optional, for logging
    ```

## 🛠️ Configuration

-   **`config.yaml`**: Controls model ID, dataset paths, training hyperparameters (learning rate, epochs, batch size), and LoRA settings.
-   **`ds_config.json`**: Configures DeepSpeed optimization (ZeRO stage, offloading, mixed precision).

## 🏃 Usage

Use the provided `scripts/run.sh` wrapper for easy execution. It automatically handles directory paths.

### Training

To start fine-tuning the model:

```bash
# Default: Train on 2 GPUs
./run.sh train 2

# Train on 4 GPUs
./scripts/run.sh train 4
```

### Inference

To evaluate the fine-tuned model on the test set:

```bash
./scripts/run.sh inference
```

### Custom Configuration

You can specify a custom DeepSpeed config file:

```bash
./scripts/run.sh --config custom_ds_config.json train 4
```

## 📊 Monitoring

Training progress (loss, accuracy, etc.) is logged to **MLflow** (and/or WandB if configured).

To view MLflow logs locally:
```bash
mlflow ui
```
Then open `http://localhost:5000` in your browser.

## 🐛 Troubleshooting

-   **`deepspeed: command not found`**: Ensure you have activated the virtual environment where `deepspeed` is installed.
-   **CUDA Errors**: Check `ds_config.json` to ensure batch sizes and offloading settings fit your GPU memory.

## ☸️ Ray Train + Kubernetes (KubeRay)

This repo now includes a Ray Train entrypoint for running multi-GPU training on a Ray cluster (including on Kubernetes via KubeRay).

- **Ray entrypoint**: `SFT/ray_train_lora_sft.py`
- **KubeRay RayJob template**: `SFT/k8s/rayjob-lora-sft.yaml`
- **Container build**: `SFT/k8s/Dockerfile`

### Local Ray (single node)

```bash
pip install -r requirements.txt
python ray_train_lora_sft.py --num_workers 2
```

### Kubernetes (high level)

1. Build/push the image from `SFT/k8s/Dockerfile` and set it in `rayjob-lora-sft.yaml`.
2. Create PVCs for:
   - `/workspace/SFT/data_dir` (your training data)
   - `/mnt/ray-results` (Ray Train run storage / checkpoints)
3. Apply the RayJob:

```bash
kubectl apply -f SFT/k8s/rayjob-lora-sft.yaml
```
