Metadata-Version: 2.3
Name: opensloth
Version: 0.1.7
Summary: A high-performance framework for fine-tuning large language models with multi-GPU support
License: MIT
Keywords: machine learning,deep learning,transformers,fine-tuning,multi-gpu,llm
Author: anhvth
Author-email: anhvth.226@gmail.com
Requires-Python: >=3.9
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: fastcore (>=1.7.29,<2.0.0)
Requires-Dist: fire (>=0.7.0,<0.8.0)
Requires-Dist: jupyterlab (>=4.3.5,<5.0.0)
Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
Requires-Dist: speedy-utils (>=0.1.20,<0.2.0)
Requires-Dist: tensorboard (>=2.19.0,<3.0.0)
Requires-Dist: tensorboardx (>=2.6.2.2,<3.0.0.0)
Project-URL: Bug Tracker, https://github.com/anhvth/opensloth/issues
Project-URL: Documentation, https://github.com/anhvth/opensloth
Project-URL: Homepage, https://github.com/anhvth/opensloth
Project-URL: Repository, https://github.com/anhvth/opensloth
Description-Content-Type: text/markdown

<p align="center">
    <img src="images/opensloth.png" alt="opensloth Logo" width="200" />
</p>

# OpenSloth

A multi-GPU training framework that combines [Unsloth](https://github.com/unslothai/unsloth) with multi-GPU support and [sequence packing](https://huggingface.co/blog/sirluk/llm-sequence-packing) optimizations.

**Core Components:**
- **Unsloth**: 2x faster training with 75% VRAM savings
- **Multi-GPU**: Distributed training across multiple GPUs  
- **Sequence Packing**: Smart batching that reduces padding waste by up to 40%

**The Result:** Unsloth's efficiency × GPU count × sequence packing optimizations = speedups that often exceed theoretical maximums.

## 💾 Installation

```bash
conda create --name opensloth_env python=3.11
pip install uv
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install unsloth xformers opensloth
# or from source pip install git+https://github.com/anhvth/opensloth.git
```

## ⚡ Quickstart

```bash
# Basic multi-GPU training
python scripts/train.py
```

| Example | Description | Link/Command |
|---------|-------------|--------------|
| **Kaggle Notebook (T4x2)** | Live training example on Kaggle's dual T4 GPU environment | [🔗 Qwen3 OpenSloth 2GPUs](https://www.kaggle.com/code/anhvth226/qwen3-opensloth-2gpus?scriptVersionId=244825301) |
| **Local Training Script** | Check out the training script for configuration examples | `python scripts/train.py` |
| **Local Jupyter Notebook** | Interactive training notebook for local development | [`notebooks/train.ipynb`](notebooks/train.ipynb) |

## ⚡ Performance Benchmarks

**[📊 View Full WandB Comparison](https://wandb.ai/anhvth/CompareUnsloth)**

### opensloth vs Unsloth Direct Comparison

Controlled comparison with identical configurations:
- **Model**: Qwen3-8B-bnb-4bit  
- **Training Steps**: 100 steps
- **Global Batch Size**: 32

**Results:**
- **opensloth (2 GPUs)**: 8m 28s ⚡
- **Unsloth (1 GPU)**: 19m 34s
- **Performance Gain**: ~2.3x faster 

**Why 2.3x Speedup on 2 GPUs?**

OpenSloth achieves **2.3x speedup** through three optimizations:
- ✅ **Sequence packing**: Smart batching reduces padding waste ([learn more](https://huggingface.co/blog/sirluk/llm-sequence-packing))
- ✅ **Multi-GPU scaling**: Distributed training across GPUs
- ✅ **Load balancing**: Even workload distribution across GPUs

**Scaling Expectations:**
- **2 GPUs**: ~2.3x faster than single GPU
- **4 GPUs**: ~4.6x faster than single GPU
- **8 GPUs**: ~9.2x faster than single GPU


## 🔧 Quick Tips
- Enable packing, set bz=1, long sequence length (8k, 16k, etc.) with larger gradient accumulation steps (64, 128). Unsloth's will automatically handle sequence packing on global batch to optimize gpu utilization.

**For faster iteration:**
- Start with smaller models: `unsloth/Qwen3-0.6b-bnb-4bit`
- Test single GPU first: modify `gpus=[0]` in script
- Use fewer samples for quick testing

**Recommended Configuration:**
```python
# Optimize for sequence packing and multi-GPU efficiency
TrainingConfig(
    per_device_train_batch_size=4,      # Larger batches per GPU
    gradient_accumulation_steps=8,       # Fewer gradient sync operations
    # Effective batch size = 4 * 8 * num_gpus
)
```

## 🔧 Troubleshooting

   **Single GPU Testing:**
   ```python
   # In your training script, change:
   gpus = [0]  # Use only first GPU for debugging
   ```

## 📖 Documentation

- See [CHANGELOG.md](CHANGELOG.md) for recent updates and breaking changes.

## How to Prepare and Store a Trainer Dataset

Follow these steps to extract and save a dataset from an Unsloth notebook:

1. Visit the [Unsloth Notebooks Documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks).
2. Select the notebook for your target model.
3. Export the notebook to a Python script.
4. Copy all code up to (but not including) `trainer.train()`.
5. Run the code to initialize the trainer.
6. Save the trainer's dataset:

   ```python
   trainer.train_dataset.save_to_disk("data/cache_qwen3_dataset")
   ```

This will store the processed dataset for later use.

