Metadata-Version: 2.4
Name: amazingvmsloth
Version: 0.2.7
Summary: Blazing-fast LLM fine-tuning with minimal VRAM — multi-GPU, manual LoRA gradients, flash attention, 4-bit quant
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.1.0
Requires-Dist: transformers>=4.36.0
Requires-Dist: bitsandbytes>=0.41.0
Requires-Dist: peft>=0.7.0
Requires-Dist: triton>=2.1.0; sys_platform == "linux"
Requires-Dist: safetensors>=0.4.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: flash-attn
Requires-Dist: flash-attn>=2.3.0; extra == "flash-attn"
Provides-Extra: multi-gpu
Requires-Dist: deepspeed>=0.12.0; extra == "multi-gpu"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: amazingvmsloth[dev,flash-attn,multi-gpu]; extra == "all"

# amazingvmsloth

Blazing-fast LLM fine-tuning with minimal VRAM.

```
  \   / |    amazingvmsloth - Fast LLM Fine-Tuning
   O^O / \_/ \   Minimal VRAM. Maximum Speed.
  \        /
   "-____-"
```

**Train 14B models on a 4GB GPU.** Multi-GPU, 4-bit quantization, LoRA, CPU offloading, gradient checkpointing, and sequence packing — all built for speed on consumer hardware.

---

## Install

```bash
pip install amazingvmsloth
```

Or from source:

```bash
git clone https://github.com/CollabVMgamez/amazingvmsloth.git
cd amazingvmsloth
pip install -e .
```

**Requirements:** Python 3.9+, PyTorch 2.1+, CUDA 11.8+ (optional, CPU training supported)

---

## Quick Start

### 1. Wizard — let it pick settings for your hardware

```bash
amazingvmsloth wizard --model Qwen/Qwen2.5-0.5B
```

Analyzes your GPU/CPU and prints a ready-to-run command.

### 2. Train

```bash
amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset tatsu-lab/alpaca \
  --epochs 3 \
  --batch-size 2 \
  --grad-accum 4 \
  --lora-r 16 \
  --output-dir ./output
```

Supports chat-format datasets too:

```bash
amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k \
  --dataset-format chat \
  --max-samples 1000 \
  --output-dir ./thinking_lora
```

### 3. Convert LoRA to merged model

```bash
amazingvmsloth merge \
  --model Qwen/Qwen2.5-0.5B \
  --lora ./output \
  --output ./merged_model
```

### 4. Run inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./merged_model", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

---

## CLI Commands

| Command | Description |
|---------|-------------|
| `wizard` | Interactive config generator based on your hardware |
| `train` | Fine-tune a model with LoRA |
| `merge` | Merge LoRA weights into base model |
| `convert` | Merge + convert to GGUF (requires llama.cpp) |
| `info` | Show model info and VRAM estimates |
| `bench` | Benchmark vs unsloth |

---

## Hardware Tiers

| GPU VRAM | Strategy |
|----------|----------|
| 4-6 GB | 4-bit quant, batch=1, grad accum=8, seq=512, tiny LoRA |
| 6-12 GB | 4-bit quant, batch=1-2, grad accum=4, seq=1024 |
| 12-24 GB | 4-bit or full precision, batch=2-4, torch.compile |
| 24+ GB | Full precision, no grad checkpointing, large batch |
| CPU only | fp32/bf16, torch.compile, physical-core threading |

---

## Key Features

- **rsLoRA** scaling for stable training at any rank
- **4-bit/8-bit quantization** via bitsandbytes
- **XFormers/SDPA attention** patching (Flash Attention on Linux)
- **Sequence packing** for 2-3x throughput
- **Gradient checkpointing** with selective layer skipping
- **Multi-GPU**: DDP, FSDP, DeepSpeed, pipeline parallelism
- **Layer offloading** via `accelerate.dispatch_model`
- **CPU training** with IPEX, pre-packing, torch.compile
- **PagedAdamW8bit** optimizer for low-VRAM training
- **Checkpoint resume** with full RNG/optimizer state
- **Tqdm progress bar** with live loss + VRAM display

---

## Example: 500 Steps on Dolly

```bash
amazingvmsloth train \
  --model Qwen/Qwen2.5-0.5B \
  --dataset databricks/databricks-dolly-15k \
  --dataset-format alpaca \
  --epochs 1 --batch-size 2 --grad-accum 2 \
  --max-samples 1000 --max-seq-length 512 \
  --lora-r 16 --output-dir ./dolly_lora --packing
```

This runs ~500 steps in ~10 minutes on a 4GB RTX 3050.

---

## Project Structure

```
amazingvmsloth/
├── lora.py              # LoRA with rsLoRA, device-aware init
├── quantization.py      # 4-bit/8-bit quant, kbit training prep
├── attention.py         # SDPA/XFormers patching
├── trainer.py           # AmazingTrainer with tqdm, packing, offloading
├── cpu_trainer.py       # CpuTrainer for CPU-only training
├── packing.py           # Sequence packing collators
├── gradient.py          # GradientAccumulator
├── optimizer.py         # PagedAdamW8bit, CpuOffloadedAdamW
├── offload.py           # Layer offloading via accelerate
├── cli.py               # CLI entrypoint
├── wizard.py            # Hardware-aware config generator
├── bench.py             # Benchmark vs unsloth
└── utils/
    ├── banner.py        # Startup banner with GPU info
    ├── memory.py        # VRAM estimation
    ├── patching.py      # LoRA save/load helpers
    └── save_load.py     # Model save/merge
```

---

## Benchmarks

On RTX 3050 4GB Laptop GPU:

| Library | Time (1 epoch, 500 samples) | Peak VRAM |
|---------|----------------------------|-----------|
| amazingvmsloth | **5.3s** | 1.07 GB |
| unsloth | 10.1s | 0.96 GB |

1.91x faster on small runs with pre-quantized models.

---

## License

MIT
