Metadata-Version: 2.4
Name: canifinetune
Version: 0.1.0
Summary: Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.
Project-URL: Homepage, https://github.com/DaoyuanLi2816/can-i-finetune-this
Project-URL: Issues, https://github.com/DaoyuanLi2816/can-i-finetune-this/issues
Project-URL: Repository, https://github.com/DaoyuanLi2816/can-i-finetune-this
Author-email: Daoyuan Li <lidaoyuan2816@gmail.com>
License: MIT
License-File: LICENSE
Keywords: fine-tuning,gpu,huggingface,llm,lora,memory,peft,qlora,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: huggingface-hub>=0.25
Requires-Dist: jinja2>=3.1
Requires-Dist: platformdirs>=4.2
Requires-Dist: pydantic>=2.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Provides-Extra: report
Requires-Dist: pandas>=2.2; extra == 'report'
Requires-Dist: tabulate>=0.9; extra == 'report'
Provides-Extra: train
Requires-Dist: accelerate>=0.33; extra == 'train'
Requires-Dist: bitsandbytes>=0.43; (platform_system != 'Darwin') and extra == 'train'
Requires-Dist: datasets>=2.20; extra == 'train'
Requires-Dist: peft>=0.12; extra == 'train'
Requires-Dist: protobuf>=4.25; extra == 'train'
Requires-Dist: sentencepiece>=0.2; extra == 'train'
Requires-Dist: torch>=2.3; extra == 'train'
Requires-Dist: transformers>=4.43; extra == 'train'
Requires-Dist: trl>=0.9; extra == 'train'
Description-Content-Type: text/markdown

# can-i-finetune-this

[![CI](https://github.com/DaoyuanLi2816/can-i-finetune-this/actions/workflows/ci.yml/badge.svg)](https://github.com/DaoyuanLi2816/can-i-finetune-this/actions/workflows/ci.yml)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org)
[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)

**Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.**

![can-i-finetune-this architecture](docs/architecture.png)

You have one consumer-grade NVIDIA GPU. You want to fine-tune an open-weight LLM
with LoRA or QLoRA, but you do not want to download 14 GB of weights just to
discover that your 12 GB / 16 GB / 24 GB card OOMs on step 1.

`canifinetune` answers, before you spend the disk and the time:

1. Can I fine-tune this model?
2. About how much VRAM will it use?
3. What batch size / sequence length / LoRA rank / quantization should I use?
4. If I can't, how should I downsize?
5. Is there local benchmark evidence for that answer?
6. Can I get a ready-to-run Hugging Face + PEFT + TRL training script for that config?

It is a single Python package with a CLI:

```bash
canifinetune doctor
canifinetune estimate --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --gpu-vram-gb 16 --seq-len 2048 --micro-batch-size 1 --lora-rank 16
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
canifinetune bench    --model sshleifer/tiny-gpt2 --method lora --steps 3
canifinetune calibrate --benchmarks benchmarks/results
canifinetune recipe   --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --output recipes/qwen2.5-1.5b-qlora-4080
canifinetune report   --benchmarks benchmarks/results --out report.md
canifinetune compare  --benchmarks benchmarks/results --out compare.md
```

What `canifinetune estimate` actually prints:

```text
+-------- Qwen/Qwen2.5-1.5B-Instruct  (qlora) --------+
| feasible: YES    ratio = 0.20    confidence = medium |
+------------------------------------------------------+
       Memory breakdown (GB)
+---------------------------------+
| Component             |   Value |
|-----------------------+---------|
| static model          |   0.737 |
| quantization overhead |   0.018 |
| trainable params      |  4.4 MB |
| gradients             |   0.008 |
| optimizer states      |   0.010 |
| activations           |   0.328 |
| CUDA / fragmentation  |   1.280 |
| safety margin         |   0.800 |
| total                 |   3.163 |
+---------------------------------+
```

Static estimate says 3.16 GB; on a real RTX 4080 the same config measures
7.10 GB (heavy bitsandbytes unpacking buffers at seq_len=2048). `canifinetune
bench` and `canifinetune calibrate` close that gap on your machine —
that is the *point* of the project.

---

## Install

`canifinetune` runs in two layers:

| Layer | Install | What you get |
| --- | --- | --- |
| Core (estimate / recommend / recipe / report) | `pip install canifinetune` | All CLI commands. No PyTorch required. |
| Training (bench / real fine-tuning) | `pip install canifinetune[train]` | Adds `torch`, `transformers`, `peft`, `bitsandbytes`, `trl`, `datasets`. |
| Reporting extras | `pip install canifinetune[report]` | Pandas/tabulate for prettier tables. |
| Development | `pip install canifinetune[dev]` | pytest, ruff, mypy. |

If you use `uv`:

```bash
uv venv
uv pip install -e ".[dev,report]"
# Add training deps when you want to run benchmarks:
uv pip install -e ".[dev,train,report]"
```

PyTorch should generally be installed with the CUDA wheel that matches your driver,
e.g.

```bash
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
```

See `docs/troubleshooting.md` for Windows / WSL / bitsandbytes specifics.

---

## Quickstart

```bash
# 1. See what your machine looks like
canifinetune doctor

# 2. Ask if a model fits on your card
canifinetune estimate \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --gpu-vram-gb 16 \
  --seq-len 2048 \
  --micro-batch-size 1 \
  --lora-rank 16

# 3. Have it search for a feasible config
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16

# 4. Run a tiny real benchmark (downloads sshleifer/tiny-gpt2, ~5 MB)
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3

# 5. Generate a ready-to-run training recipe
canifinetune recipe \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --seq-len 2048 \
  --output recipes/qwen2.5-1.5b-qlora-4080
```

---

## What's different from `accelerate estimate-memory`?

`accelerate estimate-memory` tells you how much memory **loading** a model takes.
That is not enough to know whether you can **train** it.

This project tries to answer the harder question. It models:

- Model weights, in fp32 / fp16 / bf16 / int8 / NF4 + double-quant
- LoRA / QLoRA trainable parameter count for typical `target_modules`
- Gradients only for trainable parameters
- AdamW vs 8-bit / paged AdamW optimizer states
- Activations as a function of `seq_len`, `batch_size`, `hidden_size`, `num_layers`,
  with and without gradient checkpointing
- A fragmentation / CUDA / buffer safety margin
- A feasibility decision against your actual GPU
- Concrete degradation suggestions when not feasible

Estimates are **always** marked with an `assumptions` block and a `confidence`
level, because activation memory in particular is hard to predict statically.
Run `canifinetune bench` and `canifinetune calibrate` to ground them in real
measurements on your machine.

---

## RTX 4080 baselines

`docs/rtx4080_baselines.md` contains real measurements collected on a single
RTX 4080 (16 GB). These are not synthetic. If a configuration was not run, the
table says "not run", not a guessed number.

Highlights (more in the doc):

| model | method | seq_len | measured peak | tok/sec |
| --- | --- | --- | --- | --- |
| `Qwen/Qwen2.5-0.5B-Instruct` | qlora | 1024 | 3.30 GB | 1995 |
| `Qwen/Qwen2.5-1.5B-Instruct` | qlora | 1024 | 4.36 GB | 1352 |
| `Qwen/Qwen2.5-1.5B-Instruct` | qlora | 2048 | 7.10 GB | 1470 |
| `Qwen/Qwen2.5-3B-Instruct` | qlora | 1024 | 5.54 GB | 1158 |
| `sshleifer/tiny-gpt2` (smoke) | lora | 128 | 0.12 GB | 1735 |

---

## Repository layout

```
src/canifinetune/        # package code (estimator, bench, recipes, reports, cli)
benchmarks/              # configs/, results/ (JSON), calibration/
docs/                    # design, memory model, troubleshooting
examples/                # end-to-end recipe folders
tests/                   # pytest tests (CPU-only, no large downloads)
scripts/                 # helper scripts for collecting baselines
.github/workflows/       # CI (ruff + pytest on CPU)
```

---

## Roadmap

The current scope is "single consumer GPU, single node, LoRA / QLoRA, causal LM,
Hugging Face stack". Possible directions, none committed:

- DeepSpeed ZeRO and FSDP estimation for multi-GPU setups
- Heuristics for sequence-classification / encoder-decoder training
- Throughput modeling (tokens / sec), not just feasibility
- Auto-tuning of `gradient_accumulation_steps` for a target effective batch size
- A web UI on top of the CLI

Contributions welcome.

---

## License

MIT. See `LICENSE`.
