Metadata-Version: 2.4
Name: sqlforge
Version: 0.1.0
Summary: Fine-tuned text-to-SQL: turn plain-English questions into correct SQL, measured with real execution accuracy on Spider.
Author-email: Abdullah Kousa <akousa360@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/abdullahkousa2/sqlforge
Project-URL: Repository, https://github.com/abdullahkousa2/sqlforge
Project-URL: Demo, https://huggingface.co/spaces/Abdullahkousa2/sqlforge
Project-URL: Experiment tracking, https://wandb.ai/akousa360-arab-international-university-/sqlforge-text2sql
Keywords: text-to-sql,nl2sql,llm,qlora,fine-tuning,spider,sql,qwen
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Database
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.45.0
Requires-Dist: peft>=0.13.0
Requires-Dist: accelerate>=1.0.0
Requires-Dist: huggingface_hub>=0.25.0
Provides-Extra: serve
Requires-Dist: fastapi>=0.111.0; extra == "serve"
Requires-Dist: uvicorn[standard]>=0.30.0; extra == "serve"
Provides-Extra: train
Requires-Dist: trl>=0.11.0; extra == "train"
Requires-Dist: datasets>=3.0.0; extra == "train"
Requires-Dist: bitsandbytes>=0.44.0; extra == "train"
Requires-Dist: wandb>=0.18.0; extra == "train"
Requires-Dist: pyyaml>=6.0; extra == "train"
Dynamic: license-file

<div align="center">

# SQLForge 🛠️

**A small open LLM, fine-tuned to turn plain-English questions into correct SQL — measured on the [Spider](https://yale-lily.github.io/spider) benchmark with real *execution accuracy*, not string matching.**

[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=flat&logo=python&logoColor=white)](https://www.python.org/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.7-EE4C2C?style=flat&logo=pytorch&logoColor=white)](https://pytorch.org/)
[![QLoRA](https://img.shields.io/badge/Fine--tuning-QLoRA%204bit-7C3AED?style=flat)](https://arxiv.org/abs/2305.14314)
[![Base](https://img.shields.io/badge/Base-Qwen2.5--Coder--1.5B-00A67E?style=flat)](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
[![W&B](https://img.shields.io/badge/Tracking-Weights%20%26%20Biases-FFBE00?style=flat&logo=weightsandbiases&logoColor=black)](https://wandb.ai/akousa360-arab-international-university-/sqlforge-text2sql)
[![PyPI](https://img.shields.io/pypi/v/sqlforge?style=flat&logo=pypi&logoColor=white&color=3775A9)](https://pypi.org/project/sqlforge/)
[![HF Space](https://img.shields.io/badge/🤗%20Demo-HF%20Spaces-FFD21E?style=flat)](https://huggingface.co/spaces/Abdullahkousa2/sqlforge)
[![License](https://img.shields.io/badge/License-MIT-green?style=flat)](LICENSE)

</div>

---

## 📊 Headline result

Fine-tuning a 1.5B model with a single LoRA adapter lifted execution accuracy by **+8.1 points** on the full Spider dev set — and cut the number of *crashing* queries by **35%**.

| | Execution accuracy | Correct (of 1034) | Crashing queries |
|---|:---:|:---:|:---:|
| **Base Qwen2.5-Coder-1.5B** (zero-shot) | 57.45% | 594 | 228 |
| **+ QLoRA fine-tune** (SQLForge) | **65.57%** | **678** | **148** |
| | **+8.12 pts** | **+84** | **−80 (−35%)** |

> Measured on the **full 1034-example Spider validation split**, by running every generated query against the real SQLite database and comparing result sets. Trained on a single **RTX 3070 (8GB)**. [See the runs on W&B →](https://wandb.ai/akousa360-arab-international-university-/sqlforge-text2sql)

---

## What this is

A **from-scratch fine-tuning project**, not an API wrapper. It takes a small open *code* model, teaches it text-to-SQL with QLoRA, and proves the improvement with a proper execution-accuracy harness on Spider.

The goal is to demonstrate the **full ML loop** — `data → train → evaluate → serve → ship` — with experiment tracking, an honest before/after, error analysis, and an interactive demo that runs the generated SQL against live databases.

## Why execution accuracy (and why it matters)

Most text-to-SQL demos compare the predicted query *string* to the gold string. That's fragile — two very different queries can be equally correct, and two near-identical strings can differ by one wrong column.

SQLForge instead **runs** each generated query against the actual SQLite database and compares the **result sets** (order-insensitive unless the gold query uses `ORDER BY`). If the gold query itself fails, the example is skipped rather than counted. This is the same metric used by the Spider execution-accuracy leaderboard, and it's the honest one.

## The agentic twist: self-correction

When a generated query *crashes* (e.g. `no such column`), SQLForge doesn't just mark it wrong — it feeds the SQL **and the database error** back to the model, hands it the exact valid table/column names, and lets it retry. You can watch this happen live in the demo's **self-correction trace**.

> **Honest note:** self-correction recovers some crashes but only nudges overall accuracy by ~1 point — a 1.5B model often repeats its own mistake. It's a genuinely useful, transparent feature, *and* an honest illustration of a small model's reasoning ceiling. See [Limitations](#limitations).

## Architecture

```mermaid
flowchart LR
    A[Spider<br/>question + schema] --> B[Qwen2.5-Coder-1.5B<br/>4-bit + QLoRA adapter]
    B --> C[generated SQL]
    C --> D{run on<br/>real SQLite DB}
    D -- error --> E[self-correction:<br/>feed error + valid names back]
    E --> B
    D -- rows --> F[compare result sets<br/>= execution accuracy]
```

## Live demo

A FastAPI server + a custom dark UI: pick a database, ask a question, and watch the model **write the SQL, run it, and show the results** — with the self-correction trace when it retries.

- 🤗 **Hosted:** [huggingface.co/spaces/Abdullahkousa2/sqlforge](https://huggingface.co/spaces/Abdullahkousa2/sqlforge)
- 💻 **Local:**
  ```bash
  python -m uvicorn app.server:app --port 8000
  # then open http://localhost:8000
  ```

## Install (library)

```bash
pip install sqlforge
```

Use the inference + evaluation toolkit directly:

```python
from sqlforge.inference import load_model, generate_sql

model, tok = load_model("Qwen/Qwen2.5-Coder-1.5B-Instruct",
                        adapter_path="Abdullahkousa2/sqlforge-qwen2.5-coder-1.5b")

schema = '''CREATE TABLE singer ("Singer_ID" int, "Name" text, "Age" int);'''
print(generate_sql(model, tok, schema, "How many singers are there?"))
# -> SELECT count(*) FROM singer
```

Or from the command line:

```bash
sqlforge --db mydata.sqlite --question "How many users signed up after 2020?"
```

## Reproduce the pipeline

The whole thing runs on a single 8GB GPU. Steps are also available as a guided notebook: [`run_pipeline.ipynb`](run_pipeline.ipynb).

```bash
# 0. install (torch is assumed pre-installed; see requirements.txt)
pip install -r requirements.txt

# 1. data
python scripts/download_data.py        # Spider question/SQL pairs (HF)
python scripts/download_databases.py   # Spider SQLite databases
python scripts/prepare_data.py         # -> data/processed/{train,validation}.jsonl

# 2. baseline (the "before" number)
python scripts/evaluate.py --no-4bit --wandb

# 3. fine-tune (QLoRA, ~3 epochs on an RTX 3070)
python scripts/train.py

# 4. fine-tuned eval (the "after" number)
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --wandb

# 5. self-correction eval
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --self-correct --wandb
```

## Training configuration

| | |
|---|---|
| Base model | `Qwen/Qwen2.5-Coder-1.5B-Instruct` |
| Method | QLoRA — 4-bit NF4 base + LoRA adapter |
| LoRA | r=16, α=32, dropout=0.05, all attention + MLP projections |
| Schedule | 3 epochs, lr 2e-4 cosine, warmup 0.03 |
| Batching | per-device 1 × grad-accum 16 (effective 16) |
| Memory | gradient checkpointing, paged AdamW 8-bit, bf16 |
| Seq length | 1280 (the ~3% longer examples are dropped, never truncated) |

Full config: [`configs/train_config.yaml`](configs/train_config.yaml).

## Repository layout

```
sqlforge/            # the library: prompt format, inference, execution-accuracy grader
  prompts.py         #   single source of truth for the prompt (no train/eval skew)
  inference.py       #   model loading, generation, self-correction loop
  exec_eval.py       #   run SQL on SQLite, compare result sets
scripts/             # download → prepare → train → evaluate
app/                 # FastAPI demo server + custom dark UI (static/)
configs/             # training config
run_pipeline.ipynb   # the whole pipeline as a guided notebook
```

## Limitations

Stated plainly, because honest evaluation is the point:

- **The model is small (1.5B).** Its dominant failure is *over-joining* — building an unnecessary JOIN and referencing a column on the wrong table (`no such column`). Fine-tuning reduced this by a third but couldn't eliminate it.
- **~90% needs a different class of system.** Top of the Spider leaderboard means a frontier model (GPT-4 / Claude) inside an agentic pipeline (DIN-SQL / DAIL-SQL: schema linking, decomposition, self-consistency). A locally-trained 1.5B realistically tops out in the 60s–70s.
- **A 200-example eval is noise.** An early 200-sample read showed 49.5%→51.5%; the full 1034-example set revealed the true 57.4%→65.6%. Always evaluate the full split.

## Roadmap

- [ ] Scale to a 3B/7B base for the mid-70s
- [ ] Stop-string decoding to speed up worst-case generation
- [ ] BIRD dataset augmentation
- [ ] Schema-linking pre-step before generation

## Tech stack

PyTorch · 🤗 Transformers · PEFT · TRL · bitsandbytes · Datasets · Weights & Biases · FastAPI

## License

[MIT](LICENSE) © Abdullah Kousa
