Metadata-Version: 2.4
Name: transfuzzy
Version: 0.1.0
Summary: TransFuzzy is a robust transliteration system that bridges the gap between Indic scripts and the Latin alphabet.
Author: Goutham
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: black>=26.3.1
Requires-Dist: build>=1.4.2
Requires-Dist: flask>=3.1.3
Requires-Dist: flask-cors>=6.0.2
Requires-Dist: fuzzywuzzy>=0.18.0
Requires-Dist: indic-transliteration>=2.3.81
Requires-Dist: jellyfish>=1.2.1
Requires-Dist: joblib>=1.5.3
Requires-Dist: langdetect>=1.0.9
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: numpy>=2.4.4
Requires-Dist: pandas>=3.0.1
Requires-Dist: python-levenshtein>=0.27.3
Requires-Dist: ruff>=0.15.8
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: scipy>=1.17.1
Requires-Dist: sentence-transformers>=5.3.0
Dynamic: license-file

<h1 align="center">
  <br>
  🔤 TransFuzzy
  <br>
</h1>

<h4 align="center">Multilingual Fuzzy Name Matching — phonetic + semantic + ML, all in one pipeline.</h4>

<p align="center">
  <a href="https://www.python.org/downloads/release/python-3110/">
    <img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python Version">
  </a>
  <a href="LICENSE">
    <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License">
  </a>
  <a href="https://github.com/astral-sh/uv">
    <img src="https://img.shields.io/badge/package%20manager-uv-blueviolet" alt="uv">
  </a>
  <img src="https://img.shields.io/badge/framework-Flask-lightgrey" alt="Flask">
  <img src="https://img.shields.io/badge/ML-RandomForest-orange" alt="Random Forest">
</p>

<p align="center">
  <a href="#-features">Features</a> •
  <a href="#%EF%B8%8F-architecture">Architecture</a> •
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-api-reference">API Reference</a> •
  <a href="#-training-your-own-model">Training</a> •
  <a href="#-contributing">Contributing</a>
</p>

---

## ✨ Features

- 🌐 **Multilingual** — Supports English, Hindi (Devanagari), Telugu, Tamil, Kannada, Malayalam, Gujarati, and Gurmukhi out of the box
- 🔊 **Phonetic matching** — Soundex and Metaphone codes to catch phonetically similar spellings
- 📐 **String distance** — Levenshtein and Jaro-Winkler similarity
- 🧠 **Semantic embeddings** — `all-MiniLM-L6-v2` sentence transformer for semantic closeness
- 🌲 **ML classifier** — A trained Random Forest model that combines all metrics for a final confident prediction
- ⚡ **Fast** — Pre-filters candidates by first letter, batch-encodes embeddings, and loads models once at startup
- 🖥️ **Web UI** — Clean browser-based interface, zero frontend framework required

---

## 🏛️ Architecture

```
transfuzzy/
├── main.py               # Flask app — routes, transliteration, orchestration
├── dir/
│   ├── create_csv.py     # Step 1: pair input name against the names database
│   ├── calculate_ratios.py  # Step 2: compute 8 similarity metrics per pair
│   ├── compute_metrics.py   # Step 3: RF model predicts + hybrid scoring
│   ├── enrich_data.py    # (Training) generate positive/negative training pairs
│   └── train_model.py    # (Training) GridSearchCV to train & save best RF model
├── utils/
│   └── response.py       # Standardised JSON response helper
├── db/
│   ├── names_2.txt              # Names database (one name per line)
│   ├── names.csv                # Enriched training data
│   └── best_random_forest_model.pkl  # Pre-trained model (committed)
├── templates/
│   └── index.html        # Jinja2 template for the web UI
├── static/
│   ├── styles.css
│   ├── api.js
│   ├── ui.js
│   └── app.js
├── pyproject.toml        # Project metadata & dependencies (uv)
└── scripts/
    ├── dev.py            # Cross-platform dev launcher (uv run + open browser)
    ├── enrich.py         # Convenience wrapper: enrich_data pipeline
    └── train.py          # Convenience wrapper: train model pipeline
```

### Inference Pipeline

```
Input Name
    │
    ▼
[Script Detection]  ──── Devanagari/Telugu/etc? ──► Transliterate to ITRANS
    │
    ▼
[Create Pairs]      ──── Compare against ~73k names (pre-filtered by 1st char)
    │
    ▼
[Calculate Ratios]  ──── 8 metrics: Soundex, Metaphone, Levenshtein,
    │                    Jaro-Winkler, Cosine, Euclidean, Manhattan, Pearson
    ▼
[RF Classifier]     ──── Predict probability of match (class 'y')
    │
    ▼
[Hybrid Filter]     ──── Accept if: high RF confidence OR phonetic match
    │                    Reject if composite score < 0.70
    ▼
[Results]           ──── Sorted by composite score, transliterated back
```

---

## 🚀 Quick Start

### Prerequisites

| Tool | Version | Install |
|------|---------|---------|
| Python | ≥ 3.11 | [python.org](https://python.org) |
| uv | latest | `pip install uv` or [docs.astral.sh/uv](https://docs.astral.sh/uv/getting-started/installation/) |

### 1. Clone the repository

```bash
git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy
```

### 2. Install dependencies

```bash
uv sync
```

That's it. `uv sync` reads `pyproject.toml`, creates a virtual environment (`.venv`), and installs all pinned dependencies from `uv.lock`.

### 3. Run the development server

```bash
# Cross-platform launcher — starts Flask AND opens your browser automatically
uv run python scripts/dev.py
```

Or, if you prefer the raw Flask command:

```bash
uv run python main.py
```

The app will be available at **http://localhost:5000**

---

## 📡 API Reference

### `POST /similar_names`

Find names phonetically/semantically similar to the input.

**Request**

```http
POST /similar_names
Content-Type: application/json

{
  "name": "Rahul"
}
```

**Supported scripts** — you can also pass names in:
- Devanagari: `"राहुल"`
- Telugu: `"రాహుల్"`
- Tamil, Kannada, Malayalam, Gujarati, Gurmukhi

**Response (200 OK)**

```json
{
  "similar_names": ["Rahul", "Raahul", "Rahool", "Rahil"]
}
```

**Error Response**

```json
{
  "error": "name parameter is required"
}
```

| Status | Meaning |
|--------|---------|
| `200` | Success — `similar_names` array returned |
| `400` | Bad request — missing/invalid `name` field |
| `500` | Server error — model or database file issue |

**cURL Example**

```bash
curl -X POST http://localhost:5000/similar_names \
  -H "Content-Type: application/json" \
  -d '{"name": "Priya"}'
```

---

## 🎓 Training Your Own Model

If you want to retrain the Random Forest model on your own name data, follow these steps.

### Step 1 — Prepare your names data

Edit `db/names2.txt`. Each line defines a cluster of similar names:

```
Rahul > Raahul, Rahool, Rahil
Priya > Preya, Priyah, Pria
Arjun > Arjoon, Arjuun, Arjan
```

Names within the same cluster = **positive pairs**.
Names across different clusters (but starting with the same letter) = **hard negative pairs**.

### Step 2 — Enrich the data (compute similarity metrics)

```bash
uv run python scripts/enrich.py
```

This runs `dir/enrich_data.py` which:
1. Parses clusters from `db/names2.txt`
2. Generates positive + hard-negative name pairs
3. Computes all 8 similarity metrics for each pair
4. Saves enriched training data to `db/names.csv`

> ⚠️ This step loads the sentence-transformer model and may take **5–15 minutes** depending on the size of your dataset.

### Step 3 — Train the model

```bash
uv run python scripts/train.py
```

This runs `dir/train_model.py` which:
1. Loads `db/names.csv`
2. Runs `GridSearchCV` over Random Forest hyperparameters
3. Evaluates on a 25% held-out test set
4. Saves the best model to `db/best_random_forest_model.pkl`

---

## 🧩 Similarity Metrics Explained

| Metric | Type | Description |
|--------|------|-------------|
| `soundex_ratio` | Phonetic | Similarity of Soundex codes (letter+digit hash) |
| `metaphone_ratio` | Phonetic | Similarity of Metaphone codes (pronunciation hash) |
| `levenshtein_ratio` | String | 1 − (edit distance / max length) |
| `jaro_winkler_ratio` | String | Jaro-Winkler similarity (best for short strings) |
| `cosine_similarity` | Embedding | Cosine angle between MiniLM embeddings |
| `euclidean_similarity` | Embedding | `1 / (1 + euclidean distance)` |
| `manhattan_similarity` | Embedding | `1 / (1 + L1 distance)` |
| `pearson_similarity` | Embedding | `(Pearson correlation + 1) / 2` |

The Random Forest classifier is trained on all 8 features.  
At inference, results are filtered using a **hybrid scoring system**:
- RF confidence ≥ 0.60, **OR**
- RF confidence ≥ 0.20 **AND** phonetic match (Soundex/Metaphone), **OR**
- Jaro-Winkler ≥ 0.92 (obvious variants)

Then a composite weighted score filters out low-quality matches (threshold: 0.70).

---

## 🗂️ Project Structure Details

```
db/
├── names_2.txt          # Runtime names database (73k+ names, one per line)
├── names2.txt           # Clustered names for training (Name > Variant1, Variant2)
├── names.csv            # Training data with computed metrics (generated by enrich.py)
└── best_random_forest_model.pkl  # Trained classifier
```

> **Note:** `db/names.csv` is auto-generated and gitignored. `best_random_forest_model.pkl` IS committed so contributors can run the app without retraining.

---

## 🤝 Contributing

Contributions are welcome! Here are some ways you can help:

- 🌐 Add more names to the database (`db/names_2.txt`)
- 📊 Add more name clusters for training (`db/names2.txt`)
- 🔤 Add support for new Indic scripts
- 🐛 Report bugs via [GitHub Issues](https://github.com/your-username/transfuzzy/issues)
- ✨ Improve the matching pipeline or scoring thresholds

### Development Setup

```bash
git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy
uv sync
uv run python scripts/dev.py
```

### Submitting a Pull Request

1. Fork the repository
2. Create a feature branch: `git checkout -b feat/your-feature`
3. Commit your changes: `git commit -m 'feat: add support for Bengali script'`
4. Push and open a PR

Please follow [Conventional Commits](https://www.conventionalcommits.org/) for commit messages.

---

## 📋 Requirements Summary

All dependencies are managed via `uv` and pinned in `uv.lock`:

| Package | Purpose |
|---------|---------|
| `flask` | Web framework |
| `flask-cors` | Cross-origin support |
| `fuzzywuzzy` | Fuzzy string matching |
| `python-levenshtein` | Fast Levenshtein distance |
| `jellyfish` | Phonetic algorithms (Soundex, Metaphone, Jaro-Winkler) |
| `sentence-transformers` | Semantic embeddings (`all-MiniLM-L6-v2`) |
| `scikit-learn` | Random Forest classifier |
| `indic-transliteration` | Devanagari/Telugu/Tamil etc. → ITRANS |
| `pandas`, `numpy`, `scipy` | Data manipulation and math |
| `joblib` | Model serialization |
| `matplotlib` | Feature importance plots (training only) |

---

## 📄 License

MIT © [Goutham Dechineni](LICENSE)

---

<p align="center">
  Made with ❤️ for the open-source community
</p>
