Metadata-Version: 2.4
Name: transfuzzy
Version: 0.1.2
Summary: TransFuzzy is a robust transliteration system that bridges the gap between Indic scripts and the Latin alphabet.
Author: Goutham
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: flask>=3.1.3
Requires-Dist: flask-cors>=6.0.2
Requires-Dist: fuzzywuzzy>=0.18.0
Requires-Dist: indic-transliteration>=2.3.81
Requires-Dist: jellyfish>=1.2.1
Requires-Dist: joblib>=1.5.3
Requires-Dist: langdetect>=1.0.9
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: numpy>=2.4.4
Requires-Dist: pandas>=3.0.1
Requires-Dist: python-levenshtein>=0.27.3
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: scipy>=1.17.1
Requires-Dist: sentence-transformers>=5.3.0
Provides-Extra: dev
Requires-Dist: black>=26.3.1; extra == "dev"
Requires-Dist: build>=1.4.2; extra == "dev"
Requires-Dist: pytest>=8.4.2; extra == "dev"
Requires-Dist: ruff>=0.15.8; extra == "dev"
Requires-Dist: twine>=6.2.0; extra == "dev"
Dynamic: license-file

# TransFuzzy

TransFuzzy is a Python package for multilingual personal-name matching across Latin and several Indic scripts. It exposes the same matching pipeline through a CLI and a Flask API, and it supports switching between the bundled dataset and user-managed datasets.

## What It Does

- Accepts names in Latin, Devanagari, Telugu, Tamil, Kannada, Malayalam, Gujarati, and Gurmukhi.
- Transliterates non-Latin input before matching.
- Scores candidate names with phonetic, edit-distance, and embedding-based features.
- Returns the best matches through `transfuzzy predict` or the HTTP API.
- Lets you upload, activate, list, and delete datasets without modifying package files.

## Installation

### From PyPI

```bash
pip install transfuzzy
```

### Local development setup

```bash
uv sync
```

Python `3.11+` is required.

## Runtime Notes

TransFuzzy currently loads the `sentence-transformers/all-MiniLM-L6-v2` model during module import. On a fresh machine, the first CLI, API, or test run may download model files from Hugging Face before the command can complete.

That has two practical consequences:

- The first run can be noticeably slower than later runs.
- Offline or restricted-network environments can fail before the CLI help text, API startup, or tests finish loading.

## Quick Start

### Start the API server

```bash
transfuzzy
```

or

```bash
transfuzzy serve --port 3000
```

The Flask server listens on `http://localhost:3000` and opens that URL in your default browser on startup.

### Query from the CLI

```bash
transfuzzy predict "Rahul"
```

Limit results:

```bash
transfuzzy predict "Rahul" --top 5
```

Return JSON:

```bash
transfuzzy predict "Rahul" --json
```

Use a specific text dataset file directly:

```bash
transfuzzy predict "Rahul" --db .\names.txt --top 5 --json
```

## Supported Input Scripts

Examples of valid input:

```text
Rahul
राहुल
రాహుల్
```

The output is transliterated back to the original script when the input was converted from a supported Indic script.

## CLI Reference

### `transfuzzy`

Starts the API server on port `3000` and opens the browser automatically.

### `transfuzzy serve`

Run the API server explicitly.

```bash
transfuzzy serve --port 3000
```

Use `--no-browser` to skip opening the browser:

```bash
transfuzzy serve --port 3000 --no-browser
```

### `transfuzzy predict`

Find similar names for a single input.

```bash
transfuzzy predict <name> [--top N] [--json] [--db PATH]
```

Arguments:

- `<name>`: required input string.
- `--top`: maximum number of matches to return. Default: `10`.
- `--json`: print a JSON object with `similar_names`.
- `--db`: use a dataset file path directly instead of the active managed dataset.

### `transfuzzy db`

Manage datasets stored in the TransFuzzy home directory.

Add a dataset:

```bash
transfuzzy db add .\names.txt
```

List managed datasets:

```bash
transfuzzy db list
```

Set the active dataset:

```bash
transfuzzy db use names.txt
```

Delete a managed dataset:

```bash
transfuzzy db delete names.txt
```

## API Reference

### `POST /similar_names`

Request body:

```json
{
  "name": "Rahul"
}
```

Success response:

```json
{
  "similar_names": ["Rahul", "Raahul", "Rahool"]
}
```

Validation errors are returned as JSON with an `error` field and the appropriate HTTP status code.

### `POST /upload_db`

Uploads a dataset file using `multipart/form-data` with the field name `file`.

Success response shape:

```json
{
  "message": "Dataset 'demo.txt' uploaded",
  "dataset_name": "demo.txt",
  "active_db": null
}
```

### `GET /list_dbs`

Returns the stored managed datasets and the active dataset name.

```json
{
  "datasets": ["demo.txt"],
  "active_db": "demo.txt"
}
```

### `POST /use_db`

Request body:

```json
{
  "name": "demo.txt"
}
```

### `DELETE /delete_db`

Request body:

```json
{
  "name": "demo.txt"
}
```

## Dataset Management

There are two ways to provide names:

1. Pass a file path directly with `--db`.
2. Store datasets with `transfuzzy db ...` or the dataset API routes and switch the active dataset.

Managed datasets are stored under:

```text
%USERPROFILE%\.transfuzzy\datasets
```

The active dataset name is stored in:

```text
%USERPROFILE%\.transfuzzy\config.json
```

To override the base directory, set:

```powershell
$env:TRANSFUZZY_HOME = "C:\path\to\custom-home"
```

Each dataset should contain one name per line.

## How Matching Works

The current pipeline is:

```text
Input name
-> transliteration to Latin when needed
-> candidate pair generation from the selected dataset
-> feature computation
   - Soundex ratio
   - Metaphone ratio
   - Levenshtein ratio
   - Jaro-Winkler similarity
   - Cosine similarity
   - Euclidean similarity
   - Manhattan similarity
   - Pearson similarity
-> trained model ranking
-> optional transliteration back to the input script
```

## Project Structure

```text
src/transfuzzy/
├── app.py              Flask app and HTTP routes
├── cli.py              CLI entrypoint
├── core/
│   ├── config.py       package constants and paths
│   ├── db_manager.py   managed dataset storage
│   └── pipeline.py     top-level matching pipeline
├── datasets/
│   └── default.txt     bundled dataset asset
├── db/                 packaged training/runtime artifacts
├── dir/                feature generation and training scripts
├── static/             browser-side assets
├── templates/          HTML templates
└── utils/              helper and response utilities
```

## Development

Run the app locally:

```bash
uv run transfuzzy serve
```

Run tests:

```bash
uv run python -m unittest discover -s tests -v
```

Run training-related scripts:

```bash
uv run python src/transfuzzy/dir/enrich_data.py
uv run python src/transfuzzy/dir/train_model.py
```

More development notes are in `docs/DEVELOPMENT.md`.

## Current Limitations

- Import-time model loading makes commands and tests depend on model availability.
- The package metadata says the project is a transliteration system, while the implementation is broader name matching.
- The repository contains packaged model and dataset artifacts in `src/transfuzzy/db`, so development and release size are coupled to those files.

## License

MIT. See `LICENSE`.
