Metadata-Version: 2.4
Name: synthedge
Version: 0.1.0
Summary: Diagnosis-first synthetic data augmentation for imbalanced tabular datasets
Author: Sagnik
License: MIT
Project-URL: Homepage, https://github.com/yourusername/synthedge
Project-URL: Issues, https://github.com/yourusername/synthedge/issues
Keywords: synthetic data,data augmentation,imbalanced learning,machine learning,SMOTE,CTGAN,gap detection
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.11
Requires-Dist: ctgan>=0.9
Requires-Dist: imbalanced-learn>=0.11
Requires-Dist: xgboost>=2.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"

# SynthEdge

![CI](https://github.com/Juzt-nik/SynthEdge/actions/workflows/ci.yml/badge.svg)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
![License](https://img.shields.io/badge/license-MIT-green)

**Diagnosis-first synthetic data augmentation for imbalanced tabular datasets.**

> SMOTE generated **114x more** synthetic samples and achieved **22 points worse recall**.  
> SynthEdge generated 18 targeted samples and preserved performance.  
> — *Framingham Heart Study, 3,658 rows*

---

![CI](https://github.com/Juzt-nik/SynthEdge/actions/workflows/ci.yml/badge.svg)
![Python](https://img.shields.io/pypi/pyversions/synthedge)

## The problem with SMOTE

SMOTE answers *"how do I get more data?"* — it generates blindly across all minority samples.

SynthEdge answers *"where is my data actually missing, and how do I fix exactly that?"*

When your dataset has structural gaps — specific patient subgroups, rare feature combinations, underrepresented demographics — SMOTE fills the wrong places and can actively hurt recall. SynthEdge finds the gaps first, then synthesizes surgically.

---

## Install

```bash
pip install synthedge
```

---

## Quick start

```python
from synthedge import SynthEdge

se = SynthEdge(df, target_col="target")

# Step 1: diagnose
report = se.analyze()
# Severity: SEVERE
# Will SynthEdge help? YES
# Recommendation: SMOTE will likely hurt recall. SynthEdge strongly recommended.

# Step 2: fill
aug_df = se.fill()
# [CTGAN] Training on 446 positive samples...
# [SE] Voxel (1,2,3): 9 samples via CTGAN
# [SE] Voxel (3,0,1): 6 samples via CTGAN
# Added 18 targeted positives (2915 -> 2933 rows)

# Step 3: check quality
report = se.quality_report()
# KL divergence in gap region: 1.969 (vs SMOTE: 1.989)
```

---

## CLI

```bash
# Diagnose your dataset
synthedge analyze data.csv --target diagnosis

# Fill gaps and save augmented CSV
synthedge fill data.csv --target diagnosis --n-top 3 --out augmented.csv
```

---

## How it works

### 1. 3D Local Density Scan (Gap Detector)

SynthEdge projects your training data into PCA space and tiles it with an adaptive 3D voxel grid. Each voxel is scored:

```
gap_score = 0.5 × sparsity + 0.3 × label_entropy + 0.2 × pos_rate
```

Voxels with high sparsity and minority-class presence are flagged as gaps. With `n_bins=6`, the grid has 216 voxels — fine-grained enough to isolate a 3-sample gap without merging it with nearby dense regions.

Unlike HDBSCAN (which needs `min_cluster_size` samples to form a cluster), this approach finds gaps with as few as 1 sample.

### 2. Severity Classifier

Before augmentation, SynthEdge tells you whether it will actually help:

| Severity | Score | Meaning |
|---|---|---|
| NONE | < 0.15 | Dataset is well-distributed. No augmentation needed. |
| MILD | 0.15–0.35 | Minor gaps. SynthEdge will match or slightly improve SMOTE. |
| MODERATE | 0.35–0.60 | Clear gaps. Meaningful recall improvement expected. |
| SEVERE | > 0.60 | Structural gaps. SMOTE will likely hurt you. Use SynthEdge. |

This is the only augmentation tool that tells you *not* to use it when it won't help.

### 3. CTGAN Targeted Synthesis

SynthEdge trains CTGAN on the positive-class samples to learn the real joint feature distribution. It then generates a large candidate pool and filters to samples that fall within identified gap voxels. A lightweight logistic discriminator rejects candidates that are too easy to identify as fake.

When CTGAN's pool is insufficient for a voxel (too sparse), it falls back to Gaussian sampling around the voxel centroid.

### 4. Multi-Dataset Gap Transfer

If you have multiple datasets covering the same domain, SynthEdge can find matching gap regions across them using centroid cosine similarity and transfer **real samples** from the less-sparse dataset into the more-sparse one — no synthesis needed, just rescaling.

```python
from synthedge.transfer import find_matching_gaps, transfer_samples, apply_transfers

datasets_info = [
    {"name": "Cleveland", "top_voxels": top_cl, "scaler": sc_cl,
     "X_tr_sc": X_cl, "y_tr": y_cl, "feature_names": feat_cl},
    {"name": "Framingham", "top_voxels": top_fr, "scaler": sc_fr,
     "X_tr_sc": X_fr, "y_tr": y_fr, "feature_names": feat_fr},
]

matches   = find_matching_gaps(datasets_info, similarity_threshold=0.70)
transfers = transfer_samples(matches, n_transfer=20)
X_aug, y_aug, n_added = apply_transfers("Cleveland", X_cl, y_cl, transfers)
```

---

## Benchmark results

Tested on three cardiovascular datasets with artificially carved gap regions (70% of minority samples in a specific demographic subgroup removed).

### KL divergence in gap region (lower = better recovery)

| Dataset | No aug | SMOTE | ADASYN | SynthEdge | SE vs SMOTE |
|---|---|---|---|---|---|
| Cleveland (297 rows) | 1.033 | 1.039 | 1.057 | **0.972** | **−0.067** ✓ |
| Framingham (3,658 rows) | 1.987 | 1.989 | 1.998 | **1.969** | **−0.020** ✓ |
| Cardiovascular (68,604 rows) | 0.653 | 0.646 | 0.646 | 0.653 | +0.006 |

### Minority-class recall

| Dataset | No aug | SMOTE | ADASYN | SynthEdge | SE vs SMOTE |
|---|---|---|---|---|---|
| Cleveland (297 rows) | 0.821 | 0.821 | 0.857 | 0.821 | 0.0 pp |
| Framingham (3,658 rows) | 0.514 | 0.288 | 0.351 | **0.486** | **+19.8 pp** ✓ |
| Cardiovascular (68,604 rows) | 0.695 | 0.690 | 0.690 | **0.693** | **+0.3 pp** ✓ |

### Synthesis efficiency

| Dataset | SMOTE added | SynthEdge added | Ratio |
|---|---|---|---|
| Cleveland | 22 | 27 | 1× |
| Framingham | 2,045 | 18 | **114× fewer** |
| Cardiovascular | 1,690 | 39 | **43× fewer** |

**Key finding:** On Framingham, SMOTE generated 114× more synthetic samples and achieved 22 points worse recall. SynthEdge generated 18 targeted samples and preserved performance.

**On balanced datasets** (Cardiovascular, 49.5% positive rate): SynthEdge correctly detects low severity and produces results identical to no augmentation — it does not over-generate when data is already sufficient.

---

## When to use SynthEdge

| Use SynthEdge | Use SMOTE instead |
|---|---|
| Clinical / healthcare data | Generic balanced datasets |
| Specific demographic subgroups underrepresented | Uniform minority-class sparsity |
| Audit / compliance requirements (gap report) | Quick baseline augmentation |
| Multiple datasets from same domain (transfer) | Single small dataset, no CTGAN data |
| Severity = MODERATE or SEVERE | Severity = NONE or MILD |

---

## API reference

### `SynthEdge(df, target_col, feature_cols, discrete_cols, verbose)`

| Method | Description |
|---|---|
| `.analyze(n_bins, top_k)` | Run gap detection + severity classification |
| `.fill(n_top, ctgan_epochs, use_ctgan)` | Synthesize targeted samples |
| `.quality_report(held_sc)` | KL divergence + feature drift metrics |
| `.gap_map` | DataFrame of top gap voxels with scores |
| `.severity` | Severity classification result dict |

### `synthedge.transfer`

| Function | Description |
|---|---|
| `find_matching_gaps(datasets_info, threshold)` | Match gap regions across datasets |
| `transfer_samples(matches, n_transfer)` | Extract real samples for transfer |
| `apply_transfers(name, X_sc, y, transfers)` | Inject transferred samples |

---

## Citation

If you use SynthEdge in research, please cite:

```bibtex
@software{synthedge2025,
  title  = {SynthEdge: Diagnosis-first synthetic data augmentation},
  author = {Sagnik},
  year   = {2025},
  url    = {https://github.com/yourusername/synthedge}
}
```

---

## License

MIT
