Metadata-Version: 2.1
Name: synthetic-eval
Version: 0.0.2
Summary: Package for Evaluation of Synthetic Tabular Data Quality
Home-page: https://github.com/an-seunghwan/synthetic_eval
Author: Seunghwan An
Author-email: dpeltms79@gmail.com
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: networkx ==3.4.2
Requires-Dist: numpy ==1.26.4
Requires-Dist: pandas ==2.2.3
Requires-Dist: scikit-learn ==1.5.2
Requires-Dist: scipy ==1.14.1
Requires-Dist: torch ==2.2.2
Requires-Dist: tqdm ==4.67.1

# Synthetic-Eval

**Synthetic-Eval** is a package for the comprehensive evaluation of synthetic tabular datasets.

### 1. Installation
Install using pip:
```
pip install synthetic-eval
```

### 2. Supported Metrics
- **Statistical Fidelity**
  1. KL-Divergence (`KL`)
  2. Goodness-of-Fit (Kolmogorov-Smirnov test & Chi-Squared test) (`GoF`)
  3. Maximum Mean Discrepancy (`MMD`)
  4. Cramer-Wold Distance (`CW`)
  5. (naive) $\alpha$-precision & $\beta$-recall (`alpha_precision`, `beta_recall`)
- **Machine Learning Utility** (classification task) 
  1. Accuracy (`base_cls`, `syn_cls`)
  2. Model Selection Performance (`model_selection`)
  3. Feature Selection Performance (`feature_selection`)
- **Privacy Preservation**
  1. $k$-Anonymization (`Kanon_base`, `Kanon_syn`)
  2. $k$-Map (`KMap`)
  3. Distance to Closest Record (`DCR_RS`, `DCR_RR`, `DCR_SS`)
  4. Attribute Disclosure (`AD`)

### 3. Usage
```python
from synthetic_eval import evaluation
evaluation.evaluate # function for evaluating synthetic data quality
```
- See [example.ipynb](example.ipynb) for detailed example and its results with `loan` dataset.
  - Link for download `loan` dataset: [https://www.kaggle.com/datasets/teertha/personal-loan-modeling](https://www.kaggle.com/datasets/teertha/personal-loan-modeling)

#### Example
```python
"""import libraries"""
import pandas as pd
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

"""the ground-truth (training, test) and synthetic dataset"""
data = pd.read_csv('./loan.csv') 
# len(data) # 5,000
train = data.iloc[:2000]
test = data.iloc[2000:4000]
syndata = data.iloc[4000:]

"""specify column types"""
continuous_features = [
    'Age',
    'Experience',
    'Income', 
    'CCAvg',
    'Mortgage',
]
categorical_features = [
    'Family',
    'Personal Loan',
    'Securities Account',
    'CD Account',
    'Online',
    'CreditCard'
]
target = 'Personal Loan' # machine learning utility target column

"""load Synthetic-Eval"""
from synthetic_eval import evaluation
results = evaluation.evaluate(
    syndata, train, test, 
    target, continuous_features, categorical_features, device
)

"""print results"""
for x, y in results._asdict().items():
    print(f"{x}: {y:.3f}")
```

### 3. References
  - https://github.com/vanderschaarlab/synthcity/blob/main/src/synthcity/metrics
  - https://github.com/HLasse/data-centric-synthetic-data
