Metadata-Version: 2.4
Name: synloc
Version: 1.0.0
Summary: A Python package to create synthetic data from locally estimated distributions
Home-page: https://github.com/alfurka/synloc
Author: Ali Furkan Kalay
Author-email: Ali Furkan Kalay <alfurka@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/alfurka/synloc
Project-URL: Documentation, https://alfurka.github.io/synloc/
Keywords: copulas,distributions,sampling,synthetic-data,oversampling,nonparametric-distributions,semiparametric,nonparametric,knn,clustering,k-means,multivariate-distributions
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: matplotlib
Requires-Dist: scikit-learn
Requires-Dist: joblib
Requires-Dist: tqdm
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: sphinx; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx; extra == "docs"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

<div align="center">

# synloc: An Algorithm to Create Synthetic Tabular Data

<img src="https://raw.githubusercontent.com/alfurka/synloc/main/assets/logo_white_bc.png" alt = 'synloc'>

[Overview](#overview) | [Data Requirements](#data-requirements) | [Installation](#installation) | [A Quick Example](#a-quick-example) | [Documentation](https://alfurka.github.io/synloc/) | [How to cite?](#how-to-cite) | [Replication](#replication)

[![PyPI](https://img.shields.io/pypi/v/synloc)](https://pypi.org/project/synloc) [![Python](https://img.shields.io/pypi/pyversions/synloc)](https://pypi.org/project/synloc) [![Downloads](https://static.pepy.tech/badge/synloc)](https://pepy.tech/project/synloc)

</div>

## Overview

`synloc` is an open-source Python package implementing the **Local Resampler (LR)** algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.

### Two Subsampling Strategies

Both approaches provide effective disclosure control. Choose based on your priorities:

| Approach | Best for | Key advantage |
|----------|----------|---------------|
| **k-Nearest Neighbors (k-NN)** | Stronger disclosure control | Naturally underrepresents outliers, reducing privacy risks |
| **Clustering-based** | Efficiency & accuracy | Better data utility and computational performance |

**Key features:**
- Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)
- Accurate replication of complex distributions, including multimodal and non-convex-support data
- Flexible trade-off between data utility and privacy protection
- Built-in quality diagnostics, including Kolmogorov-Smirnov distances, Wasserstein distances, summary statistics, and correlation-difference metrics
- Compatible with parametric and nonparametric distributions

This implementation aligns with statistical agencies' safe data regulations, including the **k-anonymity** criterion and the **Five Safes** framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the [paper referenced below](#how-to-cite).

## Data Requirements

`synloc` expects a numeric `pandas.DataFrame`.

- Categorical variables must be encoded before synthesis, for example with `pandas.get_dummies`.
- Boolean dummy variables are accepted and converted to `0`/`1`.
- Missing numeric values are filled with column medians during fitting.
- Columns with only missing values, duplicate column names, infinite values, and non-numeric columns raise clear errors.
- Integer-like variables can be rounded after synthesis with `round_integers`.

## Installation

`synloc` can be installed through [PyPI](https://pypi.org/):

```
pip install synloc
```

## A Quick Example

Assume that we have a sample with three variables with the following distributions:

$$x \sim Beta(0.1,\,0.1)$$

$$y \sim Beta(0.1,\, 0.5)$$

$$z \sim 10 y + Normal(0,\,1)$$

The distribution can be generated by `tools` module in `synloc`:


```python
from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. 
```

Initializing the resampler:


```python
from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)
```

**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."


```python
syn_data = resampler.fit() 
```

`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:


```python
resampler.comparePlots(['x','y','z'])
```    
![](https://raw.githubusercontent.com/alfurka/synloc/main/assets/README_7_0.png)

You can also inspect utility diagnostics after fitting:

```python
variable_metrics = resampler.compareStats()
quality = resampler.qualityReport()

print(variable_metrics[["ks_statistic", "wasserstein_distance"]])
print(quality["overall"])
```

## How to cite?

If you use `synloc` in your research, please cite the following paper:

```bibtex
@article{https://doi.org/10.1111/anzs.70032,
    author = {Kalay, Ali Furkan},
    title = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},
    journal = {Australian \& New Zealand Journal of Statistics},
    volume = {68},
    number = {1},
    pages = {e70032},
    doi = {https://doi.org/10.1111/anzs.70032},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/anzs.70032},
    year = {2026}
}


```

## Replication

For replication materials of the paper, see the [replication folder](replication/).
