Metadata-Version: 2.4
Name: metasyn-disclosure
Version: 0.2.0
Summary: Plugin package for metasyn that applies the disclosure control.
Author-email: Raoul Schram <r.d.schram@uu.nl>, Erik-Jan van Kesteren <e.vankesteren1@uu.nl>
License: MIT License
        
        Copyright (c) 2022 SoDa
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Keywords: metasyn,disclosure control,metadata,open-data,privacy,synthetic-data
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: metasyn>=2
Requires-Dist: polars
Requires-Dist: numpy>=1.20; python_version < "3.12"
Requires-Dist: numpy>1.24.4; python_version >= "3.12"
Provides-Extra: dev
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Provides-Extra: examples
Requires-Dist: matplotlib; extra == "examples"
Requires-Dist: seaborn; extra == "examples"
Dynamic: license-file

# Metasyn disclosure control
[![](https://img.shields.io/badge/metasyn-plugin-blue?logo=python&logoColor=white)](https://github.com/sodascience/metasyn)
[![Python package](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml/badge.svg)](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml)
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)

A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules of thumb as found in the following documents:

- The [SDC handbook](https://securedatagroup.org/guides-and-resources/sdc-handbook/) of the Secure Data group in the UK
- The Data Without Boundaries document [Guidelines for output checking](https://wayback.archive-it.org/12090/*/https:/cros-legacy.ec.europa.eu/system/files/dwb_standalone-document_output-checking-guidelines.pdf) (pdf)
- Statistics Netherlands' output guidelines

Producing synthetic data with [metasyn](https://github.com/sodascience/metasyn) is already a great first step towards protecting privacy, but it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in particularly sensitive data. This plugin solves these kinds of problems.

> [!WARNING]
> Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim
any responsibility as a result of using this plugin. 

## Installing the plugin

To install the package with pip, run the following:
```sh
pip install metasyn-disclosure
```

For the development, installed the package directly through git with the following command:

 ```sh
 pip install git+https://github.com/sodascience/metasyn-disclosure-control.git
 ```

## Usage

Basic usage for our built-in titanic dataset is as follows:

```py
from metasyncontrib.disclosure import DisclosurePrivacy
from metasyncontrib.disclosure.string import DisclosureFaker

from metasyn import MetaFrame, VarSpec, demo_dataframe

df = demo_dataframe("titanic")

spec = [
    VarSpec(name="PassengerId", unique=True),
    VarSpec(name="Name", distribution=DisclosureFaker("name")),
]

mf = MetaFrame.fit_dataframe(
    df=df,
    var_specs=spec,
    privacy=DisclosurePrivacy(),
)

mf.synthesize(5)
```

```
shape: (5, 13)
┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐
│ PassengerId ┆ Name               ┆ Sex    ┆ Age  ┆ … ┆ Birthday   ┆ Board time ┆ Married since       ┆ all_NA │
│ ---         ┆ ---                ┆ ---    ┆ ---  ┆   ┆ ---        ┆ ---        ┆ ---                 ┆ ---    │
│ i64         ┆ str                ┆ cat    ┆ i64  ┆   ┆ date       ┆ time       ┆ datetime[μs]        ┆ f32    │
╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡
│ 0           ┆ Benjamin Cox       ┆ female ┆ 27   ┆ … ┆ 1931-12-01 ┆ 14:33:06   ┆ 2022-07-30 02:16:37 ┆ null   │
│ 1           ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null       ┆ 2022-08-03 13:09:19 ┆ null   │
│ 2           ┆ Randy Mosley       ┆ male   ┆ 24   ┆ … ┆ 1933-01-06 ┆ 15:52:54   ┆ 2022-07-18 18:52:05 ┆ null   │
│ 3           ┆ Vincent Maddox     ┆ female ┆ 24   ┆ … ┆ 1937-02-10 ┆ 16:58:30   ┆ 2022-07-23 20:29:49 ┆ null   │
│ 4           ┆ Kristin Holland    ┆ male   ┆ 17   ┆ … ┆ 1939-12-09 ┆ 18:07:45   ┆ 2022-08-05 02:41:51 ┆ null   │
└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘
```


## Implementation details
The rules of thumb, roughly, are: 

- at least 10 units
- at least 10 degrees of freedom
- no group disclosure
- no dominance

For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data. 



<!-- CONTRIBUTING -->
## Contributing
You can contribute to this metasyn plugin by giving feedback in the "Issues" tab, or by creating a pull request.

To create a pull request:
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request


<!-- CONTACT -->
## Contact
This is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Raoul Schram](https://github.com/qubixes) or [Erik-Jan van Kesteren](https://github.com/vankesteren).

<img src="soda.png" alt="SoDa logo" width="250px"/> 
