Metadata-Version: 2.4
Name: tabdm
Version: 0.1.0
Summary: Transformed-space diffusion for mixed-type tabular data.
Author: TabDM Contributors
License-Expression: Apache-2.0
Keywords: synthetic-data,diffusion,tabular-data,pytorch
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.0
Requires-Dist: torch>=2.2
Provides-Extra: eval
Requires-Dist: scikit-learn>=1.4; extra == "eval"
Requires-Dist: scipy>=1.12; extra == "eval"
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: scikit-learn>=1.4; extra == "dev"
Requires-Dist: scipy>=1.12; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# TabDM

TabDM is a small Python library for generating synthetic mixed-type tabular
data with diffusion in a transformed feature space.

It is designed for tabular datasets with numeric, categorical, boolean,
ordinal, count, and positive continuous columns. The public API focuses on two
workflows:

- fit a model and generate synthetic rows in one call
- fit once, then generate repeatedly with optional target or subgroup controls

TabDM also includes evaluation helpers for schema compatibility, distribution
fidelity, downstream utility, validity checks, and privacy-screening metrics.

For exact function signatures, parameter semantics, and report shapes, see
[docs/API_REFERENCE.md](docs/API_REFERENCE.md).

## Install

From a local checkout:

```bash
pip install -e .
```

For evaluation helpers:

```bash
pip install -e ".[eval]"
```

For local development:

```bash
pip install -e ".[dev]"
```

## Quick Start

```python
import pandas as pd

from tabdm import generate_synthetic_data

real = pd.DataFrame(
    {
        "age": [21.0, 35.0, 44.0, 28.0, 31.0, 39.0],
        "job": ["admin", "tech", "admin", "services", "tech", "admin"],
        "owns_house": [True, False, True, False, True, False],
        "balance": [1000.0, 250.0, 1900.0, 750.0, 1200.0, 400.0],
    }
)

synthetic = generate_synthetic_data(
    real,
    num_rows=100,
    discrete_columns=["job", "owns_house"],
    epochs=50,
    timesteps=64,
    sample_steps=16,
    random_state=42,
)
```

`generate_synthetic_data` returns a `pandas.DataFrame` with the same column
order as the training dataframe. Numeric outputs are clipped to the training
range, count columns are rounded, and discrete columns are decoded to values
observed during fitting.

## Fit Once, Generate Many Times

Use `fit_tabdm` when you want to reuse a fitted model.

```python
from tabdm import fit_tabdm

model = fit_tabdm(
    real,
    discrete_columns=["job", "owns_house"],
    epochs=50,
    timesteps=64,
    sample_steps=16,
    random_state=42,
)

synthetic_a = model.generate(100, random_state=1)
synthetic_b = model.generate(100, random_state=2)
```

Passing the same `random_state` to `generate` produces the same sampled rows for
the same fitted model.

## Conditional Generation

TabDM can treat target, sensitive, or explicitly named columns as conditioning
columns. Condition columns are not generated by the diffusion model. They are
provided by the caller or sampled from the training rows, then recombined with
generated feature columns.

```python
real = pd.DataFrame(
    {
        "age": [21, 35, 44, 28, 31, 39],
        "job": ["admin", "tech", "admin", "services", "tech", "admin"],
        "sex": ["f", "m", "f", "m", "f", "m"],
        "default": ["yes", "no", "yes", "no", "yes", "no"],
    }
)

synthetic = generate_synthetic_data(
    real,
    num_rows=200,
    discrete_columns=["job", "sex", "default"],
    target_column="default",
    sensitive_columns=["sex"],
    conditions={"default": "yes"},
    condition_strategy="prior",
    epochs=50,
    random_state=42,
)
```

Conditioning controls:

| Argument | Meaning |
| --- | --- |
| `target_column` | Downstream label to hold fixed or sample separately. |
| `sensitive_columns` | Subgroup columns to preserve or control. |
| `condition_on` | Additional columns to use as generation conditions. |
| `conditions` | Fixed values or row-wise values to use at generation time. |
| `condition_strategy` | How unspecified condition columns are sampled: `prior` or `balanced`. |

`conditions` can be:

- `None`: sample all condition columns from the training condition rows
- a mapping of scalar values: fix those columns and sample the remaining
  condition columns from matching training rows
- a mapping of sequences: provide row-wise condition values for all condition
  columns
- a one-row dataframe: repeat the row for every generated sample
- a `num_rows`-row dataframe: use row-wise condition values directly

## Column Metadata

TabDM can use metadata for columns whose dtype alone is not enough.

```python
metadata = {
    "grade_band": {"type": "ordinal", "order": ["low", "mid", "high"]},
    "incidents": {"type": "count"},
    "tuition": {"type": "positive_continuous"},
}

synthetic = generate_synthetic_data(
    real,
    discrete_columns=["district"],
    column_metadata=metadata,
    random_state=42,
)
```

Supported metadata types:

| Type | Transform behavior | Inverse behavior |
| --- | --- | --- |
| `ordinal` | Encoded as an ordered scalar using the supplied or inferred order. | Rounded to the nearest ordinal level. |
| `count` | Encoded with `log1p` after clipping to non-negative values. | Decoded with `expm1`, clipped to the training range, and rounded. |
| `positive_continuous` | Encoded with `log1p` after clipping to non-negative values. | Decoded with `expm1` and clipped to the training range. |

Object, string, categorical, and boolean columns are inferred as discrete when
`discrete_columns` is not supplied. Numeric columns are continuous unless listed
in metadata.

## Generation Parameters

`generate_synthetic_data` exposes the fitting and sampling controls directly.

| Parameter | Default | Description |
| --- | --- | --- |
| `dataframe` | required | Training dataframe. Must contain at least one row. |
| `num_rows` | `len(dataframe)` | Number of synthetic rows. Must be positive when provided. |
| `discrete_columns` | inferred | Categorical columns. Accepts column names. |
| `column_metadata` | `None` | Metadata for ordinal, count, or positive continuous columns. |
| `target_column` | `None` | Column to condition on rather than generate. |
| `sensitive_columns` | `None` | Additional condition columns, usually subgroup attributes. |
| `condition_on` | `None` | Other condition columns. |
| `conditions` | `None` | Fixed or row-wise generation conditions. |
| `condition_strategy` | `"prior"` | `prior` samples training condition rows; `balanced` samples unique condition rows uniformly. |
| `hidden_dims` | `(256, 256)` | MLP denoiser hidden layer sizes. |
| `time_embedding_dim` | `64` | Sinusoidal timestep embedding size. |
| `timesteps` | `96` | Number of training diffusion timesteps. |
| `sample_steps` | `24` | Number of deterministic reverse steps used during sampling. |
| `epochs` | `120` | Training epochs. |
| `batch_size` | `512` | Training and sampling batch size. |
| `learning_rate` | `1e-3` | AdamW learning rate. |
| `weight_decay` | `1e-6` | AdamW weight decay. |
| `beta_start` | `1e-4` | First value in the linear noise schedule. |
| `beta_end` | `0.02` | Last value in the linear noise schedule. |
| `dropout` | `0.0` | Dropout inside the denoiser MLP. |
| `discrete_loss_weight` | `2.0` | Multiplier for one-hot categorical spans in the training loss. |
| `prediction_clip` | `1.5` | Clamp applied to predicted transformed features. |
| `grad_clip_norm` | `1.0` | Gradient norm clipping threshold. Use `0` to disable. |
| `device` | `"cpu"` | `"cpu"` or a CUDA device string. Falls back to CPU if CUDA is unavailable. |
| `random_state` | `None` | Seeds Python, NumPy, and Torch during fitting, and seeds sampling noise when generating. |
| `verbose` | `False` | Print training loss periodically. |
| `return_model` | `False` | Return `SyntheticDataResult` with the fitted model and metadata. |

## Lower-Level Model API

```python
from tabdm import TabDM, TabDMConfig

model = TabDM(
    TabDMConfig(
        hidden_dims=(256, 256),
        time_embedding_dim=64,
        timesteps=64,
        sample_steps=16,
        epochs=50,
        batch_size=512,
        random_state=42,
    )
)

model.fit(real, discrete_columns=["job", "owns_house"])
synthetic = model.sample(100, random_state=42)
```

Use `TabDM` directly if you want to hold a model object, inspect
`fit_history_`, or call `sample` repeatedly.

## Evaluation

Install the optional evaluation dependencies first:

```bash
pip install -e ".[eval]"
```

Then call `evaluate_synthetic`.

```python
from tabdm import evaluate_synthetic

report = evaluate_synthetic(
    real=real,
    synthetic=synthetic,
    target_column="default",
    random_state=42,
)

print(report["schema"])
print(report["distribution"])
print(report["validity"])
print(report["utility"])
print(report["trust"])
```

`evaluate_synthetic` can compute:

| Group | Included by default | Description |
| --- | --- | --- |
| `schema` | yes | Column presence, column order compatibility, and dtype mismatches. |
| `distribution` | yes | Categorical total variation distance, numeric KS distance, and numeric correlation delta. |
| `validity` | yes | Numeric bound violations and unseen categorical values. |
| `utility` | yes when `target_column` is provided | Train-on-synthetic, test-on-real downstream utility. |
| `trust` | yes | Exact row matches and nearest-neighbor privacy-screening metrics. |

You can select metric groups explicitly:

```python
report = evaluate_synthetic(
    real=real,
    synthetic=synthetic,
    target_column="default",
    metrics=("schema", "distribution", "validity"),
    include_trust=False,
)
```

Task type is inferred as classification for object, string, categorical,
boolean, and low-cardinality integer targets. Floating numeric targets are
treated as regression. Override with `task_type="classification"` or
`task_type="regression"` when needed.

### Evaluation Helpers

The public evaluation helpers are:

- `evaluate_synthetic`
- `evaluate_utility`
- `schema_report`
- `distribution_report`
- `validity_report`
- `trust_report`
- `exact_match_rate`
- `nearest_neighbor_privacy`
- `categorical_tvd`
- `numeric_ks`
- `numeric_correlation_delta`
- `infer_task_type`

Privacy-screening metrics are diagnostics only. They do not prove
anonymization, differential privacy, or legal compliance.

## Public API

- `tabdm.TabDM`
- `tabdm.TabDMConfig`
- `tabdm.DataTransformer`
- `tabdm.SyntheticDataResult`
- `tabdm.fit_tabdm`
- `tabdm.generate_synthetic_data`
- `tabdm.infer_discrete_columns`
- `tabdm.evaluate_synthetic`
- `tabdm.evaluate_utility`
- `tabdm.trust_report`

## Testing

```bash
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
```

`PYTEST_DISABLE_PLUGIN_AUTOLOAD=1` avoids unrelated third-party pytest plugin
startup issues in environments with many globally installed plugins.

## License

TabDM is distributed under the Apache License 2.0.

## Scope

This package ships the core generation and evaluation APIs.

TabDM is an alpha research/development package. Always evaluate generated data
for the intended dataset, task, and privacy posture before use.
