Metadata-Version: 2.4
Name: nemo-safe-synthesizer
Version: 0.0.5rc0
Summary: Safe synthesizer
Author-email: NVIDIA <nemo@nvidia.com>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Human Machine Interfaces
Classifier: Topic :: Software Development
Requires-Python: <3.14,>=3.11
Requires-Dist: colorama>=0.4.6
Requires-Dist: faker>=20.0
Requires-Dist: huggingface-hub<1,>=0.34.4
Requires-Dist: jsonschema>=4.22.0
Requires-Dist: pandas>=2.1.3
Requires-Dist: pydantic-settings>=2.6.1
Requires-Dist: pydantic[email]>=2.12.5
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=14.1.0
Requires-Dist: setuptools>=80.0.0
Requires-Dist: structlog>=25.4.0
Requires-Dist: tqdm>=4.67.1
Provides-Extra: cpu
Requires-Dist: accelerate; extra == 'cpu'
Requires-Dist: bitsandbytes==0.49.1; extra == 'cpu'
Requires-Dist: flashinfer-cubin==0.6.6; (sys_platform == 'linux') and extra == 'cpu'
Requires-Dist: flashinfer-python==0.6.6; (sys_platform == 'linux') and extra == 'cpu'
Requires-Dist: gliner; extra == 'cpu'
Requires-Dist: kernels>=0.12.1; extra == 'cpu'
Requires-Dist: opacus; extra == 'cpu'
Requires-Dist: peft; extra == 'cpu'
Requires-Dist: sentence-transformers; extra == 'cpu'
Requires-Dist: torch==2.10.0+cpu; (sys_platform == 'linux') and extra == 'cpu'
Requires-Dist: torch==2.10.0; (sys_platform == 'darwin') and extra == 'cpu'
Requires-Dist: torchao==0.16.0; extra == 'cpu'
Requires-Dist: torchvision==0.25.0+cpu; (sys_platform == 'linux') and extra == 'cpu'
Requires-Dist: torchvision==0.25.0; (sys_platform == 'darwin') and extra == 'cpu'
Requires-Dist: transformers==4.57.3; extra == 'cpu'
Requires-Dist: triton>=2.0.0; (sys_platform == 'linux') and extra == 'cpu'
Requires-Dist: trl>=0.23.0; extra == 'cpu'
Requires-Dist: vllm==0.18.0; (sys_platform == 'linux') and extra == 'cpu'
Provides-Extra: cu128
Requires-Dist: accelerate; extra == 'cu128'
Requires-Dist: bitsandbytes==0.49.1; extra == 'cu128'
Requires-Dist: flashinfer-cubin==0.6.6; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: flashinfer-jit-cache==0.6.6+cu128; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: flashinfer-python==0.6.6; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: gliner; extra == 'cu128'
Requires-Dist: kernels>=0.12.1; extra == 'cu128'
Requires-Dist: nvidia-cublas-cu12; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: nvidia-ml-py; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: opacus; extra == 'cu128'
Requires-Dist: peft; extra == 'cu128'
Requires-Dist: sentence-transformers; extra == 'cu128'
Requires-Dist: torch-c-dlpack-ext; extra == 'cu128'
Requires-Dist: torch==2.10.0+cu128; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: torchao==0.16.0+cu128; (sys_platform == 'linux' and platform_machine == 'x86_64') and extra == 'cu128'
Requires-Dist: torchvision==0.25.0+cu128; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: transformers==4.57.3; extra == 'cu128'
Requires-Dist: triton>=2.0.0; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: trl>=0.23.0; extra == 'cu128'
Requires-Dist: vllm==0.18.0; (sys_platform == 'linux') and extra == 'cu128'
Requires-Dist: xformers==v0.0.34; (sys_platform == 'linux' and platform_machine == 'x86_64') and extra == 'cu128'
Provides-Extra: engine
Requires-Dist: anyascii; extra == 'engine'
Requires-Dist: betterproto; extra == 'engine'
Requires-Dist: cached-property; extra == 'engine'
Requires-Dist: category-encoders; extra == 'engine'
Requires-Dist: datasets; extra == 'engine'
Requires-Dist: dateparser; extra == 'engine'
Requires-Dist: dython; extra == 'engine'
Requires-Dist: faker; extra == 'engine'
Requires-Dist: flashtext; extra == 'engine'
Requires-Dist: huggingface-hub<1,>=0.34.4; extra == 'engine'
Requires-Dist: json-repair; extra == 'engine'
Requires-Dist: matplotlib; extra == 'engine'
Requires-Dist: outlines>=1.0.0; extra == 'engine'
Requires-Dist: pandas<3,>=2.1.3; extra == 'engine'
Requires-Dist: plotly; extra == 'engine'
Requires-Dist: prv-accountant; extra == 'engine'
Requires-Dist: pycountry; extra == 'engine'
Requires-Dist: python-stdnum; extra == 'engine'
Requires-Dist: range-regex>=0.1.0; extra == 'engine'
Requires-Dist: ratelimit; extra == 'engine'
Requires-Dist: scikit-learn; extra == 'engine'
Requires-Dist: smart-open==7.0.5; extra == 'engine'
Requires-Dist: tenacity==9.1.4; extra == 'engine'
Requires-Dist: tiktoken<1.0,>=0.7.0; extra == 'engine'
Requires-Dist: tldextract; extra == 'engine'
Requires-Dist: tqdm>=4.67.1; extra == 'engine'
Requires-Dist: urllib3>=2.6.1; extra == 'engine'
Requires-Dist: wandb==0.26.0; extra == 'engine'
Description-Content-Type: text/markdown

# 🛡️ NeMo Safe Synthesizer

NVIDIA NeMo Safe Synthesizer creates private, safe versions of sensitive tabular datasets -- entirely synthetic data with no one-to-one mapping to your original records. Purpose-built for privacy compliance and sensitive information protection while preserving data utility for downstream AI tasks.

## Quick Start

Read detailed usage below, or jump to the documentation with [Getting Started](https://nvidia-nemo.github.io/Safe-Synthesizer/user-guide/getting-started/) or the [Safe Synthesizer 101](https://nvidia-nemo.github.io/Safe-Synthesizer/tutorials/safe-synthesizer-101/) notebook.


### Prerequisites

- Python 3.11–3.13 (we pin a specific 3.11.x in `.python-version` for local/dev bootstrap; any 3.11, 3.12, or 3.13 interpreter works. Python 3.14+ is NOT supported because ray, a transitive dependency of vLLM, does not yet publish `cp314` wheels)
- [uv](https://docs.astral.sh/uv/) (recommended) or pip -- Python package manager
- NVIDIA GPU (A100 or larger) for training and generation
- Linux only -- macOS, Windows, and Apple Silicon are not supported for training or generation. A CPU-only install is available for development and configuration validation.

### Installation

```bash
# With uv (recommended):
uv pip install "nemo-safe-synthesizer[cu128,engine]" \
  --index https://flashinfer.ai/whl/cu128 \
  --index https://download.pytorch.org/whl/cu128 \
  --index-strategy unsafe-best-match

# With pip:
pip install "nemo-safe-synthesizer[cu128,engine]" \
  --extra-index-url https://download.pytorch.org/whl/cu128 \
  --extra-index-url https://flashinfer.ai/whl/cu128
```

Or install from source:

```bash
git clone https://github.com/NVIDIA-NeMo/Safe-Synthesizer.git
cd Safe-Synthesizer
make setup # installs the pinned mise version (if missing) + pinned tool versions from mise.lock
make bootstrap-nss cuda
```

Development tools (`ruff`, `ty`, `yq`, `gh`, etc.) are managed via [mise](https://mise.jdx.dev/). Tool versions are declared in `.mise.toml` and locked in `mise.lock` (committed). mise also manages environment variables -- place project-local secrets or overrides in `.env` or `.env.local` (both git-ignored, auto-loaded by mise).

### Running

Activate Python virtual environment and run the CLI using `safe-synthesizer`:

```bash
> safe-synthesizer --help
Usage: safe-synthesizer [OPTIONS] COMMAND [ARGS]...

  NeMo Safe Synthesizer command-line interface. This application is used to
  run the Safe Synthesizer pipeline. It can be used to train a model, generate
  synthetic data, and evaluate the synthetic data. It can also be used to
  modify a config file.

Options:
  --help  Show this message and exit.

Commands:
  artifacts  Artifacts management commands.
  config     Manage Safe Synthesizer configurations.
  run        Run the Safe Synthesizer end-to-end pipeline.
```

## Running the Pipeline

The `run` command executes the Safe Synthesizer pipeline. Without a subcommand, it runs the full end-to-end pipeline:

```bash
> uv run safe-synthesizer run --help
Usage: safe-synthesizer run [OPTIONS] COMMAND [ARGS]...

  Run the Safe Synthesizer end-to-end pipeline.

  Without a subcommand, runs the full end-to-end pipeline. Use 'run train' or
  'run generate' for individual stages.

Options:
  --config TEXT                   path to a yaml config file
  --data-source TEXT                      Dataset name, URL, or path to CSV dataset.
                                  For 'run generate', this is optional if a
                                  cached dataset exists in the workdir.
  --artifact-path DIRECTORY       Base directory for all runs. Runs are
                                  created as <artifact-
                                  path>/<config>---<dataset>/<timestamp>/. Can
                                  also be set via NSS_ARTIFACTS_PATH env var.
                                  [default: ./safe-synthesizer-artifacts]
  --run-path DIRECTORY            Explicit path for this run's output
                                  directory. When specified, outputs go
                                  directly to this path. Overrides --artifact-
                                  path.
  --output-file PATH              Path to output CSV file. Overrides the
                                  default workdir output location.
  --log-format [json|plain]       Log format for console output. File logging
                                  will always be JSON. Can also be set via
                                  NSS_LOG_FORMAT env var. [default: plain]
  --log-color / --no-log-color    Whether to colorize the log output on the
                                  console. [default: --log-color]
  --log-file PATH                 Path to log file. Defaults to a file nested
                                  under the run directory. Can also be set via
                                  NSS_LOG_FILE env var.
  --wandb-mode [online|offline|disabled]
                                  Wandb mode. 'online' will upload logs to
                                  wandb, 'offline' will save logs to a local
                                  file, 'disabled' will not upload logs to
                                  wandb. Can also be set via WANDB_MODE env
                                  var. [default: disabled]
  --wandb-project TEXT            Wandb project. Can also be set via
                                  WANDB_PROJECT env var.
  -v                              Verbose logging. 'v' shows debug info from
                                  main program, 'vv' shows debug from
                                  dependencies too
  --dataset-registry TEXT         URL or path of a dataset registry YAML file.
                                  If provided, datasets in the registry may be
                                  referenced by name in --data-source. Can also be set
                                  via NSS_DATASET_REGISTRY env var. If both
                                  env var and CLI option are provided, the CLI
                                  option takes precedence.
  --help                          Show this message and exit.

Commands:
  generate  Run the generation stage only.
  train     Run the training stage only.
```

### Subcommands

- `safe-synthesizer run train` - Run only the training stage, saving the adapter to the run directory.
- `safe-synthesizer run generate` - Run only the generation stage using a saved adapter.

```bash
> uv run safe-synthesizer run generate --help
Usage: safe-synthesizer run generate [OPTIONS]

  Run the generation stage only.

  This command loads a trained adapter and generates synthetic data. Requires
  'run train' to have been executed first.

  Use --run-path to specify the exact run directory containing the trained
  model, or use --auto-discover-adapter with --artifact-path to automatically
  find the latest trained run.

Options:
  --config TEXT                   path to a yaml config file
  --data-source TEXT                      Dataset name, URL, or path to CSV dataset.
                                  [required]
  --artifact-path DIRECTORY       Base directory for all runs. Runs are
                                  created as <artifact-path>/<config>-
                                  <dataset>/<timestamp>/. [default: ./safe-
                                  synthesizer-artifacts]
  --run-path DIRECTORY            Explicit path for this run's output
                                  directory. When specified, outputs go
                                  directly to this path. Overrides --artifact-
                                  path.
  --output-file PATH              Path to output CSV file. Overrides the
                                  default workdir output location.
  --log-format [json|plain]       Log format for console output. File logging
                                  will always be JSON.
  --log-color / --no-log-color    Whether to colorize the log output on the
                                  console
  --log-file PATH                 Path to log file. Defaults to a file nested
                                  under the run directory.
  -v                              Verbose logging. 'v' shows debug info from
                                  main program, 'vv' shows debug from
                                  dependencies too
  --wandb-mode [online|offline|disabled]
                                  Wandb mode. 'online' will upload logs to
                                  wandb, 'offline' will save logs to a local
                                  file, 'disabled' will not upload logs to
                                  wandb.
  --wandb-project TEXT            Wandb project. If not specified, the project
                                  will be taken from the environment variable
                                  WANDB_PROJECT.
  --auto-discover-adapter         Automatically find the latest trained
                                  adapter in --artifact-path. Without this
                                  flag, --run-path must point to a specific
                                  trained run.
  --help                          Show this message and exit.
```

## Managing Configurations

The `config` command provides tools to validate and modify configuration files:

```bash
> uv run safe-synthesizer config --help
Usage: safe-synthesizer config [OPTIONS] COMMAND [ARGS]...

  Manage Safe Synthesizer configurations.

Options:
  --help  Show this message and exit.

Commands:
  modify    Modify a Safe Synthesizer configuration.
  validate  Validate a Safe Synthesizer configuration.
```

## Attention Configuration

Safe Synthesizer exposes attention implementation settings for both training and generation.

### Training (`attn_implementation`)

Controls the HuggingFace attention backend used during model loading for training. Set via config YAML, CLI, or SDK:

```yaml
# config.yaml
training:
  attn_implementation: "kernels-community/vllm-flash-attn3"
```

```bash
# CLI override
safe-synthesizer run --training__attn_implementation sdpa --data-source my_data.csv
```

| Value | Description | Requires |
|-------|-------------|----------|
| `kernels-community/vllm-flash-attn3` | Flash Attention 3 via HuggingFace Kernels Hub (default) | `kernels` pip package |
| `kernels-community/flash-attn2` | Flash Attention 2 via HuggingFace Kernels Hub | `kernels` pip package |
| `flash_attention_2` | Flash Attention 2 (traditional) | `flash-attn` pip package |
| `sdpa` | PyTorch scaled dot product attention | None (built-in) |
| `eager` | Standard PyTorch attention | None (built-in) |

If the default `kernels-community/vllm-flash-attn3` is configured but the `kernels` package is not installed, the backend automatically falls back to `sdpa`.

### Generation (`attention_backend`)

Controls the vLLM attention backend used during synthetic data generation. Defaults to `"auto"`, which lets vLLM auto-select the best available backend.

```yaml
# config.yaml
generation:
  attention_backend: "FLASH_ATTN"
```

Common values: `FLASHINFER`, `FLASH_ATTN`, `TORCH_SDPA`, `TRITON_ATTN`, `FLEX_ATTENTION`.

## NIM Integration

Column classification uses a NIM/OpenAI-compatible endpoint to detect entity types
in your data. `NSS_INFERENCE_ENDPOINT` defaults to `https://integrate.api.nvidia.com/v1`;
override it to use a different endpoint.

When using the CLI or Python SDK, set `NSS_INFERENCE_KEY` (and `NSS_INFERENCE_ENDPOINT` only if not
using the default) so column classification can run.

### Local Endpoint

To point to a locally hosted LLM, add the variables to `.env.local` (git-ignored, auto-loaded by mise):

```bash
# .env.local
NSS_INFERENCE_ENDPOINT=https://your-local-nim-endpoint
NSS_INFERENCE_KEY=your-api-key  # pragma: allowlist secret
```

Or export them in your shell:

```bash
export NSS_INFERENCE_ENDPOINT="https://your-local-nim-endpoint"
export NSS_INFERENCE_KEY="your-api-key"  # pragma: allowlist secret
```

### Disable Classification

To disable classification entirely:

```yaml
replace_pii:
  globals:
    classify:
      enable_classify: false
```

When classification is disabled, NSS falls back to default entity types.

## Artifacts and Workdirs

Safe Synthesizer uses a structured directory format to manage artifacts (trained models, synthetic data, logs).

### Directory Layout

By default, runs are nested under `--artifact-path` using the project name (`<config>---<dataset>`) and a unique run name.

```text
<artifact-path>/<config>---<dataset>/<run_name>/
├── train/
│   ├── safe-synthesizer-config.json
│   └── adapter/                     # trained PEFT adapter
│       ├── adapter_config.json
│       ├── adapter_model.safetensors
│       ├── metadata_v2.json
│       └── dataset_schema.json
├── generate/
│   ├── logs.jsonl                   # generate-only workflow
│   ├── info.json                    # generate-only workflow
│   ├── synthetic_data.csv
│   ├── evaluation_report.html
│   └── evaluation_metrics.json      # machine-readable metrics
├── dataset/
│   ├── training.csv
│   ├── test.csv
│   ├── validation.csv               # when training.validation_ratio > 0
│   └── transformed_training.csv     # when PII replacement transforms the data
└── logs/
    └── <phase>.jsonl                # e.g. end_to_end.jsonl or train.jsonl
```

### Run Names

If not provided with `--run-path`, run names are automatically generated using the current `<timestamp>`.

### Overriding Paths

- Use `--run-path` to specify an explicit directory for the run, bypassing the `<project>/<timestamp>` nesting.
- Use `--output-file` to specify an explicit path for the final synthetic CSV, overriding the default location in the `generate/` directory.

## WandB Logging

Safe Synthesizer supports Weights & Biases (WandB) for experiment tracking.

### Configuration

You can enable WandB logging using CLI options or environment variables:

- `--wandb-mode [online|offline|disabled]`: Set the WandB mode. Default is `disabled`.
- `--wandb-project <name>`: Specify the WandB project name.
- `WANDB_API_KEY`: Ensure your API key is set in your environment.

### Logged Data

The following information is logged to WandB:

- Configuration parameters
- Training metrics (if supported by the backend)
- Generation statistics
- Evaluation results
- Timing information

## Dataset Registry

Safe Synthesizer supports a *dataset registry* to simplify working with a standard set of datasets.
Datasets in the registry may be referenced by name, rather than repeatedly specifying long URLS or file paths on the command line.
Additionally, the registry supports custom config overrides or args that are specific to individual datasets.

### Providing a Dataset Registry

You can supply a dataset registry (YAML file) via either the CLI or an environment variable:

- CLI Option:
`--dataset-registry <path_or_url>`
- Environment Variable:
Set `NSS_DATASET_REGISTRY` to point to your YAML file (path or URL).

If both are provided, the CLI option takes precedence.

### Referencing Datasets

When a dataset registry is provided, you can use dataset names defined in the registry with the `--data-source` argument.
For example:

```bash
nemo-safe-synthesizer run --dataset-registry my_registry.yaml --data-source my_dataset
```

This will load the dataset from the url plus apply any overrides for `my_dataset` from the registry YAML.

### Dataset Registry YAML Format

The registry file should conform to the pydantic model defined by `DatasetRegistry` in `cli/datasets.py`. For example,

```yaml
# registry.yaml
base_url: /root/data/location
datasets:
- name: dataset1
  url: dataset1.csv
- name: dataset2
  url: dataset2.jsonl
  overrides:
    data:
      group_training_examples_by: id
- name: dataset3
  url: /absolute/path/to/dataset.csv
- name: dataset4
  url: https://myhost.com/path/to/dataset.json
  load_args:
    keyword: custom_arg_for_data_reader
```

- Minimal requirements for each entry in the `datasets:` list are a `name` and a `url`.
`url` may be a URL or a file path, anything that data readers like `pd.read_csv` will accept.
- `base_url` - Any relative urls or paths will be prepended with the `base_url` before attempting to load the dataset.
This only applies to the named datasets in the registry which have a relative url.
Passing a relative `--data-source` on the CLI will attempt to load the file relative to your current working directory, regardless of whether a registry is provided or whether `base_url` is set.
`base_url` is optional, if not provided, it is recommended to use absolute urls or file paths for all entries.
- `overrides` - Dataset specific config overrides, such as a dataset that should always be run with `group_training_examples_by`.
Config values passed as CLI arguments always take precendence, then any overrides from the registry, and finally values from the `--config` yaml file.
- `load_args` - Extra arguments needed by the data reader for a specific dataset.
For example, changing the separator used by `pd.read_csv` for a `.csv` file with a different delimiter.

## License

NeMo Safe Synthesizer is licensed under the [Apache License 2.0](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/blob/main/LICENSE).

## Contact

- [Need help? Ask us a question](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/discussions)
- [Report a bug](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/issues/new?template=bug-report.yml)
- [Make a feature request](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/issues/new?template=feature-request.yml)
- [Report a security vulnerability](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/security/policy)
