Metadata-Version: 2.4
Name: synthforge
Version: 0.1.0
Summary: Next-generation synthetic data generation with LLM-augmented pipelines, diffusion models, and evaluation-first design
Author-email: SynthForge Team <synthforge@example.com>
Maintainer-email: SynthForge Team <synthforge@example.com>
License: PROPRIETARY SOFTWARE LICENSE AGREEMENT
        
        Copyright (c) 2025 SynthForge Team. All rights reserved.
        
        This software and associated documentation files (the "Software") are the
        proprietary and confidential property of the copyright holder.
        
        GRANT OF LICENSE:
        Subject to the terms of this agreement, the copyright holder grants you a
        limited, non-exclusive, non-transferable, revocable license to use the
        Software for your internal business purposes only.
        
        RESTRICTIONS:
        You may NOT:
          1. Copy, modify, or distribute the Software without prior written consent
          2. Reverse engineer, decompile, or disassemble the Software
          3. Sublicense, sell, lease, or rent the Software to any third party
          4. Remove or alter any proprietary notices or labels on the Software
          5. Use the Software to build a competing product or service
        
        DISCLAIMER OF WARRANTIES:
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
        
        LIMITATION OF LIABILITY:
        IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY CLAIM, DAMAGES OR
        OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
        FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
        IN THE SOFTWARE.
        
        For licensing inquiries, contact: synthforge@example.com
        
Project-URL: Homepage, https://github.com/YOUR_USERNAME/synthforge
Project-URL: Documentation, https://github.com/YOUR_USERNAME/synthforge#readme
Project-URL: Repository, https://github.com/YOUR_USERNAME/synthforge
Project-URL: Issues, https://github.com/YOUR_USERNAME/synthforge/issues
Project-URL: Changelog, https://github.com/YOUR_USERNAME/synthforge/blob/main/CHANGELOG.md
Keywords: synthetic-data,privacy,tabular-data,machine-learning,llm,diffusion-models,data-generation,pii-detection,gaussian-copula,ctgan
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Security
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas<3.0,>=2.0
Requires-Dist: numpy<2.1,>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.11
Requires-Dist: pydantic>=2.0
Requires-Dist: tqdm>=4.65
Requires-Dist: faker>=20.0
Requires-Dist: joblib>=1.3
Provides-Extra: gan
Requires-Dist: torch>=2.1; extra == "gan"
Provides-Extra: diffusion
Requires-Dist: torch>=2.1; extra == "diffusion"
Provides-Extra: great
Requires-Dist: torch>=2.1; extra == "great"
Requires-Dist: transformers>=4.35; extra == "great"
Provides-Extra: llm
Requires-Dist: litellm>=1.40; extra == "llm"
Provides-Extra: presidio
Requires-Dist: presidio-analyzer>=2.2; extra == "presidio"
Requires-Dist: presidio-anonymizer>=2.2; extra == "presidio"
Provides-Extra: evaluation
Requires-Dist: sdmetrics>=0.14; extra == "evaluation"
Requires-Dist: xgboost>=2.0; extra == "evaluation"
Requires-Dist: matplotlib>=3.7; extra == "evaluation"
Requires-Dist: seaborn>=0.13; extra == "evaluation"
Provides-Extra: all
Requires-Dist: synthforge[diffusion,evaluation,gan,great,llm,presidio]; extra == "all"
Provides-Extra: dev
Requires-Dist: synthforge[all]; extra == "dev"
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: ruff>=0.3; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# SynthForge

> Next-generation synthetic data generation with LLM-augmented pipelines.

SynthForge combines **statistical generative models** (Gaussian Copula, CTGAN, TVAE, Diffusion) with **LLM-powered intelligence** (schema enrichment, PII/MNPI detection, semantic validation) to produce high-fidelity synthetic tabular data from small production samples.

## Quick Start

```python
import pandas as pd
from synthforge import SynthForge

# Load a sample from production (e.g., 2500 rows from Redshift)
df = pd.read_csv("production_sample.csv")

# One-line generation
forge = SynthForge()
synthetic_df = forge.fit_generate(df, num_rows=100_000)

# With LLM enrichment (auto-detects PII, infers semantics)
forge = SynthForge(llm_provider="anthropic", llm_model="claude-sonnet-4-20250514")
forge.profile(df)                    # Schema enrichment + PII detection
forge.fit(df)                        # Train synthesizer
synthetic_df = forge.generate(100_000)  # Bulk generate
report = forge.evaluate(df, synthetic_df)  # Quality report
```

## Key Features

- **Intelligent Schema Detection**: LLM-powered column semantic inference beyond statistical type detection
- **PII/MNPI Detection**: Presidio + LLM augmentation for catching non-obvious sensitive data
- **Multiple Synthesizers**: Gaussian Copula (fast), CTGAN/TVAE (balanced), TabSyn (highest quality)
- **Data-Type Strategies**: Specialized pipelines for categorical, numerical, time-series, and mixed-type tables
- **Evaluation-First**: Built-in quality reports with statistical fidelity, ML utility, and privacy metrics
- **Configurable Scale**: From 1K to 10M+ rows with batch generation and optional GPU acceleration
- **LLM-Agnostic**: Works with Claude, OpenAI, Ollama, vLLM, or any LiteLLM-supported provider

## Installation

```bash
pip install synthforge                  # Core (Gaussian Copula only)
pip install "synthforge[gan]"           # + CTGAN/TVAE
pip install "synthforge[llm]"           # + LLM enrichment
pip install "synthforge[evaluation]"    # + quality reports
pip install "synthforge[all]"           # Everything
```

## Architecture

```
Production Sample (DataFrame/CSV)
        │
        ▼
┌─────────────────────────────────┐
│  1. PROFILE (LLM-augmented)     │
│  • Auto-detect metadata         │
│  • Semantic column inference     │
│  • PII / MNPI detection         │
│  • Business rule extraction     │
│  • Synthesizer recommendation   │
└─────────────┬───────────────────┘
              ▼
┌─────────────────────────────────┐
│  2. FIT (Statistical/Neural)    │
│  • Reversible data transforms   │
│  • Constraint-aware training    │
│  • Auto-select or user-pick:    │
│    GaussianCopula / CTGAN /     │
│    TVAE / TabSyn / Diffusion    │
└─────────────┬───────────────────┘
              ▼
┌─────────────────────────────────┐
│  3. GENERATE (Batch)            │
│  • Configurable row count       │
│  • Batch chunking for scale     │
│  • Constraint enforcement       │
│  • PII replacement (Faker)      │
└─────────────┬───────────────────┘
              ▼
┌─────────────────────────────────┐
│  4. EVALUATE (5-layer pipeline) │
│  • Diagnostic checks            │
│  • Statistical fidelity         │
│  • ML utility (TSTR)            │
│  • Privacy (MIA, Anonymeter)    │
│  • LLM semantic validation      │
└─────────────────────────────────┘
```

## License

Proprietary. All rights reserved.
