Metadata-Version: 2.4
Name: tabular-bank
Version: 0.1.0
Summary: A contamination-proof tabular ML benchmark — drop-in replacement for TabArena with procedurally generated synthetic datasets
Author: tabular-bank contributors
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/jxucoder/tabular-bank
Project-URL: Repository, https://github.com/jxucoder/tabular-bank
Project-URL: Issues, https://github.com/jxucoder/tabular-bank/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Provides-Extra: benchmark
Requires-Dist: tabarena; extra == "benchmark"
Requires-Dist: autogluon.tabular>=1.0; extra == "benchmark"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"

# tabular-bank

A contamination-proof tabular ML benchmark — drop-in replacement for [TabArena](https://github.com/autogluon/tabarena) with procedurally generated synthetic datasets.

## Why tabular-bank?

TabArena is the leading benchmark for tabular ML models, but it uses real-world datasets that may be contaminated in LLM/foundation model training data. `tabular-bank` solves this by generating datasets **procedurally from a secret seed** — the repo contains only the generation engine. No dataset-specific information is ever committed.

### Anti-Contamination Architecture

- **Procedural structure**: Feature specs, DAG topology, mechanism families, coefficients, and noise models are generated from the seed
- **Cryptographic seed derivation**: HMAC-SHA256 ensures datasets are unpredictable without the master secret
- **Rotating benchmark rounds**: Each round uses a fresh seed; past rounds' seeds are published after expiry
- **Auditable fairness**: All generation code is public — anyone can verify the engine is unbiased

## Installation

```bash
pip install tabular-bank

# With TabArena integration for official benchmarking
pip install "tabular-bank[benchmark]"
```

## Quick Start

### Generate Datasets

```bash
# Via CLI
tabular-bank generate --round round-001 --secret "your-secret" --n-scenarios 10

# Via Python
from tabular_bank.generation.generate import generate_all
generate_all(master_secret="your-secret", round_id="round-001", n_scenarios=10)
```

### Run a Benchmark

```python
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from tabular_bank.context import TabularBankContext
from tabular_bank.runner import run_benchmark
from tabular_bank.leaderboard import generate_leaderboard, format_leaderboard

# Models to benchmark
models = {
    "GBM": GradientBoostingClassifier(n_estimators=100),
    "RF": RandomForestClassifier(n_estimators=100),
}

# Run benchmark
result = run_benchmark(
    models=models,
    round_id="round-001",
    master_secret="your-secret",
)

# Generate leaderboard
leaderboard = generate_leaderboard(result)
print(format_leaderboard(leaderboard))
```

### Inspect Datasets

```bash
tabular-bank info --round round-001
```

You can also set `TABULAR_BANK_SECRET` and `TABULAR_BANK_CACHE` in the environment.
Legacy `SYNTHETIC_TAB_SECRET` / `SYNTHETIC_TAB_CACHE` names are still accepted.

## Architecture

```mermaid
flowchart TD
    A["Secret + Round ID"] --> B["HMAC-SHA256"]
    B --> C["Round Seed"]
    C --> D["Scenario Sampler"]
    D --> E["Scenario Config\n(problem type, features, difficulty,\nmissing values, imbalance, etc.)"]

    E --> FS["Feature Seed"]
    E --> DS["DAG Seed"]
    E --> AS["Data Seed"]
    E --> SS["Split Seed"]

    FS --> FG["Feature Generator"]
    FG --> FO["Names · Types · Distributions"]

    DS --> DB["DAG Builder"]
    DB --> DO["Causal Graph · Sampled Mechanisms\n(spline, tanh, interaction, etc.)"]

    AS --> SM["Sampler"]
    SM --> SO["Tabular DataFrame\n+ Heteroscedastic Residuals"]

    SS --> SG["Split Generator"]
    SG --> SGO["Cross-Validation Folds\n(10 repeats × 3 folds)"]
```

## Parametric Scenario Sampling

Rather than fixed hand-crafted templates, `tabular-bank` samples all scenario parameters from a continuous space (CausalProfiler-inspired coverage guarantee). Any valid configuration has non-zero probability of being generated, producing diverse, non-redundant benchmark tasks.

**Sampled axes include:**
- Problem type: binary classification, multiclass, regression
- Feature count, sample size, categorical ratio
- Difficulty: noise scale, nonlinearity probability, interaction probability, heteroscedastic noise probability, DAG edge density
- DAG complexity: confounder count and strength, max parent count
- Missing values: rate and mechanism (MCAR / MAR / MNAR)
- Class imbalance ratio (binary tasks)
- Temporal autocorrelation in root features
- Root feature correlations (multivariate Gaussian)

Edges no longer draw from a tiny fixed "form" enum alone. Each edge samples a
structured mechanism specification, with families including linear, threshold,
sigmoid, tanh, piecewise-linear, sinusoidal, spline, and interaction effects.
Non-root nodes can also sample heteroscedastic residual noise models whose
variance depends on one of their parents.

```python
from tabular_bank.generation.engine import generate_sampled_datasets

datasets = generate_sampled_datasets(
    master_secret="your-secret",
    round_id="round-001",
    n_scenarios=20,
)
```

## TabArena Compatibility

`tabular-bank` is designed as a drop-in replacement for TabArena. Generated datasets can be converted to TabArena's `UserTask` format for use with TabArena's full evaluation pipeline (8-fold bagging, standardized HPO, ELO leaderboards).

```python
ctx = TabularBankContext(round_id="round-001", master_secret="your-secret")
tabarena_tasks = ctx.get_tabarena_tasks()  # Requires tabarena package
```

## License

Apache-2.0
