Metadata-Version: 2.4
Name: imputify
Version: 0.1.0
Summary: A library for imputation of missing data in tabular datasets with comprehensive evaluation metrics
Project-URL: Homepage, https://github.com/gabfssilva/imputify
Project-URL: Repository, https://github.com/gabfssilva/imputify
Project-URL: Issues, https://github.com/gabfssilva/imputify/issues
Author-email: Gabriel Francisco dos Santos Silva <gabfssilva@gmail.com>
License: MIT
License-File: LICENSE
Keywords: autoencoder,data-preprocessing,deep-learning,imputation,machine-learning,missing-data,scikit-learn,tabular-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: accelerate>=0.25
Requires-Dist: jsonpickle>=3.0
Requires-Dist: numpy>=2.0
Requires-Dist: pandas>=2.0
Requires-Dist: peft>=0.7
Requires-Dist: pyampute>=0.0.3
Requires-Dist: scikit-learn>=1.5
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.36
Requires-Dist: xgboost>=3.1.3
Provides-Extra: all
Requires-Dist: bitsandbytes>=0.41; extra == 'all'
Requires-Dist: cyclopts>=4.5.1; extra == 'all'
Requires-Dist: hf-transfer>=0.1; extra == 'all'
Requires-Dist: modal>=1.0; extra == 'all'
Requires-Dist: optimum-quanto>=0.2; extra == 'all'
Requires-Dist: pandas-stubs>=2.0; extra == 'all'
Requires-Dist: plotext>=5.0; extra == 'all'
Requires-Dist: rich>=13.0; extra == 'all'
Requires-Dist: skypilot[runpod]>=0.11.1; extra == 'all'
Provides-Extra: cpu
Requires-Dist: torch>=2.0; extra == 'cpu'
Provides-Extra: cuda
Requires-Dist: bitsandbytes>=0.41; extra == 'cuda'
Requires-Dist: hf-transfer>=0.1; extra == 'cuda'
Provides-Extra: experiments
Requires-Dist: cyclopts>=4.5.1; extra == 'experiments'
Requires-Dist: modal>=1.0; extra == 'experiments'
Requires-Dist: plotext>=5.0; extra == 'experiments'
Requires-Dist: rich>=13.0; extra == 'experiments'
Requires-Dist: skypilot[runpod]>=0.11.1; extra == 'experiments'
Provides-Extra: gpu
Requires-Dist: torch>=2.0; extra == 'gpu'
Provides-Extra: quantization
Requires-Dist: optimum-quanto>=0.2; extra == 'quantization'
Provides-Extra: types
Requires-Dist: pandas-stubs>=2.0; extra == 'types'
Description-Content-Type: text/markdown

# Imputify

A Python library for **evaluating and performing missing data imputation**. It measures imputation quality across three dimensions: reconstruction (how close are imputed values to the truth?), distribution preservation (are statistical properties maintained?), and predictive utility (can downstream models still perform well?).

The library is fully compatible with scikit-learn's `fit`/`transform` API and provides ready-to-use imputers: KNN, statistical baselines, autoencoders (DAE, VAE), GAIN, and a decoder-only LLM fine-tuned for tabular imputation.

> This library is part of my master's research proposal, so apart from scikit-learn compatibility, expect breaking changes. The API will stabilize as the research progresses.

---

## Why missingness matters

Not all missing data is created equal. The *mechanism* behind missingness determines which imputation methods will work. **MCAR** (Missing Completely at Random) is the easy case, values disappear randomly with no pattern, like a sensor failing at random times. **MAR** (Missing at Random) is trickier, missingness depends on *other* observed values, like high earners being more likely to skip income questions. **MNAR** (Missing Not at Random) is the hardest, missingness depends on the *missing value itself*, like very sick patients being unable to complete health surveys.

Most imputation methods assume MCAR or MAR. MNAR breaks these assumptions because the data you're trying to recover is exactly what's causing it to be missing. This is where I think LLMs might help, they can learn complex conditional distributions from the observed data and extrapolate patterns that simpler methods miss.

## Evaluation

A good imputation isn't just "close to the true value". Imputify measures quality from three complementary perspectives:

**Reconstruction**, point-wise accuracy, with MAE, RMSE, NRMSE for numerical features and accuracy for categorical features.

**Distribution**, statistical properties, as Wasserstein distance, KS statistic, KL divergence (how much distributions shifted), as well as Correlation shift (did we break relationships between variables?)

**Predictive utility**, downstream impact, by training a model on original vs imputed data and compare the performance gap. 

> Predictive metrics:
> 
> **Classification**: accuracy, precision, recall, F1 
> 
> **Regression**: R², MAE, RMSE

The overall score combines these into a single number in [0, 1]. Reconstruction and distribution are normalized as `1 / (1 + error)`, predictive as `1 - |Δmetrics|`. The final score is the mean of all three.

## Imputers

| Imputer | Category | Description |
|---------|----------|-------------|
| `StatisticalImputer` | Baseline | Mean/median for numerical, mode for categorical |
| `KNNImputer` | Baseline | k-nearest neighbors |
| `MICEImputer` | Baseline | Multiple Imputation by Chained Equations |
| `MissForestImputer` | Baseline | Random Forest-based iterative imputation |
| `XGBoostImputer` | Baseline | XGBoost-based iterative imputation |
| `DAEImputer` | Deep Learning | Denoising AutoEncoder with swap noise |
| `VAEImputer` | Deep Learning | Variational AutoEncoder (probabilistic latent space) |
| `GAINImputer` | Deep Learning | Generative Adversarial Imputation Nets |
| `DecoderOnlyImputer` | LLM | Fine-tuned decoder-only transformer via structured JSON serialization |

## Example

```python
from imputify.imputer import DAEImputer
from imputify.missing import introduce_missing, PatternConfig
from imputify.metrics import evaluate

# Create realistic missing data (MNAR pattern)
pattern = PatternConfig(incomplete_vars=['income'], mechanism='MNAR')
X_missing, mask = introduce_missing(X, proportion=0.3, patterns=[pattern])

# Impute
imputer = DAEImputer(hidden_dim=128, epochs=100)
X_imputed = imputer.fit_transform(X_missing)

# Evaluate across all dimensions
results = evaluate(X_original, X_imputed, mask, y=y)
print(f"Overall score: {results.overall_score:.3f}")
```

## Installation

If you don't have `uv` installed, do yourself a favor and:

```bash
# Linux & macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

# macOS (Homebrew)
brew install uv

# Windows
# Well, check their installation page: https://docs.astral.sh/uv/getting-started/installation/
```

Once installed, simply clone the repo and run `uv sync` to install dependencies:

```bash
git clone https://github.com/gabfssilva/imputify
cd imputify
uv sync
```

Requires Python 3.12+. 

Open the project using your favorite IDE and that's it.

## License

MIT
