Metadata-Version: 2.4
Name: dataset-preprocessing-agent
Version: 0.1.0
Summary: LLM-based automated dataset standardization and evaluation framework
License: MIT License
        
        Copyright (c) 2025 Ton Nom
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: datasets>=3.6.0
Requires-Dist: unitxt
Requires-Dist: transformers
Requires-Dist: torch
Requires-Dist: accelerate
Requires-Dist: sentence-transformers
Requires-Dist: openai
Requires-Dist: pandas
Requires-Dist: tasksource
Provides-Extra: notebook
Requires-Dist: seaborn; extra == "notebook"
Requires-Dist: matplotlib; extra == "notebook"
Requires-Dist: ipywidgets; extra == "notebook"
Dynamic: license-file

# dataset-preprocessing-agent

**Automated dataset standardization using LLM agents.**

Every HuggingFace dataset has a unique schema (`tweet_text`, `review_body`, `sentence1`, …), making it hard to reuse models across tasks without manual mapping work. This library automates that step: given a raw dataset, an LLM inspects a small sample and produces a JSON mapping of raw column names to a canonical schema, evaluated against [Unitxt](https://github.com/IBM/unitxt) and [tasksource](https://github.com/sileod/tasksource) ground truths.

---

## Installation

```bash
pip install dataset-preprocessing-agent
```

For notebook visualization support:

```bash
pip install "dataset-preprocessing-agent[notebook]"
```

**Python 3.10+ required.**

For the API backend, export your OpenRouter key:

```bash
export OPENROUTER_API_KEY="your_key_here"
```

---

## Quick Start

```python
from dataset_preprocessing_agent.standardize_api import load_standardized_dataset

result = load_standardized_dataset("glue", config="sst2")
print(result["mapping"])
# {"task": "classification", "text": "sentence", "label": "label"}
```

### Evaluate against Unitxt ground truth

```python
from dataset_preprocessing_agent.eval import evaluate

result = evaluate(hf_name="glue", hf_config="sst2", card_id="sst2")
print(result["score"])        
print(result["gt_cols"])      # e.g. ['label', 'sentence']
print(result["pred_cols"])    # e.g. ['label', 'sentence']
```

### Evaluate against tasksource ground truth

```python
from dataset_preprocessing_agent.eval_ts import evaluate_ts

result = evaluate_ts("glue", "rte")
print(result["score"])
print(result["ts_gt"])        # GT mapping from tasksource preprocessing
```

---

## Architecture

The pipeline runs in three stages:

1. **Standardization** — an LLM inspects 5–10 raw samples and outputs a JSON mapping of raw column names to canonical fields (`task`, `text` / `text_a` + `text_b`, `label`).
2. **Mapping application** — `apply_llm_mapping` renames columns and converts integer labels to class name strings.
3. **Evaluation** — the predicted raw column set is compared to the ground-truth column set using Jaccard similarity on raw HuggingFace column names.

### Backends

| Module | Backend |
|--------|---------|
| `standardize_api` | Cloud LLM via OpenRouter API |
| `standardize_local` | Local HuggingFace model |

### Baselines

| Baseline | Method |
|----------|--------|
| `baseline_keyword_match` | Synonym dictionary matching |
| `baseline_embedding_match` | Cosine similarity via `all-MiniLM-L6-v2` |

### Evaluation backends

| Module | Ground truth |
|--------|-------------|
| `eval` | Unitxt task cards |
| `eval_ts` | tasksource preprocessing objects |

---

## Dependencies

| Package | Purpose |
|---------|---------|
| `unitxt` | Ground-truth task cards |
| `tasksource` | Ground-truth preprocessing objects |
| `datasets` | HuggingFace dataset loading |
| `transformers` + `torch` + `accelerate` | Local model inference |
| `openai` | OpenRouter API client |
| `sentence-transformers` | Embedding baseline |
| `pandas` | Result DataFrames |
