Metadata-Version: 2.4
Name: open_autonlu
Version: 1.0.0
Summary: The framework can be useful for finetuning and few-shot learning models for classification task and NER.
Project-URL: Homepage, https://github.com/mts-ai/OpenAutoNLU
Project-URL: Repository, https://github.com/mts-ai/OpenAutoNLU
Project-URL: Issues, https://github.com/mts-ai/OpenAutoNLU/issues
Author-email: Grigory Arshinov <g.arshinov@mts.ai>, Daniil Karpov <dankarpov90@gmail.com>, Daria Samsonova <daria1208@ya.ru>, Anton Nenashev <anton.nenasheff@yandex.ru>, Alexander Boriskin <a.boriskin@mts.ai>, Ayaz Zaripov <a.zaripov1@mts.ai>
License: CC-BY-NC-4.0
License-File: LICENSE.md
Keywords: automl,data-augmentation,data-quality,few-shot,llm,ner,nlp,nlu,python,setfit,text-classification,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.13,>=3.12
Requires-Dist: accelerate==1.12.0
Requires-Dist: aiohttp==3.13.3
Requires-Dist: augmentex==1.3.1
Requires-Dist: crowd-kit==1.4.1
Requires-Dist: datasets==2.19.1
Requires-Dist: deprecated>=1.2.18
Requires-Dist: fonttools==4.60.2
Requires-Dist: h11==0.16.0
Requires-Dist: httpx==0.27.2
Requires-Dist: huggingface-hub==0.36.2
Requires-Dist: iobes==1.5.1
Requires-Dist: iterative-stratification==0.1.9
Requires-Dist: jsonlines==4.0.0
Requires-Dist: jupyter-core==5.8.0
Requires-Dist: matplotlib==3.9.0
Requires-Dist: mdpd==0.2.1
Requires-Dist: nltk==3.9.3
Requires-Dist: numpy==2.3.3
Requires-Dist: onnx==1.17.0
Requires-Dist: onnxruntime-gpu==1.19.2; sys_platform == 'linux' and platform_machine == 'x86_64'
Requires-Dist: onnxruntime==1.19.2; sys_platform == 'darwin' or (sys_platform == 'linux' and platform_machine != 'x86_64')
Requires-Dist: onnxscript==0.2.4
Requires-Dist: openai==1.55.3
Requires-Dist: optimum==1.27.0
Requires-Dist: optuna==3.6.1
Requires-Dist: outlines==0.1.14
Requires-Dist: pandas==2.3.3
Requires-Dist: pillow==12.1.1
Requires-Dist: plotly==6.5.2
Requires-Dist: protobuf==6.33.5
Requires-Dist: python-dotenv==1.1.0
Requires-Dist: python-fire==0.1.0
Requires-Dist: scikit-learn==1.6.1
Requires-Dist: seaborn==0.13.2
Requires-Dist: sentence-transformers==3.4.1
Requires-Dist: seqeval==1.2.2
Requires-Dist: setfit==1.1.1
Requires-Dist: setuptools==78.1.1
Requires-Dist: skl2onnx>=1.17.0
Requires-Dist: sklearn-pandas==2.2.0
Requires-Dist: spacy>=3.0.0
Requires-Dist: streamlit==1.54.0
Requires-Dist: tornado==6.5.4
Requires-Dist: transformers==4.53.0
Requires-Dist: typing-extensions==4.12.2
Requires-Dist: urllib3==2.6.3
Provides-Extra: cpu
Requires-Dist: torch==2.6.0; extra == 'cpu'
Requires-Dist: torchdata==0.9.0; (platform_machine == 'x86_64' or sys_platform == 'darwin') and extra == 'cpu'
Requires-Dist: torchsummary==1.5.1; (platform_machine == 'x86_64' or sys_platform == 'darwin') and extra == 'cpu'
Requires-Dist: torchtext==0.18.0; (platform_machine == 'x86_64' or sys_platform == 'darwin') and extra == 'cpu'
Provides-Extra: cuda
Requires-Dist: torch==2.6.0; extra == 'cuda'
Requires-Dist: torchdata==0.9.0; (platform_machine == 'x86_64' or sys_platform == 'darwin') and extra == 'cuda'
Requires-Dist: torchsummary==1.5.1; (platform_machine == 'x86_64' or sys_platform == 'darwin') and extra == 'cuda'
Requires-Dist: torchtext==0.18.0; (platform_machine == 'x86_64' or sys_platform == 'darwin') and extra == 'cuda'
Description-Content-Type: text/markdown

# OpenAutoNLU Pipeline

[![arXiv](https://img.shields.io/badge/arXiv-2603.01824-b31b1b.svg)](https://arxiv.org/abs/2603.01824)
[![Python 3.12](https://img.shields.io/badge/Python-3.12-blue.svg)](https://www.python.org/downloads/)

OpenAutoNLU is an open-source pipeline for training natural language understanding (NLU) models for **text classification** (multiclass) and **named entity recognition (NER)**. It supports few-shot learning (SetFit, AncSetFit with optional anchor labels), classic fine-tuning, data quality diagnostics, out-of-distribution (OOD) detection, optional LLM-based augmentation and synthetic test generation, and ONNX export for deployment.

You provide train (and optionally test) data; high-level pipelines (`TextClassificationTrainingPipeline`, `TokenClassificationTrainingPipeline`) load it, run optional data-quality checks, then **automatically choose the training method** from the data: **AncSetFit** for very small datasets (2–5 samples per class), **SetFit** for medium size (6–80), and **fine-tuning** for larger data. You can override configs (batch size, OOD method, augmentation, etc.) and save models in ONNX format. A Streamlit app and Docker images (CPU/GPU) are included for interactive use.

Built by MWS AI and contributors (see [pyproject.toml](pyproject.toml) for authors). Aimed at practitioners and researchers who want a single, data-driven workflow for few-shot and full-size NLU training without manually picking methods or tuning low-level knobs.

!requires Python >=3.12, <3.13

Usage examples are located in the `examples` folder.

## Installation
To work with the repository in developer mode, install it as an editable package:
```bash
pip install -e .
```

This way you don't need to reinstall the package after code changes. To install all dependencies (two configurations are available: cpu and cuda) for development, run:
```bash
uv sync --extra cuda
```

## Documentation

To build and view the documentation locally:
```bash
uv sync
cd docs && uv run make html
open build/html/index.html
```
## Running with Docker


**With GPU** (recommended host: 16GB RAM, 8 CPU, A100 40GB, ~30GB disk):

```bash
docker-compose up -d
```

**Without GPU** (macOS or CPU-only):

```bash
docker build --build-arg EXTRA=cpu -t open-autonlu .
docker run -p 8501:8501 open-autonlu
```

## Code example with default parameters:

### Training
```python
from open_autonlu.auto_classes import (
    TextClassificationTrainingPipeline,
    TokenClassificationTrainingPipeline
)
from open_autonlu.methods.data_types import SaveFormat

# Text Classification training
pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    test_path="test.csv",
    config_overrides={"language": "en"}  # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)

# NER training
pipeline = TokenClassificationTrainingPipeline(
    train_path="train.json",
    test_path="test.json",
    config_overrides={"language": "en"}  # "en" or "ru"
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)
```


### Inference
```python
from open_autonlu.auto_classes import (
    TextClassificationInferenceManager,
    TokenClassificationInferenceManager
)

# Text Classification inference
inferer = TextClassificationInferenceManager("./model")
results = inferer.predict(["Hello world", "Goodbye"], batch_size=32)
for r in results:
    print(f"{r.most_probable.label}: {r.most_probable.score:.3f}")

# NER inference
ner_inferer = TokenClassificationInferenceManager("./ner_model")
results = ner_inferer.predict(["John works at Google"], batch_size=1)
for r in results:
    for entity in r.labels:
        print(f"{entity.text}: {entity.label}")
```

## Data Quality Diagnostics

The `diagnose()` method evaluates training data quality using multiple evaluators:
- `cartography` (MulticlassCLF) [Dataset Cartography](https://aclanthology.org/2020.emnlp-main.746.pdf)
- `vinfo` (MulticlassCLF) [V-Usable information](https://arxiv.org/abs/2110.08420)
- `uncertainty` (MulticlassCLF, NER)
- `retag` (MulticlassCLF, NER)
- `label aggregation` (NER)

Run the data quality stage:
```python
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(train_path="train.csv")
evaluation_result = pipeline.diagnose()
```

## Configuration Overrides

The `config_overrides` parameter allows you to customize training behavior with modifying default configurations.

### Basic Usage

```python
from open_autonlu.auto_classes import TextClassificationTrainingPipeline
from open_autonlu.methods.data_types import OodMethod, SaveFormat

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    config_overrides={
        "language": "en",                # Prompt language for LLM pipelines ("en" or "ru")
        "ood_method": OodMethod.LOGIT,   # OOD detection method
        "batch_size": 32,                # Batch size
    }
)
result = pipeline.train()
pipeline.save("./model", SaveFormat.ONNX)
```

### OOD Detection Methods

Out-of-Distribution detection identifies inputs that don't belong to any trained class.

| Method | Description | Best for |
|--------|-------------|----------|
| `OodMethod.AUTO` | Auto-select based on training method | Default |
| `OodMethod.MARGINAL_MAHALANOBIS_OOD` | Mahalanobis distance from embedding distribution | Finetuning |
| `OodMethod.MSP_OOD` | Maximum Softmax Probability threshold | SetFit, AncSetFit |
| `OodMethod.LOGIT` | Adds `outOfScope` class during training | Alternative approach |
| `OodMethod.NONE` | Disable OOD detection | When not needed |

The `threshold_factor` parameter controls OOD detection sensitivity. It is a multiplier applied to the OOD detection threshold. Higher values make detection more conservative (fewer samples are marked as OOD), while lower values make it more aggressive (more samples are flagged as OOD). Default value is `1.0`.

```python
from open_autonlu.methods.data_types import OodMethod

# Override ood_method and adjust sensitivity
config_overrides = {
    "ood_method": OodMethod.MARGINAL_MAHALANOBIS_OOD,
    "threshold_factor": 1.5,  # More conservative OOD detection
}
```

### LLM Data Augmentation

Automatically augment underrepresented classes using LLM generation. The `language` parameter controls which prompts are sent to the LLM (`"en"` for English, `"ru"` for Russian).

```python
import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",
    config_overrides={
        "language": "en",
        "llm_augmentation": {
            "enabled": True,
            "use_domain_analysis": True,  # Analyze domain for better prompts
            "threshold": 81,               # Augment classes with < 81 samples
            "max_attempts": 10,            # Max generation attempts
            "num_shot": 5,                 # Examples in prompt
            "config_overrides": {
                "LlmClientConfig": {
                    "api_key": os.environ["MODEL_API_KEY"],
                    "model_id": "gpt-4",
                }
            }
        }
    }
)
```

### Synthetic Test Generation

Generate synthetic test data using LLM when no test set is provided.

```python
import os
from open_autonlu.auto_classes import TextClassificationTrainingPipeline

pipeline = TextClassificationTrainingPipeline(
    train_path="train.csv",  # No test_path provided
    config_overrides={
        "language": "en",  
        "llm_test_generation": {
            "enabled": True,
            "num_samples_per_class": 100,
            "use_domain_analysis": True,
            "synthetic_test_path": "./synthetic_test.csv",  # Save generated data
            "config_overrides": {
                "LlmClientConfig": {
                    "api_key": os.environ["MODEL_API_KEY"],
                    "model_id": "gpt-4",
                }
            }
        }
    }
)
result = pipeline.train()  # Test data generated automatically
```

### Method-Specific Overrides

```python
# SetFit configuration
config_overrides = {
    "SetFitMethodConfig": {
        "num_iterations": 25,
        "body_lr": 2e-5,
        "batch_size": 16,
    }
}

# Finetuner configuration
config_overrides = {
    "FinetunerConfig": {
        "num_hpo_trials": 15,  # Hyperparameter optimization trials
    }
}
```

## Data Formats

### Text Classification (CSV)

```csv
text,label,anc_label
"Remove my meeting tomorrow",calendar_remove,remove calendar event
"Add a dentist appointment on Friday",calendar_set,add calendar event
```

The `anc_label` column is optional. It contains a natural language description of what the class means. It is a human-readable explanation of the label.

### NER (JSON)

The package supports two NER data formats:

**Offsets format** — entities are defined by character spans with `start` and `end` positions:

```json
[
  {"text": "What time is it in Australia", "spans": [{"start": 19, "end": 28, "label": "place_name"}]},
  {"text": "What is the forecast today for Moscow", "spans": [{"start": 21, "end": 26, "label": "date"}, {"start": 31, "end": 37, "label": "place_name"}]}
]
```

**Brackets format** — entities are marked inline using `[label : entity]` notation:

```json
[
  {"text": "play a track by [artist : the rolling stones]"},
  {"text": "play [song : hello] by [artist : adele]"}
]
```

## Example data

The files in `examples/test_data/noise_n_shot_data/` (text classification) and `examples/test_data/noise_n_shot_data_ner/` (NER) were made with external sampling scripts.

- **Text classification:** the scripts use the [**SNIPS**](https://huggingface.co/datasets/DeepPavlov/snips) dataset (intent/slot-style). They build train/test splits with optional n-shot sampling and label noise. In the included example, 1% of training labels were noised (randomly flipped to another class). The resulting CSVs follow the formats described above.
- **NER:** the scripts use the [**MASSIVE**](https://huggingface.co/datasets/DeepPavlov/massive) dataset. They produce few-shot train/test subsets with optional label noise (1% of labels noised) and export data in the offsets/BIO-style JSON expected by the NER pipeline.