Metadata-Version: 2.4
Name: label-lens
Version: 0.1.0
Summary: Training Data Quality Analyzer — analyze labeled text classification datasets for quality issues
Author-email: "Michael J. Noe" <mikejnoe@gmail.com>
License-Expression: MIT
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.1.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: scikit-learn>=1.3.0
Provides-Extra: app
Requires-Dist: streamlit>=1.30.0; extra == "app"
Requires-Dist: plotly>=5.18.0; extra == "app"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: ruff>=0.3.0; extra == "dev"
Dynamic: license-file

---
title: LabelLens
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false
license: mit
---

# Label Lens

Training data quality analyzer for text classification datasets. Upload a CSV with text and label columns, get an automated quality report with actionable recommendations.

## Features

- **Auto-detect columns** — Automatically identifies text and label columns in your CSV
- **Class distribution analysis** — Imbalance ratio, effective number of classes, long-tail detection, suggested focal loss weights
- **Duplicate detection** — Exact duplicates and near-duplicates via TF-IDF cosine similarity, with cross-class conflicts flagged as critical
- **Label noise scoring** — Cross-validated confidence scoring to surface likely mislabels
- **Actionable report** — Severity ratings (Critical/Warning/Info) with specific recommendations
- **Interactive visualizations** — Plotly charts for exploring your data

## Quick Start

### As a web app

```bash
pip install label-lens[app]
streamlit run app.py
```

Or with uv:

```bash
uv sync
uv run streamlit run app.py
```

A sample dataset is included for demo purposes.

### As a library

```bash
pip install label-lens
```

```python
import pandas as pd
from label_lens import (
    analyze_distribution,
    find_exact_duplicates,
    find_near_duplicates,
    score_label_noise,
    generate_report,
)

df = pd.read_csv("your_dataset.csv")  # must have 'text' and 'label' columns

dist = analyze_distribution(df)
dups = find_exact_duplicates(df)
near_dups = find_near_duplicates(df)
noise = score_label_noise(df)

report = generate_report(dist, dups, near_dups, noise)
print(report["overall_severity"])  # "Critical", "Warning", or "Info"
print(report["recommendations"])
```

If your CSV uses different column names, use `prepare_dataframe` to standardize them:

```python
from label_lens import prepare_dataframe

df = prepare_dataframe(raw_df, text_col="content", label_col="category")
```

## Installation

Requires Python 3.13+.

```bash
# Library only (pandas, numpy, scikit-learn)
pip install label-lens

# With Streamlit app and Plotly charts
pip install label-lens[app]

# Development
pip install label-lens[dev]
```

## How It Works

**Distribution analysis** computes imbalance ratio, entropy-based effective class count, and identifies long-tail classes (<1% representation). It also calculates inverse-frequency focal loss alpha values.

**Duplicate detection** finds exact text matches and uses TF-IDF vectorization with chunked cosine similarity to find near-duplicates. Cross-class duplicates (same text, different labels) are flagged as critical since they represent definite labeling errors.

**Noise scoring** trains a logistic regression on TF-IDF features using stratified k-fold cross-validation. For each sample, it records the model's confidence in the given label. The bottom 5th percentile by confidence are flagged as mislabel suspects.

## Project Structure

```
label_lens/
├── ingest.py         # Column detection, validation, DataFrame prep
├── distribution.py   # Class distribution analysis + visualization
├── duplicates.py     # Exact and near-duplicate detection
├── noise.py          # Label noise scoring via cross-validated confidence
├── report.py         # Aggregate findings and generate recommendations
└── utils.py          # Shared helpers
```

## Development

```bash
# Install dev dependencies
uv sync --all-extras

# Run tests
pytest tests/ -v

# Lint and format
ruff check .
ruff format .
```

## Deployment

Label Lens is designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker. The `Dockerfile` at the repo root handles the build.

## Tech Stack

- Python 3.13+
- Streamlit
- pandas / numpy
- scikit-learn (TF-IDF, logistic regression, cross-validation)
- Plotly

## License

MIT

---

Built by [Mike Noe](https://mikenoe.com)
