Metadata-Version: 2.4
Name: polartox
Version: 0.1.1
Summary: NLP toolkit for annotator polarization research: synthetic datasets, polarized trees, and disagreement metrics
Author-email: SwkratisCS <swkratisgiannoutsos@gmail.com>
License-Expression: MIT
Project-URL: Repository, https://github.com/SwkratisCS/polarizedtrees
Keywords: polarization,annotation,synthetic data,nlp,disagreement,crowdsourcing,demographics,polarized trees
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Provides-Extra: ndfu
Requires-Dist: ndfu; extra == "ndfu"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# polartox

NLP toolkit for **annotator polarization research**. Provides tools for synthetic dataset generation and polarization detection in human annotation studies.

## Install

```bash
pip install polartox

# with nDFU support (Pavlopoulos & Likas, 2024 -- github.com/ipavlopoulos/ndfu)
pip install "polartox[ndfu]"
```

## Tools

| Module | Description | Status |
|---|---|---|
| `polartox.datagen` | Synthetic annotator pool with injected, ground-truth polarization | Stable |
| `polartox.trees` | Polarized Trees detection algorithm | Coming soon |

## `polartox.datagen`

Builds a pool of annotators with explicit demographic identities and generates annotation datasets where every text independently gets **k active dimensions** (0–4) that drive its disagreement:

- **k = 0** — no dimension explains anything, a true unimodal negative control
- **k ≥ 1** — a random subset of dimensions is active, each with its own random toxic/civil lean split and a continuous intensity (`alpha`) controlling how strongly it pulls toward its pole

Identities' rating distributions are built by taking the **elementwise product** of their active-dimension shapes — signal composes rather than averages away, reaching the full nDFU range instead of collapsing toward the middle.

The generative config is returned alongside the dataset as ground truth, enabling direct validation of detection algorithms.

```python
from polartox.datagen import AnnotatorPool, DEFAULT_DIMENSIONS, DEFAULT_DEPTH_WEIGHTS, DEFAULT_INTENSITY_RANGE

pool = AnnotatorPool(
    dimensions=DEFAULT_DIMENSIONS,
    scale=5,
    intensity_range=DEFAULT_INTENSITY_RANGE,
    depth_weights=DEFAULT_DEPTH_WEIGHTS,
    annotators_per_identity=10,
)

result = pool.generate_dataset(
    n_texts=100,
    n_annotators_per_text=150,
    noise=0.05,
    seed=42,
)
dataset, ground_truth = result
# dataset columns: text_id, annotator_id, <dimensions>, rating
# ground_truth: per-text active_dims, lean (toxic/civil split), and alpha (intensity)
```

nDFU scoring is provided by the collaborative [`ndfu`](https://github.com/ipavlopoulos/ndfu) package (Pavlopoulos & Likas, 2024) rather than reimplemented here:

```python
from ndfu import dfu, pdf

text_data = dataset[dataset["text_id"] == 0]
hist = pdf(text_data["rating"].tolist(), range(1, pool.scale + 1))
score = dfu(hist)
```

Full API documentation available on [GitHub](https://github.com/Swkratis210204/polartox).

## Changelog

See [CHANGELOG.md](https://github.com/Swkratis210204/polartox/blob/main/CHANGELOG.md) for release history.
