Metadata-Version: 2.4
Name: poljacc
Version: 0.1.1
Summary: Vocabulary separability diagnostics for text classification
Author: Jay Yu
License-Expression: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.21
Requires-Dist: scikit-learn>=1.0
Requires-Dist: matplotlib>=3.5
Requires-Dist: pandas>=1.3
Requires-Dist: scipy>=1.7
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"

# poljacc

Vocabulary separability diagnostics for text classification.

Companion package for Oh and Yu (2026), *"When Sparse Beats Dense: Vocabulary Separability and Model Selection in Political Text Analysis."*

**Author**: Yongjai Yu (yongjai.yu@email.ucr.edu)

## Installation

```bash
pip install poljacc
```

For development:

```bash
git clone https://github.com/YongjaiYu/poljacc.git
cd poljacc
pip install -e .
```

## Quick Start

```python
from poljacc import diagnose, compare

# Diagnose vocabulary separability
result = diagnose(texts, labels)
print(result.jaccard)           # 0.860
print(result.recommendation)    # "High vocabulary overlap..."
result.report()                 # formatted summary
result.plot()                   # show heatmap

# Run TF-IDF baseline
baseline = compare(texts, labels)
print(baseline.f1)              # 0.724
print(baseline.classification_report)
```

## API

### `diagnose(texts, labels, top_k=5000)`

Compute vocabulary separability between classes:

- **Jaccard similarity**: pairwise overlap of top-k vocabularies (ranked by document frequency)
- **Centroid distance**: Euclidean distance between TF-IDF class centroids
- **Recommendation**: model selection guidance based on overlap level

Returns a `DiagnosticResult` with `.jaccard`, `.jaccard_matrix`, `.centroid_distance`, `.centroid_matrix`, `.labels`, `.recommendation`, `.report()`, and `.plot()`.

### `compare(texts, labels, test_size=0.2, random_state=1017)`

One-click TF-IDF + LogisticRegression baseline:

- TF-IDF: unigrams + bigrams, max 50k features, sublinear TF
- Logistic Regression: C tuned via 5-fold CV over {0.01, 0.1, 1, 10, 100}

Returns a `ComparisonResult` with `.f1`, `.accuracy`, `.classification_report`, `.best_C`.

## Recommendation Thresholds

| Jaccard Range | Interpretation | Recommendation |
|---|---|---|
| > 0.7 | High overlap | TF-IDF recommended |
| 0.4 -- 0.7 | Moderate overlap | Consider both |
| <= 0.4 | Low overlap | Neural models may outperform |

## Dependencies

numpy, scikit-learn, matplotlib, pandas, scipy
