Metadata-Version: 2.4
Name: granuscore
Version: 1.0.0
Summary: Granularity scoring for natural language
Author-email: Lukas Ellinger <lukas.ellinger@tum.de>
License-Expression: MIT
Project-URL: Repository, https://github.com/lukasellinger/granuscore
Project-URL: Issues, https://github.com/lukasellinger/granuscore/issues
Keywords: granularity,natural language processing,nlp,text analysis,semantic abstraction,specificity
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.13,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: faiss-cpu>=1.13.2
Requires-Dist: hierarchy_transformers>=0.1.1
Requires-Dist: lightgbm>=4.6.0
Requires-Dist: spacy>=3.8.11
Requires-Dist: torch>=2.9.1
Requires-Dist: numpy>=1.26.4
Requires-Dist: platformdirs>=3.10.0
Requires-Dist: requests>=2.31.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: openai>=2.21.0
Provides-Extra: dev
Requires-Dist: optuna>=4.6.0; extra == "dev"
Requires-Dist: pygam>=0.12.0; extra == "dev"
Requires-Dist: python-dotenv>=1.2.1; extra == "dev"
Requires-Dist: ollama>=0.6.1; extra == "dev"
Requires-Dist: google-genai>=1.66.0; extra == "dev"
Requires-Dist: wget>=3.2; extra == "dev"
Dynamic: license-file

[![PyPI version](https://img.shields.io/pypi/v/granuscore.svg)](https://pypi.org/project/granuscore/)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/lukasellinger/granuscore)
# Granuscore

**Granuscore** is a Python library for measuring the *semantic granularity* of natural language text.

It provides an end-to-end pipeline that:

1. splits text into referential units,
2. assigns continuous granularity scores to each unit,
3. aggregates these scores into document-level estimates.

Granuscore is designed for analyzing how *fine-grained* or *coarse-grained* textual expressions are in applications such as question answering, educational dialogue, summarization, and scientific writing.

---

## Installation

Install from PyPI:

```bash
pip install granuscore
```

Or install the latest development version locally:

```bash
git clone https://github.com/lukasellinger/granuscore.git
cd granuscore
pip install -e .
```

Optional development dependencies:

```bash
pip install -e ".[dev]"
```

---

## Quick Start

```python
from granuscore import GranuScore

scorer = GranuScore()

text = """
Tony Hawk was born in San Diego.
"""

score = scorer(text)

print(score)
```

By default, Granuscore returns percentile scores, where higher values correspond to coarser-grained expressions.

---

## Default Configuration

The default configuration reproduces the setup used in the paper.

```python
scorer = GranuScore()
```

Equivalent to:

```python
scorer = GranuScore(
    predictor_type="hit",
)
```

Default settings:

- `predictor_type="hit"`
- `model_name="Hierarchy-Transformers/HiT-MiniLM-L12-WordNetNoun"`
- `search_method="random_anchors"`
- `random_anchors_k=999`

Required artifacts such as:
- FAISS indices,
- anchor vectors,
- LightGBM models,
- and reference percentile distributions

are automatically downloaded and cached on first use.

---

## Important Compatibility Note

The default configuration works out of the box and is the recommended setup.

If you customize components such as:
- the embedding model,
- search method,
- FAISS index,
- anchor vectors,
- or LightGBM model,

you must ensure that all resources are compatible with each other.

For example, a LightGBM model trained using:

```python
search_method="random_anchors"
```

should not be combined with:

```python
search_method="nearest_neighbor"
```

Similarly, FAISS indices, anchor vectors, percentile reference distributions, and LightGBM models must originate from the same embedding space and training configuration.

Compatibility between custom resources is not validated automatically.

---

## Notebook Tutorial

An interactive introduction is available in:

```text
notebooks/getting_started.ipynb
```

---

## Repository Structure

```text
granuscore/
├── src/
│   └── granuscore/
│       ├── pipeline.py
│       ├── granularity_predictor.py
│       ├── claim_splitter.py
│       ├── bucket_output.py
│       ├── cache.py
│       └── artifacts.py
├── notebooks/
│   └── getting_started.ipynb
├── training_scripts/
├── evaluation/
├── assets/
├── data/ (needs to be externally downloaded)
├── pyproject.toml
├── LICENSE
└── README.md
```

---

## Reproducing Paper Experiments

The datasets and precomputed resources required to reproduce the experiments from the paper are available here:

https://drive.google.com/drive/folders/1mJdUENOxHEiuYn-_f1KRQ1PZggXJDnb4?usp=sharing

Download the archive and extract it into the repository root:

```bash
unzip data.zip
```

This will create the expected directory structure used by the training and evaluation scripts.

---

## Training Pipeline

Training uses precomputed `.pkl` feature files.

1. Generate precomputed datasets:

```text
training_scripts/build_precalc_data/
```

2. Train LightGBM models:

```bash
python training_scripts/train_lgb_models.py
```

---

## Citation
Updated Citation information will be added after publication.

```bibtex
@misc{ellinger2026granuscore,
  title={Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering},
  author={Ellinger, Lukas and Fichtl, Alexander M. and Anschütz, Miriam and Groh, Georg},
  year={2026}
}
```
