Metadata-Version: 2.4
Name: entroprisal
Version: 0.6.0
Summary: Calculate entropy-based linguistic metrics on text using reference corpora
Project-URL: Homepage, https://github.com/learlab/entroprisal
Project-URL: Repository, https://github.com/learlab/entroprisal
Author: Langdon Holmes
License: MIT License
        
        Copyright 2025 Langdon Holmes
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: entropy,linguistics,nlp,surprisal,text-analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: polars>=0.19.0
Requires-Dist: pyarrow>=21.0.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: all
Requires-Dist: huggingface-hub>=0.20.0; extra == 'all'
Requires-Dist: spacy>=3.5.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: build>=0.10.0; extra == 'dev'
Requires-Dist: ipykernel>=6.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: twine>=4.0.0; extra == 'dev'
Provides-Extra: hf
Requires-Dist: huggingface-hub>=0.20.0; extra == 'hf'
Provides-Extra: spacy
Requires-Dist: spacy>=3.5.0; extra == 'spacy'
Description-Content-Type: text/markdown

# entroprisal

Calculate information theoretic linguistic metrics on text using reference corpora.

## Overview

`entroprisal` is a Python package that computes various entropy and surprisal metrics for text analysis. It provides three main calculators:

- **TokenEntropisalCalculator**: Token-level n-gram entropy and surprisal
- **CharacterEntropisalCalculator**: Character-level entropy and surprisal
- **RestOfWordEntropisalCalculator**: Character-level rest-of-word entropy and surprisal (bidirectional: left-to-right and right-to-left word completion)

These metrics are useful for analyzing text complexity, readability, and information content.

### Metrics at a glance

`n` denotes the conditioning context length (preceding tokens or characters for token / character calculators; prefix or suffix length for the rest-of-word calculator). Each cell lists the `n` values supported.

| Calculator | Direction | Surprisal | Entropy | Entropy reduction | Entropy difference |
|---|---|---|---|---|---|
| `TokenEntropisalCalculator` | forward | n = 1, 2, 3 | n = 1, 2, 3 | n = 1, 2, 3 | n = 1, 2, 3 |
| `CharacterEntropisalCalculator` | forward | n = 1, 2, 3 | n = 1, 2, 3 | n = 1, 2, 3 | n = 1, 2, 3 |
| `RestOfWordEntropisalCalculator` | forward and backward | n = 1, 2, 3 | n = 1, 2, 3 | n = 1, 2, 3 | &mdash; |

At `n=1`, entropy reduction uses the corpus-wide marginal entropy (`H(W_t)`, `H(c_i)`, or `H(W)`) as the Distribution A baseline, which makes the reduction equal to the mutual information between the target and the single preceding context element.

## Installation

### Basic Installation

```bash
pip install entroprisal[all]
```

The package will automatically download reference data files from Hugging Face Hub when first used (~4GiB total).

SpaCy and Hugging Face Hub are optional dependencies for additional functionality. A minimal installation without these dependencies is also possible:

```bash
pip install entroprisal
```

The minimal installation exists primarily as a stable version, since SpaCy and Hugging Face Hub dependencies are updated frequently and may cause installation issues in some environments. However, the full installation with `pip install entroprisal[all]` is recommended for the best experience and performance.

### Optional Dependencies included in `all`

`huggingface-hub` is used for faster downloads with caching (recommended)

`spacy` is used for classifying content words vs. function words in your target text and for tokenization.

If using SpaCy, you will need to download a SpaCy language model as well:

```bash
python -m spacy download en_core_web_lg
```

### Development Installation

```bash
# Clone the repository
git clone https://github.com/learlab/entroprisal.git
cd entroprisal

# Install in editable mode with dev dependencies
uv pip install -e .[dev]
```

## Data Files

Reference corpus files are automatically downloaded from [Hugging Face Hub](https://huggingface.co/datasets/langdonholmes/slimpajama-ngrams) on first use:

- `google-books-dictionary-words.txt` - Word frequencies (included in package)
- `4grams_aw.parquet` - All-word 4-gram frequencies (~2GiB)
- `4grams_cw.parquet` - Content-word 4-gram frequencies (~1.8GiB)

Files are cached locally to avoid re-downloading. To use the faster Hugging Face Hub downloader with resume capability, install with `pip install entroprisal[hf]` or `pip install entroprisal[all]`.

## Quick Start

### Text Preprocessing

For best results, preprocess your text using the `preprocess_text()` function, which uses spaCy for tokenization. This ensures consistency with how the reference corpora were prepared.

```python
from entroprisal import preprocess_text

# Preprocess text (requires spaCy: pip install entroprisal[spacy])
text = "The quick brown fox jumps over the lazy dog."
tokens = preprocess_text(text)
# [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]

# For content-word-only analysis (nouns, verbs, adjectives, adverbs)
content_tokens = preprocess_text(text, content_words_only=True)
# [['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']]
```

### Token-Level Entropy and Surprisal

```python
from entroprisal import TokenEntropisalCalculator
from entroprisal.utils import load_4grams

# Load reference n-gram data
ngrams = load_4grams("aw")  # "aw" = all words, "cw" = content words

# Initialize calculator
calc = TokenEntropisalCalculator(ngrams, min_frequency=100)

# Calculate metrics for a list of tokens
tokens = ["the", "quick", "brown", "fox"]
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes (per-document means over attested positions):
# - ngram_surprisal_1, ngram_surprisal_2, ngram_surprisal_3
# - ngram_entropy_1, ngram_entropy_2, ngram_entropy_3
# - entropy_reduction_2, entropy_reduction_3
# - entropy_difference_2, entropy_difference_3
# - Support counts for each metric
```

### Per-Position Token Metrics

In addition to per-document means, `TokenEntropisalCalculator` exposes **per-position**
metrics that return a `pandas.DataFrame` with one row per token. Throughout, the suffix
`n` is the conditioning context length, matching `ngram_surprisal_n` (so `n=3` is the
4-gram). Contexts that are unattested in the reference corpus yield `NaN` and a `False`
availability flag.

```python
tokens = ["the", "quick", "brown", "fox"]

# Surprisal: -log2 P(w_t | n preceding tokens), the information value of each token.
# n=3 (default) uses the full 4-gram context; n=2 the trigram; n=1 the bigram.
calc.surprisal(tokens)             # n=3 (4-gram)
calc.surprisal(tokens, n=2)        # trigram
# columns: position, token, surprisal, surprisal_available

# Entropy reduction (Hale-style, conditional mutual information):
#   H(W_t | w_{t-n}..w_{t-2}) - H(W_t | w_{t-n}..w_{t-1})
# How much observing the most recent context word reduced uncertainty about a fixed
# target. n=3 (default) is the 4-gram; n=2 the trigram. (n=1 is unsupported: it reduces
# to an affine function of the mean transitional entropy.) Clipped at 0 by default.
calc.entropy_reduction(tokens, n=3)
calc.entropy_reduction(tokens, n=2, signed=True)  # keep negative values
# columns: position, token, entropy_reduction, available

# Entropy difference (Lowder-style): E_n[t-1] - E_n[t], the change in next-word entropy
# from one position to the next. NOTE: this differences entropies over *different* random
# variables (adjacent positions), unlike entropy_reduction's H(X) - H(X|y) over a fixed
# target. n in {2, 3}; n=3 (default) reproduces the original Lowder et al. (2018)
# definition, n=2 the trigram form. (n=1 is unsupported: it telescopes to the final
# token's entropy.) Clipped at 0 by default.
calc.entropy_difference(tokens, n=3)
# columns: position, token, entropy_difference, available

# Everything at once, at every context length (best for comparative analysis):
calc.compute_all(tokens)
# columns: position, token,
#          surprisal_1, surprisal_2, surprisal_3,
#          entropy_reduction_2, entropy_reduction_3,
#          entropy_difference_2, entropy_difference_3,
#          and a matching *_available flag for each metric
```

All four methods accept a `base` argument (default `2.0` for bits); `entropy_reduction`
and `entropy_difference` additionally accept `signed` (default `False`, clipping negatives
to 0 per Hale's convention).

### Character-Level Entropy and Surprisal

```python
from entroprisal import CharacterEntropisalCalculator, preprocess_text
from entroprisal.utils import load_google_books_words

# Load word frequency data
words_df = load_google_books_words()

# Initialize calculator
calc = CharacterEntropisalCalculator(words_df)

# Preprocess text to get tokens
text = "The quick brown fox jumps over the lazy dog"
tokens = preprocess_text(text)[0]  # Get first document's tokens

# Calculate metrics for tokens
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes:
# - char_entropy, char_surprisal: Single character transition metrics
# - bigraph_entropy, bigraph_surprisal: Two-character context metrics
# - trigraph_entropy, trigraph_surprisal: Three-character context metrics
# - char_entropy_reduction_{2,3}, char_entropy_difference_{2,3}: see below
```

Per-position character metrics return a `pandas.DataFrame` with one row per target
character position within each boundary-padded word (`#word#`):

```python
# Character surprisal: -log2 P(c_i | n preceding chars) per character.
# n=3 (default) uses the trigraph context; n=2 the bigraph; n=1 the single-char.
calc.surprisal(tokens)             # n=3 (trigraph)
calc.surprisal(tokens, n=2)        # bigraph
# columns: token_index, word, position, target, surprisal, surprisal_available

# Entropy reduction (Hale-style, conditional mutual information) at the character
# level: H(c_i | c_{i-n}..c_{i-2}) - H(c_i | c_{i-n}..c_{i-1}). n=3 (default) is
# the trigraph context; n=2 is the bigraph. Clipped at 0 by default.
calc.entropy_reduction(tokens, n=3)

# Entropy difference (Lowder-style) across char positions within a word.
calc.entropy_difference(tokens, n=3)

# Everything at once.
calc.compute_all(tokens)
```

### Rest-of-Word Entropy and Surprisal (Character-Level, Bidirectional)

```python
from entroprisal import RestOfWordEntropisalCalculator, preprocess_text
from entroprisal.utils import load_google_books_words

# Load word frequency data
words_df = load_google_books_words()

# Initialize calculator
calc = RestOfWordEntropisalCalculator(words_df)

# Preprocess text to get tokens
text = "The quick brown fox"
tokens = preprocess_text(text)[0]  # Get first document's tokens

# Calculate metrics for tokens
metrics = calc.calculate_metrics(tokens)

print(metrics)
# Output includes:
# - lr_c1_entropy, lr_c1_surprisal: Left-to-right, 1-char context
# - lr_c2_entropy, lr_c2_surprisal: Left-to-right, 2-char context
# - lr_c3_entropy, lr_c3_surprisal: Left-to-right, 3-char context
# - rl_c1_entropy, rl_c1_surprisal: Right-to-left, 1-char context
# - rl_c2_entropy, rl_c2_surprisal: Right-to-left, 2-char context
# - rl_c3_entropy, rl_c3_surprisal: Right-to-left, 3-char context
# - {lr,rl}_entropy_reduction_{1,2,3}: see below
# - mean_word_length
```

Per-word rest-of-word metrics expose per-token DataFrames parameterized by
`direction` (`"lr"` or `"rl"`) and conditioning prefix length `n`:

```python
# Surprisal of the remaining characters given the n-char prefix/suffix.
calc.surprisal(tokens, direction="lr", n=2)
# columns: token_index, word, surprisal, surprisal_available

# Entropy reduction: H(W | (n-1) chars) - H(W | n chars), the reduction in
# uncertainty about word identity from observing one more character. n=1 uses
# the corpus-wide marginal H(W) as the Distribution A baseline -- the reduction
# from observing the very first character. This is the clean Hale-style
# conditional mutual information, directly parallel to Hale (2016)'s original
# parse-given-words formulation but with characters reducing uncertainty about
# word identity. Clipped at 0 by default.
calc.entropy_reduction(tokens, direction="lr", n=1)
calc.entropy_reduction(tokens, direction="rl", n=2)

# Both directions x all prefix lengths in one DataFrame.
calc.compute_all(tokens)
```

### Batch Processing

All calculators support batch processing with token lists:

```python
from entroprisal import preprocess_text

# Preprocess multiple texts at once
texts = [
    "First text sample",
    "Second text sample",
    "Third text sample"
]
token_lists = preprocess_text(texts)  # Returns list of token lists

# Returns a pandas DataFrame with one row per document
results_df = calc.calculate_batch(token_lists)
print(results_df)
```

## API Reference

### TokenEntropisalCalculator

Calculate token-level entropy and surprisal metrics using n-gram frequencies.

**Methods:**

- `calculate_metrics(tokens: List[str]) -> Dict[str, float]`: Per-document mean metrics for a token list
- `calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame`: Batch processing
- `surprisal(tokens, *, n=3, base=2.0) -> pd.DataFrame`: Per-position surprisal (`n` in {1, 2, 3})
- `entropy_reduction(tokens, *, n=3, signed=False, base=2.0) -> pd.DataFrame`: Per-position entropy reduction (conditional mutual information; `n` in {2, 3})
- `entropy_difference(tokens, *, n=3, signed=False, base=2.0) -> pd.DataFrame`: Per-position entropy difference (Lowder-style; `n` in {2, 3})
- `compute_all(tokens, *, signed=False, base=2.0) -> pd.DataFrame`: All per-position metrics at every context length
- `get_detailed_ngram_analysis(tokens: List[str]) -> Dict[int, pd.DataFrame]`: Detailed per-token analysis

### CharacterEntropisalCalculator

Calculate character-level transition entropy and surprisal.

**Methods:**

- `calculate_metrics(tokens: List[str]) -> Dict[str, float]`: Per-document mean metrics for a token list
- `calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame`: Batch processing
- `surprisal(tokens, *, n=3, base=2.0) -> pd.DataFrame`: Per-position character surprisal (`n` in {1, 2, 3})
- `entropy_reduction(tokens, *, n=3, signed=False, base=2.0) -> pd.DataFrame`: Per-position character entropy reduction (`n` in {2, 3})
- `entropy_difference(tokens, *, n=3, signed=False, base=2.0) -> pd.DataFrame`: Per-position character entropy difference (`n` in {2, 3})
- `compute_all(tokens, *, signed=False, base=2.0) -> pd.DataFrame`: All per-position character metrics
- `get_character_entropy(char: str) -> Optional[float]`: Lookup entropy for specific character
- `get_character_surprisal(context: str, target: str) -> Optional[float]`: Lookup surprisal for character transition
- `get_bigraph_entropy(bigraph: str) -> Optional[float]`: Lookup entropy for bigraph
- `get_bigraph_surprisal(bigraph: str) -> Optional[float]`: Lookup surprisal for bigraph
- `get_trigraph_entropy(trigraph: str) -> Optional[float]`: Lookup entropy for trigraph
- `get_trigraph_surprisal(trigraph: str) -> Optional[float]`: Lookup surprisal for trigraph

### RestOfWordEntropisalCalculator

Calculate character-level rest-of-word entropy and surprisal in both directions (predicting remaining characters from left-to-right and right-to-left contexts).

**Methods:**

- `calculate_metrics(tokens: List[str]) -> Dict[str, float]`: Per-document mean metrics for a token list
- `calculate_batch(token_lists: List[List[str]]) -> pd.DataFrame`: Batch processing
- `surprisal(tokens, *, direction="lr", n=2, base=2.0) -> pd.DataFrame`: Per-word rest-of-word surprisal
- `entropy_reduction(tokens, *, direction="lr", n=2, signed=False, base=2.0) -> pd.DataFrame`: Per-word entropy reduction over word identity (`n` in {1, 2, 3}; `n=1` uses the marginal `H(W)` baseline)
- `compute_all(tokens, *, signed=False, base=2.0) -> pd.DataFrame`: All per-word metrics, both directions
- `get_word_frequency(word: str) -> int`: Get frequency of a word in reference corpus

Attribute: `word_marginal_entropy` (float) — `H(W)` over the corpus, used as the Distribution A baseline for `entropy_reduction(..., n=1)`.

## Utilities

```python
from entroprisal.utils import (
    load_google_books_words,
    load_4grams,
    get_data_dir,
    preprocess_text,
    is_content_token
)

# Load reference data
words_df = load_google_books_words()
ngrams_aw = load_4grams("aw")
ngrams_cw = load_4grams("cw")

# Get data directory path
data_dir = get_data_dir()

# Preprocess text with spaCy tokenization
# Returns list of token lists (one per document)
tokens = preprocess_text("The quick brown fox jumps over the lazy dog.")
# [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]

# Process multiple texts
texts = ["First sentence.", "Second sentence."]
token_lists = preprocess_text(texts)

# Extract only content words (nouns, verbs, adjectives, adverbs)
content_tokens = preprocess_text("The quick brown fox jumps.", content_words_only=True)
# [['quick', 'brown', 'fox', 'jumps']]  # 'the' filtered out

# Use a different spaCy model
tokens = preprocess_text("Some text", spacy_model_tag="en_core_web_sm")
```

## Examples

See `examples/usage_examples.ipynb` for comprehensive examples including:

- Loading and initializing calculators
- Processing single texts and batches
- Combining multiple metrics
- Visualizing results

## Development

### Running Tests

```bash
pytest tests/
```

### Code Style

```bash
# Format code
black src/

# Lint code
ruff check src/
```

## License

It's MIT licensed. Do what you want with it.

## Citation

On the other hand, if you are an academic, please cite the package as follows:

```bibtex
@software{entroprisal,
  title = {entroprisal: Entropy-based linguistic metrics},
  author = {Langdon Holmes and Scott Crossley},
  year = {2025},
  url = {https://github.com/learlab/entroprisal}
}
```

```apa
Holmes, L., & Crossley, S. (2025). entroprisal: Entropy-based linguistic metrics [Computer software].
```

## Acknowledgments

Reference data sources:

- Google Books word frequencies: [gwordlist](https://github.com/orgtre/google-books-words)
- N-gram token frequencies: Derived from the slimpajama test set [slimpajama](https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama)
