Metadata-Version: 2.4
Name: pycorpdiff
Version: 0.1.0a27
Summary: Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
Project-URL: Homepage, https://github.com/jturner-uofl/pycorpdiff
Project-URL: Documentation, https://github.com/jturner-uofl/pycorpdiff
Project-URL: Repository, https://github.com/jturner-uofl/pycorpdiff
Project-URL: Issues, https://github.com/jturner-uofl/pycorpdiff/issues
Author-email: Jason Turner <jason.s.turner@gmail.com>
License: MIT License
        
        Copyright (c) 2026 Jason Turner
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: collocation,comparative corpus analysis,computational social science,corpus linguistics,diachronic nlp,digital humanities,discourse analysis,keyness,semantic change,temporal text analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: numpy>=1.24
Requires-Dist: pandas<3,>=2.0
Requires-Dist: pyarrow>=14
Requires-Dist: scipy>=1.11
Provides-Extra: all
Requires-Dist: altair>=5; extra == 'all'
Requires-Dist: datasets>=2.14; extra == 'all'
Requires-Dist: duckdb>=0.10; extra == 'all'
Requires-Dist: jupyter>=1.0; extra == 'all'
Requires-Dist: matplotlib>=3.8; extra == 'all'
Requires-Dist: networkx>=3.1; extra == 'all'
Requires-Dist: polars>=1.0; extra == 'all'
Requires-Dist: pyarrow>=15; extra == 'all'
Requires-Dist: ruptures>=1.1; extra == 'all'
Requires-Dist: scikit-learn>=1.3; extra == 'all'
Requires-Dist: sentence-transformers>=2.2; extra == 'all'
Requires-Dist: spacy>=3.7; extra == 'all'
Requires-Dist: statsmodels>=0.14; extra == 'all'
Requires-Dist: vl-convert-python>=1.5; extra == 'all'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pandas-stubs>=2.2; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pytest-cov>=4.1; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: duckdb
Requires-Dist: duckdb>=0.10; extra == 'duckdb'
Provides-Extra: huggingface
Requires-Dist: datasets>=2.14; extra == 'huggingface'
Provides-Extra: nlp
Requires-Dist: spacy>=3.7; extra == 'nlp'
Provides-Extra: notebooks
Requires-Dist: jupyter>=1.0; extra == 'notebooks'
Requires-Dist: vl-convert-python>=1.5; extra == 'notebooks'
Provides-Extra: polars
Requires-Dist: polars>=1.0; extra == 'polars'
Requires-Dist: pyarrow>=15; extra == 'polars'
Provides-Extra: semantic
Requires-Dist: scikit-learn>=1.3; extra == 'semantic'
Requires-Dist: sentence-transformers>=2.2; extra == 'semantic'
Provides-Extra: showcase
Requires-Dist: pysofra>=0.1.0a3; extra == 'showcase'
Provides-Extra: temporal
Requires-Dist: ruptures>=1.1; extra == 'temporal'
Requires-Dist: statsmodels>=0.14; extra == 'temporal'
Provides-Extra: viz
Requires-Dist: altair>=5; extra == 'viz'
Requires-Dist: matplotlib>=3.8; extra == 'viz'
Requires-Dist: networkx>=3.1; extra == 'viz'
Description-Content-Type: text/markdown

# pycorpdiff

[![PyPI](https://img.shields.io/pypi/v/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)
[![Python versions](https://img.shields.io/pypi/pyversions/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)
[![CI](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml/badge.svg)](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Comparative corpus analysis for modern Python workflows.**

`pycorpdiff` is the **missing comparative layer** between R's
[`quanteda`](https://quanteda.io/), the closed-source SketchEngine
platform, and the fragmented Python NLP stack
(`nltk`/`spaCy`/`gensim`/`sentence-transformers`). Three public verbs
— `compare(a, b)`, `track(c, term)`, `compare.before_after(c, event)` —
consolidate keyness, collocations, dispersion, temporal trajectories,
changepoint detection, interrupted time series, causal-impact analysis,
forecasting, online changepoint detection, and embedding-based semantic
shift under a single notebook-native API. Keyness and collocation
results carry their own KWIC evidence: `.explain(term)` returns the
source-text concordances behind any ranked term.

The package answers the questions corpus linguistics, digital humanities,
and computational social science routinely have:

- *How does corpus A differ from corpus B?* — `compare(a, b).keyness()`
- *How has discourse around X evolved over time?* — `track(c, "x").over_time()`
- *What did "migrant" mean in 2005 vs 2023?* — `compare(...).semantic_shift("migrant", embedder=...)`
- *Did this event actually shift the conversation?* — `track(...).causal_impact(event_date=...)`
- *Where is the discourse heading?* — `track(...).forecast(horizon=4)`

`pycorpdiff` is positioned as **orchestration**, not reinvention.
Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
`SBERT`-compatible model) plug in via two `typing.Protocol` extension
points — one-line adapters, no plugin registry. The base install's
direct runtime dependencies are `numpy`, `pandas`, `scipy`, and
`pyarrow`; everything else is opt-in via extras.

> **Status: alpha (0.1.0a27).** Public API is stable for the features
> described below; on PyPI as `pip install pycorpdiff`. Alpha releases
> are intentionally rapid (audit-driven), each shipping fixes and tests
> behind the published version; dependency pins will tighten at beta.

## The three-layer architecture

| Layer | Purpose | Key surface |
|---|---|---|
| **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |
| **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |
| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each implementing the relevant subset of `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |

## Quick start

```bash
pip install "pycorpdiff[viz]"
```

```python
import pycorpdiff as pcd

# Bundled synthetic Hansard-style sample — runs offline, no data download.
corpus = pcd.load_hansard_sample()
immigration = corpus.slice(topic="immigration")

# Which words separate the humanising and criminalising frames?
keyness = pcd.compare(
    immigration.slice(frame="humanising"),
    immigration.slice(frame="criminalising"),
).keyness(min_count=3)

keyness.plot()                # volcano plot — picture the result
# keyness.table.head(10)      # or look at the ranked table directly
# keyness.explain("criminal") # KWIC concordances showing the textual evidence
```

That's the entire surface in five lines: load a corpus, slice it,
compare two slices, plot the result. Every other analytical method —
collocation shifts, semantic drift, temporal trajectories, changepoint
detection, causal-impact analysis, forecasting, co-occurrence networks,
N-way keyness — follows the same shape. See
[the showcase notebook](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
for the full feature tour, or the cheat sheet below for one-line API previews.

### Cheat sheet — every analytical surface in one block

```python
# Compare verbs (returns Result objects; methods exposed vary by Result)
pcd.compare(a, b).keyness()                                                   # default formula="rayson" (LL Wizard)
pcd.compare(a, b).keyness(formula="dunning")                                  # full 4-cell G² (Dunning 1993; same family as quanteda / NLTK, edge-case tolerance not certified)
pcd.compare(a, b).keyness(ci="bootstrap", n_boot=999)                         # adds g2_ci_lower / g2_ci_upper columns
pcd.compare(a, b).collocation_shift("immigrant")
pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
# SBERTEmbedder downloads a sentence-transformers model on first call;
# use pcd.HashEmbedder() for offline / deterministic-test settings.

# Reference-baseline keyness (bundled or user-built)
pcd.against_baseline(corpus, "gutenberg_fiction")                             # vs bundled 19th-c. fiction baseline
pcd.against_baseline(corpus, pcd.baseline_from_corpus(reference_corpus))      # vs your own reference

# Sub-corpus balancing — Coarsened Exact Matching before keyness
m = pcd.match(a, b, on=["year", "party"], seed=0)                             # balances A and B on covariates
pcd.compare(m.a_matched, m.b_matched).keyness()                               # like-for-like comparison

# Lexical diversity (TTR, MATTR, MTLD, HD-D) — pooled and over time
pcd.lexical_diversity(corpus)                                                 # pooled corpus-level values
pcd.lexical_diversity(corpus, freq="Y", ci="bootstrap", n_boot=199)           # per-year trajectory + CIs

# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods).
# Note: ITS / causal_impact require sufficient pre/post-event periods to fit (min_pre_periods=15,
# min_post_periods=8 by default); the bundled Hansard sample is too small to exercise these
# lines literally -- they are shown here as API previews. See examples/jss_case_study.ipynb
# for a full-corpus run.
tr = pcd.track(corpus, "immigrant").over_time(freq="Y")
tr.changepoints()                                  # offline PELT
tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
tr.burstiness()                                    # Kleinberg 1999 multi-state HMM — burst-intensity states
# tr.interrupted_time_series(event_date="2016")    # segmented OLS [needs >=15 pre-periods]
# tr.causal_impact(event_date="2016")              # Bayesian counterfactual (Brodersen 2015) [needs >=15 pre-periods]
tr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)

# Before / after a known event
pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()

# N-way (≥ 2 corpora) — the four corpora `a, b, c, d` are illustrative placeholders
# (the cheat sheet's `a, b` from the keyness lines above; you supply `c, d`).
# pcd.keyness_multi([a, b, c, d], labels=["A", "B", "C", "D"])

# The discourse as a graph
pcd.cooccurrence_network(corpus, top_n=30).plot()
```

See [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
for a walkthrough on the synthetic Hansard-style corpus exercising
every analytical surface.

## Installation

```bash
pip install pycorpdiff                       # lexical-comparative core (MIT)
pip install "pycorpdiff[viz]"                # + altair / matplotlib / networkx
pip install "pycorpdiff[semantic]"           # + sentence-transformers
pip install "pycorpdiff[temporal]"           # + ruptures / statsmodels
pip install "pycorpdiff[notebooks]"          # + jupyter / vl-convert
pip install "pycorpdiff[all]"                # everything MIT-compatible
pip install "pycorpdiff[all,showcase]"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase
```

The base install's direct runtime dependencies are `numpy`, `pandas`,
`scipy`, and `pyarrow`; optional extras land per analytical layer so
you only pay for what you use. `[showcase]` is broken out separately
because `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without
that extra remains MIT-only.

To work from source:

```bash
git clone https://github.com/jturner-uofl/pycorpdiff
cd pycorpdiff
pip install -e ".[dev]"
pytest -q
```

## Cross-validation receipts

The math is checked against standard tools by automated test. The
fast tier runs on every push (matrix CI); the slow tier needs heavy
optional dependencies (NLTK, Scattertext, Stanford SNAP downloads)
and runs on main pushes only.

Fast tier:

- **Rayson's LL Wizard** — hand-derived contingency-table reference
  triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))

Slow tier:

- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
  on every adjacent bigram
- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
  US Conventions corpus
- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
  sanity check on Stanford SNAP COHA decade embeddings (skips
  gracefully if the archive isn't reachable)

## Citation

If you use `pycorpdiff` in academic work, please cite the software via
the `CITATION.cff` file in this repository — GitHub renders a "Cite this
repository" widget directly from it.

## License

MIT — see [LICENSE](https://github.com/jturner-uofl/pycorpdiff/blob/main/LICENSE).

## Case studies and demos (rendered)

GitHub's in-browser notebook renderer is unreliable on larger notebooks
with embedded SVG outputs. The links below point to the **pre-rendered
HTML artefacts** (the canonical read versions) and to nbviewer fallbacks
for the `.ipynb` source. Notebook sources still live under `examples/`
for re-execution.

- **asylum case study — lexicalising asylum in UK Parliament, 2010-2023.**
  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/jss_case_study.html)
  · [nbviewer](https://nbviewer.org/github/jturner-uofl/pycorpdiff/blob/main/examples/jss_case_study.ipynb)
  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/jss_case_study.ipynb)
- **Full feature tour (showcase).**
  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/pycorpdiff_showcase.html)
  · [nbviewer](https://nbviewer.org/github/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
- **Tutorial.**
  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/pycorpdiff_tutorial.html)
  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_tutorial.ipynb)
- **Hansard demo.**
  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/hansard_demo.html)
  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/hansard_demo.ipynb)

## Further reading

- [`docs/design.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/design.md) — three-layer architecture
- [`docs/statistical-methods.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/statistical-methods.md) — every metric's formula + citation
- [`docs/rendered/`](https://github.com/jturner-uofl/pycorpdiff/tree/main/docs/rendered) — catalogue of static HTML renders for offline viewing
