Metadata-Version: 2.4
Name: thematic-analyser
Version: 0.2.0
Summary: Corpus-level inductive thematic analysis via multi-LLM consensus labelling — a member of the lens analyser family.
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: fastapi>=0.109.0
Requires-Dist: lens-contract>=0.2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: rich>=13.7.0
Requires-Dist: uvicorn[standard]>=0.27.0
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: documents
Requires-Dist: document-analyser>=0.4.0; extra == 'documents'
Provides-Extra: irr
Requires-Dist: krippendorff>=0.6; extra == 'irr'
Provides-Extra: llm
Requires-Dist: anthropic>=0.40; extra == 'llm'
Requires-Dist: openai>=1.12; extra == 'llm'
Provides-Extra: topics
Requires-Dist: bertopic>=0.16; extra == 'topics'
Requires-Dist: scikit-learn>=1.3; extra == 'topics'
Description-Content-Type: text/markdown

# thematic-analyser

Corpus-level **inductive thematic analysis via multi-LLM consensus labelling** —
a member of the [lens analyser family](https://github.com/michael-borck/lens-analysers).

Most family members read *one artefact* for *fixed* signals. This one is the
family's first **corpus-level, inductive** member: it takes a whole corpus and
discovers a codebook. Like `cite-sight` it is `auto_routable=False` (a corpus
isn't implied by a file extension).

## The method

Harvested from a parked research project (*Unveiling Risks in AI Systems*, Borck
& Thompson 2024 — see `docs/method/`). The novelty is not the topic model; it's
what happens to its output:

1. **Topics** — a pluggable, optional topic model proposes candidate themes
   (BERTopic via the `[topics]` extra, or bring your own precomputed topics).
   Mirrors BERTopic's clustering/representation split.
2. **Independent** — two or more *coders* (different LLMs) label each topic
   **blind**, no peeking.
3. **Critique** — coders see each other's labels and argue over N rounds,
   revising toward the most defensible shared label.
4. **Resolve** — converged label if they agree; otherwise the majority of the
   final round, flagged `agreed=False` for a **human** to settle.
5. **Reliability** — Krippendorff's α (the `[irr]` extra) over the *blind*
   labels — the defensibility number. Percent-agreement fallback otherwise.
6. **Codebook** — a flat set of themes the human groups into a hierarchy
   (`apply_hierarchy`), exportable to REFI-QDA for QualCoder/NVivo/ATLAS.ti.

The human sets the hierarchy; the machine does the labelling and the bookkeeping.

## Install

```bash
uv venv && uv pip install -e '../lens-contract' -e '.[dev]'
uv run pytest                       # offline smoke (stub coders, no API key)

uv pip install -e '.[topics]'       # + fit topics from raw text (BERTopic)
uv pip install -e '.[llm]'          # + real LLM coders (anthropic)
uv pip install -e '.[irr]'          # + Krippendorff's alpha
uv pip install -e '.[documents]'    # + .pdf/.docx ingestion via document-analyser
```

## CLI

```bash
thematic-analyser corpus.txt                      # fit topics, stub coders, human summary
thematic-analyser corpus.txt --topics topics.json # skip fitting; use precomputed topics
thematic-analyser corpus/ --rounds 3 --json       # directory of docs; JSON to stdout
thematic-analyser serve --port 8017               # HTTP API
thematic-analyser manifest                        # capability manifest
```

Bare positional = analyse. `--json` prints the `ThematicAnalysis` model and
nothing else; diagnostics go to stderr.

## Python

```python
from thematic_analyser import ThematicAnalyser, LLMCoder

# Real two-model consensus (needs the [llm] extra + ANTHROPIC_API_KEY):
coders = [
    LLMCoder("claude", "claude-opus-4-8", context="jailbreak prompts"),
    LLMCoder("haiku",  "claude-haiku-4-5-20251001", context="jailbreak prompts"),
]
result = ThematicAnalyser(coders, rounds=3).analyse("corpus.txt", topics="topics.json")
print(result.reliability)            # Krippendorff's alpha on the blind labels
print([(c.label, c.agreed) for c in result.consensus])
```

Without coders it defaults to two offline **stub** coders so everything runs with
no API key — that's what the test suite uses.

## HTTP

```bash
thematic-analyser serve --port 8017
curl -F file=@corpus.txt -F rounds=3 'http://127.0.0.1:8017/analyse'
curl http://127.0.0.1:8017/health
```

`GET /health`, `GET /manifest`, `POST /analyse` (multipart corpus upload). The
HTTP face runs the cheap stub-coder default; the LLM tier and human-in-the-loop
curation live in the desktop app, which calls the Python surface directly.

## Status

v0.1 scaffold. Working offline path (corpus → topics → consensus → reliability →
codebook → REFI-QDA export). Seams still to flesh out: BERTopic fitting (`[topics]`),
real provider wiring beyond Anthropic, a full `.qdpx` writer, and the local
desktop curation app (forked from the debrief/insight-lens shell).
