Metadata-Version: 2.3
Name: archival-structures
Version: 0.3.0
Summary: A Python package for analysing and restructuring the output of Automatic Text Recognition (ATR) pipelines.
License: MIT
Keywords: archives,pagexml,atr,ocr,digital-humanities,document-analysis
Author: Marijn Koolen
Author-email: marijn.koolen@gmail.com
Requires-Python: >=3.11,<3.15
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: anthropic (>=0.40.0)
Requires-Dist: fuzzy-search (>=2.5.0,<4.0.0)
Requires-Dist: hdbscan (>=0.8.44)
Requires-Dist: ipycanvas (>=0.13.0)
Requires-Dist: ipython (>=8.20.0)
Requires-Dist: ipywidgets (>=8.1.0)
Requires-Dist: lxml (>=5.3.0,<7.0.0)
Requires-Dist: matplotlib (>=3.9.0)
Requires-Dist: numpy (>=2.3.5,<3.0.0)
Requires-Dist: opencv-python (>=4.13.0.92)
Requires-Dist: orjson (>=3.11.7)
Requires-Dist: pagexml-tools (>=0.8.0,<1.0.0)
Requires-Dist: pandas (>=2.3.3,<3.0.0)
Requires-Dist: pillow (>=11.0.0)
Requires-Dist: pyyaml (>=6.0)
Requires-Dist: scikit-image (>=0.26.0)
Requires-Dist: scikit-learn (>=1.9.0)
Requires-Dist: seaborn (>=0.13.0)
Requires-Dist: torch (>=2.12.0)
Requires-Dist: tqdm (>=4.65.0)
Requires-Dist: transformers (>=5.11.0)
Requires-Dist: umap-learn (>=0.5.12)
Project-URL: Homepage, https://github.com/Data-Scopes/archival-structures
Project-URL: Issues, https://github.com/Data-Scopes/archival-structures/issues
Project-URL: Repository, https://github.com/Data-Scopes/archival-structures
Description-Content-Type: text/markdown

# archival-structures

Tools for analysing PageXML/ATR transcriptions and scan images of archival documents:
detecting and splitting two-page book openings, clustering text lines and page layouts,
mining cross-page document-element sequences, ink-colour and missing-transcription detection,
and parsing EAD/METS archival finding-aid metadata.

Full documentation (including the per-module API reference) lives in [`docs/`](docs/) and is
built with Sphinx; see [Documentation](#documentation) below.

## Techniques and tasks

Archival images and transcriptions are organised as
`<institute_id>/<archive_id>/<inventory_num_id>/<scan>`. The core idea behind this package is
that one inventory number's worth of scans is a structured, ordered corpus, not a set of
independent images -- so the analysis is built up in layers:

1. **Opening detection and splitting** (`archival_structures.analysis.opening_detection`) --
   decide whether a scan is a two-page spread, split it into independent verso/recto pages, and
   classify a whole inventory number as a *book of openings* versus a *mixed* folder/booklet.
2. **Page-layout clustering** (`archival_structures.analysis.page_layout_clustering`) -- cluster
   whole pages by the spatial arrangement of their text lines, via a grid-pattern TF-IDF
   fingerprint. A complementary fingerprint, `archival_structures.analysis.relational_patterns`
   (clustered by `relational_layout_clustering`), instead encodes each line's own type and its
   RCC-8 spatial relation to its immediate below/right neighbour -- relational line-neighbourhood
   patterns a pixel-pattern fingerprint can't represent.
   - **Structural whitespace** (`archival_structures.analysis.empty_regions`) -- detects and
     clusters significant whitespace regions within pages (computed geometrically, not from
     PageXML region markup) and scores which relational patterns are over-represented adjacent
     to those whitespace boundaries.
   - **Cross-page boundaries** (`archival_structures.analysis.boundary_detection`) -- detects
     blank or near-blank pages in the page sequence, and identifies which page-layout clusters
     systematically appear before or after them.
   - **Text-extent margins** (`archival_structures.analysis.text_extent`) -- measures how far
     from each page edge the first and last transcribed lines sit (relative top, bottom, left,
     right margins); classifies each page as `full_text`, `late_start`, `early_end`, or `short`;
     and characterises each inventory by its full-text page fraction -- a lightweight signal for
     distinguishing running-text books from sparse table registers or mixed-document archives.
3. **Line clustering** (`archival_structures.analysis.line_clustering`) -- cluster individual
   text lines by indentation/width/height into a vocabulary of recurring line types (body text,
   closing lines, marginalia, ...).
4. **Sequence-pattern mining** (`archival_structures.analysis.sequence_patterns`) -- order lines
   into a corpus-wide reading sequence and segment it into document elements, including elements
   that span a page break.

Tasks 2 and 3 both depend on splitting first (task 1) -- clustering whole two-page scans
conflates the left and right page's geometry into one coordinate frame.

Alongside the text-analysis pipeline:

- **Ink colour, multi-colour text, and missing transcriptions**
  (`archival_structures.clustering.colour_clustering`) -- robust ink/paper separation via
  multiotsu + connected-component shape (resistant to small artefacts like a sticker or stain),
  screening pages for more than one ink colour via LAB chroma spread, and flagging untranscribed
  page regions whose pixels look like genuine ink rather than blank paper.
- **Coordinate-space bridging** (`archival_structures.model.image`,
  `archival_structures.image`) -- converting between a scan's native pixel coordinates, a
  thumbnail's, and a canvas rendering of a selection, via an affine `Transform`; converting
  between PageXML `Coords` and this package's own `Box` type; ipywidgets-based interactive
  region drawing/tagging.
- **Ground-truth annotation** (`archival_structures.datasets.annotations`) -- a multi-level
  `namespace:type(:subtype)?(#N)?` tag vocabulary (see
  [`docs/vocabulary.md`](docs/vocabulary.md)) for labelling scans/pages/lines/cross-page
  elements, plus ipywidgets notebook apps for producing it one scan
  (`archival_structures.datasets.annotations`) or one cluster
  (`archival_structures.datasets.bulk_tagging`) at a time.
- **Stream analysis** (`archival_structures.stream_analysis`) -- a separate concern from the
  PageXML pipeline: embeddings + UMAP/HDBSCAN clustering, layout features, optional VLM tagging,
  and active-learning ground-truth creation for a plain directory of document images (no PageXML
  required) -- see [`docs/stream_analysis.md`](docs/stream_analysis.md).
  - **Sequence pattern analysis** (`archival_structures.stream_analysis.sequence_analysis`)
    -- label-agnostic tools for analysing ordered sequences of cluster labels (from visual
    or layout clustering): run-length encoding and noise-run merging, cluster n-gram mining,
    tandem repeat detection (recurring cluster sub-sequences), and transition matrices.
  - **Subsequence detection** (`archival_structures.stream_analysis.overview.subsequence_detection`)
    -- detects visually homogeneous (book-like) subsequences within a heterogeneous scan sequence
    using adjacent cosine similarity between DINOv2 embeddings; threshold-based and optional
    change-point (ruptures) boundary detection; scores each segment by mean similarity, cluster
    entropy, and optional opening consistency.
- **EAD/METS parsing** (`archival_structures.parsers`) -- a separate concern from the
  PageXML/image pipeline: parsing the archival finding-aid metadata (series/subseries/file
  structure, page manifests) that describes an archive's holdings.

See [`docs/findings.md`](docs/findings.md) for the concrete, validated-against-real-data lessons
learned while building this -- several of the choices above (e.g. splitting before clustering,
chroma spread over luminosity-class counting for multi-colour detection) turned out to matter a
lot more than they first appeared to.

## Demo notebooks

All in [`notebooks/demo/`](notebooks/demo/):

- [`annotate-scans.ipynb`](notebooks/demo/annotate-scans.ipynb) -- ipywidgets ground-truth
  annotation app.
- [`bulk-tag-annotation-demo.ipynb`](notebooks/demo/bulk-tag-annotation-demo.ipynb) -- tagging
  many scans at once by cluster, with a structured namespace/type/subtype tag builder instead
  of free text.
- [`inventory-structure-demo.ipynb`](notebooks/demo/inventory-structure-demo.ipynb) --
  classifying a whole inventory number as a book of openings vs a mixed folder.
- [`opening-detection-demo.ipynb`](notebooks/demo/opening-detection-demo.ipynb) -- per-scan
  opening detection and splitting.
- [`line-clustering-demo.ipynb`](notebooks/demo/line-clustering-demo.ipynb) and
  [`line-clustering-table-vs-deeds-demo.ipynb`](notebooks/demo/line-clustering-table-vs-deeds-demo.ipynb)
  -- clustering text lines by indentation/width, and comparing that across a table-like register
  versus notary deeds.
- [`page-layout-clustering-demo.ipynb`](notebooks/demo/page-layout-clustering-demo.ipynb) and
  [`page-layout-clustering-table-vs-deeds-demo.ipynb`](notebooks/demo/page-layout-clustering-table-vs-deeds-demo.ipynb)
  -- clustering pages by text-line layout, and the same table-vs-deeds comparison.
- [`relational-layout-clustering-table-vs-deeds-demo.ipynb`](notebooks/demo/relational-layout-clustering-table-vs-deeds-demo.ipynb)
  -- clustering pages by line-type-and-neighbour-relation fingerprint instead of raw geometry,
  compared against the geometric clustering above.
- [`empty-region-clustering-demo.ipynb`](notebooks/demo/empty-region-clustering-demo.ipynb) --
  detecting and clustering significant whitespace regions within pages; contrasting the tiny
  inter-cell gaps in a table register against the structural blank areas in notary deed pages.
- [`boundary-within-pages-demo.ipynb`](notebooks/demo/boundary-within-pages-demo.ipynb) --
  which relational line-neighbourhood patterns (RCC-8 symbols) are over-represented immediately
  adjacent to significant whitespace regions -- the within-page boundary markers.
- [`boundary-across-pages-demo.ipynb`](notebooks/demo/boundary-across-pages-demo.ipynb) --
  which page-layout clusters appear near blank pages in the page sequence -- the across-page
  boundary markers; contrasts the table register's front-matter blanks against the notary deeds'
  regular blank-recto convention.
- [`full-text-page-detection-demo.ipynb`](notebooks/demo/full-text-page-detection-demo.ipynb) --
  detecting full-text pages from top/bottom text-extent margins; comparing six inventories
  (three HaNA table registers, two HaNA letter-copy books, one notary-deeds book) by their
  full-text page fraction, margin distribution, and line-width/equal-extent features.
- [`pagexml-image-region-linking.ipynb`](notebooks/demo/pagexml-image-region-linking.ipynb) --
  drawing PageXML regions on a thumbnail, and converting a manually-drawn selection back into a
  new PageXML region.
- [`pagexml-image-multicolour-explorer.ipynb`](notebooks/demo/pagexml-image-multicolour-explorer.ipynb)
  -- screening a sample of scans for multi-colour text and missing-transcription candidates.
- [`sequence-patterns-demo.ipynb`](notebooks/demo/sequence-patterns-demo.ipynb) -- mining
  recurring n-gram patterns and cross-page document elements, comparing the table register
  against the notary deeds.
- [`stream-analysis-overview-demo.ipynb`](notebooks/demo/stream-analysis-overview-demo.ipynb)
  and [`stream-analysis-groundtruth-demo.ipynb`](notebooks/demo/stream-analysis-groundtruth-demo.ipynb)
  -- embeddings + clustering, optional VLM tagging, and active-learning ground-truth creation
  for a plain directory of document images (no PageXML required).
- [`subsequence-detection-demo.ipynb`](notebooks/demo/subsequence-detection-demo.ipynb) --
  detecting book-like subsequences within a heterogeneous scan sequence (`NL-AsdSAA_89_3.1`)
  using adjacent DINOv2 cosine similarity; validates against a known book run and identifies
  additional candidates.
- [`cluster-sequence-analysis-demo.ipynb`](notebooks/demo/cluster-sequence-analysis-demo.ipynb) --
  sequence pattern analysis of cluster label sequences for `NL-HaNA_2.10.50_1` (visual and
  layout clustering) and `NL-AsnDA_0114.11_1` (layout clustering); demonstrates
  `run_length_encode`, `find_tandem_repeats`, `find_frequent_ngrams`, and `label_transition_matrix`.
- [`resolution-cluster-sequence-demo.ipynb`](notebooks/demo/resolution-cluster-sequence-demo.ipynb) --
  layout cluster sequence analysis for six resolution-book inventories from `NL-HaNA_1.01.02`
  (3771–3823); discovers candidate section boundaries from cluster sequence patterns without
  using the available ground-truth section metadata.

### Demo data

The notebooks above need real PageXML/thumbnail data (~341MB across 7 inventory numbers) that
isn't committed to this repo -- only the package code is. Download `demo-data.zip` from the
[latest release](https://github.com/Data-Scopes/archival-structures/releases) and extract it at
the repository root:

```bash
unzip demo-data.zip -d .
```

This recreates `data/PageXML/`, `data/thumbs/`, and `data/annotations/` with exactly the
inventory numbers the demo notebooks reference, so they run unchanged once extracted.

## Installation

```bash
poetry install
```

Requires Python >=3.11,<3.15 -- `torch`'s `triton` dependency caps out at Python <3.15, so the
project's declared Python range matches that rather than the more typical `<4.0`.

## Documentation

Built with Sphinx; requires the optional `docs` dependency group:

```bash
poetry install --with docs
cd docs
make html
```

