Metadata-Version: 2.4
Name: pyprocessors-reconciliation
Version: 0.6.25
Summary: Sherpa reconciliation processor
Author-email: Olivier Terrier <olivier.terrier@kairntech.com>
License-Expression: MIT
License-File: AUTHORS.md
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: collections-extended
Requires-Dist: log-with-context
Requires-Dist: pymultirole-plugins<0.7.0,>=0.6.0
Provides-Extra: dev
Requires-Dist: bump2version; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Provides-Extra: docs
Requires-Dist: lxml-html-clean; extra == 'docs'
Requires-Dist: m2r2; extra == 'docs'
Requires-Dist: sphinx; extra == 'docs'
Requires-Dist: sphinx-rtd-theme; extra == 'docs'
Requires-Dist: sphinxcontrib-apidoc; extra == 'docs'
Provides-Extra: sbom
Requires-Dist: cyclonedx-bom; extra == 'sbom'
Requires-Dist: pip-audit; extra == 'sbom'
Provides-Extra: test
Requires-Dist: dirty-equals; extra == 'test'
Requires-Dist: pytest-check; extra == 'test'
Requires-Dist: pytest-cov; extra == 'test'
Requires-Dist: pytest-mypy; extra == 'test'
Requires-Dist: pytest>=6.0.0; extra == 'test'
Requires-Dist: ruff; extra == 'test'
Description-Content-Type: text/markdown

# pyprocessors_reconciliation

[![license](https://img.shields.io/github/license/oterrier/pyprocessors_reconciliation)](https://github.com/oterrier/pyprocessors_reconciliation/blob/master/LICENSE)
[![tests](https://github.com/oterrier/pyprocessors_reconciliation/workflows/tests/badge.svg)](https://github.com/oterrier/pyprocessors_reconciliation/actions?query=workflow%3Atests)
[![codecov](https://img.shields.io/codecov/c/github/oterrier/pyprocessors_reconciliation)](https://codecov.io/gh/oterrier/pyprocessors_reconciliation)
[![docs](https://img.shields.io/readthedocs/pyprocessors_reconciliation)](https://pyprocessors_reconciliation.readthedocs.io)
[![version](https://img.shields.io/pypi/v/pyprocessors_reconciliation)](https://pypi.org/project/pyprocessors_reconciliation/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyprocessors_reconciliation)](https://pypi.org/project/pyprocessors_reconciliation/)

Reconciliation annotations coming from different annotators.

## Installation

```
pip install pyprocessors-reconciliation
```

## Overview

`ReconciliationProcessor` is a [pymultirole](https://github.com/kairntech/pymultirole-plugins) processor plugin that reconciles overlapping annotations produced by multiple annotators (NER models, knowledge-base linkers, white/kill lists) into a single coherent set.

The processor is registered under the `pyprocessors.plugins` entry point as `reconciliation`.

### Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `type` | `ReconciliationType` | `linker` | Reconciliation strategy (currently only `linker`) |
| `kill_label` | `str \| None` | `None` | Label whose annotations suppress matching model annotations |
| `white_label` | `str \| None` | `None` | Label treated as authoritative (terms stripped so it acts like a model annotation) |
| `whitelisted_lexicons` | `list[str] \| None` | `None` | Lexicons whose annotations are duplicated as term-free model candidates |
| `person_label` | `str \| None` | `None` | Label used to identify person annotations for last-name resolution |
| `remove_suspicious` | `bool` | `True` | Drop model annotations that contain no capitalised word (numbers, percentages, etc.) |
| `resolve_lastnames` | `bool` | `False` | Resolve isolated last names / first names using full names seen earlier in the document |

### How it works

1. **Sentence filtering** — `sentence`-labelled annotations are removed before processing.
2. **Whitelist marking** (`mark_whitelisted`) — annotations matching `white_label` have their terms cleared so they behave like model candidates; annotations from `whitelisted_lexicons` get a term-free duplicate added alongside the original.
3. **Grouping** (`group_annotations`) — annotations are grouped by their first term's lexicon (empty string = model / no-lexicon). Same-span annotations in the same group have their term lists merged and deduplicated.
4. **Linker consolidation** (`consolidate_linker`):
   - Suspicious model annotations (no capitalised word) are optionally dropped.
   - Kill-list annotations suppress matching model annotations.
   - KB annotations at the same span enrich the matching model annotation with their terms.
   - Overlapping or mismatched-label KB matches are logged as warnings and skipped.
5. **Last-name resolution** — when `resolve_lastnames=True`, isolated person names (single token) are resolved to the full-name annotation seen earliest in the document.

## Developing

### Prerequisites

[uv](https://github.com/astral-sh/uv) is required as the package manager.

```
pip install uv
```

Clone the repository:

```
git clone https://github.com/oterrier/pyprocessors_reconciliation
cd pyprocessors_reconciliation
```

### Install in development mode

```
uv sync --extra test
```

### Running the test suite

```
uv run pytest
```

### Linting and formatting

```
uv run ruff check .
uv run ruff format .
```

### Building the documentation

```
uv run --extra docs sphinx-build docs docs/_build
```

The built documentation is available at `docs/_build/index.html`.

### Building and publishing

```
uv build
uv publish
```

## SBOM & vulnerability check

Install the SBOM dependencies:

```
uv sync --extra sbom
```

Generate a CycloneDX SBOM from the current environment:

```
uv run cyclonedx-py environment -o sbom.cdx.json --output-format json
```

Audit dependencies for known vulnerabilities:

```
uv run pip-audit --format json --output audit-report.json
```

To fail on any known vulnerability (useful in CI):

```
uv run pip-audit --strict
```
