Metadata-Version: 2.4
Name: span-aligner
Version: 0.2.0
Summary: A utility for aligning and mapping text spans between different text representations.
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<3,>=1.26
Requires-Dist: scipy<2,>=1.13
Requires-Dist: networkx<4,>=3.3
Requires-Dist: rapidfuzz<4,>=3.13
Requires-Dist: regex>=2024.9
Requires-Dist: transformers<6,>=4.41
Requires-Dist: torch<3,>=2.5
Requires-Dist: scikit-learn<2,>=1.7
Requires-Dist: requests<3,>=2.32
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Dynamic: license-file

# Span Projecting & Alignment

A utility for aligning and mapping text spans between different text representations, and projecting annotations across languages using semantic alignment.

## Features

- **Span Alignment**: Sanitize boundaries, fuzzy match segments, map spans between text versions.
- **Span Projection**: Project annotations from a source text (e.g., English) to a target text (e.g., Dutch) using embeddings.

## Installation

Install dependencies:

```bash
pip install span-aligner
```

## Usage

The package `span_aligner` provides two main classes: `SpanAligner` and `SpanProjector`.

*   **`SpanAligner`**:
    Uses regex and fuzzy search. It is highly efficient but restricted to **monolingual** tasks (same language). It serves as a strong baseline for correcting boundary offsets or mapping annotations between slightly different versions of a text.

*   **`SpanProjector`**:
    Uses **word embeddings** (Transformers) to align tokens semantically. It supports **cross-lingual** projection and handles significant paraphrasing. However, it is computationally more expensive.
    *   *Complexity*: The default `mwmf` (Max Weight Matching) algorithm has a complexity of **O(n³)**, meaning execution time increases exponentially with text length.
    *   *Use Case*: Use when languages differ or when textual differences are too great for fuzzy matching.

## Optimization & Best Practices

To achieve the best results while managing computational cost, follow these guidelines:

### 1. Choose the Right Tool for the Job
If the source and target texts are in the same language, **always start with `SpanAligner`**. It is significantly faster and creates precise splits. Only switch to `SpanProjector` if fuzzy matching fails due to low textual overlap.

### 2. Manage Text Length (Chunking)
The `SpanProjector` (specifically with `mwmf`) struggles with very long sequences.
*   **Split Texts**: Break documents into logical segments (e.g., paragraphs, decisions, list items) before projection.
*   **Project Locally**: Align spans within their corresponding segments rather than projecting a small span against an entire document.

### 3. Select the Appropriate Algorithm
*   **`mwmf`** (Max Weight Matching): The gold standard. Finds the globally optimal alignment but is slow. Use for final, high-quality output on segmented text.
*   **`inter`** (Intersection): Much faster. Works excellently for **short, distinct spans** (e.g., named entities like persons, locations, dates) where context is less critical.
*   **`itermax`**: A balanced heuristic that offers better speed than `mwmf` with comparable quality for many tasks.

### 4. Translation-Assisted Projection (Hybrid Approach)
If direct cross-lingual projection yields subpar results, consider an intermediate translation step to simplify the alignment task:

1.  **Translate Source**: Use an LLM or NMT model to translate the annotated source text (or just the spans) into the target language.
2.  **Align Locally**: Use `SpanAligner` (or `SpanProjector` with `inter`) to map the *translated* spans onto the *actual* target text.

**Tip**: The translation should mimic the vocabulary of the target text as closely as possible.
*   *Workflow*: `annotated_source` + `target_text` → **LLM** → `rough_translated_source` → **SpanAligner** → `final_annotated_target`



### Span Aligner

Utilities for exact and fuzzy span mapping.

#### Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

```python
from span_aligner import SpanAligner

tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."

ner_map = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "publication_date": "PUBLICATION DATE",
    "impact_location": "PRIMARY LOCATION"
}

span_map ={
    "motivation" : "MOTIVATION"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map=ner_map,
    span_map=span_map
)

print(annotations["entities"])
# Output:
#[
#    {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
#    {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
#    {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
#]
```

#### Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

```python
from span_aligner import SpanAligner

text = "On 2026-01-12, the Budget Committee finalized the annual report."
# Entities corresponding to 'ADMINISTRATIVE BODY' label (indices skip "the ")
entities = [{"start": 19, "end": 35, "labels": ["administrative_body"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, entities=entities)
print(tagged)
# Output: On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.
```

#### Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, allowing for noisy text or translation differences.

```python
from span_aligner import SpanAligner

original_text = "Budget Committee met on 2026-01-12 to view\n\n the central park prject."
tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)
print(mapped_tagged_text)
# Output preserves original text errors:
# "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view
#  the <impact_location>central park</impact_location> prject."
```

### Span Projector

Project annotations from one text to another using semantic alignment (e.g., cross-lingual projection).

The process begins by generating embeddings for both source and target texts, creating a similarity matrix, and finding the optimal set of alignment pairs. Several algorithms are implemented for this matching phase, including `mwmf`, `inter`, `itermax`, `fwd`, `rev`, `greedy`, and `threshold`.



#### Project En -> En (Identity/Paraphrase)

Project annotations to a similar text in the same language. Functions similar to the `spanAligner` with improved fuzzy matching.

```python
from span_aligner import SpanProjector

# Initialize projector (uses BERT embeddings by default)
projector = SpanProjector(src_lang="en", tgt_lang="en")

src_text = "The <ent>cat</ent> \n\n sat. on the mat."
tgt_text = "The cat sat on the mat."

tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
print(tagged_tgt)
# Output: The <ent>cat</ent> sat on the mat.
```

#### Project En -> Nl (Cross-Lingual)

Project annotations from an English source text to a Dutch target translation.

```python
from span_aligner import SpanProjector

# Initialize projector
projector = SpanProjector(src_lang="en", tgt_lang="nl")

src_text = """DECISION LIST <contextual_location>Municipality of Zele</contextual_location>
 <administrative_body>Standing Committee</administrative_body> | <contextual_date>June 28, 2021</contextual_date>
  <title>1. Acceptance of candidacies for the examination procedure coordinator of Welfare</title>
  <decision>Acceptance of candidacies for the examination procedure coordinator of Welfare</decision>
  <title>2. Establishment of valuation rules for the integrated entity Municipality and Public Social Welfare Center (OCMW)</title>
  <decision>Establishment of valuation rules for the integrated entity Municipality and OCMW</decision>"""

tgt_text = """BESLUITENLIJST Gemeente Zele Vast bureau | 28 juni 20211.
 1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
 Aanvaarden kandidaturen examenprocedure coördinator Welzijn
 2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW
 Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW"""

tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
print(tagged_tgt)
# Output: BESLUITENLIJST <contextual_location>Gemeente Zele</contextual_location>
# <administrative_body>Vast bureau</administrative_body> <contextual_date>| 28 juni 20211</contextual_date>.
# <title>1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
# Aanvaarden kandidaturen examenprocedure coördinator</title> Welzijn
# <title>2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</title>
# <decision>Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</decision>

```

### Sentence Aligner

Low-level class for aligning tokens between two texts (sentences or paragraphs) using transformer embeddings.  Based on the work of `simalign` but optimized for span mapping (partial alignment instead of full text) and customized for different embedding providers (Ollama, SaaS providers, Transformers, Sentence-Transformers).

#### Initialize Aligner

```python
from span_aligner import SentenceAligner

# Use bert embeddings (default) with BPE tokenization
aligner = SentenceAligner(model="bert", token_type="bpe") 

text_src = "This is a simple test sentence for alignment."
text_tgt = "Dit is een eenvoudige testzin voor uitlijning."
```

#### Get Text Embeddings

Retrieve tokens and embedding vectors for a string.

```python
tokens_src, vecs_src = aligner.get_text_embeddings(text_src)
print(f"Src tokens: {len(tokens_src)}, Vectors: {vecs_src.shape}")
# Output: Src tokens: 9, Vectors: (10, 768)
```

#### Align Partial Substring

Find the alignment of a specific substring from source to target.

```python
# Align "simple test"
res_sub = aligner.align_texts_partial_substring(text_src, text_tgt, "simple test")
print(f"Src tokens in result: {[t.text for t in res_sub.src_tokens]}")
# Output: Src tokens in result: ['simple', 'test']
```

## Configuration & Advanced Usage

### Embedding Models

The `model` parameter supports common transformer models:

- `"bert"`: `bert-base-multilingual-cased` (Default, robust multilingual performance)
- `"xlmr"`: `xlm-roberta-base` (Strong cross-lingual transfer)
- `"xlmr-large"`: `xlm-roberta-large` (Higher accuracy, more resource intensive)

```python
# Use xlm-roberta-base
projector = SpanProjector(model="xlmr")
```

### Matching Algorithms

The `matching_method` parameter controls how the token similarity matrix is converted into an alignment.

- `"mwmf"` (**Max Weight Matching**): Finds the global optimal independent edge set. Best quality, O(n³) complexity.
- `"inter"` (**Intersection**): Intersection of forward and backward attention. High precision, lower recall, very fast.
- `"itermax"` (**Iterative Max**): Heuristic iterative maximization. Good speed/quality balance.
- `"greedy"` (**Greedy**): Selects best matches greedily. Fast but local optimum.

```python
# Trade accuracy for speed with 'inter'
projector = SpanProjector(matching_method="inter")
```

### Tokenization: BPE vs Word

- `token_type="bpe"` (Recommended): Uses the transformer's subword tokenizer (e.g. WordPiece). Handles rare words better and aligns closer to the model's internal representation.
- `token_type="word"`: Splits by whitespace/punctuation. Simpler, but can result in `[UNK]` tokens for transformers.
