Metadata-Version: 2.4
Name: span-aligner
Version: 0.1.2
Summary: A utility for aligning and mapping text spans between different text representations.
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rapidfuzz>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Dynamic: license-file

# Span Aligner

A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.

## Features

- Sanitize span boundaries to avoid special characters.
- Find exact and fuzzy matches of text segments in original documents.
- Map spans from one text representation to another.
- Rebuild tagged text with nested annotations.
- Merge result objects containing span annotations.

## Installation

Install from source:

```bash
pip install span-aligner
```


## Usage


### Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

```python
tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."

ner_map = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "publication_date": "PUBLICATION DATE",
    "impact_location": "PRIMARY LOCATION"
}

span_map ={
    "motivation" : "MOTIVATION"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map=ner_map,
    span_map=span_map
)

print(annotations["entities"])
# Output:
#[
#    {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
#    {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
#    {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
#]

print(annotations["spans"])
# Output:
#[
#    {'start': 0, 'end': 76, 'text': 'Environmental Committee discussed the central park renovation on 2025-12-15.', 'labels': ['MOTIVATION']}
#]


print(annotations["plain_text"])
# Output: "Environmental Committee discussed the central park renovation on 2025-12-15."
```

### Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

```python
text = "On 2026-01-12, the Budget Committee finalized the annual report."
# Spans corresponding to 'MOTIVATION' label, mapped to 'motivation' tag
spans = [{"start": 0, "end": 64, "labels": ["motivation"]}]
# Entities corresponding to 'ADMINISTRATIVE BODY' label, mapped to 'administrative_body' tag
entities = [{"start": 15, "end": 35, "labels": ["administrative_body"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <motivation>On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.</motivation>
```

### Rebuild Tagged Text from Task

Generate tagged text directly from a Label Studio task object.

```python
# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {
    "DECISION": "decision",
    "LEGAL FRAMEWORK": "legal_framework",
    "EXPIRATION DATE": "expiry_date"
}

tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)
```

### Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and text as written in the original.

```python
original_text = "Budget Budget Committee met on 2026-01-12 to view\n\n the central park prject."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)
print(mapped_tagged_text)
# Output might look like: "Budget <administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view\n\n the <impact_location>central park</impact_location> prject."
```



### Map Tags to Original and Get Positions

Combine mapping tags to original text and extracting entities with correct labels.

```python
original_text = "Legal basis: Art. 5. The Env. Committee met on 2026-01-12."
tagged_text = "Legal basis: <article>Art. 5</article>. The <administrative_body>Environmental Committee</administrative_body> met on <session_date>2026-01-12</session_date>."

# 1. Map tags to the noisy original text
mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)

# 2. Extract annotations using the mapping
ner_label_mapping = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "session_date": "SESSION DATE",
    "article": "ARTICLE"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    mapped_tagged_text,
    ner_map=ner_label_mapping
)

print(annotations["entities"])
# Output:
# [
#  {'start': 13, 'end': 19, 'text': 'Art. 5', 'labels': ['ARTICLE']},
#  {'start': 47, 'end': 57, 'text': '2026-01-12', 'labels': ['SESSION DATE']}
# ]
```



