Metadata-Version: 2.4
Name: span-aligner
Version: 0.1.0
Summary: A utility for aligning and mapping text spans between different text representations.
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rapidfuzz>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Dynamic: license-file

# Span Aligner

A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.

## Features

- Sanitize span boundaries to avoid special characters.
- Find exact and fuzzy matches of text segments in original documents.
- Map spans from one text representation to another.
- Rebuild tagged text with nested annotations.
- Merge result objects containing span annotations.

## Installation

Install from source:

```bash
pip install .
```

For development:

```bash
pip install -e ".[dev]"
```

## Usage

```python
from span_aligner import SpanAligner

original = "Hello, World!"
result_obj = {
    "spans": [{"start": 0, "end": 5, "text": "Hello", "labels": ["greeting"]}],
    "entities": [],
    "task": {"data": {"text": ""}}
}

success, mapped = SpanAligner.map_spans_to_original(original, result_obj)
print(mapped)
```

### Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and original text as written in the original.

```python
original_text = "The quick brown fox jumps\n\n over the dog."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "The <adj>quick</adj> brown fox jumps over the <animal>dog</animal>."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.8
)
print(mapped_tagged_text)
# Output might look like: "The <adj>quick</adj> brown fox jumps\n\n over the <animal>dog</animal>."
# (If original text differed slightly, tags would be placed on best matching spans)
```

### Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

```python
text = "Hello World"
spans = [{"start": 0, "end": 11, "labels": ["sentence"]}]
entities = [{"start": 6, "end": 11, "labels": ["location"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <sentence>Hello <location>World</location></sentence>
```

### Rebuild Tagged Text from Task

Generate tagged text directly from a Label Studio task object.

```python
# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {"Location": "loc", "Person": "per"}

tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)
```

### Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

```python
tagged_input = "Visit <loc>Paris</loc> and see the <landmark>Eiffel Tower</landmark>."

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map={"loc": "Location", "landmark": "Location"}
)

print(annotations["entities"])
# Output: 
# [
#   {"start": 6, "end": 11, "text": "Paris", "labels": ["Location"]},
#   {"start": 24, "end": 36, "text": "Eiffel Tower", "labels": ["Location"]}
# ]
print(annotations["plain_text"])
# Output: "Visit Paris and see the Eiffel Tower."
```
