Metadata-Version: 2.4
Name: flattentei
Version: 0.1.9
Summary: Transform TEI XML to a simple standoff format
Project-URL: Homepage, https://github.com/ottowg/flatten-tei
Author-email: Wolf Otto <wolfgang.otto@gesis.org>
License: BSD-2-Clause
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.10
Requires-Dist: lxml>=4.9.1
Requires-Dist: nltk>=3.8
Description-Content-Type: text/markdown

# flattentei

Convert TEI XML documents to plain text with standoff annotations — a simple, pipeline-friendly format for NLP workflows.

## What it does

**flattentei** reads TEI XML files and produces:

- A plain **text** string of the full document
- **Span annotations** (begin/end offsets into the text) for structural elements like paragraphs, sentences, section headings, references, and figures
- Structured **metadata** (authors, DOI, journal, affiliations, ORCID, …)
- A list of **figures** with captions
- Typed **`Doc` objects** that support sentence splitting, span lookup, and relation attachment for downstream NLP pipelines

The standoff JSON format (`flatdoc`) keeps text and annotations strictly separated, which makes it easy to feed into annotation tools, relation extraction pipelines, or fine-tuning workflows.

---

## Supported TEI dialects

| Dialect key | Description |
|-------------|-------------|
| `"tei_wdm"` | WDM / ULB Darmstadt TEI — journal articles converted from JATS |

The original GROBID-based parser (`transform_xml`) is still available for backwards compatibility.

---

## Installation

```bash
pip install flattentei
```

Requires Python ≥ 3.10.

---

## Quick start

### Parse a WDM TEI file

```python
import flattentei

doc = flattentei.parse_xml("article.xml", dialect="tei_wdm")

# unpack the three main outputs
text, annotations, metadata = doc

print(doc.doc_id)               # e.g. "jz000102-0007"
print(doc.metadata["title"])    # "Detuned Resonances"
print(doc.metadata["authors"])  # [{"surname": "Colyer", "forename": "Greg", ...}, ...]
print(doc.metadata["doi"])      # "10.3390/fluids7090297"
```

Also accepts raw bytes:

```python
doc = flattentei.parse_xml(Path("article.xml").read_bytes(), dialect="tei_wdm")
```

---

### Work with sentences and spans

`doc.sentences` returns a list of `Sentence` objects. If the XML contains sentence markup they are used directly; otherwise NLTK `sent_tokenize` is applied paragraph-by-paragraph.

```python
for sent in doc.sentences:
    print(sent.sentence_idx, sent.text)

    for span in sent.spans:
        # span.span_type e.g. "ReferenceToBib", "Paragraph", …
        # span.begin / span.end  — offsets within the sentence
        # span.begin_in_doc / span.end_in_doc — offsets in the full document
        print(f"  [{span.span_type}] {span.text!r}")
```

---

### Export to flatdoc JSON

The `to_json()` method returns a dict compatible with the original flatdoc format (`{"text": …, "annotations": …}`):

```python
import json

flat = doc.to_json()
json.dump(flat, open("article.json", "w"))
```

---

### Access span annotations directly

```python
# all paragraph offsets
for para in doc.spans["Paragraph"]:
    print(doc.text[para["begin"]:para["end"]])

# all in-text citation spans with their targets
for ref in doc.spans.get("ReferenceToBib", []):
    print(ref["target"], doc.text[ref["begin"]:ref["end"]])
```

---

### Attach relations (NLP pipeline output)

`Relation` connects two `Span` objects with an optional label and confidence score. Designed to hold the output of entity and relation extraction models.

```python
from flattentei import Relation

sents = doc.sentences
subj = sents[2].spans[0]
obj  = sents[2].spans[1]

# attach to a sentence
sents[2].relations.append(Relation(subject=subj, object=obj, label="cites", score=0.91))

# or to the whole document
doc.relations.append(Relation(subject=subj, object=obj, label="authored_by"))
```

---

### Load existing flatdoc JSON files

```python
import json
from flattentei import get_units

with open("article.json") as f:
    flatdoc = json.load(f)

# extract sentences with their text
sentences = get_units("Sentence", flatdoc)

# extract entities enriched with the surrounding sentence text
entities = get_units("Entity", flatdoc, enrich_container=["Sentence"])
for ent in entities:
    print(ent["text"], ent["container"]["Sentence"]["text"])
```

---

### Batch convert a folder of TEI XML files

```bash
flatten-tei-folder --source ./xml_files --target ./output
```

Or from Python:

```python
from flattentei.tei_to_text_and_standoff import transform_xml_folder
from pathlib import Path

transform_xml_folder(Path("xml_files"), Path("output"))
```

---

## Data model

```
Doc
├── doc_id: str
├── text: str
├── spans: dict[str, list[dict]]   # {"Paragraph": [{begin, end, idx, …}, …], …}
├── metadata: dict                  # title, authors, doi, journal, …
├── figures: list[dict]             # id, head, label, url
├── relations: list[Relation]
└── sentences → list[Sentence]     # property, computed on access

Sentence
├── doc_id, sentence_id, sentence_idx
├── text, begin_idx
├── spans: list[Span]
└── relations: list[Relation]

Span
├── doc_id, text, span_type
├── begin, end                     # relative to parent container
└── begin_in_doc, end_in_doc

Relation
├── subject: Span
├── object: Span
├── label: str | None
└── score: float | None
```

### Span types produced by the WDM parser

| Type | Description |
|------|-------------|
| `Abstract` | Abstract section |
| `Div` | Section div (with optional `id`) |
| `Head` | Section heading (with optional `n`, `id`) |
| `Paragraph` | Paragraph |
| `ReferenceToBib` | In-text citation (with `target` = bib entry id) |
| `ReferenceToFigure` | In-text figure reference |
| `ReferenceToSection` | In-text section cross-reference |
| `ReferenceString` | Full formatted reference entry |
| `SectionHeader` | Title + abstract region |
| `SectionMain` | Body text region |
| `SectionFootnote` | Back-matter notes region |
| `SectionReference` | Reference list region |
