Metadata-Version: 2.4
Name: flattentei
Version: 0.2.0
Summary: Transform TEI XML to a simple standoff format
Project-URL: Homepage, https://github.com/ottowg/flatten-tei
Author-email: Wolf Otto <wolfgang.otto@gesis.org>
License: PolyForm Noncommercial License 1.0.0
        
        Copyright (c) 2024 Wolf Otto
        
        Acceptance
        
        In order to get any license under these terms, you must agree to
        them as both strict obligations and conditions to all your licenses.
        
        Copyright License
        
        The licensor grants you a copyright license for the software to do
        everything you might do with the software that would otherwise infringe
        the licensor's copyright in it for any permitted purpose. However, you
        may only distribute the software according to Distribution License and
        make changes or new works based on the software according to Changes
        and New Works License.
        
        Distribution License
        
        The licensor grants you an additional copyright license to distribute
        copies of the software. Your license to distribute covers distributing
        the software with changes and new works permitted by Changes and New
        Works License.
        
        Notices
        
        You must ensure that anyone who gets a copy of any part of the software
        from you also gets a copy of these terms or the URL for them above, as
        well as copies of any plain-text lines beginning with Required Notice:
        that the licensor provided with the software. For example:
        
            Required Notice: Copyright (c) 2024 Wolf Otto
        
        Changes and New Works License
        
        The licensor grants you an additional copyright license to make changes
        and new works based on the software for any permitted purpose.
        
        Patent License
        
        The licensor grants you a patent license for the software that covers
        patent claims the licensor can license, or becomes able to license, that
        you would infringe by using the software.
        
        Noncommercial Purposes
        
        Any noncommercial purpose is a permitted purpose.
        
        Personal Uses
        
        Personal use for research, experiment, and testing for the benefit of
        public knowledge, personal study, private entertainment, hobby projects,
        amateur pursuits, or religious observance, without any anticipated
        commercial application, is use for a permitted purpose.
        
        Noncommercial Organizations
        
        Use by any charitable organization, educational institution, public
        research organization, public safety or health organization,
        environmental protection organization, or government institution is use
        for a permitted purpose regardless of the source of funding or
        obligations resulting from the funding.
        
        Fair Use
        
        You may have "fair use" rights for the software under the law. These
        terms do not limit them.
        
        No Other Rights
        
        These terms do not allow you to sublicense or transfer any of your
        licenses to anyone else, or prevent the licensor from granting licenses
        to anyone else. These terms do not imply any other licenses.
        
        Patent Defense
        
        If you make any written claim that the software infringes or contributes
        to infringement of any patent, your patent license for the software
        granted under these terms ends immediately. If your employer makes such
        a claim, your patent license ends immediately for work on behalf of your
        employer.
        
        Violations
        
        The first time you are notified in writing that you have violated any of
        these terms, or done anything with the software not covered by your
        licenses, you have 30 days to come into compliance. If you do not do so,
        your license ends immediately.
        
        No Liability
        
        As far as the law allows, the software comes as is, without any warranty
        or condition, and the licensor will not be liable to you for any damages
        arising out of these terms or the use or nature of the software, under
        any kind of legal claim.
        
        Definitions
        
        The licensor is the individual or entity offering these terms, and the
        software is the software the licensor makes available under these terms.
        
        You refers to the individual or entity agreeing to these terms.
        
        Your company is any legal entity, sole proprietorship, or other kind of
        organization that you work for, plus all organizations that have
        control over, are under the control of, or are under common control with
        that organization. Control means ownership of substantially all the
        assets of an entity, or the power to direct its management and policies
        by vote, contract, or otherwise. Control can be direct or indirect.
        
        Your licenses are all the licenses granted to you for the software under
        these terms.
        
        Use means anything you do with the software requiring one of your
        licenses.
License-File: LICENSE
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.10
Requires-Dist: lxml>=4.9.1
Requires-Dist: nltk>=3.8
Description-Content-Type: text/markdown

# flattentei

Convert TEI XML documents to plain text with standoff annotations — a simple, pipeline-friendly format for NLP workflows.

## What it does

**flattentei** reads TEI XML files and produces:

- A plain **text** string of the full document
- **Span annotations** (begin/end offsets into the text) for structural elements like paragraphs, sentences, section headings, references, and figures
- Structured **metadata** (authors, DOI, journal, affiliations, ORCID, …)
- A list of **figures** with captions
- Typed **`Doc` objects** that support sentence splitting, span lookup, and relation attachment for downstream NLP pipelines

The standoff JSON format (`flatdoc`) keeps text and annotations strictly separated, which makes it easy to feed into annotation tools, relation extraction pipelines, or fine-tuning workflows.

---

## Supported TEI dialects

| Dialect key | Description |
|-------------|-------------|
| `"tei_wdm"` | WDM / ULB Darmstadt TEI — journal articles converted from JATS |

The original GROBID-based parser (`transform_xml`) is still available for backwards compatibility.

---

## Installation

```bash
pip install flattentei
```

Requires Python ≥ 3.10.

---

## Quick start

### Parse a WDM TEI file

```python
import flattentei

doc = flattentei.parse_xml("article.xml", dialect="tei_wdm")

# unpack the three main outputs
text, annotations, metadata = doc

print(doc.doc_id)               # e.g. "jz000102-0007"
print(doc.metadata["title"])    # "Detuned Resonances"
print(doc.metadata["authors"])  # [{"surname": "Colyer", "forename": "Greg", ...}, ...]
print(doc.metadata["doi"])      # "10.3390/fluids7090297"
```

Also accepts raw bytes:

```python
doc = flattentei.parse_xml(Path("article.xml").read_bytes(), dialect="tei_wdm")
```

---

### Work with sentences and spans

`doc.sentences` returns a list of `Sentence` objects. If the XML contains sentence markup they are used directly; otherwise NLTK `sent_tokenize` is applied paragraph-by-paragraph.

```python
for sent in doc.sentences:
    print(sent.sentence_idx, sent.text)

    for span in sent.spans:
        # span.span_type e.g. "ReferenceToBib", "Paragraph", …
        # span.begin / span.end  — offsets within the sentence
        # span.begin_in_doc / span.end_in_doc — offsets in the full document
        print(f"  [{span.span_type}] {span.text!r}")
```

---

### Export to flatdoc JSON

The `to_json()` method returns a dict compatible with the original flatdoc format (`{"text": …, "annotations": …}`):

```python
import json

flat = doc.to_json()
json.dump(flat, open("article.json", "w"))
```

---

### Access span annotations directly

```python
# all paragraph offsets
for para in doc.spans["Paragraph"]:
    print(doc.text[para["begin"]:para["end"]])

# all in-text citation spans with their targets
for ref in doc.spans.get("ReferenceToBib", []):
    print(ref["target"], doc.text[ref["begin"]:ref["end"]])
```

---

### Attach relations (NLP pipeline output)

`Relation` connects two `Span` objects with an optional label and confidence score. Designed to hold the output of entity and relation extraction models.

```python
from flattentei import Relation

sents = doc.sentences
subj = sents[2].spans[0]
obj  = sents[2].spans[1]

# attach to a sentence
sents[2].relations.append(Relation(subject=subj, object=obj, label="cites", score=0.91))

# or to the whole document
doc.relations.append(Relation(subject=subj, object=obj, label="authored_by"))
```

---

### Load existing flatdoc JSON files

```python
import json
from flattentei import get_units

with open("article.json") as f:
    flatdoc = json.load(f)

# extract sentences with their text
sentences = get_units("Sentence", flatdoc)

# extract entities enriched with the surrounding sentence text
entities = get_units("Entity", flatdoc, enrich_container=["Sentence"])
for ent in entities:
    print(ent["text"], ent["container"]["Sentence"]["text"])
```

---

### Batch convert a folder of TEI XML files

```bash
flatten-tei-folder --source ./xml_files --target ./output
```

Or from Python:

```python
from flattentei.tei_to_text_and_standoff import transform_xml_folder
from pathlib import Path

transform_xml_folder(Path("xml_files"), Path("output"))
```

---

## Data model

```
Doc
├── doc_id: str
├── text: str
├── spans: dict[str, list[dict]]   # {"Paragraph": [{begin, end, idx, …}, …], …}
├── metadata: dict                  # title, authors, doi, journal, …
├── figures: list[dict]             # id, head, label, url
├── relations: list[Relation]
└── sentences → list[Sentence]     # property, computed on access

Sentence
├── doc_id, sentence_id, sentence_idx
├── text, begin_idx
├── spans: list[Span]
└── relations: list[Relation]

Span
├── doc_id, text, span_type
├── begin, end                     # relative to parent container
└── begin_in_doc, end_in_doc

Relation
├── subject: Span
├── object: Span
├── label: str | None
└── score: float | None
```

### Span types produced by the WDM parser

| Type | Description |
|------|-------------|
| `Abstract` | Abstract section |
| `Div` | Section div (with optional `id`) |
| `Head` | Section heading (with optional `n`, `id`) |
| `Paragraph` | Paragraph |
| `ReferenceToBib` | In-text citation (with `target` = bib entry id) |
| `ReferenceToFigure` | In-text figure reference |
| `ReferenceToSection` | In-text section cross-reference |
| `ReferenceString` | Full formatted reference entry |
| `SectionHeader` | Title + abstract region |
| `SectionMain` | Body text region |
| `SectionFootnote` | Back-matter notes region |
| `SectionReference` | Reference list region |
