Metadata-Version: 2.4
Name: gsapere
Version: 0.2.1
Summary: Entity and Relation Extraction on scientific text using HGERE with a span-pruning stage.
Author: Wolfgang Otto
License: MIT
License-File: LICENSE
Keywords: entity-recognition,information-extraction,nlp,relation-extraction,scientific-text
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.11,>=3.9
Requires-Dist: einops>=0.8.0
Requires-Dist: fastapi>=0.128.8
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: matplotlib>=3.9.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.32.0
Requires-Dist: scikit-learn>=1.6.0
Requires-Dist: setuptools>=75.6.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: torch>=2.8.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: transformers<5.0,>=4.40.0
Requires-Dist: wandb>=0.19.9
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: lint>=1.2.1; extra == 'dev'
Requires-Dist: pytest>=8.4.2; extra == 'dev'
Requires-Dist: ruff>=0.15.5; extra == 'dev'
Requires-Dist: twine>=6.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# gsapere — Entity and Relation Extraction for Scientific Text

A fork of [HGERE](https://github.com/yanzhh/HGERE) adapted for scientific text, with a two-stage pipeline for **joint entity and relation extraction (ERE)**.

> **Paper under review.**
> Configs used for our experiments are in [`configs/`](configs/).

The pipeline consists of:

1. **Rule-based pre-filter** *(optional)* — removes deterministically non-entity spans (punctuation, function-word sequences, etc.) before the neural pruner sees training data, reducing trivial negatives and speeding up training
2. **Span Pruner** — a binary classifier that scores remaining candidate n-grams and filters them to a manageable set (target: ≥ 98 % entity recall)
3. **HGERE** — a Hypergraph GNN that jointly predicts entity types and relations on the pruned candidates

Supported datasets: **GSAP-ERE**, **SciER**, **SciNLP**, **SciERC**

---

## Changes from the original

- Large-scale code restructuring: Pydantic-first configs, typed signatures throughout, proper package layout under `src/`
- All dependencies updated to current versions
- The transformer package is **no longer hardcoded** — any compatible HuggingFace `transformers` version works
- Added rule-based pre-filter, span pruner stage, multi-dataset joint training, and full CLI entry points
- Tests for all major components

---

## Requirements

- **Python 3.9** (tested; `<3.11` required by some dependencies)
- CUDA 12.8 (adjust `pyproject.toml` for other CUDA versions)
- A GPU with at least ~24 GB VRAM for default batch sizes (tested on A40 / 40 GB)

---

## Installation

Install [uv](https://github.com/astral-sh/uv):

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Clone the repository and install:

```bash
git clone <repo-url>
cd HGERE
uv sync
```

### Datasets

Datasets are loaded from their original sources via the download command:

```bash
uv run gsapere-download-dataset --list          # list available datasets
uv run gsapere-download-dataset gsap-ere
uv run gsapere-download-dataset scier
uv run gsapere-download-dataset scinlp
uv run gsapere-download-dataset scierc
uv run gsapere-download-dataset --all           # download everything
```

See [documentation/download-dataset.md](documentation/download-dataset.md) for split details and manual download fallbacks.

#### GSAP-ERE

Fine-grained entity and relation extraction focused on machine learning — 100 annotated full-text ML publications, 63K entities, 35K relations, 10 entity types, 18 relation types.
DOI: <https://doi.org/10.60914/c4c1d-s0587>

> Otto et al., "GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning", AAAI 2026.
> <https://ojs.aaai.org/index.php/AAAI/article/view/40537>

#### SciER

Entity and relation extraction dataset for datasets, methods, and tasks in scientific documents — 106 annotated full-text papers, 24k entities, 12k relations.

> Dziadek et al., "SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents", EMNLP 2024.
> <https://aclanthology.org/2024.emnlp-main.726/>

#### SciNLP

Full-text entity and relation extraction benchmark for the NLP domain — 60 annotated ACL papers, 6,409 entities, 1,648 relations.

> "SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP", EMNLP 2025.
> <https://aclanthology.org/2025.emnlp-main.732/>

#### SciERC

Scientific information extraction benchmark — 500 annotated AI abstracts,
6 entity types, 7 relation types.

> Luan et al., "Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction", EMNLP 2018.
> <https://aclanthology.org/D18-1360/>

---

## Training

Training is a two-step process: first train the pruner, then train HGERE on the pruner's output.

### Step 1 — Fit the rule-based pre-filter (optional)

```bash
uv run gsapere-fit-rulebased-pruner configs/train/gsap/fit_rulebased_pruner.yaml
```

This fits token n-gram patterns from the training data that deterministically exclude non-entity spans. The saved JSON file is referenced in the pruner training config to speed up training.

### Step 2 — Train the span pruner

```bash
uv run gsapere-train-pruner configs/train/gsap/train_gsap_pruner.yaml
```

After training, run pruner inference on train/dev/test to produce the enriched input files for HGERE (see `scripts/pruner/`).

### Step 3 — Train HGERE (single dataset)

```bash
uv run gsapere-train-hgere configs/train/gsap/train_gsap_hgere.yaml
```

Example config:

```yaml
schema_version: "1.0"
label_set: gsap
model_dir: saves/hgere/gsap
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
ner_prediction_dir: saves/pruner/gsap/output
max_seq_length: 512
n_iter: 3
layernorm: true
attn_self: true

train_params:
  learning_rate: 1e-5
  num_train_epochs: 8
  per_gpu_train_batch_size: 21
  fp16: true
  evaluate_during_training: true
  eval_epochs: 1
  loss_re_weight_alpha: 0.9
  log_wandb: true
```

### Step 3 (alt) — Train HGERE on multiple datasets jointly

Multi-dataset mode trains a shared encoder with per-dataset NER and relation heads. Each dataset must have its own pruner output directory.

```bash
uv run gsapere-train-hgere configs/multi-sciere-scinlp-gsap-ere/train/hgere/train_multi.yaml
```

Example config:

```yaml
schema_version: "1.0"
model_dir: saves/multi/hgere/run1
base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
max_seq_length: 512
n_iter: 3
layernorm: true
attn_self: true
sampling_temperature: 0.8   # 0 = always largest dataset, 1 = proportional to size
seeds: [42, 43, 44]          # run once per seed; _seed<n> appended to model_dir

datasets:
  - label_set: scier
    ner_prediction_dir: saves/pruner/scier/output
    train_file: ent_pred_train.json
    dev_file: ent_pred_dev.json
    test_file: ent_pred_test.json
  - label_set: scinlp
    ner_prediction_dir: saves/pruner/scinlp/output
    train_file: ent_pred_train.json
    dev_file: ent_pred_dev.json   # omit (null) to skip dev evaluation for this dataset
  - label_set: gsap
    ner_prediction_dir: saves/pruner/gsap/output
    train_file: ent_pred_train.json

train_params:
  learning_rate: 1e-5
  num_train_epochs: 8
  per_gpu_train_batch_size: 21
  fp16: true
  evaluate_during_training: true
  log_wandb: true
```

---

## Inference

### Full pipeline (pruner → HGERE)

```bash
CUDA_VISIBLE_DEVICES=0 uv run gsapere-pipeline \
    --config configs/inference/gsap-pipeline-best.yaml \
    --input input/ \
    --output output/
```

`--input` can be a `.jsonl` file or a directory of `.jsonl` files.
Ready-to-use configs for all supported datasets are in [`configs/inference/`](configs/inference/).

The pipeline config combines pruner and HGERE settings in a single YAML file:

```yaml
label_set: gsap

pruner:
  model_dir: saves/pruner/gsap/best
  base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
  model_type: bertspanmarkerpruner
  max_seq_length: 256
  per_gpu_eval_batch_size: 32
  final_pruning:
    method: threshold
    threshold: 0.0005

hgere:
  model_dir: saves/hgere/gsap/best
  base_model_name_or_path: pretrained_models/scibert_scivocab_uncased
  model_type: hyper
  max_seq_length: 512
  per_gpu_eval_batch_size: 32
  n_iter: 3
  layernorm: true
  attn_self: true
  pre_filter_params:
    method: threshold
    value: 0.0125
```

---

## Docker API

The pipeline can be served as a REST API. Build and run with Docker (requires `--gpus all`):

```bash
docker build -t gsapere-api .

docker run --gpus all \
    -v /path/to/models:/app/models \
    -v /path/to/config.yaml:/app/config.yaml \
    -e PIPELINE_CONFIG=/app/config.yaml \
    -p 8000:8000 \
    gsapere-api
```

Models and the pipeline config are mounted at runtime — the image itself contains only the code.

**Endpoints:**

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness check |
| `POST` | `/predict` | Run the pipeline on a batch of documents |

**Example request:**

```bash
curl -X POST http://localhost:8000/predict \
    -H "Content-Type: application/json" \
    -d '{"documents": [{"doc_key": "doc1", "sentences": [["We", "train", "BERT", "."]]}]}'
```

---

## CLI reference

| Command | Description |
|---|---|
| `gsapere-train-pruner` | Train the span pruner |
| `gsapere-train-hgere` | Train the HGERE ERE model |
| `gsapere-pipeline` | Run the full two-stage pipeline on new documents |
| `gsapere-download-dataset` | Download supported datasets |
| `gsapere-tune-pruner` | Threshold sweep and optimisation for the pruner |
| `gsapere-fit-rulebased-pruner` | Fit a rule-based pruner baseline |
| `infer-fixed-spans` | Run HGERE on fixed (gold) spans |
| `infer-pruner-augmented` | Run HGERE on pruner-predicted spans |
| `gsap-ere-benchmark-pipeline` | Benchmark pipeline throughput |
| `gsapere-fix-gold-annos` | Add gold annotations to prediction files |
| `gsapere-analysis-ner-length-distribution` | Analyse entity length distributions |
| `gsapere-generate-pruner-docs` | Regenerate parameter docs in `documentation/api/` |

---

## Development

```bash
uv run pytest                          # run tests
uv run ruff format src/ tests/         # format
uv run ruff check src/ tests/          # lint
```

---

## Building and publishing

```bash
uv build                               # produces dist/ wheel + sdist
bash publish.sh                        # build + upload to PyPI (requires .pypi token file)
```

---

## Citation

Please cite this work and the original HGERE:

```bibtex
@article{Otto2026GSAP-ERE,
  title   = {{GSAP-ERE}: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning},
  author  = {Otto, Wolfgang and Gan, Lu and Upadhyaya, Sharmila and Karmakar, Saurav and Dietze, Stefan},
  journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume  = {40},
  number  = {38},
  pages   = {32600--32609},
  year    = {2026},
  month   = {Mar.},
  doi     = {10.1609/aaai.v40i38.40537},
  url     = {https://ojs.aaai.org/index.php/AAAI/article/view/40537},
}

@misc{yan2023joint,
  title         = {Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks},
  author        = {Zhaohui Yan and Songlin Yang and Wei Liu and Kewei Tu},
  year          = {2023},
  eprint        = {2310.17238},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}
```

---

## License

MIT — see [LICENSE](LICENSE).
