Metadata-Version: 2.4
Name: poster2json
Version: 0.4.4
Summary: Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models
License: MIT
License-File: LICENSE.md
Keywords: poster,json,metadata,extraction,llm,scientific,pdf,ocr,machine-learning,datacite,fair-data
Author: FAIR Data Innovations Hub
Author-email: contact@fairdataihub.org
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Provides-Extra: vision
Requires-Dist: Pillow (>=9.0.0)
Requires-Dist: accelerate (>=0.20.0)
Requires-Dist: art (>=6.0,<7.0)
Requires-Dist: bitsandbytes (>=0.44.0)
Requires-Dist: click (>=8.0,<9.0)
Requires-Dist: jsonschema (>=4.17.0)
Requires-Dist: lingua-language-detector (>=2.1,<2.2)
Requires-Dist: numpy
Requires-Dist: pymupdf (>=1.22.0)
Requires-Dist: rouge-score
Requires-Dist: safetensors
Requires-Dist: sentencepiece
Requires-Dist: torch (>=2.0.0)
Requires-Dist: transformers (>=4.40.0)
Project-URL: Documentation, https://fairdataihub.github.io/poster2json
Project-URL: Homepage, https://github.com/fairdataihub/poster2json
Project-URL: Repository, https://github.com/fairdataihub/poster2json
Description-Content-Type: text/markdown

<div align="center">

<img src="https://cdn.posters.science/logos/poster-fairy.png" alt="logo" width="200" height="auto" title="This image was generated by AI" />

<br />

<h1>poster2json</h1>

<p>
Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.
</p>

<br />

<p>
  <a href="https://github.com/fairdataihub/poster2json/graphs/contributors">
    <img src="https://img.shields.io/github/contributors/fairdataihub/poster2json.svg?style=flat-square" alt="contributors" />
  </a>
  <a href="https://github.com/fairdataihub/poster2json/stargazers">
    <img src="https://img.shields.io/github/stars/fairdataihub/poster2json.svg?style=flat-square" alt="stars" />
  </a>
  <a href="https://github.com/fairdataihub/poster2json/issues/">
    <img src="https://img.shields.io/github/issues/fairdataihub/poster2json.svg?style=flat-square" alt="open issues" />
  </a>
  <a href="https://github.com/fairdataihub/poster2json/blob/main/LICENSE">
    <img src="https://img.shields.io/github/license/fairdataihub/poster2json.svg?style=flat-square" alt="license" />
  </a>
</p>
<p>
  <a href="https://pypi.org/project/poster2json">
    <img src="https://img.shields.io/pypi/v/poster2json.svg" alt="PyPI Version" />
  </a>
  <a href="https://pypistats.org/packages/poster2json">
    <img src="https://img.shields.io/pypi/dm/poster2json.svg?color=orange" alt="PyPI Downloads" />
  </a>
  <a href="https://zenodo.org/badge/latestdoi/1105067405">
    <img src="https://zenodo.org/badge/1105067405.svg" alt="DOI" />
  </a>
</p>

<h4>
    <a href="https://fairdataihub.github.io/poster2json/">Documentation</a>
  <span> · </span>
    <a href="https://fairdataihub.github.io/poster2json/about/changelog/">Changelog</a>
  <span> · </span>
    <a href="https://github.com/fairdataihub/poster2json/issues/">Report Bug</a>
  <span> · </span>
    <a href="https://github.com/fairdataihub/poster2json/issues/">Request Feature</a>
</h4>
</div>

<br />

---

## Description

**poster2json** extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the [poster-json-schema](https://github.com/fairdataihub/poster-json-schema).

The pipeline uses:

- [**Llama-3.1-8B-Instruct**](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) (a verbatim mirror of Meta's release; swap with any HuggingFace instruct model via `--model`) for JSON structuring
- **Qwen2-VL-7B** for vision-based OCR of image posters
- **pdfalto** for layout-aware PDF text extraction
- **lingua-language-detector** for ISO 639-1 language detection on body text (overrides any value the model emits — body text beats metadata-fragment guessing)
- **ROR** (`https://api.ror.org`) for affiliation and publisher canonicalisation; matched names get a ROR identifier attached
- **SPDX** matching (with integer-exact version handling) for license normalisation in `rightsList`

## Quick Start

### Installation

```bash
pip install poster2json
```

### CLI Usage

```bash
# Extract metadata from a poster (default: Llama-3.1-8B-Instruct @ 4bit)
poster2json extract poster.pdf -o result.json

# Use a different instruct model (any HuggingFace repo id works)
poster2json extract poster.pdf --model google/gemma-2-9b-it --quantization 4bit

# Trade VRAM for quality
poster2json extract poster.pdf --quantization 8bit
poster2json extract poster.pdf --quantization fp16

# Validate extracted JSON
poster2json validate result.json

# Process multiple posters
poster2json batch ./posters/ -o ./output/
```

### Python API

```python
from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate the result
is_valid = validate_poster(result)
```

## Output Format

Output conforms to the [poster-json-schema](https://github.com/fairdataihub/poster-json-schema) (DataCite 4.7):

```json
{
  "$schema": "https://posters.science/schema/v0.2/poster_schema.json",
  "creators": [
    {
      "name": "Garcia, Sofia",
      "givenName": "Sofia",
      "familyName": "Garcia",
      "affiliation": [
        {
          "name": "Stanford University",
          "affiliationIdentifier": "https://ror.org/00f54p054",
          "affiliationIdentifierScheme": "ROR",
          "schemeUri": "https://ror.org/"
        }
      ]
    }
  ],
  "titles": [
    { "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
  ],
  "publicationYear": 2025,
  "language": "en",
  "researchField": "Health Sciences",
  "subjects": [
    { "subject": "Machine Learning" },
    { "subject": "Diabetic Retinopathy" }
  ],
  "descriptions": [
    { "description": "We present a deep learning model...", "descriptionType": "Abstract" }
  ],
  "publisher": { "name": "Zenodo" },
  "rightsList": [
    {
      "rights": "Creative Commons Attribution 4.0 International",
      "rightsIdentifier": "CC-BY-4.0",
      "rightsIdentifierScheme": "SPDX",
      "schemeUri": "https://spdx.org/licenses/",
      "rightsUri": "https://creativecommons.org/licenses/by/4.0/"
    }
  ],
  "content": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." },
      { "sectionTitle": "Results", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "id": "fig1", "caption": "Figure 1. ROC curves showing..." }],
  "tableCaptions": [{ "id": "table1", "caption": "Table 1. Performance metrics" }]
}
```

Notes on the auto-populated fields:
- `language` is detected from the raw body text (lingua heuristic). Returns null when text is too short (<200 chars / <50 non-ASCII codepoints) or the detector is unsure.
- `researchField` must be one of the four OpenAlex top-level domains: `Health Sciences`, `Life Sciences`, `Physical Sciences`, `Social Sciences`. Null when the model can't pick one confidently.
- `affiliation` and `publisher` get ROR enrichment when the matcher returns a high-confidence chosen result. Strings without a confident match pass through unchanged. Set `POSTER2JSON_ROR=0` to disable.
- `rightsList` entries are matched against an SPDX table; the matcher is conservative on version numbers (e.g. `CC-BY-4.0` and `CC-BY-4.1` are never confused).

## System Requirements

| Requirement | Specification                    |
| ----------- | -------------------------------- |
| GPU         | NVIDIA CUDA-capable, ≥8GB VRAM (default 4bit); ≥16GB for `--quantization fp16` or image/OCR posters |
| RAM         | ≥32GB recommended                |
| Python      | 3.10+                            |
| OS          | Linux, macOS, Windows (via WSL2) |

## Performance

Validated on 10 manually annotated scientific posters:

| Metric           | Score | Threshold |
| ---------------- | ----- | --------- |
| Word Capture     | 0.96  | ≥0.75     |
| ROUGE-L          | 0.89  | ≥0.75     |
| Number Capture   | 0.93  | ≥0.75     |
| Field Proportion | 0.99  | 0.50–2.00 |

**Pass Rate**: 10/10 (100%)

## Documentation

| Document                             | Description                     |
| ------------------------------------ | ------------------------------- |
| [Architecture](docs/architecture.md) | Technical details & methodology |
| [Evaluation](docs/evaluation.md)     | Validation metrics & results    |

## Development Setup

```bash
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Run tests
poe test

# Format code
poe format
```

If you are on windows and have multiple python versions, you can use the following commands:

```bash
py -0p # list all python versions

py -3.12 -m venv .venv
```

## License

MIT License - see [LICENSE](LICENSE.md) for details.

## Citation

```bibtex
@software{poster2json2026,
  title = {poster2json: Scientific Poster to JSON Metadata Extraction},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  version = {0.4.3},
  url = {https://github.com/fairdataihub/poster2json},
  doi = {10.5281/zenodo.18320010}
}
```

## Funding

This project is funded by [The Navigation Fund](https://www.navigation.org/) ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)).

## Contributing

Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

