Metadata-Version: 2.4
Name: pdf2data-tools
Version: 0.1.1
Summary: Transforms PDF files into machine readable JSON files
Author-email: Daniel Pereira Costa <daniel.pereira.costa@tecnico.ulisboa.pt>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/pocoyo7798/pdf2data
Project-URL: Repository, https://github.com/pocoyo7798/pdf2data
Project-URL: Issues, https://github.com/pocoyo7798/pdf2data/issues
Keywords: pdf,data-extraction,json,tables,figures,document-analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Click<=8.3.1
Requires-Dist: PyMuPDF<=1.26.7
Requires-Dist: pylatexenc<=2.10
Requires-Dist: pydantic<=2.12.5
Requires-Dist: beautifulsoup4<=4.14.3
Requires-Dist: pdf2doi<=1.7
Requires-Dist: Levenshtein<=0.27.3
Requires-Dist: trieregex<=1.0.0
Requires-Dist: bibtexparser<=1.4.3
Requires-Dist: pypdf>=3.1.0
Provides-Extra: test
Requires-Dist: pytest>=3; extra == "test"
Provides-Extra: pdf2data-pipeline
Requires-Dist: torch<=2.10.0; extra == "pdf2data-pipeline"
Requires-Dist: opencv-python<=4.13.0.92; extra == "pdf2data-pipeline"
Requires-Dist: tensorflow<=2.20.0; extra == "pdf2data-pipeline"
Requires-Dist: doclayout_yolo<=0.0.4; extra == "pdf2data-pipeline"
Requires-Dist: pdf2image<=1.17.0; extra == "pdf2data-pipeline"
Requires-Dist: paddleocr<=3.4.0; extra == "pdf2data-pipeline"
Requires-Dist: paddlepaddle<=3.3.0; extra == "pdf2data-pipeline"
Dynamic: license-file

# pdf2data

[![PyPI version](https://badge.fury.io/py/pdf2data-tools.svg)](https://pypi.org/project/pdf2data-tools/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

Transforms PDF files into machine-readable JSON files. Extracts tables, figures, text blocks, metadata, and references from scientific papers and documents.

> **Note:** The repository is under active development for an article publication. Some errors are expected. Please report any issues on the [issues page](https://github.com/Pocoyo7798/pdf2data/issues).

## Installation

### From PyPI (recommended)

```bash
pip install pdf2data-tools
```

### With optional dependencies

```bash
# For the full PDF2Data pipeline (layout detection, OCR, etc.)
pip install pdf2data-tools[pdf2data_pipeline]
```

### From source (development)

```bash
conda create --name pdf2data python=3.10
conda activate pdf2data
git clone git@github.com:Pocoyo7798/pdf2data.git
cd pdf2data
pip install -e .
```

## Usage

### As a library

```python
from pdf2data.pdf2data_pipeline import PDF2Data

pipeline = PDF2Data(
    layout_model="DocLayout-YOLO-DocStructBench",
    input_folder="path/to/pdfs",
    output_folder="path/to/results",
)
```

### Command line

```bash
# Extract tables and figures
pdf2data_block path_to_folder path_to_results

# Extract text
pdf2data_text path_to_folder path_to_results

# Extract metadata
pdf2data_metadata path_to_folder path_to_results

# Extract references
pdf2data_references path_to_folder path_to_results
```

## Update and Publish (PyPI)

Use this flow when you make changes and want to publish a new package version.

```bash
# 1) Bump version in pyproject.toml
# [project]
# version = "0.0.2"

# 2) (Optional) Keep __version__ in sync
# edit pdf2data/__init__.py

# 3) Install/reinstall build tools
python -m pip install --upgrade build twine

# 4) Clean previous artifacts
rm -rf dist build *.egg-info

# 5) Build package
python -m build

# 6) Validate distribution files
python -m twine check dist/*

# 7) Upload to PyPI
python -m twine upload dist/*
```

When prompted by `twine`:
- Username: `__token__`
- Password: your PyPI token (`pypi-...`)

Verify the release:

```bash
pip install --upgrade pdf2data-tools
pip show pdf2data-tools
```

## License

Apache Software License 2.0
