Metadata-Version: 2.1
Name: pyspark-pdf
Version: 0.1.0rc5
Summary: Spark-Pdf is a library for processing documents using Apache Spark
Author: Mykola Melnyk
Author-email: mykola@stabrise.com
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: ml
Requires-Dist: PyMuPDF (==1.24.11)
Requires-Dist: imagesize (==1.4.1)
Requires-Dist: numpy (>=1.26.4,<2.0.0)
Requires-Dist: pandas (>=2.2.2,<3.0.0)
Requires-Dist: pillow (>=10.4.0,<11.0.0)
Requires-Dist: pyarrow (==17.0.0)
Requires-Dist: pyspark (==3.5.3)
Requires-Dist: pytesseract (==0.3.13)
Requires-Dist: pytest (>=7.4.4,<8.0.0)
Requires-Dist: torch (>=2.3.0,<3.0.0) ; extra == "ml"
Requires-Dist: transformers (>=4.42.0,<5.0.0) ; extra == "ml"
Description-Content-Type: text/markdown

<img src="./images/SparkPdfLogo.png">

<p align="center">
    <a href="https://pypi.org/project/pyspark-pdf/" alt="Package on PyPI"><img src="https://img.shields.io/pypi/v/pyspark-pdf.svg" /></a>
    <a href="https://github.com/stabrise/spark-pdf/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/stabrise/spark-pdf.svg?color=blue"></a>
    <a href="https://stabrise.com"><img alt="StabRise" src="https://img.shields.io/badge/powered%20by-StabRise-orange.svg?style=flat&colorA=E1523D&colorB=007D8A"></a>
</p>



# Spark Pdf

Spark-Pdf is a library for processing documents using Apache Spark.

It includes the following features:

- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results

## Installation

### Requirements

- Python 3.10
- Apache Spark 3.5 or higher
- Java 8
- Tesseract 5.0 or higher

```bash
  pip install pyspark-pdf
```

## Development

### Setup

```bash
  git clone
  cd spark-pdf
```

### Install dependencies

```bash
  poetry install
```

### Run tests

```bash
  poetry run pytest --cov=sparkpdf --cov-report=html:coverage_report tests/ 
```

### Build package

```bash
  poetry build
```

### Build documentation

```bash
  poetry run sphinx-build -M html source build
```

### Docker

Build image:

```bash
  docker build -t spark-pdf .
```

Run container:
```bash
  docker run --rm -it --entrypoint bash spark-pdf:latest
```

### Release

```bash
  poetry version patch
  poetry publish --build
```

