Metadata-Version: 2.4
Name: smart_pdf_for_business
Version: 1.0.0
Summary: A Python library for business PDF related content analysis
Project-URL: Homepage, https://github.com/Sagraetor/Smart-Pdf-for-Business
Author-email: Sagraetor <sagraetor@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: Ai,Business automation,Data extraction,Document parsing,Machine Learning,NLP,OCR,PDF,PDF analysis,Semantic search,Signatures,Text extraction
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Filters
Requires-Python: >=3.10
Requires-Dist: accelerate==1.11.0
Requires-Dist: annotated-types==0.7.0
Requires-Dist: antlr4-python3-runtime==4.9.3
Requires-Dist: attrs==25.4.0
Requires-Dist: beautifulsoup4==4.14.2
Requires-Dist: blis==1.3.0
Requires-Dist: catalogue==2.0.10
Requires-Dist: certifi==2025.11.12
Requires-Dist: charset-normalizer==3.4.4
Requires-Dist: click==8.3.0
Requires-Dist: cloudpathlib==0.23.0
Requires-Dist: colorama==0.4.6
Requires-Dist: coloredlogs==15.0.1
Requires-Dist: colorlog==6.10.1
Requires-Dist: confection==0.1.5
Requires-Dist: cymem==2.0.11
Requires-Dist: dill==0.4.0
Requires-Dist: docling-core==2.50.1
Requires-Dist: docling-ibm-models==3.10.2
Requires-Dist: docling-parse==4.7.1
Requires-Dist: docling==2.61.2
Requires-Dist: easyocr==1.7.2
Requires-Dist: et-xmlfile==2.0.0
Requires-Dist: faker==38.0.0
Requires-Dist: filelock==3.20.0
Requires-Dist: filetype==1.2.0
Requires-Dist: flatbuffers==25.9.23
Requires-Dist: fsspec==2025.10.0
Requires-Dist: huggingface-hub==0.36.0
Requires-Dist: humanfriendly==10.0
Requires-Dist: idna==3.11
Requires-Dist: imageio==2.37.2
Requires-Dist: jinja2==3.1.6
Requires-Dist: joblib==1.5.2
Requires-Dist: jsonlines==4.0.0
Requires-Dist: jsonref==1.1.0
Requires-Dist: jsonschema-specifications==2025.9.1
Requires-Dist: jsonschema==4.25.1
Requires-Dist: latex2mathml==3.78.1
Requires-Dist: lazy-loader==0.4
Requires-Dist: lxml==6.0.2
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: marko==2.2.1
Requires-Dist: markupsafe==3.0.3
Requires-Dist: mdurl==0.1.2
Requires-Dist: mpire==2.10.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: multiprocess==0.70.18
Requires-Dist: murmurhash==1.0.13
Requires-Dist: networkx==3.5
Requires-Dist: ninja==1.13.0
Requires-Dist: numpy==2.2.6
Requires-Dist: omegaconf==2.3.0
Requires-Dist: onnxruntime==1.23.2
Requires-Dist: opencv-python-headless==4.12.0.88
Requires-Dist: opencv-python==4.12.0.88
Requires-Dist: openpyxl==3.1.5
Requires-Dist: packaging==25.0
Requires-Dist: pandas==2.3.3
Requires-Dist: pillow==11.3.0
Requires-Dist: pluggy==1.6.0
Requires-Dist: polyfactory==2.22.4
Requires-Dist: preshed==3.0.10
Requires-Dist: protobuf==6.33.1
Requires-Dist: psutil==7.1.3
Requires-Dist: pyclipper==1.3.0.post6
Requires-Dist: pydantic-core==2.41.5
Requires-Dist: pydantic-settings==2.12.0
Requires-Dist: pydantic==2.12.4
Requires-Dist: pygments==2.19.2
Requires-Dist: pylatexenc==2.10
Requires-Dist: pymupdf==1.26.6
Requires-Dist: pypdfium2==4.30.0
Requires-Dist: pyreadline3==3.5.4
Requires-Dist: python-bidi==0.6.7
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: python-docx==1.2.0
Requires-Dist: python-dotenv==1.2.1
Requires-Dist: python-pptx==1.0.2
Requires-Dist: pytz==2025.2
Requires-Dist: pywin32==311
Requires-Dist: pyyaml==6.0.3
Requires-Dist: rapidocr==3.4.2
Requires-Dist: referencing==0.37.0
Requires-Dist: regex==2025.11.3
Requires-Dist: requests==2.32.5
Requires-Dist: rich==14.2.0
Requires-Dist: rpds-py==0.28.0
Requires-Dist: rtree==1.4.1
Requires-Dist: safetensors==0.6.2
Requires-Dist: scikit-image==0.25.2
Requires-Dist: scikit-learn==1.7.2
Requires-Dist: scipy==1.16.3
Requires-Dist: semchunk==2.2.2
Requires-Dist: sentence-transformers==5.1.2
Requires-Dist: setuptools==80.9.0
Requires-Dist: shapely==2.1.2
Requires-Dist: shellingham==1.5.4
Requires-Dist: six==1.17.0
Requires-Dist: smart-open==7.5.0
Requires-Dist: soupsieve==2.8
Requires-Dist: spacy-layout==0.0.12
Requires-Dist: spacy-legacy==3.0.12
Requires-Dist: spacy-loggers==1.0.5
Requires-Dist: spacy==3.8.8
Requires-Dist: srsly==2.5.1
Requires-Dist: sympy==1.14.0
Requires-Dist: tabulate==0.9.0
Requires-Dist: thinc==8.3.8
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: tifffile==2025.10.16
Requires-Dist: tokenizers==0.22.1
Requires-Dist: torch==2.9.0
Requires-Dist: torchvision==0.24.0
Requires-Dist: tqdm==4.67.1
Requires-Dist: transformers==4.57.1
Requires-Dist: typer-slim==0.20.0
Requires-Dist: typer==0.19.2
Requires-Dist: typing-extensions==4.15.0
Requires-Dist: typing-inspection==0.4.2
Requires-Dist: tzdata==2025.2
Requires-Dist: urllib3==2.5.0
Requires-Dist: wasabi==1.1.3
Requires-Dist: weasel==0.4.2
Requires-Dist: wrapt==2.0.1
Requires-Dist: xlsxwriter==3.2.9
Description-Content-Type: text/markdown


# Smart PDF for Business: Extract Text, Headings, and Signatures from PDFs



**Smart PDF for Business** is a Python library for structured PDF content analysis, built on top of **[spaCy-layout](https://github.com/explosion/spacy-layout/tree/main?tab=readme-ov-file)**.
Using spaCy-layout’s page, block, and layout-aware text extraction capabilities, the library adds higher-level features for business workflows and document automation. Smart PDF for Business provides a unified interface to extract, search, and analyze PDF documents with structure-aware intelligence:

* Load PDFs from files, folders, or raw bytes.
* Extract headings, body text, sections, and layout-aware content.
* Search text using **keywords** or **semantic similarity** powered by SentenceTransformers.
* Detect handwritten or scanned **signatures** with bounding boxes, pages, and optional cropped images.
* Export results as **plain text**, **Markdown**, **CSV**, **Excel**, or **JSON**.
* Process **multiple PDFs at once** with batch utilities.

[![PyPI Version](https://img.shields.io/pypi/v/smart_pdf_for_business.svg?style=flat-square&logo=pypi)](https://pypi.org/project/smart_pdf_for_business/)
[![Python Version](https://img.shields.io/badge/python-3.10%2B-blue.svg?style=flat-square)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-MIT-green.svg?style=flat-square)](LICENSE)
## 📝 Usage

> ⚠️ This package requires Python 3.10 or above.

```bash
pip install smart_pdf_for_business
```
After initialising a `PDFDoc` object using one of the factory methods, you can call its built-in functions. Most methods create a new  `PDFDoc` object containing the result. This allows you to utilise the library's output capabilities.
```python
from smart_pdf_for_business import PDFDoc

# Load a PDF
pdf = PDFDoc.from_file("example.pdf")

# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic")

# Export results
clauses.to_csv("output.csv")
clauses.to_excel("output.xlsx")
```
Alternatively, you can use `as_tuple=True` to return the result as a tuple.
```python
from smart_pdf_for_business import PDFDoc

# Load a PDF
pdf = PDFDoc.from_file("example.pdf")

# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic", as_tuple=True)

# Print result as tuple
print(clauses)
```


## 📚 API

### <kbd>class</kbd> PDFDoc
| Attribute | Description |
|------|-------------|
| `data: bytes` | Raw byte content of the PDF document. |
| `path: Optional[Path]` | File system path to the PDF file, if loaded from disk. |
| `name: Optional[str]` | Name of the PDF document (usually the filename). |
| `spacy_doc: Optional[Doc]` | Processed spaCy Doc object, containing text, layout, and markdown. |
| `sections: List[Tuple[str, str]]` | List of `(title, body)` tuples extracted from the document. |
| `weighted_sections: List[Tuple[Tuple[str, str], float]]` | List of sections paired with a semantic match weight. |
| `signatures: List[Signature]` | Detected signatures stored as `Signature` objects. |

---

| Methods | Description | Parameter |
|---------|-------------|-----------|
| `from_file` | Creates a `PDFDoc` from a PDF file. <br>**Return:** `PDFDoc` | `path: str \| Path` <br>- Path to the PDF file |
| `from_folder` | Loads all PDFs from a folder (optionally recursively). <br>**Return:** `list[PDFDoc]` | `folder_path: str \| Path` <br>- Folder containing PDFs <br>`recursive: bool = True` <br>- Include subfolders |
| `from_bytes` | Creates a `PDFDoc` from raw PDF byte content. <br>**Return:** `PDFDoc` | `data: bytes` <br>- PDF bytes <br>`name: Optional[str] = None` <br>- Name assigned to document |
| `from_byte_list` | Creates multiple `PDFDoc` objects from PDF byte streams. <br>**Return:** `list[PDFDoc]` | `byte_list: List[bytes]` <br>- List of PDF byte streams <br>`names: Optional[List[str]] = None` <br>- Corresponding PDF names |
| `search_signature` | Detects signatures near keyword regions and optionally saves cropped images. Updates `self.signatures`. <br>**Return:** `PDFDoc` | `keywords: list[str]` <br>- Signature-related keywords <br>`exact=False` <br>- Forbid substring matching <br>`max_distance=70` <br>- Max pixel distance <br>`save_folder=None` <br>- Folder to save crops <br>`min_contrast=30` <br>- Minimum image contrast <br>`filter_stroke_density=True` <br>- Stroke-density filtering <br>`enforce_text_type: bool = True` <br>- Filter signature text using nlp  |
| `search_header` | Searches section headers for keyword matches. <br>**Return:** `PDFDoc` or `list[tuple[str, str]]` | `keywords: list[str]` <br>- Header keywords <br>`exact: bool = True` <br>- Exact match <br>`as_tuple: bool = False` <br>- Return tuples |
| `search_body` | Searches section bodies for keyword matches. <br>**Return:** `PDFDoc` or `list[tuple[str, str]]` | `keywords: list[str]` <br>- Body keywords <br>`exact: bool = True` <br>- Exact match <br>`as_tuple: bool = False` <br>- Return tuples |
| `search_by_meaning` | Performs semantic search across sections, sentences, paragraphs, or words. <br>**Return:** `PDFDoc` or `list[tuple[str, float]]` | `query: str` <br>- Search query <br>`threshold: float = 0` <br>- Minimum similarity <br>`chunk_by: str = "section"` <br>- Chunking strategy <br>`as_tuple: bool = False` <br>- Return raw tuples <br>`word_count: int = 5` <br>- Words per chunk <br>`buffer_size: int = 1` <br>- Chunk overlap <br>`chunk_text_semantically_threshold: float = 0.3` <br>- Semantic chunk threshold <br>`max_results: int = 5` <br>- Max results |
| `to_text` | Converts the PDF to plain text with optional signature annotations. <br>**Return:** `str` | `annotate_signatures: bool = True` <br>- Annotate signature positions |
| `to_markdown` | Converts the PDF to Markdown with optional signature annotations. <br>**Return:** `str` | `annotate_signatures: bool = True` <br>- Annotate signature positions |
| `to_dataframe` | Converts the document to a structured pandas DataFrame. <br>**Return:** `pandas.DataFrame` | None |
| `to_excel` | Exports the document DataFrame to an Excel (.xlsx) file. <br>**Return:** `None` | `path: str \| Path` <br>- Output .xlsx path <br>`sheet_name: str = "Sheet1"` <br>- Sheet name |
| `to_csv` | Exports the document DataFrame to a CSV file. <br>**Return:** `None` | `path: str \| Path` <br>- Output CSV file path |
| `to_json` | Serializes the document into JSON, optionally writing to a file. <br>**Return:** `str` (JSON) or `None` (if saved to file) | `path: Optional[str \| Path] = None` <br>- Optional save path <br>`indent: int = 2` <br>- JSON formatting |

### <kbd>class</kbd> PDFDocBatch
| Attribute | Description |
|-----------|-------------|
| `pdfdocs: list[PDFDoc]` | List containing all `PDFDoc` instances in the batch. |
---
| Methods | Description | Parameter |
|---------|-------------|-----------|
| `from_folder` | Creates a batch from all PDFs in a folder. <br>**Return:** `PDFDocBatch` | `path: str \| Path` <br>- Path to the folder containing PDFs <br>`recursive: bool = True` <br>- Whether to include subfolders |
| `from_byte_list` | Creates a batch from a list of PDF byte streams. <br>**Return:** `PDFDocBatch` | `byte_list: List[bytes]` <br>- List of PDF bytes <br>`names: Optional[List[str]] = None` <br>- Optional list of names corresponding to each PDF |
| `from_pdfdoc_list` | Creates a batch from an existing list of `PDFDoc` objects. <br>**Return:** `PDFDocBatch` | `pdfdoc_list: List[PDFDoc]` <br>- List of `PDFDoc` instances |
| `extend` | Extends the batch with another batch or list of PDFs. <br>**Return:** `None` | `other: Union[List[PDFDoc], PDFDocBatch]` <br>- Batch or list of PDFs to append |
| `append` | Appends a single `PDFDoc` to the batch. <br>**Return:** `None` | `pdfdoc: PDFDoc` <br>- `PDFDoc` instance to append |
| `search_signature` | Detects handwritten or scanned signatures associated with keyword regions for multiple PDFs. <br>**Return:** `PDFDocBatch` | `keywords: List[str]` <br>- Keywords used to locate signature-related regions <br>`exact: bool = False` <br>- Whether keyword matching must be exact <br>`max_distance: int = 70` <br>- Maximum pixel distance between a keyword block and signature <br>`save_folder: Optional[str \| Path] = None` <br>- Folder to save cropped signature images, if provided <br>`min_contrast: int = 30` <br>- Minimum pixel contrast threshold <br>`filter_stroke_density: bool = True` <br>- Filter image regions based on stroke density <br>`enforce_text_type: bool = True` <br>- Filter signature text based on stroke density |
| `search_header` | Searches headers in every PDF for keywords. <br>**Return:** `PDFDocBatch` | `keywords: list[str]` <br>- Header keywords <br>`exact: bool = True` <br>- Exact or partial matching |
| `search_body` | Searches body text in every PDF for keywords. <br>**Return:** `PDFDocBatch` | `keywords: list[str]` <br>- Body keywords <br>`exact: bool = True` <br>- Exact or partial matching |
| `search_by_meaning` | Performs semantic search across all PDFs. <br>**Return:** `PDFDocBatch` | `query: str` <br>- Natural language query <br>`threshold: float = 0` <br>- Min similarity score <br>`chunk_by: str = "section"` <br>- Chunking method <br>`word_count: int = 5` <br>- Words per chunk <br>`buffer_size: int = 1` <br>- Overlap size <br>`chunk_text_semantically_threshold: float = 0.3` <br>- Semantic similarity threshold <br>`max_results: int = 5` <br>- Max results per PDF |
| `to_dataframe` | Combines all PDFs in the batch into a DataFrame. PDFs with the same name overwrite each other. <br>**Return:** `pandas.DataFrame` | None |
| `to_excel` | Exports batch data to an Excel file. <br>**Return:** `None` | `path: str \| Path` <br>- Output Excel path <br>`sheet_name: str = "Sheet1"` <br>- Sheet name |
| `to_csv` | Exports batch data to a CSV file. <br>**Return:** `None` | `path: str \| Path` <br>- Output CSV file path |

### <kbd>dataclass</kbd> Signature

| Attribute  | Description                               |
| ---------- | ----------------------------------------- |
| `keyword`  | Keyword used for search                   |
| `text`     | OCR text of signature region              |
| `bbox`     | Bounding box coordinates (x1, y1, x2, y2) |
| `page`     | Page number                               |
| `img`      | Cropped image as `numpy.ndarray`          |
| `distance` | Distance to keyword                       |
---

## Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

---
