Metadata-Version: 2.4
Name: doxtract
Version: 0.0.9
Summary: Structured document processor with diagram/image/text extraction with optional langchain or dataset output
Home-page: https://github.com/EthanRyne/Advanced_pdf_extractor
Author: Bhavesh Kumar
Author-email: bhaveshk@gmail.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: PyMuPDF
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# 📄 doxtract

**doxtract** is a high-level document preprocessing toolkit that extracts per-page structured metadata from PDFs, DOCX, PPTX, or TXT files — with optional diagram/image detection and native support for RAG pipelines via 🦜 LangChain and 🤗 HuggingFace `datasets.Dataset`.

---

## ✨ Features

- 🔍 Detects and skips repeating headers and footers
- 🧠 Heuristically filters out Table of Contents pages
- 🖼 Extracts vector diagrams and embedded raster images
- 📑 Reconstructs clean plain-text or Markdown layouts
- 🦜 **LangChain Integration:** Native export to `langchain_core.documents.Document` with enriched metadata.
- 🔁 **Flexible Return Formats:**
  - A nested Python dictionary (`dict[doc_name → list[pages]]`)
  - A 🤗 `datasets.Dataset` for ML/NLP pipelines.
  - A list of `LangChain` Document objects.
- 🚫 Warns on scanned PDFs without OCR — no extraction guesswork

---

## 📦 Installation

```bash
pip install doxtract
````

Or for local development:

```bash
git clone https://github.com/EthanRyne/Advanced_pdf_extractor
cd Advanced_pdf_extractor
pip install -e .
```

Make sure you have [LibreOffice](https://www.libreoffice.org/) installed and available as `soffice` in your `PATH` (required for `.docx`, `.pptx`, `.txt` conversion).

---

## 🧪 Quick Example

```python
from doxtract.processor import preprocess

output = preprocess(
    ["input/spec_sheet.pdf", "notes.docx"],
    markdown=True,               # Output GitHub-flavored Markdown
    extract_vectors=True,        # Extract vector diagrams
    extract_images=True,         # Extract raster images
    strip_headers_footers=True,  # Remove headers/footers from text
    preserve_layout=True,       # If True, use exact spacing from the PDF
    max_workers=None,            # If given, will be used for parallel doc processing
    as_dataset=True              # Return a HuggingFace Dataset
)
print(output)
```

---

## ⚙️ Parameters

| Name                      | Type          | Description                                                    |
| ------------------------- | ------------- | -------------------------------------------------------------- |
| `paths`                   | `list[str]`   | List of input files (`.pdf`, `.docx`, `.pptx`, `.txt`)         |
| `markdown`                | `bool`        | If `True`, output uses GitHub‑flavored Markdown                |
| `extract_vectors`         | `bool`        | Save and log bounding boxes of detected diagrams               |
| `extract_images`          | `bool`        | Save visible images per page                                   |
| `output_root`             | `str or Path` | Directory to store outputs and extracted media                 |
| `strip_headers_footers`   | `bool`        | Remove recurring headers/footers from output text              |
| `preserve_layout`         | `bool`        | If True, use exact spacing from the PDF                        |
| `max_workers`             | `int`         | If given, will be used for parallel doc processing             |
| `as_langchain_docs`       | `bool`        | Return as a list of `langchain_core.documents.Document` objects|
| `as_dataset`              | `bool`        | Return as HuggingFace `datasets.Dataset`                       |
| *(advanced tuning knobs)* |               |                                                                |
| `vector_margin`           | `int`         | Padding around diagrams (in px)                                |
| `page_top_pct`            | `float`       | % height for detecting headers                                 |
| `page_bottom_pct`         | `float`       | % height for detecting footers                                 |
| `min_header_pages`        | `int`         | Min pages with similar header/footer to consider valid         |
| `toc_threshold`           | `int`         | TOC detection sensitivity                                      |
| `y_tol`                   | `int`         | Line grouping tolerance (vertical)                             |
| `space_thresh`            | `int`         | Horizontal gap → one space                                     |

---

## 🛑 OCR Handling

If a PDF is detected to be a **scanned document with no embedded text**, `doxtract` will **abort the run with a warning**:

> ⚠️ `scanned_file.pdf` looks like a scanned PDF with no text layer. Please run OCR first; aborting.

To preprocess such files, run OCR first using [OCRmyPDF](https://ocrmypdf.readthedocs.io/) or similar tools.

---

## 📁 Output Example (simplified)

Each output "page" is a dictionary with:

```json
{
  "document_name": "spec.pdf",
  "page_number": 3,
  "page_content": "...",
  "is_toc_page": false,
  "headers": ["My Spec Sheet"],
  "footers": [],
  "diagrams": [
    {"path": "Doc Data/spec/diagrams/p003_1.png", "bbox": [12.1, 55.2, 430.6, 310.4]}
  ],
  "images_on_this_page": [
    "Doc Data/spec/images/p003_xref12.png"
  ]
}
```

---

## 📑 Metadata & LangChain Compatibility

When using `as_langchain_docs=True`, **doxtract** automatically enriches the metadata to match industry standards, ensuring your RAG citations are accurate:

| Metadata Key   | Description                                           |
| -------------- | ----------------------------------------------------- |
| `source`       | The full path to the source file                      |
| `page`         | The 0-indexed page number                             |
| `total_pages`  | Total pages in the document                           |
| `creationdate` | Normalized PDF creation timestamp                     |
| `moddate`      | Normalized PDF modification timestamp                 |
| `title/author` | Metadata extracted from the PDF header                |
| `is_toc_page`  | Boolean flag indicating if the page is a TOC          |

---

## 🤗 Dataset Mode

If `as_dataset=True`, the output is a HuggingFace-compatible `datasets.Dataset`, ideal for training/evaluation workflows:

```python
from doxtract.processor import preprocess

ds = preprocess(["spec.pdf"], as_dataset=True)
print(ds[0]["page_content"])
```

---

## 🦜 Langchain Mode

If `as_langchain_docs=True`, the output is a Langchain-compatible `Document`, ideal for langchain RAG pipeline workflows:

```python
from doxtract.processor import preprocess

output = preprocess(
    ["spec.pdf"],
    markdown=True,
    preserve_layout=True,
    as_langchain_docs=True        
)

# You can immediately plug into a Vector Store
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=output, embedding=OpenAIEmbeddings())
```

---

## 🧱 Dependencies

* `PyMuPDF` (fitz)
* `langchain` (optional, for LangChain output)
* `datasets` (optional, for dataset output)
* LibreOffice (`soffice`) for office conversion

---

## 🧑‍💻 License

MIT License © 2025

---

## 📬 Contributing

Pull requests welcome! For major changes, please open an issue first to discuss what you’d like to change or improve.
