Metadata-Version: 2.4
Name: use-page-filter
Version: 0.1.0
Summary: Detect blank or content pages in PDFs using OCR and image analysis
Author: Sujan Sharma
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: opencv-python
Requires-Dist: numpy
Requires-Dist: pytesseract
Requires-Dist: pymupdf
Dynamic: license-file

# Use Page Filter

**Use Page Filter** is a lightweight Python package that detects whether a page in a PDF contains useful content or should be considered blank.

It is designed for real-world document pipelines such as tax documents, scanned PDFs, and mixed digital/scanned files where pages may contain very small text, thin characters, shapes, or tables.   
*So it can be used independantly in your pipeline so to reduce the extra use of VLLMs and similar other tasks.* 

The package prioritizes **not missing real text**, even if it is only a single character.

---

## Features

- Detect blank pages in PDFs  
- Works with scanned and digital PDFs  
- Detects very thin characters (e.g., `A`, `S`)  
- Handles pages with tables or shapes  
- Avoids OCR hallucinations from simple shapes  
- Multi-stage OCR detection pipeline  
- Lightweight and CPU-friendly  

---

## How It Works

Each page is processed through a **progressive detection pipeline**.  
The system stops as soon as text is detected.

### Processing Flow

1. **Native PDF Text Extraction**
   - If the page already contains digital text → `CONTENT`

2. **Direct OCR**
   - Run OCR on the rendered page

3. **Rotated OCR**
   - Try OCR with rotations: `90°`, `180°`, `270°`

4. **Scaled OCR**
   - Downscale image (`50%`, `25%`) and run OCR

5. **Dilated + Scaled OCR**
   - Strengthen thin strokes  
   - Run OCR again

6. **Shape Validation**
   - Prevent shapes like boxes or lines from being interpreted as text

---

### Final Decision

- If any stage detects a valid letter or number → `CONTENT`  
- Otherwise → `BLANK`

---

## Installation

### Install from source

```bash
pip install use-page-filter
```
---

## Install Tesseract OCR (Required)

Tesseract OCR must be installed separately.

Ubuntu:
bash ``` sudo apt install tesseract-ocr```

Mac:
bash ``` brew install tesseract ```

Windows:
https://github.com/UB-Mannheim/tesseract/wiki


## Quick Example
<pre>from use_page_filter import process_pdf 
results = process_pdf("document.pdf")  

for page in results:  
  print(page) </pre>

Example output:
```
Page 1  | CONTENT | Native text
Page 2  | BLANK   | Solid page
Page 3  | CONTENT | OCR scaled
```

## Detect a Single Page

<pre>import fitz
from use_page_filter import detect_page
doc = fitz.open("document.pdf")

for page in doc:
    is_blank, reason, confidence = detect_page(page)
    print(is_blank, reason, confidence)
</pre>


## Limitations

* Very stylized fonts may not be detected.
* Complex graphical pages may require additional heuristics.
* OCR accuracy depends on the Tesseract engine.

---

**Note* | This is Made in a day for Personal Use*  
--
*Feel free to contribute* 
