Metadata-Version: 2.2
Name: pdf-masking-library
Version: 0.1.6
Summary: A library for processing PDFs with OCR and masking sensitive information
Author: Demo
Author-email: demo@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3
Description-Content-Type: text/markdown
Requires-Dist: pytesseract
Requires-Dist: pdf2image
Requires-Dist: pdfrw
Requires-Dist: lxml
Requires-Dist: reportlab
Requires-Dist: Pillow
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# PDF Masking Library

**pdf-masking-library** is a Python library designed to process PDF files by masking sensitive information using Optical Character Recognition (OCR). It supports masking predefined patterns such as Aadhaar numbers, PAN numbers, and custom patterns provided by the user.

## A Simple Example

```python
import base64
from pdf_masking_library import process_pdf

base64_pdf_input = "Your base64 here"
custom_pattern = [r"\b\d{2}\b"]
psm = 6  # Default PSM is 6
lang = "eng+kan"  # Default OCR language is English and Kannada
aadhar = True  # Enable Aadhaar masking
pan = True  # Enable PAN masking

base64_pdf_output = process_pdf(base64_pdf_input, custom_pattern=custom_pattern, psm=psm, lang=lang, aadhar=aadhar, pan=pan)

# Save the masked PDF to a file
with open("masked_output.pdf", "wb") as output_file:
    output_file.write(base64.b64decode(base64_pdf_output))
```

### Masking Information
The library allows independent control over masking of Aadhaar numbers, PAN numbers, and custom patterns:
-   Aadhaar Numbers: 12-digit Indian identification numbers (enabled using --aadhar).
-   PAN Numbers: 10-character alphanumeric Permanent Account Numbers (enabled using --pan).
-    Custom Patterns: User-defined patterns using regular expressions(enabled using --custom-pattern).



## Command-Line Interface (CLI)
The library includes a CLI tool for easy integration into scripts and workflows.

- Mask Aadhaar Numbers:
    > python -m pdf_masking_library input.pdf output.pdf --aadhar

- Mask PAN Numbers:
    > python -m pdf_masking_library input.pdf output.pdf --pan

- Mask Using Custom Patterns:
    > python -m pdf_masking_library input.pdf output.pdf --custom-pattern "\b\d{2}\b"


- Specify OCR Page Segmentation Mode (psm):
    >  python -m pdf_masking_library input.pdf output.pdf --psm 3

    Default is 6 if not specified.

- Specify OCR Language (lang):
    > python -m pdf_masking_library input.pdf output.pdf --lang eng+kan+tel

    Default is eng+kan. Multiple languages should be separated using +.

