Metadata-Version: 2.4
Name: doculens
Version: 0.1.0
Summary: Document screening and quality reporting pipeline for RAG preprocessing, PII detection, readability, and compliance workflows
Project-URL: Homepage, https://github.com/DHS-IT-Solutions/doculens
Project-URL: Repository, https://github.com/DHS-IT-Solutions/doculens
Project-URL: Issues, https://github.com/DHS-IT-Solutions/doculens/issues
Author-email: MARIA SELCIYA M <MARIA.SELCIYA@dhsit.co.uk>
License: MIT License
        
        Copyright (c) 2025 MARIA SELCIYA M
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: NLP,PII detection,RAG preprocessing,compliance,document audit,document quality,document screening,pipeline,readability,text analysis
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: jinja2>=3.1
Requires-Dist: langdetect>=1.0
Requires-Dist: pdfplumber>=0.9
Requires-Dist: pydantic>=2.0
Requires-Dist: python-docx>=0.8
Requires-Dist: rich>=13.0
Requires-Dist: textstat>=0.7
Requires-Dist: typer>=0.12
Provides-Extra: all
Requires-Dist: language-tool-python>=2.7; extra == 'all'
Requires-Dist: presidio-analyzer>=2.2; extra == 'all'
Requires-Dist: presidio-anonymizer>=2.2; extra == 'all'
Requires-Dist: spacy>=3.5; extra == 'all'
Provides-Extra: dev
Requires-Dist: build; extra == 'dev'
Requires-Dist: hatchling; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Provides-Extra: grammar
Requires-Dist: language-tool-python>=2.7; extra == 'grammar'
Provides-Extra: pii
Requires-Dist: presidio-analyzer>=2.2; extra == 'pii'
Requires-Dist: presidio-anonymizer>=2.2; extra == 'pii'
Requires-Dist: spacy>=3.5; extra == 'pii'
Description-Content-Type: text/markdown

# doculens

> Screen documents for quality, PII, readability, and compliance — before they enter your RAG pipeline or review workflow.

[![PyPI version](https://badge.fury.io/py/doculens.svg)](https://badge.fury.io/py/doculens)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## What is doculens?

**doculens** is a document screening and quality reporting pipeline. It ingests documents (PDF, DOCX, TXT, HTML), runs them through a configurable set of screeners, and produces structured reports in JSON or HTML.

## Features

- **Multi-format ingestion** — PDF, DOCX, TXT, HTML
- **6 built-in screeners** — readability, word count, language detection, PII detection, grammar, duplicate detection
- **Structured reports** — JSON and HTML output with color-coded score bars
- **CLI interface** — single file and batch screening with Rich-formatted terminal output
- **Pluggable architecture** — add custom screeners to the pipeline
- **Configurable thresholds** — min/max word count, expected language, readability floor, and more
- **PII redaction** — replace detected sensitive data with labelled placeholders like `[EMAIL_REDACTED]`
- **Auto-detection** — PII and grammar screeners are automatically enabled when their dependencies are installed

## Installation

```bash
pip install doculens
```

### Optional extras

```bash
pip install doculens[pii]       # PII detection (Presidio + spaCy)
pip install doculens[grammar]   # Grammar checking (LanguageTool)
pip install doculens[all]       # Everything
```

## Quickstart

### CLI

```bash
# Screen a single document
doculens run report.pdf

# Choose specific screeners and output format
doculens run report.pdf --screeners readability,wordcount,pii --format html -o report.html

# Screen all documents in a folder
doculens batch ./documents --recursive --format json --output-dir ./reports

# Redact PII from a document
doculens redact contract.pdf -o contract_clean.txt

# List available screeners
doculens list-screeners
```

### Python API

```python
from doculens import ScreeningConfig, ScreeningPipeline

# Configure and run
config = ScreeningConfig(
    screeners=["readability", "wordcount", "language"],
    min_word_count=50,
    expected_language="en",
)
pipeline = ScreeningPipeline(config)
report = pipeline.screen_file("report.pdf")

print(report.overall_passed)        # True / False
print(report.summary)               # {'total_screeners': 3, 'passed': 3, ...}

for result in report.results:
    print(f"{result.screener_name}: {result.score:.2f} — {'PASS' if result.passed else 'FAIL'}")
```

### PII redaction

```python
from doculens.screeners.pii import PIIScreener

screener = PIIScreener()
redacted = screener.redact("Contact John Smith at john@example.com")
print(redacted)
# "Contact [NAME_REDACTED] at [EMAIL_REDACTED]"
```

### Generate reports

```python
from doculens import HTMLReportGenerator, JSONReportGenerator

# HTML report
html = HTMLReportGenerator()
html.save(report, "report.html")

# JSON report
json_gen = JSONReportGenerator()
json_gen.save(report, "report.json")
```

## Screeners

| Screener | Key | What it checks | Library |
|---|---|---|---|
| Readability | `readability` | Flesch score, grade level, Gunning Fog, SMOG | `textstat` |
| Word Count | `wordcount` | Min/max words, line count, avg word length | built-in |
| Language | `language` | Detects language, validates against expected | `langdetect` |
| PII Detection | `pii` | Emails, phones, names, credit cards, IPs | `presidio` + `spaCy` |
| Grammar | `grammar` | Spelling and grammar errors | `language-tool-python` |
| Duplicates | `duplicate` | Exact and near-duplicate paragraphs | built-in |

## CLI Options

```
doculens run <file> [OPTIONS]
  --screeners, -s    Comma-separated screener names (auto-includes pii/grammar if installed)
  --format, -f       Report format: json or html (default: json)
  --output, -o       Save report to file
  --min-words        Minimum word count threshold (default: 50)
  --lang             Expected language code, e.g. "en"
  --dup-threshold    Similarity threshold for duplicate detection (default: 0.8)
  --verbose, -v      Show warnings in output

doculens batch <folder> [OPTIONS]
  --screeners, -s    Comma-separated screener names
  --format, -f       Report format: json or html (default: json)
  --output-dir, -o   Directory to save individual reports
  --min-words        Minimum word count threshold (default: 50)
  --recursive, -r    Scan subdirectories
  --dup-threshold    Similarity threshold for duplicate detection (default: 0.8)
  --verbose, -v      Show warnings in output

doculens redact <file> [OPTIONS]
  --output, -o       Save redacted text to file (default: print to stdout)
```

### Sample CLI output

```
doculens — screening report.pdf

╭────────────┬────────┬──────────────────────┬──────────────────────────────────╮
│ Screener   │ Status │ Score                │ Summary                          │
├────────────┼────────┼──────────────────────┼──────────────────────────────────┤
│ Readability│  PASS  │ ████████████     72% │ Standard — suitable for most     │
│            │        │                      │ business documents               │
│ Word Count │  PASS  │ ████████████    100% │ 1,243 words, 48 lines           │
│ Language   │  PASS  │ ████████████    100% │ English (100% confident)         │
│ PII        │  FAIL  │ ██████           42% │ 3 PII items: 2 emails, 1 name   │
╰────────────┴────────┴──────────────────────┴──────────────────────────────────╯
╭──────────────────────────────────────────────────────────────╮
│ Overall: FAILED  |  Words: 1,243  |  Screeners: 3/4 passed  │
╰──────────────────────────────────────────────────────────────╯
```

## Supported formats

| Format | Extensions | Library |
|---|---|---|
| PDF | `.pdf` | `pdfplumber` |
| Word | `.docx` | `python-docx` |
| HTML | `.html`, `.htm` | `beautifulsoup4` |
| Plain text | `.txt` | built-in |

## Contributing

Contributions are welcome. Please open an issue first to discuss what you would like to change.

## License

MIT
