Metadata-Version: 2.4
Name: warraqa
Version: 1.0.0
Summary: Warraqa (ورّاقة) — Document Scribe Agent. Converts PDFs and Word/PowerPoint files to clean Markdown with self-scoring.
Project-URL: Homepage, https://github.com/AALAM-Studio/warraqa
Project-URL: Repository, https://github.com/AALAM-Studio/warraqa
Project-URL: Issues, https://github.com/AALAM-Studio/warraqa/issues
Project-URL: Changelog, https://github.com/AALAM-Studio/warraqa/blob/main/CHANGELOG.md
Project-URL: Commercial License, https://www.aalam.consulting/
Author-email: AALAM Studio <contact@aalam.consulting>
Maintainer-email: AALAM Studio <contact@aalam.consulting>
License: # PolyForm Noncommercial License 1.0.0
        
        <https://polyformproject.org/licenses/noncommercial/1.0.0>
        
        ## Acceptance
        
        In order to get any license under these terms, you must agree to them as both strict obligations and conditions to all your licenses.
        
        ## Copyright License
        
        The licensor grants you a copyright license for the software to do everything you might do with the software that would otherwise infringe the licensor's copyright in it for any permitted purpose.  However, you may only distribute the software according to [Distribution License](#distribution-license) and make changes or new works based on the software according to [Changes and New Works License](#changes-and-new-works-license).
        
        ## Distribution License
        
        The licensor grants you an additional copyright license to distribute copies of the software.  Your license to distribute covers distributing the software with changes and new works permitted by [Changes and New Works License](#changes-and-new-works-license).
        
        ## Notices
        
        You must ensure that anyone who gets a copy of any part of the software from you also gets a copy of these terms or the URL for them above, as well as copies of any plain-text lines beginning with `Required Notice:` that the licensor provided with the software.  For example:
        
        > Required Notice: Copyright Yoyodyne, Inc. (http://example.com)
        
        ## Changes and New Works License
        
        The licensor grants you an additional copyright license to make changes and new works based on the software for any permitted purpose.
        
        ## Patent License
        
        The licensor grants you a patent license for the software that covers patent claims the licensor can license, or becomes able to license, that you would infringe by using the software.
        
        ## Noncommercial Purposes
        
        Any noncommercial purpose is a permitted purpose.
        
        ## Personal Uses
        
        Personal use for research, experiment, and testing for the benefit of public knowledge, personal study, private entertainment, hobby projects, amateur pursuits, or religious observance, without any anticipated commercial application, is use for a permitted purpose.
        
        ## Noncommercial Organizations
        
        Use by any charitable organization, educational institution, public research organization, public safety or health organization, environmental protection organization, or government institution is use for a permitted purpose regardless of the source of funding or obligations resulting from the funding.
        
        ## Fair Use
        
        You may have "fair use" rights for the software under the law. These terms do not limit them.
        
        ## No Other Rights
        
        These terms do not allow you to sublicense or transfer any of your licenses to anyone else, or prevent the licensor from granting licenses to anyone else.  These terms do not imply any other licenses.
        
        ## Patent Defense
        
        If you make any written claim that the software infringes or contributes to infringement of any patent, your patent license for the software granted under these terms ends immediately. If your company makes such a claim, your patent license ends immediately for work on behalf of your company.
        
        ## Violations
        
        The first time you are notified in writing that you have violated any of these terms, or done anything with the software not covered by your licenses, your licenses can nonetheless continue if you come into full compliance with these terms, and take practical steps to correct past violations, within 32 days of receiving notice.  Otherwise, all your licenses end immediately.
        
        ## No Liability
        
        ***As far as the law allows, the software comes as is, without any warranty or condition, and the licensor will not be liable to you for any damages arising out of these terms or the use or nature of the software, under any kind of legal claim.***
        
        ## Definitions
        
        The **licensor** is the individual or entity offering these terms, and the **software** is the software the licensor makes available under these terms.
        
        **You** refers to the individual or entity agreeing to these terms.
        
        **Your company** is any legal entity, sole proprietorship, or other kind of organization that you work for, plus all organizations that have control over, are under the control of, or are under common control with that organization.  **Control** means ownership of substantially all the assets of an entity, or the power to direct its management and policies by vote, contract, or otherwise.  Control can be direct or indirect.
        
        **Your licenses** are all the licenses granted to you for the software under these terms.
        
        **Use** means anything you do with the software requiring one of your licenses.
        
        ---
        
        Required Notice: Copyright (c) 2026 AALAM Studio (https://github.com/AALAM-Studio).
        
        Warraqa (ورّاقة) is published under the PolyForm Noncommercial License 1.0.0 for personal, research, educational, and other noncommercial purposes. Commercial use requires a separate license from AALAM Studio — please contact `contact@aalam.consulting`.
License-File: LICENSE
Keywords: document-conversion,docx,knowledge-base,markdown,marker,markitdown,ocr,pdf,pptx,pymupdf,rag
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.10
Requires-Dist: marker-pdf
Requires-Dist: markitdown[all]
Requires-Dist: pymupdf4llm
Requires-Dist: python-magic-bin; sys_platform == 'win32'
Requires-Dist: python-magic; sys_platform != 'win32'
Requires-Dist: pywin32; sys_platform == 'win32'
Requires-Dist: pyyaml
Requires-Dist: rich
Requires-Dist: watchdog
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

# Warraqa (ورّاقة)

### The Document Scribe Agent

*Named after the **Warrāqūn** — the master scribes and paper-makers of the Islamic Golden Age.*

[![License: PolyForm NC 1.0.0](https://img.shields.io/badge/License-PolyForm%20NC%201.0.0-orange.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![Version](https://img.shields.io/badge/version-1.0.0-magenta.svg)](CHANGELOG.md)
[![CI](https://github.com/AALAM-Studio/warraqa/actions/workflows/ci.yml/badge.svg)](https://github.com/AALAM-Studio/warraqa/actions/workflows/ci.yml)
[![Aalam Studio](https://img.shields.io/badge/from-AALAM%20Studio-black.svg)](https://github.com/AALAM-Studio)

**Warraqa converts PDF, Word, and PowerPoint documents into clean, accurate Markdown — and scores her own work.**

</div>

---

## Why Warraqa?

Most document-to-Markdown tools are one-trick ponies: great at clean PDFs, terrible at scans; great at `.docx`, blind to `.doc`; or they silently produce garbage and let you discover it three pipelines later.

Warraqa is a **specialist agent**. She picks the right engine for each file, falls back gracefully, scores her output from 0–100 with letter grades, and tells you which conversions to trust. She's built to feed RAG pipelines, knowledge bases, and downstream agents — where Markdown quality directly determines retrieval quality.

## Features

- **Dual-engine architecture** — best specialized tool for each format
  - **Marker** (deep learning) for scanned PDFs: tables, equations, multi-column, OCR
  - **PyMuPDF4LLM** (fast, CPU-only) for native-text PDFs
  - **MarkItDown** (Microsoft) for `.docx` and `.pptx`
  - **MS Office COM** auto-converts legacy `.doc` and `.ppt` to modern formats first
  - Pandoc fallback for `.docx` resilience
- **Smart triage** — every PDF is pre-scanned to detect native vs. scanned content; routing is automatic
- **Two-phase batch processing** — fast files (native PDFs, Word, PowerPoint) run first; slow OCR work is deferred to a single trailing pass so you don't wait on Marker mid-batch
- **Quality scoring** — every conversion gets a 0–100 confidence score with an A–F grade across 5 dimensions (completeness, structure, encoding, density, readability)
- **Crash-resistant** — sanitizes invalid Unicode from upstream engines so a single bad PDF can't kill a 1000-file run
- **Folder workflow** — input → convert → output + move originals to `processed/` or `failed/`
- **Watch mode** — continuous monitoring for new files
- **Inter-agent API** — designed for other agents to call programmatically

## Quick Start

### Option 1 — pip (recommended)

```bash
pip install warraqa
warraqa --folder "C:\path\to\documents"
```

You still need **Pandoc** on `PATH` for `.docx` fallback, and **MS Office** (Windows) for legacy `.doc`/`.ppt`. The Marker engine downloads its ML models on first use (~2–3 GB).

### Option 2 — Clone + bootstrap script

```bash
git clone https://github.com/AALAM-Studio/warraqa.git
cd warraqa
python bootstrap.py        # creates .venv, installs deps, auto-installs Pandoc on Windows
.venv\Scripts\activate     # Linux/macOS: source .venv/bin/activate
python run.py
```

### Option 3 — Docker (for cloud / headless use)

```bash
docker build -t warraqa .
docker run --rm -v "/path/to/docs:/data" warraqa --folder /data
```

Note: the Docker image is CPU-only and does **not** include MS Office, so legacy `.doc`/`.ppt` will be skipped with a clean error message.

## Usage

```bash
warraqa                              # Manual mode — opens a folder picker dialog
warraqa --folder "C:\path"           # Process a specific folder
warraqa --file path/to/document.pdf  # Convert a single file
warraqa --watch --folder "C:\path"   # Watch mode — continuously monitor
warraqa --folder "C:\path" --no-save --no-move    # Dry run
warraqa --help                       # All options
```

## Output Structure

```
output/
├── md_files/        # Converted Markdown files
├── processed/       # Successfully converted originals
├── failed/          # Failed conversion originals
├── reports/         # JSON reports with scores and metadata
├── scanned_pdfs/    # Staging area for OCR-bound PDFs (auto-cleaned per run)
└── warraqa.log
```

## Quality Scoring

Every conversion is scored across 5 weighted dimensions:

| Dimension | Weight | What It Measures |
|:--|:--|:--|
| Text Completeness | 30% | Word count vs. expected density for file size |
| Structure Integrity | 25% | Headings, lists, tables, formatting |
| Encoding Quality | 20% | Garbled text, mojibake, Unicode issues |
| Content Density | 15% | Meaningful text vs. noise |
| Readability | 10% | Line length, paragraph structure |

Grades: **A** (90–100) → **B** (75–89) → **C** (60–74) → **D** (40–59) → **F** (0–39).
Files scoring below 40 are moved to `output/failed/` automatically.

## Inter-Agent API

```python
from warraqa import Warraqa

agent = Warraqa()

# Convert a single file
result = agent.convert_file("document.pdf")
print(result.confidence_score)    # 87
print(result.grade)               # "B"
print(result.markdown_content)    # "# Title\n\n..."
print(result.output_path)         # Path to saved .md file

# Process a folder
results = agent.process_folder("C:/Users/you/Academia")
for r in results:
    print(f"{r.source_file.filename}: {r.grade} ({r.confidence_score}/100)")
```

## Configuration

Edit [`config.yaml`](config.yaml) to customize:

- Default mode (manual / watch)
- Output directories
- Engine preferences (primary / fallback per format)
- Scoring thresholds
- Logging level

## Supported Formats

| Extension | Engine | Notes |
|:--|:--|:--|
| `.pdf` (native text) | PyMuPDF4LLM | Fast, CPU-only |
| `.pdf` (scanned) | Marker | Deferred to Phase 2 OCR pass |
| `.docx` | MarkItDown → Pandoc | — |
| `.doc` | MS Office COM → MarkItDown | Windows + Office required |
| `.pptx` | MarkItDown | — |
| `.ppt` | MS Office COM → MarkItDown | Windows + Office required |

## License

**Warraqa is published under the [PolyForm Noncommercial License 1.0.0](LICENSE)** — a source-available license that allows free use for:

- Personal projects, research, study, and experimentation
- Academic and educational institutions
- Charitable, public-safety, health, and government organizations
- Internal evaluation by any organization

**Commercial use** — including using Warraqa as part of a product or service offered to paying customers, internal business operations at a for-profit company, or any revenue-generating workflow — **requires a separate commercial license**. Contact **`contact@aalam.consulting`** to discuss licensing.

Note on terminology: PolyForm Noncommercial is *source-available*, not *open source* in the OSI sense (which by definition allows commercial use). The full text is in [`LICENSE`](LICENSE).

## Versioning Policy

This repository contains **Warraqa v1.0.0** — the inaugural public, source-available release. Future versions of Warraqa are developed privately and available under commercial license terms. Critical bug fixes may be backported to v1.x at AALAM Studio's discretion.

See [`CHANGELOG.md`](CHANGELOG.md) for the release history.

## Citation

If Warraqa contributes to academic research, please cite it. A machine-readable [`CITATION.cff`](CITATION.cff) is provided, or use the GitHub "Cite this repository" button.

## Acknowledgements

Warraqa stands on the shoulders of excellent open-source projects:

- [Marker](https://github.com/VikParuchuri/marker) — Vik Paruchuri's deep-learning PDF parser
- [PyMuPDF4LLM](https://github.com/pymupdf/PyMuPDF) — Artifex's LLM-optimized PDF extraction
- [MarkItDown](https://github.com/microsoft/markitdown) — Microsoft's universal-to-markdown converter
- [Pandoc](https://pandoc.org/) — John MacFarlane's document conversion swiss-army knife
- [Rich](https://github.com/Textualize/rich) — Will McGugan's terminal beautifier

## Part of Aalam Studio

Warraqa is the first publicly released agent in the **AALAM Studio** ecosystem. Other agents access her output at a predictable path:

```python
WARRAQA_OUTPUT = "c:/projects/aalam-studio/warraqa/output/"
```

---

<div align="center">

*She reads. She transcribes. She scores her own work.*

**Built with care by [AALAM Studio](https://github.com/AALAM-Studio).**

</div>
