Metadata-Version: 2.4
Name: warraqa
Version: 1.0.1
Summary: Warraqa (ورّاقة) — Document Scribe Agent. Converts PDFs and Word/PowerPoint files to clean Markdown with self-scoring.
Project-URL: Homepage, https://github.com/AALAM-Studio/warraqa
Project-URL: Repository, https://github.com/AALAM-Studio/warraqa
Project-URL: Issues, https://github.com/AALAM-Studio/warraqa/issues
Project-URL: Changelog, https://github.com/AALAM-Studio/warraqa/blob/main/CHANGELOG.md
Project-URL: Commercial License, https://www.aalam.consulting/
Author-email: AALAM Studio <contact@aalam.consulting>
Maintainer-email: AALAM Studio <contact@aalam.consulting>
License: # PolyForm Noncommercial License 1.0.0
        
        <https://polyformproject.org/licenses/noncommercial/1.0.0>
        
        ## Acceptance
        
        In order to get any license under these terms, you must agree to them as both strict obligations and conditions to all your licenses.
        
        ## Copyright License
        
        The licensor grants you a copyright license for the software to do everything you might do with the software that would otherwise infringe the licensor's copyright in it for any permitted purpose.  However, you may only distribute the software according to [Distribution License](#distribution-license) and make changes or new works based on the software according to [Changes and New Works License](#changes-and-new-works-license).
        
        ## Distribution License
        
        The licensor grants you an additional copyright license to distribute copies of the software.  Your license to distribute covers distributing the software with changes and new works permitted by [Changes and New Works License](#changes-and-new-works-license).
        
        ## Notices
        
        You must ensure that anyone who gets a copy of any part of the software from you also gets a copy of these terms or the URL for them above, as well as copies of any plain-text lines beginning with `Required Notice:` that the licensor provided with the software.  For example:
        
        > Required Notice: Copyright Yoyodyne, Inc. (http://example.com)
        
        ## Changes and New Works License
        
        The licensor grants you an additional copyright license to make changes and new works based on the software for any permitted purpose.
        
        ## Patent License
        
        The licensor grants you a patent license for the software that covers patent claims the licensor can license, or becomes able to license, that you would infringe by using the software.
        
        ## Noncommercial Purposes
        
        Any noncommercial purpose is a permitted purpose.
        
        ## Personal Uses
        
        Personal use for research, experiment, and testing for the benefit of public knowledge, personal study, private entertainment, hobby projects, amateur pursuits, or religious observance, without any anticipated commercial application, is use for a permitted purpose.
        
        ## Noncommercial Organizations
        
        Use by any charitable organization, educational institution, public research organization, public safety or health organization, environmental protection organization, or government institution is use for a permitted purpose regardless of the source of funding or obligations resulting from the funding.
        
        ## Fair Use
        
        You may have "fair use" rights for the software under the law. These terms do not limit them.
        
        ## No Other Rights
        
        These terms do not allow you to sublicense or transfer any of your licenses to anyone else, or prevent the licensor from granting licenses to anyone else.  These terms do not imply any other licenses.
        
        ## Patent Defense
        
        If you make any written claim that the software infringes or contributes to infringement of any patent, your patent license for the software granted under these terms ends immediately. If your company makes such a claim, your patent license ends immediately for work on behalf of your company.
        
        ## Violations
        
        The first time you are notified in writing that you have violated any of these terms, or done anything with the software not covered by your licenses, your licenses can nonetheless continue if you come into full compliance with these terms, and take practical steps to correct past violations, within 32 days of receiving notice.  Otherwise, all your licenses end immediately.
        
        ## No Liability
        
        ***As far as the law allows, the software comes as is, without any warranty or condition, and the licensor will not be liable to you for any damages arising out of these terms or the use or nature of the software, under any kind of legal claim.***
        
        ## Definitions
        
        The **licensor** is the individual or entity offering these terms, and the **software** is the software the licensor makes available under these terms.
        
        **You** refers to the individual or entity agreeing to these terms.
        
        **Your company** is any legal entity, sole proprietorship, or other kind of organization that you work for, plus all organizations that have control over, are under the control of, or are under common control with that organization.  **Control** means ownership of substantially all the assets of an entity, or the power to direct its management and policies by vote, contract, or otherwise.  Control can be direct or indirect.
        
        **Your licenses** are all the licenses granted to you for the software under these terms.
        
        **Use** means anything you do with the software requiring one of your licenses.
        
        ---
        
        Required Notice: Copyright (c) 2026 AALAM Studio (https://github.com/AALAM-Studio).
        
        Warraqa (ورّاقة) is published under the PolyForm Noncommercial License 1.0.0 for personal, research, educational, and other noncommercial purposes. Commercial use requires a separate license from AALAM Studio — please contact `contact@aalam.consulting`.
License-File: LICENSE
Keywords: document-conversion,docx,knowledge-base,markdown,marker,markitdown,ocr,pdf,pptx,pymupdf,rag
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: <3.14,>=3.10
Requires-Dist: marker-pdf
Requires-Dist: markitdown[all]
Requires-Dist: pymupdf4llm
Requires-Dist: python-magic-bin; sys_platform == 'win32'
Requires-Dist: python-magic; sys_platform != 'win32'
Requires-Dist: pywin32; sys_platform == 'win32'
Requires-Dist: pyyaml
Requires-Dist: rich
Requires-Dist: watchdog
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

# Warraqa (ورّاقة)

### The Document Scribe Agent

*Named after the **Warrāqūn** — the master scribes and paper-makers of the Islamic Golden Age.*

[![License: PolyForm NC 1.0.0](https://img.shields.io/badge/License-PolyForm%20NC%201.0.0-orange.svg)](LICENSE)
[![Python 3.10–3.13](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg)](https://www.python.org/)
[![Version](https://img.shields.io/badge/version-1.0.1-magenta.svg)](CHANGELOG.md)
[![CI](https://github.com/AALAM-Studio/warraqa/actions/workflows/ci.yml/badge.svg)](https://github.com/AALAM-Studio/warraqa/actions/workflows/ci.yml)
[![Aalam Studio](https://img.shields.io/badge/from-AALAM%20Studio-black.svg)](https://github.com/AALAM-Studio)

**Warraqa converts PDF, Word, and PowerPoint documents into clean, accurate Markdown — and scores her own work.**

</div>

---

## Why Warraqa?

Most document-to-Markdown tools are one-trick ponies: great at clean PDFs, terrible at scans; great at `.docx`, blind to `.doc`; or they silently produce garbage and let you discover it three pipelines later.

Warraqa is a **specialist agent**. She picks the right engine for each file, falls back gracefully, scores her output from 0–100 with letter grades, and tells you which conversions to trust. She's built to feed RAG pipelines, knowledge bases, and downstream agents — where Markdown quality directly determines retrieval quality.

## Features

- **Dual-engine architecture** — best specialized tool for each format
  - **Marker** (deep learning) for scanned PDFs: tables, equations, multi-column, OCR
  - **PyMuPDF4LLM** (fast, CPU-only) for native-text PDFs
  - **MarkItDown** (Microsoft) for `.docx` and `.pptx`
  - **MS Office COM** auto-converts legacy `.doc` and `.ppt` to modern formats first
  - Pandoc fallback for `.docx` resilience
- **Smart triage** — every PDF is pre-scanned to detect native vs. scanned content; routing is automatic
- **Two-phase batch processing** — fast files (native PDFs, Word, PowerPoint) run first; slow OCR work is deferred to a single trailing pass so you don't wait on Marker mid-batch
- **Quality scoring** — every conversion gets a 0–100 confidence score with an A–F grade across 5 dimensions (completeness, structure, encoding, density, readability)
- **Crash-resistant** — sanitizes invalid Unicode from upstream engines so a single bad PDF can't kill a 1000-file run
- **Folder workflow** — input → convert → output + move originals to `processed/` or `failed/`
- **Watch mode** — continuous monitoring for new files
- **Inter-agent API** — designed for other agents to call programmatically

---

## Quick Start

### Step 1 — Install Python 3.11 or 3.12

> **Important:** Warraqa requires Python **3.10 to 3.13**. Python 3.14 is not yet supported by some upstream dependencies (Pillow, regex) and will fail during install.

**Windows:**

1. Go to [python.org/downloads](https://www.python.org/downloads/) and download the **Python 3.12** installer (look for "Python 3.12.x" under Stable Releases).
2. Run the installer.
3. **On the very first screen, check the box that says "Add python.exe to PATH"** — this is the most commonly missed step.
4. Click "Install Now".

**Verify** — open a new PowerShell window and run:

```powershell
python --version
```

You should see `Python 3.12.x`. If you see an error, close and reopen PowerShell and try again.

---

### Step 2 — Install Pandoc (for Word document fallback)

**Windows (recommended):**

```powershell
winget install --id JohnMacFarlane.Pandoc -e
```

**Alternative:** download the `.msi` installer from [pandoc.org/installing.html](https://pandoc.org/installing.html).

**Verify:**

```powershell
pandoc --version
```

---

### Step 3 — Install Warraqa

Open PowerShell (**not** the Python prompt — if you see `>>>`, type `exit()` first) and run:

```powershell
pip install warraqa
```

This downloads Warraqa and all its dependencies (~300–400 MB including PyTorch).

**Verify:**

```powershell
warraqa --help
```

---

### Step 4 — Run

```powershell
warraqa --folder "C:\path\to\your\documents"
```

> The **first time you convert a scanned PDF**, Marker downloads its deep-learning models (~2–3 GB). This is a one-time download; subsequent runs use the cached models.

---

### Option B — Clone + bootstrap script

```bash
git clone https://github.com/AALAM-Studio/warraqa.git
cd warraqa
python bootstrap.py        # creates .venv, installs deps, auto-installs Pandoc on Windows
.venv\Scripts\activate     # Linux/macOS: source .venv/bin/activate
python run.py
```

### Option C — Docker (for cloud / headless use)

```bash
docker build -t warraqa .
docker run --rm -v "/path/to/docs:/data" warraqa --folder /data
```

The Docker image is CPU-only and does **not** include MS Office, so legacy `.doc`/`.ppt` will be skipped with a clean error message.

---

## Usage

```bash
warraqa                              # Manual mode — opens a folder picker dialog
warraqa --folder "C:\path"           # Process a specific folder
warraqa --file path/to/document.pdf  # Convert a single file
warraqa --watch --folder "C:\path"   # Watch mode — continuously monitor
warraqa --folder "C:\path" --no-save --no-move    # Dry run
warraqa --help                       # All options
```

## Output Structure

```
output/
├── md_files/        # Converted Markdown files
├── processed/       # Successfully converted originals
├── failed/          # Failed conversion originals
├── reports/         # JSON reports with scores and metadata
├── scanned_pdfs/    # Staging area for OCR-bound PDFs (auto-cleaned per run)
└── warraqa.log
```

## Quality Scoring

Every conversion is scored across 5 weighted dimensions:

| Dimension | Weight | What It Measures |
|:--|:--|:--|
| Text Completeness | 30% | Word count vs. expected density for file size |
| Structure Integrity | 25% | Headings, lists, tables, formatting |
| Encoding Quality | 20% | Garbled text, mojibake, Unicode issues |
| Content Density | 15% | Meaningful text vs. noise |
| Readability | 10% | Line length, paragraph structure |

Grades: **A** (90–100) → **B** (75–89) → **C** (60–74) → **D** (40–59) → **F** (0–39).
Files scoring below 40 are moved to `output/failed/` automatically.

## Inter-Agent API

```python
from warraqa import Warraqa

agent = Warraqa()

# Convert a single file
result = agent.convert_file("document.pdf")
print(result.confidence_score)    # 87
print(result.grade)               # "B"
print(result.markdown_content)    # "# Title\n\n..."
print(result.output_path)         # Path to saved .md file

# Process a folder
results = agent.process_folder("C:/Users/you/Academia")
for r in results:
    print(f"{r.source_file.filename}: {r.grade} ({r.confidence_score}/100)")
```

## Configuration

Edit [`config.yaml`](config.yaml) to customize:

- Default mode (manual / watch)
- Output directories
- Engine preferences (primary / fallback per format)
- Scoring thresholds
- Logging level

## Supported Formats

| Extension | Engine | Notes |
|:--|:--|:--|
| `.pdf` (native text) | PyMuPDF4LLM | Fast, CPU-only |
| `.pdf` (scanned) | Marker | Deferred to Phase 2 OCR pass |
| `.docx` | MarkItDown → Pandoc | — |
| `.doc` | MS Office COM → MarkItDown | Windows + Office required |
| `.pptx` | MarkItDown | — |
| `.ppt` | MS Office COM → MarkItDown | Windows + Office required |

---

## Troubleshooting

### `pip install warraqa` fails with "Failed building wheel for Pillow" or "Microsoft Visual C++ required"

**Cause:** You are running Python 3.14. Warraqa's OCR engine requires Pillow 10.x, which has no pre-built Windows package for Python 3.14.

**Fix:** Install **Python 3.12** from [python.org/downloads](https://www.python.org/downloads/). You can have multiple Python versions installed. Then run:

```powershell
py -3.12 -m pip install warraqa
```

---

### `pip install warraqa` gives `SyntaxError: invalid syntax`

**Cause:** You typed `pip install warraqa` inside the Python REPL (the `>>>` prompt). `pip` is a terminal command, not a Python command.

**Fix:** Type `exit()` to leave Python, then run `pip install warraqa` in PowerShell.

---

### `warraqa` is not recognized after install

**Cause:** Either the install failed, or Python's `Scripts` folder is not on your PATH.

**Check if installed:**

```powershell
pip show warraqa
# If it shows version info, the scripts folder isn't on PATH — run via:
python -m warraqa --help
```

**Permanent PATH fix:** search Windows for "Edit the system environment variables" → Environment Variables → User `Path` → add `C:\Users\<YourName>\AppData\Local\Programs\Python\Python312\Scripts`.

---

### `pandoc` is not recognized

Reinstall via `winget install --id JohnMacFarlane.Pandoc -e` and open a new PowerShell window. Warraqa still converts `.docx` without Pandoc — it just loses the Pandoc fallback if MarkItDown fails.

---

### First scanned-PDF conversion is very slow

Normal — Marker downloads ~2–3 GB of model weights on first use. After that, models are cached.

---

### Legacy `.doc` / `.ppt` files are skipped

These require Microsoft Word / PowerPoint (Windows only). If you see a "COM not available" warning, install Microsoft Office or convert the files to `.docx`/`.pptx` format first.

---

## License

**Warraqa is published under the [PolyForm Noncommercial License 1.0.0](LICENSE)** — a source-available license that allows free use for:

- Personal projects, research, study, and experimentation
- Academic and educational institutions
- Charitable, public-safety, health, and government organizations
- Internal evaluation by any organization

**Commercial use** — including using Warraqa as part of a product or service offered to paying customers, internal business operations at a for-profit company, or any revenue-generating workflow — **requires a separate commercial license**. Contact **`contact@aalam.consulting`** to discuss licensing.

Note on terminology: PolyForm Noncommercial is *source-available*, not *open source* in the OSI sense (which by definition allows commercial use). The full text is in [`LICENSE`](LICENSE).

## Versioning Policy

This repository contains the public `1.x` line of Warraqa. Future major versions are developed privately and available under commercial license terms. Critical bug fixes are backported to `1.x` at AALAM Studio's discretion.

See [`CHANGELOG.md`](CHANGELOG.md) for the release history.

## Citation

If Warraqa contributes to academic research, please cite it. A machine-readable [`CITATION.cff`](CITATION.cff) is provided, or use the GitHub "Cite this repository" button.

## Acknowledgements

Warraqa stands on the shoulders of excellent open-source projects:

- [Marker](https://github.com/VikParuchuri/marker) — Vik Paruchuri's deep-learning PDF parser
- [PyMuPDF4LLM](https://github.com/pymupdf/PyMuPDF) — Artifex's LLM-optimized PDF extraction
- [MarkItDown](https://github.com/microsoft/markitdown) — Microsoft's universal-to-markdown converter
- [Pandoc](https://pandoc.org/) — John MacFarlane's document conversion swiss-army knife
- [Rich](https://github.com/Textualize/rich) — Will McGugan's terminal beautifier

## Part of Aalam Studio

Warraqa is the first publicly released agent in the **AALAM Studio** ecosystem. Other agents access her output at a predictable path:

```python
WARRAQA_OUTPUT = "c:/projects/aalam-studio/warraqa/output/"
```

---

<div align="center">

*She reads. She transcribes. She scores her own work.*

**Built with care by [AALAM Studio](https://github.com/AALAM-Studio).**

</div>
