Metadata-Version: 2.4
Name: papercutter
Version: 2.0.1
Summary: Automated evidence synthesis pipeline for systematic literature reviews
Project-URL: Homepage, https://github.com/rawatpranjal/papercutter
Project-URL: Repository, https://github.com/rawatpranjal/papercutter
Author: Pranjal Rawat
License-Expression: MIT
License-File: LICENSE
Keywords: docling,evidence-synthesis,llm,pdf,research,systematic-review
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Requires-Dist: arxiv>=2.1.0
Requires-Dist: bibtexparser>=1.4.0
Requires-Dist: certifi>=2023.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pdfplumber>=0.11.0
Requires-Dist: pydantic-settings>=2.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pypdf>=4.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer[all]>=0.12.0
Provides-Extra: all
Requires-Dist: docling>=2.0.0; extra == 'all'
Requires-Dist: instructor>=1.0.0; extra == 'all'
Requires-Dist: jinja2>=3.0.0; extra == 'all'
Requires-Dist: litellm>=1.30.0; extra == 'all'
Requires-Dist: mypy>=1.8.0; extra == 'all'
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: pymupdf>=1.24.0; extra == 'all'
Requires-Dist: pytest-cov>=4.0.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.0.0; extra == 'all'
Requires-Dist: pytest>=8.0.0; extra == 'all'
Requires-Dist: ruff>=0.3.0; extra == 'all'
Requires-Dist: thefuzz>=0.22.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Provides-Extra: docling
Requires-Dist: docling>=2.0.0; extra == 'docling'
Provides-Extra: equations
Requires-Dist: pillow>=10.0.0; extra == 'equations'
Requires-Dist: pymupdf>=1.24.0; extra == 'equations'
Provides-Extra: equations-nougat
Requires-Dist: pillow>=10.0.0; extra == 'equations-nougat'
Requires-Dist: pymupdf>=1.24.0; extra == 'equations-nougat'
Requires-Dist: torch>=2.0.0; extra == 'equations-nougat'
Requires-Dist: transformers>=4.30.0; extra == 'equations-nougat'
Provides-Extra: equations-pix2tex
Requires-Dist: pillow>=10.0.0; extra == 'equations-pix2tex'
Requires-Dist: pix2tex>=0.1.0; extra == 'equations-pix2tex'
Requires-Dist: pymupdf>=1.24.0; extra == 'equations-pix2tex'
Provides-Extra: factory
Requires-Dist: docling>=2.0.0; extra == 'factory'
Requires-Dist: instructor>=1.0.0; extra == 'factory'
Requires-Dist: jinja2>=3.0.0; extra == 'factory'
Requires-Dist: litellm>=1.30.0; extra == 'factory'
Requires-Dist: thefuzz>=0.22.0; extra == 'factory'
Provides-Extra: fast
Requires-Dist: pymupdf>=1.24.0; extra == 'fast'
Provides-Extra: llm
Requires-Dist: instructor>=1.0.0; extra == 'llm'
Requires-Dist: litellm>=1.30.0; extra == 'llm'
Description-Content-Type: text/markdown

# Papercutter Factory

### Automated Evidence Synthesis Pipeline for Research

**Papercutter Factory** is a local, batch-processing pipeline designed to transform unstructured academic PDF collections into structured datasets and systematic review reports.

It addresses the specific tooling gap between reference managers (Zotero, Mendeley) and analysis software (R, Stata). Unlike generic "Chat with PDF" tools, Papercutter is architected for **extraction reliability, reproducibility, and scale**. It utilizes **Docling** to convert PDFs into structured Markdown and JSON before applying LLM-based extraction, ensuring tabular data and complex layouts are preserved.

---

## Key Capabilities

*   **Pipeline Architecture:** A stateless, resumable workflow. Processing status is tracked per file, allowing large batches to be paused and resumed without data loss.
*   **High-Fidelity Digitization:** Utilizes IBM's **Docling** to convert PDFs into structured Markdown, preserving table geometry and section hierarchy better than standard text extraction.
*   **Intelligent Splitting:** Automatically detects large volumes (e.g., handbooks, dissertations) and splits them into chapter-level units for granular analysis.
*   **Schema Validation (Pilot Mode):** Includes a "Pilot Protocol" to test extraction schemas on a random sample. Includes source quotes for every extracted data point to verify accuracy before processing the full library.
*   **Bibliographic Linking:** Fuzzy-matches PDF contents to existing BibTeX records to ensure metadata consistency.

---

## Installation

Papercutter is a comprehensive toolkit that relies on PyTorch and Docling for document layout analysis. A standard installation requires Python 3.10+.

```bash
pip install papercutter
```

**System Requirements:**
*   **Hardware:** A GPU is recommended for optimal OCR and layout analysis speed, though the system functions on CPU.
*   **API Access:** Requires an active API key for OpenAI (`export OPENAI_API_KEY=...`) or Anthropic.
*   **Optional:** Tesseract OCR (for legacy scanned documents).

---

## Workflow Overview

The system operates in four distinct phases to ensure data integrity.

### 1. Ingest (Digitization)
Initializes the project structure and converts raw PDFs into a unified internal format.

```bash
# Initialize a new review project
papercutter init my_project

# Process PDFs and link to metadata
cd my_project
papercutter ingest ./raw_pdfs/ --bib references.bib
```

*   **Process:** Scans directories, identifies duplicates via SHA256, splits large volumes, and runs Docling conversion.
*   **Metadata:** If a BibTeX file is provided, PDFs are linked to citations via fuzzy title matching.

### 2. Configure (Schema Definition)
Defines the variables to be extracted from the literature.

```bash
papercutter configure
```

*   **Process:** The system analyzes abstracts from the ingested library and proposes a draft schema. The user generates a `config.yaml` file to enforce strict types on extracted data.

**Example `config.yaml`:**
```yaml
columns:
  - key: sample_size
    description: "The total number of observations (N). Exclude year ranges."
    type: integer
  - key: estimation_method
    description: "The primary statistical strategy (e.g. DiD, RDD, OLS)."
    type: string
  - key: treatment_effect
    description: "The extracted coefficient for the main treatment."
    type: float
```

### 3. Grind (Extraction Loop)
Executes the LLM-based extraction and summarization.

```bash
# Step A: Pilot Run (Validation)
papercutter grind --pilot
```
*   Processes a random 5-paper sample.
*   Generates a **Traceability Report** (`pilot_matrix.csv`) containing the extracted value alongside the *exact quote* from the text used to derive it. This allows researchers to audit LLM performance.

```bash
# Step B: Full Execution
papercutter grind --full
```
*   Processes the remaining library. This step is idempotent; already processed papers are skipped.

### 4. Report (Synthesis)
Compiles final artifacts for analysis and reading.

```bash
papercutter report
```

*   **Outputs:**
    *   `matrix.csv`: A flattened dataset of all extracted variables, ready for import into R/Stata/Pandas.
    *   `systematic_review.pdf`: A compiled LaTeX document containing:
        *   **Structured Summaries:** One-page standardized syntheses of every paper.
        *   **Contribution Grid:** A consolidated appendix layout for rapid comparison.

---

## Project Structure

Papercutter enforces a standardized directory structure to manage state.

```text
my_project/
├── input/                  # Raw PDF repository
├── config.yaml             # Extraction schema definition
├── .papercutter/           # Internal state (Markdown cache, Inventory)
└── output/
    ├── matrix.csv          # Final dataset for analysis
    ├── systematic_review.pdf
    └── pilot_trace.csv     # Audit trail for verification
```

---

## Common Use Cases

**Meta-Regression Analysis**
> *Goal:* Extract specific regression coefficients and standard errors from 50+ empirical papers.
> *Workflow:* Define `coefficient`, `standard_error`, and `model_specification` in the schema. Use the Pilot Mode to ensure the LLM distinguishes between "Main Results" and "Robustness Checks."

**Large Volume Processing**
> *Goal:* Analyze a Handbook or multi-chapter Report.
> *Workflow:* The Ingest phase detects the volume size. The Splitter module separates chapters into individual units. The Report phase generates a "Flashcard" style appendix for rapid review.

**Library Remediation**
> *Goal:* Organize a messy folder of PDFs with inconsistent filenames.
> *Workflow:* The Ingest phase uses header analysis to identify papers and links them to a clean BibTeX file, generating a structured inventory of the collection.

---

## License

MIT License. Open for academic and commercial use.
