Metadata-Version: 2.4
Name: papercutter
Version: 1.2.0
Summary: Extract and map content from academic papers for LLM processing
Project-URL: Homepage, https://github.com/rawatpranjal/papercutter
Project-URL: Repository, https://github.com/rawatpranjal/papercutter
Author: Pranjal Rawat
License-Expression: MIT
License-File: LICENSE
Keywords: academic,arxiv,extraction,papers,pdf,research
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Requires-Dist: arxiv>=2.1.0
Requires-Dist: bibtexparser>=1.4.0
Requires-Dist: certifi>=2023.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pdfplumber>=0.11.0
Requires-Dist: pydantic-settings>=2.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pypdf>=4.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer[all]>=0.12.0
Provides-Extra: all
Requires-Dist: docling>=2.0.0; extra == 'all'
Requires-Dist: instructor>=1.0.0; extra == 'all'
Requires-Dist: jinja2>=3.0.0; extra == 'all'
Requires-Dist: litellm>=1.30.0; extra == 'all'
Requires-Dist: mypy>=1.8.0; extra == 'all'
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: pymupdf>=1.24.0; extra == 'all'
Requires-Dist: pytest-cov>=4.0.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.0.0; extra == 'all'
Requires-Dist: pytest>=8.0.0; extra == 'all'
Requires-Dist: ruff>=0.3.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Provides-Extra: docling
Requires-Dist: docling>=2.0.0; extra == 'docling'
Provides-Extra: equations
Requires-Dist: pillow>=10.0.0; extra == 'equations'
Requires-Dist: pymupdf>=1.24.0; extra == 'equations'
Provides-Extra: equations-nougat
Requires-Dist: pillow>=10.0.0; extra == 'equations-nougat'
Requires-Dist: pymupdf>=1.24.0; extra == 'equations-nougat'
Requires-Dist: torch>=2.0.0; extra == 'equations-nougat'
Requires-Dist: transformers>=4.30.0; extra == 'equations-nougat'
Provides-Extra: equations-pix2tex
Requires-Dist: pillow>=10.0.0; extra == 'equations-pix2tex'
Requires-Dist: pix2tex>=0.1.0; extra == 'equations-pix2tex'
Requires-Dist: pymupdf>=1.24.0; extra == 'equations-pix2tex'
Provides-Extra: factory
Requires-Dist: docling>=2.0.0; extra == 'factory'
Requires-Dist: instructor>=1.0.0; extra == 'factory'
Requires-Dist: jinja2>=3.0.0; extra == 'factory'
Requires-Dist: litellm>=1.30.0; extra == 'factory'
Provides-Extra: fast
Requires-Dist: pymupdf>=1.24.0; extra == 'fast'
Provides-Extra: llm
Requires-Dist: instructor>=1.0.0; extra == 'llm'
Requires-Dist: litellm>=1.30.0; extra == 'llm'
Description-Content-Type: text/markdown

# Papercutter

[![PyPI version](https://badge.fury.io/py/papercutter.svg)](https://pypi.org/project/papercutter/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/rawatpranjal/papercutter/actions/workflows/ci.yml/badge.svg)](https://github.com/rawatpranjal/papercutter/actions/workflows/ci.yml)

Extract knowledge from academic papers. A CLI-first Python package for researchers.

## Installation

```bash
pip install papercutter
```

With LLM features (summarization, reports, study aids):
```bash
pip install papercutter[llm]
```

With fast PDF processing (PyMuPDF):
```bash
pip install papercutter[fast]
```

All optional dependencies:
```bash
pip install papercutter[all]
```

### Development Installation

```bash
git clone https://github.com/rawatpranjal/papercutter.git
cd papercutter
pip install -e ".[dev]"
```

## Quick Start

### Fetch Papers

Download papers from various academic sources:

```bash
# From arXiv
papercutter fetch arxiv 2301.00001

# From DOI
papercutter fetch doi 10.1257/aer.20180779

# From SSRN
papercutter fetch ssrn 4123456

# From NBER
papercutter fetch nber w29000

# From direct URL
papercutter fetch url "https://example.com/paper.pdf" --name smith_2024
```

### Extract Text

Extract clean text from PDFs:

```bash
# Full text to stdout
papercutter extract text paper.pdf

# Save to file
papercutter extract text paper.pdf --output paper.txt

# Chunk for LLM processing
papercutter extract text paper.pdf --chunk-size 4000 --overlap 200

# Extract specific pages
papercutter extract text paper.pdf --pages "1-10,15"
```

### Extract Tables

Extract tables from PDFs as CSV or JSON:

```bash
# All tables to stdout as JSON
papercutter extract tables paper.pdf

# Save as CSV files
papercutter extract tables paper.pdf --output ./tables/ --format csv

# Extract from specific pages
papercutter extract tables paper.pdf --pages "5-10" --format json
```

### Extract References

Extract bibliography as BibTeX:

```bash
# BibTeX to stdout
papercutter extract refs paper.pdf

# Save to file
papercutter extract refs paper.pdf --output refs.bib

# As JSON
papercutter extract refs paper.pdf --format json
```

## Configuration

Papercutter stores configuration in `~/.papercutter/config.yaml`:

```yaml
output:
  directory: ~/papers

extraction:
  backend: pdfplumber
  text:
    chunk_size: null
    chunk_overlap: 200
  tables:
    format: csv

# LLM settings (v0.2)
llm:
  default_provider: anthropic
  default_model: claude-sonnet-4-20250514
```

Environment variables override config:
```bash
export PAPERCUTTER_ANTHROPIC_API_KEY=sk-ant-...
export PAPERCUTTER_OPENAI_API_KEY=sk-...
```

## Migration from Papercut

Papercutter is a direct rename of the original Papercut project. To upgrade an existing installation:

1. Reinstall the package: `pip uninstall papercut && pip install papercutter`.
2. Update scripts and shell aliases to call `papercutter` instead of `papercut`.
3. Rename your config directory if you have custom settings: `mv ~/.papercut ~/.papercutter`.
4. (Optional) Rename the cache directory to retain cached artifacts: `mv ~/.cache/papercut ~/.cache/papercutter`.
5. Update any `PAPERCUT_*` environment variables to the new `PAPERCUTTER_*` prefix.

## Development

Run tests:
```bash
pytest tests/
```

Run linting:
```bash
ruff check src/
mypy src/
```

## License

MIT License - see [LICENSE](LICENSE) for details.
