Metadata-Version: 2.4
Name: revpdf
Version: 0.1.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Utilities
Summary: A triage-and-recovery toolkit for PDFs saved with incremental updates.
Author-email: Rekhet <anubis@snu.ac.kr>
License-Expression: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/Rekhet/revpdf

# revpdf

`revpdf` is a triage-and-recovery toolkit for PDFs saved with incremental updates. It helps you inspect a PDF's edit history and safely roll it back to an earlier revision when later saves added unwanted markup.

## Installation

```bash
pip install revpdf
```

## Quick Start

### 1. Inspect a PDF

```bash
revpdf-inspect input.pdf
```

This prints revision numbers, byte ranges, and counts of suspicious annotation markers.

### 2. List object-level changes per revision

```bash
revpdf-list input.pdf
```

This identifies which objects each revision appends and distinguishes likely annotation-only changes from content-affecting changes.

### 3. Extract a chosen revision

```bash
revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf
```

### 4. Automated Sanitization (New)

```bash
revpdf-clean input.pdf --output cleaned.pdf --strategies acrobat samsung --manifest integrity.json
```

This uses best-guess heuristics to automatically identify and surgically remove platform-specific markup while regenerating the XRef stream for a clean file.

**Forensic Features:**
- **--manifest (-m):** Generates a machine-readable JSON report mapping original object IDs to their cryptographic hashes.
- **Global Signature:** A cumulative SHA-256 fingerprint of all kept content, providing proof that the original textbook data is byte-identical.

## Forensic Integrity Module

`revpdf` is designed for high-assurance environments where data provenance is critical.

### Tiered Hashing
To maintain high performance on large files, the integrity engine uses a tiered approach:
- **Hashed:** Text content streams, page geometry, and structural objects.
- **Skipped:** Large binary assets like embedded images and fonts (can be enabled via API).

### Integrity Manifest (JSON)
The manifest provides a verifiable audit trail:
- `global_signature`: The unique hash for the entire "cleaned" document state.
- `object_hashes`: A dictionary of `New_ID -> SHA-256_Hash`.

## Developer SDK (v0.2.0+)

`revpdf` now provides a tiered Python SDK with a high-performance Rust backend. It supports `asyncio` for non-blocking I/O.

### Object-Model API

Designed for ease of use and integration into Python workflows.

```python
import asyncio
from revpdf import PdfDocument, Sanitizer

async def main():
    # Load document (Lazy loading handled by Rust)
    doc = await PdfDocument.open("textbook.pdf")
    
    # Apply automated sanitization strategies
    sanitizer = Sanitizer(strategies=["acrobat", "samsung"])
    removed_count = await sanitizer.apply(doc)
    print(f"Identified {removed_count} objects for removal")
    
    # Surgical Save (Modern XRef Stream regeneration)
    manifest = await doc.save("cleaned.pdf", surgical=True)
    print(f"Surgical Save Complete!")
    print(f"Global Document Signature: {manifest.global_signature}")
    print(f"Verified Objects: {len(manifest.object_hashes)}")

if __name__ == "__main__":
    asyncio.run(main())
```

## The Workflow

## When this works well

This method is appropriate when the PDF was modified by incremental saves. In that format, each save appends a new revision to the end of the file.

Typical signs:

- multiple `%%EOF` markers in the file
- trailer dictionaries with `/Prev` pointers
- later revisions containing annotation markers such as `/Type /Annot`, `/Subtype /Stamp`, `/InkList`, `/AAPL:AKExtras`, or `/PPKType (draw)`

## Safety Rule

Only roll back to an earlier revision if the later revisions contain unwanted annotations or annotation appearance streams and do not replace the actual textbook page content you need to keep.

Do not roll back blindly if later revisions also change:

- page content streams
- text objects
- fonts
- images
- page tree structure for real content changes

If those appear in the later revisions, you need a more careful repair strategy.

## The Manual Workflow

### 1. Find the revision boundaries

Each incremental revision normally ends with `%%EOF`.

Run:

```bash
python3 - <<'PY'
from pathlib import Path

data = Path("input.pdf").read_bytes()
cursor = 0
index = 1
while True:
    pos = data.find(b"%%EOF", cursor)
    if pos == -1:
        break
    print(f"Revision {index}: EOF at byte {pos}")
    cursor = pos + 1
    index += 1
PY
```

If you see more than one `%%EOF`, the file contains multiple revisions.

### 2. Inspect the trailer chain

Each later trailer often points backward to the previous revision using `/Prev`.

Run:

```bash
python3 - <<'PY'
from pathlib import Path

data = Path("input.pdf").read_bytes()
cursor = 0
while True:
    pos = data.find(b"%%EOF", cursor)
    if pos == -1:
        break
    snippet = data[max(0, pos - 400):pos + 20]
    print(snippet.decode("latin1", "replace"))
    print("-----")
    cursor = pos + 1
PY
```

This helps confirm that the file was saved incrementally rather than rewritten from scratch.

### 3. Search for suspicious annotation markers

Search the raw PDF bytes for common overlay markers:

```bash
rg -a -n '/Subtype /(Stamp|Ink|FreeText|Square|Circle|Highlight)|/InkList|/AAPL:AKExtras|/PPKType \(draw\)|Mobile User' input.pdf
```

Common signs of hand-drawn markup include:

- `/Subtype /Stamp`
- `/Subtype /Ink`
- `/InkList`
- `/PPKType (draw)`
- `/AAPL:AKExtras`
- `Mobile User`

### 4. Compare revisions, not just the whole file

The important question is not whether the PDF contains annotations somewhere. The important question is when those objects first appear.

For each appended revision, inspect whether it adds:

- only annotation objects and annotation appearance streams
- page `/Annots` references pointing to those annotations

That is usually safe to roll back.

If the appended revision adds or replaces actual page contents, treat it as unsafe for blind rollback.

### 5. Choose the rollback target

Choose the last revision before the unwanted markers first appear.

Example logic:

- revision 1: no unwanted annotation markers
- revision 2: unwanted drawing markers appear
- revision 3: more of the same unwanted drawing markers

In that case, revision 1 is the clean rollback target.

### 6. Extract the earlier revision into a new file

Once you know the correct revision boundary, copy the file only up to that revision's final `%%EOF`.

Manual example:

```bash
head -c <SAFE_END_OFFSET> input.pdf > cleaned.pdf
```

Do this into a new output file. Leave the original untouched.

### 7. Verify the cleaned file

Run:

```bash
pdfinfo cleaned.pdf
```

Then confirm the unwanted markers are gone:

```bash
python3 - <<'PY'
from pathlib import Path

data = Path("cleaned.pdf").read_bytes()
for token in [
    b"Mobile User",
    b"/PPKType (draw)",
    b"/Subtype /Stamp",
    b"/Subtype /Ink",
    b"/AAPL:AKExtras",
]:
    print(token.decode("latin1"), data.count(token))
PY
```

Check:

- page count still matches what you expect
- the cleaned file opens normally
- the unwanted annotation markers are gone
- the original content remains intact

### Reusable Commands

The package includes four core commands:

- `revpdf-inspect`
- `revpdf-extract`
- `revpdf-list`
- `revpdf-clean`

#### Inspect a PDF

```bash
revpdf-inspect input.pdf
```


This prints:

- revision number
- revision byte range
- end offset
- trailer `/Prev`
- trailer `/Size`
- counts of suspicious annotation markers in that revision

#### List object-level changes per revision

```bash
revpdf-list input.pdf
```

This prints, for each revision:

- object count
- per-kind counts such as page, annot, xobject, font, and generic objects
- whether each appended object was added, redefined, or repeated
- a revision assessment such as `likely_annotation_only` or `content_affecting_or_mixed`
- notable details such as `/Rect`, `/Annots`, `/Contents`, stream filters, compressed-object containers, and vendor markers when present

Useful options:

```bash
revpdf-list input.pdf --revision-index 3
revpdf-list input.pdf --show-baseline-objects
revpdf-list input.pdf --summary-only
revpdf-list input.pdf --json
```

#### Extract a chosen revision

```bash
revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf
```

This writes a new PDF containing only the bytes through the selected revision.

You can also extract by byte offset:

```bash
revpdf-extract input.pdf --end-offset 229086321 --output cleaned.pdf
```

## Practical Decision Checklist

Use rollback when all of these are true:

- the PDF has multiple revisions
- the unwanted changes were introduced in later revisions
- those later revisions are annotation-only or annotation-dominant
- the earlier revision already contains the correct textbook content

Do not use blind rollback when any of these are true:

- the later revisions contain actual content changes you need
- you cannot tell whether the later objects are only annotations
- the PDF was fully rewritten instead of saved incrementally

## Notes

- Some tools warn about broken or invalid linearization tables. That does not automatically mean the PDF is unusable.
- Some annotation systems render hand-drawn marks as `/Stamp` objects with appearance streams rather than `/Ink` objects. Search broadly.
- Some editors store changed objects inside compressed object streams (`/ObjStm`) or xref streams. The change-report script now detects these and expands common Flate-compressed object streams.
- Always work on a copy when the document is important.

## Recommended Sequence

1. Run `revpdf-inspect`.
2. Run `revpdf-list` to see exactly which objects each revision added or redefined.
3. Identify the first revision that introduces unwanted annotation markers.
4. Choose between:
   - **Rollback:** Use `revpdf-extract` to truncate at a safe revision.
   - **Surgical Cleanup:** Use `revpdf-clean` to remove specific markup layers while keeping recent content.
5. Verify the page count, metadata, and marker counts.

