Metadata-Version: 2.4
Name: receipt-renamer
Version: 0.1.0
Summary: Watch a scans folder and auto-rename receipt images/PDFs (and course/document scans) from their OCR'd text.
Author: smolkai
License: MIT
Keywords: ocr,receipts,tesseract,rename,watcher,snap,ebt
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: Pillow>=10.0
Requires-Dist: watchdog>=4.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: pdf2image>=1.17
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == "test"
Dynamic: license-file

# receipt-renamer

Point it at the folder your scanner app drops files into. For every receipt
image/PDF it OCRs the page and renames it to something searchable:

```
Receipt 2024-03-14 1042 Safeway SNAP EBT Adobe Scan.jpg
```

Non-receipt scans (lecture notes, handouts) are handled by a separate rule
layer: it reads the **page-top title** and expands course codes into every form
you might later search for, so `CHE2A Midterm Review.pdf` becomes:

```
CHE2A Midterm Review CHE 2A Chem 2A Genius Scan.pdf
```

and `OChem Lecture Notes.pdf` becomes:

```
OChem Lecture Notes Organic Chemistry CHE8A CHE 8A Chem 8A.pdf
```

Everything — the store list, the SNAP/EBT patterns, the receipt heuristics, the
scanner-app markers, the course rules, and the filename templates — lives in one
YAML file you can edit.

## How the name is built

**Receipts** use the template:

```
Receipt {datetime} {store} {SNAP EBT if present} {scanner-app} {your notes}
```

- `{datetime}` — parsed from the receipt (`YYYY-MM-DD HHMM`); many date/time
  formats are understood (US `MM/DD/YYYY`, ISO `YYYY-MM-DD`, `Jan 5, 2024`,
  12h/24h times). If no date is found it falls back to `Receipt {store} …`.
- `{store}` — matched against a known-store list (Safeway, Costco, Trader Joe's,
  Whole Foods, Target, Walmart, Kroger, CVS, Walgreens, Aldi, Sprouts, and
  more — all editable).
- `SNAP EBT` — inserted only when a SNAP/EBT line is detected.
- `{scanner-app}` — detected from the OCR text or the original filename
  (Genius Scan, Adobe Scan, CamScanner, Microsoft Lens, …).
- `{your notes}` — anything you pass with `--notes`.

Empty fields collapse cleanly — no double spaces, no dangling separators.

**Documents** (anything not detected as a receipt) use:

```
{title} {course-code aliases} {scanner-app} {your notes}
```

## Install

Requires the **Tesseract** OCR engine on your PATH, plus **poppler** if you want
to OCR PDFs.

```sh
# macOS
brew install tesseract poppler
# Debian/Ubuntu
sudo apt-get install tesseract-ocr poppler-utils

pip install -e .
```

## Usage

```sh
# Dry run over a folder (prints planned renames, changes nothing):
receipt-renamer batch ~/Scans

# Apply the renames:
receipt-renamer batch ~/Scans --commit

# Move renamed files into a tidy archive instead of renaming in place:
receipt-renamer batch ~/Scans --commit --dest ~/Receipts

# One file, with a note:
receipt-renamer one ~/Scans/IMG_0001.jpg --commit --notes "reimburse work"

# Watch the folder and rename new scans as they arrive:
receipt-renamer watch ~/Scans --commit

# Print the default config so you can copy and tweak it:
receipt-renamer dump-config > my-rules.yaml
receipt-renamer batch ~/Scans --config my-rules.yaml --commit
```

> **Dry run is the default.** Nothing is renamed until you pass `--commit`.

## Configuration

Run `receipt-renamer dump-config` to see the full annotated default. Highlights:

- **`stores`** — canonical name + aliases/spellings. Whole-word matching, so
  `Target` won't match "targeting". Longest matching alias wins.
- **`snap_patterns`** — regexes that flag a SNAP/EBT receipt.
- **`receipt_signals` / `receipt_min_signals`** — a page is a receipt if a known
  store is found, or if it hits at least this many signals (TOTAL, TAX, `$x.xx`,
  card brands, …). This catches receipts from stores not in your list.
- **`courses`** — both a generic `DEPT + number` pattern (with a `subject_map`
  so `CHE` → `Chem`) and explicit `rules` (`OChem` → `Organic Chemistry` +
  the `CHE8A`/`CHE 8A`/`Chem 8A` family).
- **`templates`** / **`datetime_format`** — the output filename shapes.

## Architecture

The core operates entirely on **text**, never on pixels, so it's fast to test
and the OCR backend is swappable:

| module | responsibility |
| --- | --- |
| `config` | load + validate the YAML rule table |
| `ocr` | pluggable OCR (`OcrFn`); default backend = Tesseract via pytesseract / pdf2image |
| `stores` | whole-word store recognition |
| `receipts` | date/time parsing + SNAP/EBT detection |
| `courses` | course-code expansion |
| `classify` | receipt vs document decision |
| `rename` | filename assembly, sanitization, collision-safe targets (pure) |
| `processor` | OCR → plan → (optional) rename, never throws on a bad file |
| `watcher` | watchdog folder watcher with a write-settle delay |
| `cli` | `batch` / `one` / `watch` / `dump-config` |

## Tests

```sh
pip install -e ".[test]"
pytest
```

The bulk of the suite runs against **saved OCR-text fixtures** (in
`tests/fixtures/`) and a **mocked OCR function**, so it needs no Tesseract.
`tests/test_ocr_end_to_end.py` renders a synthetic receipt with Pillow and runs
the **real Tesseract** pipeline; it skips automatically if the binary is absent.

## Notes & limitations

- OCR quality is bounded by Tesseract and the scan. Faded thermal receipts and
  skewed photos will parse worse; the heuristics are deliberately forgiving.
- Date parsing favours the first plausible date on the page. Very unusual
  layouts may pick the wrong one — review a dry run before `--commit`.
- For PDFs, only the first couple of pages are OCR'd (configurable via
  `--pdf-pages`).

## License

MIT — see [LICENSE](LICENSE).
