Metadata-Version: 2.4
Name: nophi
Version: 0.1.0
Summary: Detect and redact PHI/PII from documents (.txt, .csv, .docx, .xlsx, .pdf)
License: MIT
Project-URL: Homepage, https://github.com/kshen3778/no-phi
Project-URL: Repository, https://github.com/kshen3778/no-phi
Keywords: phi,pii,redaction,anonymization,healthcare,privacy
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer[all]<1.0,>=0.26
Requires-Dist: click<9.0,>=8.4
Requires-Dist: rich<16.0,>=15.0
Requires-Dist: presidio-analyzer>=2.2.362
Requires-Dist: presidio-anonymizer>=2.2.362
Requires-Dist: spacy<4.0,>=3.8
Requires-Dist: python-docx<2.0,>=1.2
Requires-Dist: openpyxl<4.0,>=3.1
Requires-Dist: pymupdf<2.0,>=1.27
Requires-Dist: certifi>=2024.0
Requires-Dist: drug-named-entity-recognition<3.0,>=2.0.9
Dynamic: license-file

# no-phi

A command-line tool for detecting and redacting **PHI/PII** (protected health
information / personally identifiable information) from documents. It reads
`.txt`, `.csv`, `.docx`, `.xlsx`, and `.pdf` files, finds personal data, writes
redacted copies, and produces an Excel findings report.

It is tuned for **healthcare documents**: a layer of biomedical recognizers
suppresses the false positives that general-purpose NER produces on clinical
text (e.g. tagging a drug name like *Perindopril* as a PERSON, or *Cardiology*
as an ORGANIZATION).

```bash
# Scan a file or folder, write redacted copies + phi_report.xlsx
python main.py scan report.pdf
python main.py scan ./records/ --output ./records_cleaned/

# Detect only, don't write redacted files
python main.py scan report.docx --dry-run

# Restrict to specific entity types
python main.py scan data.csv --entities PERSON,PHONE_NUMBER,US_SSN

# Map detected values to stable IDs instead of <ENTITY_TYPE> (CSV cols: id,mapped_id)
python main.py scan notes.txt --mappings mappings.csv

# Ignore known-safe values (.txt/.csv/.xlsx/.json) — not redacted or reported
python main.py scan ./records/ --exclude allowlist.txt

# Pre-download all NLP models (otherwise downloaded on first scan)
python main.py download-models
```

---

## Pipeline

Every file flows through four stages: **extract → recognize → redact →
report**. The tools used at each stage are listed below.

```
                ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
   file  ─────► │ 1. EXTRACT  ├──►│ 2. RECOGNIZE├──►│ 3. REDACT   ├──►│ 4. REPORT   │
                │  text +     │   │  PII spans  │   │  anonymize/ │   │  Excel      │
                │  positions  │   │  (Presidio) │   │  black-box  │   │  findings   │
                └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
```

The CLI orchestration lives in [nophi/cli.py](nophi/cli.py): it collects input
files ([`_collect_files`](nophi/cli.py)), dispatches each by extension to a
handler in [nophi/handlers/](nophi/handlers/), aggregates findings, and prints a
Rich summary table. ([main.py](main.py) is a thin shim that calls into it.)

| Layer | Package / tool |
| --- | --- |
| CLI, options, sub-commands | [Typer](https://typer.tiangolo.com/) |
| Terminal progress bars & tables | [Rich](https://rich.readthedocs.io/) |
| Entity detection engine | [Presidio Analyzer](https://microsoft.github.io/presidio/) |
| Anonymization engine | [Presidio Anonymizer](https://microsoft.github.io/presidio/) |
| General NER backend | [spaCy](https://spacy.io/) `en_core_web_lg` |
| Biomedical NER | [scispaCy](https://allenai.github.io/scispacy/) `en_ner_bc5cdr_md`, `en_ner_bionlp13cg_md` |
| Drug-name matching | [drug-named-entity-recognition](https://pypi.org/project/drug-named-entity-recognition/) (DrugBank) + bundled [RxNorm](https://www.nlm.nih.gov/research/umls/rxnorm/) name list |
| Report output | [openpyxl](https://openpyxl.readthedocs.io/) |

---

### 1. Extract — text + positions

Each file type has a handler in [nophi/handlers/](nophi/handlers/) that pulls out
the text to scan. For formats with layout (PDF), it also tracks where each piece
of text sits so redactions can be placed precisely.

| Type | Handler | Library | Notes |
| --- | --- | --- | --- |
| `.txt` | [text.py](nophi/handlers/text.py) | stdlib | Whole file read as one string. |
| `.csv` | [text.py](nophi/handlers/text.py) | `csv` | Dialect auto-sniffed; scanned **per cell**. |
| `.docx` | [docx.py](nophi/handlers/docx.py) | `python-docx` | Each paragraph and table cell. |
| `.xlsx` | [xlsx.py](nophi/handlers/xlsx.py) | `openpyxl` | Every string cell across all sheets. |
| `.pdf` | [pdf.py](nophi/handlers/pdf.py) | `PyMuPDF` | Words + bounding boxes via `get_text("words")`, reassembled into text with a char-offset → word-box map. |

### 2. Recognize — find PII spans

[nophi/analyzer.py](nophi/analyzer.py) builds a Presidio `AnalyzerEngine` (backed
by the spaCy `en_core_web_lg` model) and exposes [`scan_text`](nophi/analyzer.py),
which returns the detected entities with character offsets and confidence scores.

Detection comes from three sources working together:

- **Presidio built-ins** — spaCy NER for `PERSON`, `ORGANIZATION`, `LOCATION`,
  `DATE_TIME`, `NRP`, plus pattern/checksum recognizers for `PHONE_NUMBER`,
  `EMAIL_ADDRESS`, `CREDIT_CARD`, `US_SSN`, `IBAN_CODE`, `IP_ADDRESS`, `URL`,
  `MEDICAL_LICENSE`, and other structured identifiers.

- **Custom biomedical recognizers** ([nophi/recognizers.py](nophi/recognizers.py)) — these
  do **not** add PII. They recognize medical vocabulary and tag it with the
  internal type `MEDICAL_TERM`, which is used to *protect* that text from being
  scrubbed — **not** to redact it.

  They form **four complementary layers**, each catching what the others miss
  (a curated deny-list, two drug-name lists, and an ML model). All four emit
  `MEDICAL_TERM`:

  | # | Recognizer | Backed by | Catches | Matching |
  | --- | --- | --- | --- | --- |
  | 1 | `MedicalTermRecognizer` | deny-list in [nophi/data/medical_terms.py](nophi/data/medical_terms.py) | Hospital departments, specialties, wards, symptoms, diagnoses, procedures, labs/imaging, shorthand. | Exact (case-insensitive) |
  | 2 | `MedicationRecognizer` | `drug-named-entity-recognition` (DrugBank) | Drug names, incl. common misspellings. | Fuzzy |
  | 3 | `RxNormRecognizer` | bundled RxNorm name list ([nophi/data/rxnorm_names.txt.gz](nophi/data/rxnorm_names.txt.gz)) | Drug brand + ingredient names from RxNorm (incl. many vitamins/minerals under their ingredient names). | Exact n-gram |
  | 4 | `BiomedicalNerRecognizer` | scispaCy `en_ner_bc5cdr_md` + `en_ner_bionlp13cg_md` | Chemicals, diseases, anatomy, genes, organisms, tissues — recognized **by ML context**, so it catches substances in no list. | ML model |

  Layers 2–4 overlap on purpose: the two drug lists give high-precision exact/fuzzy
  hits, and the ML layer is the backstop for substances not in any list. Coverage
  of supplements/vitamins is therefore good for clinical/ingredient names
  (e.g. *ascorbic acid*, *cholecalciferol*) but thinner for lay/botanical names
  (e.g. *fish oil*, *ginkgo biloba*); the ML layer is the main net for those.

  **Suppression logic** in [`scan_text`](nophi/analyzer.py): any
  `PERSON` / `ORGANIZATION` / `NRP` / `LOCATION` detection that overlaps a
  `MEDICAL_TERM` span is dropped, and the `MEDICAL_TERM` spans themselves are
  removed from the output (they are not PII). The net effect is that genuine
  names/places survive while clinical vocabulary stops being mislabeled as
  identifiers.

- **`StreetAddressRecognizer`** ([nophi/recognizers.py](nophi/recognizers.py)) — unlike the
  biomedical recognizers, this one *adds* PII that Presidio's defaults miss.
  spaCy NER tags cities/regions (`Scarborough`) but not street lines, so a
  regex matches a house number + 1–3 street-name words + a known street-type
  suffix (`Rd`, `Street`, `Ave`, `Blvd`, `Dr`, …) and reports it as `LOCATION`.
  It handles bare addresses (`2867 Ellesmere Rd`) as well as full ones, plus
  alphanumeric house numbers (`221B`) and ordinal street names (`350 5th
  Avenue`). Requiring a leading number keeps it from matching a `Dr.` title or
  dosages like `100 mg tablet`.

### 3. Redact — anonymize or black-box

[nophi/redactor.py](nophi/redactor.py) builds a Presidio `AnonymizerEngine` and the
operator set used to replace each entity. By default an entity becomes
`<ENTITY_TYPE>`; with `--mappings` (CSV columns `id,mapped_id`), a detection whose
text matches an `id` is replaced by its `mapped_id` instead (token-overlap match,
applied across all entity types so it works regardless of how Presidio classified
the value). The `--exclude` option takes a `.txt`/`.csv`/`.xlsx`/`.json` list of
values to ignore — any detection matching one (case-insensitive) is dropped in
[`scan_text`](nophi/analyzer.py) before redaction or reporting.

How the replacement is applied depends on the format:

- **`.txt` / `.csv`** — Presidio rewrites the string in place
  ([`anonymize_text`](nophi/redactor.py)).
- **`.docx`** — the anonymized text is written back into the paragraph/cell,
  preserving document structure.
- **`.xlsx`** — matching cell values are overwritten.
- **`.pdf`** — each detected span is mapped back to the exact word bounding
  boxes it covers; `page.add_redact_annot()` draws a filled black box with a
  short white label, and `page.apply_redactions()` **permanently removes** the
  underlying text from the PDF content stream (a true irreversible redaction,
  not just a visual cover).

### 4. Report — Excel findings

[nophi/reporter.py](nophi/reporter.py) writes an `openpyxl` workbook (default
`phi_report.xlsx`) with two sheets:

- **Findings** — one row per detection: file, entity type, original text,
  replacement, character position.
- **Summary** — entity-type counts and per-file PHI counts.

---

## Models & first run

The NLP models are **downloaded on first use and cached** under
`~/.cache/no-phi/models/` — they are not bundled into the program.
[nophi/models.py](nophi/models.py) handles fetching, extracting, and
(for the scispaCy models) patching them to load under the installed spaCy
version.

| Model | Size | Source |
| --- | --- | --- |
| `en_core_web_lg` (base NER) | ~560 MB | spaCy GitHub releases (pip wheel) |
| `en_ner_bc5cdr_md` | ~115 MB | scispaCy S3 release (`.tar.gz`) |
| `en_ner_bionlp13cg_md` | ~120 MB | scispaCy S3 release (`.tar.gz`) |

> The scispaCy biomedical models load with **plain spaCy** — the heavyweight
> `scispacy` package (and its `nmslib`/`scipy`/`scikit-learn` dependencies) is
> **not** required. [nophi/models.py](nophi/models.py) rewrites a stale boolean
> in each model's `config.cfg` during extraction so it validates under spaCy 3.8.

Run `python main.py download-models` to fetch everything ahead of time, or just
run a scan and the models download automatically on first invocation.

---

## Project layout

```
pyproject.toml       # packaging + dependencies + `nophi` console entry point
main.py              # thin entry-point shim → nophi.cli:main (used by Nuitka build)
nophi/               # the package
├── __main__.py      # enables `python -m nophi`
├── cli.py           # Typer app + orchestration
├── analyzer.py      # build_analyzer() + scan_text()  (detection)
├── recognizers.py   # custom MEDICAL_TERM recognizers
├── redactor.py      # anonymization
├── reporter.py      # Excel findings report
├── models.py        # model download / cache
├── handlers/        # per-format read/write/redact (text, docx, xlsx, pdf)
└── data/            # medical_terms.py + bundled rxnorm_names.txt.gz
scripts/             # build_rxnorm_list.py (refreshes the bundled RxNorm list)
docs/                # expansion_notes.md (user guide lives in the repo-root docs/)
```

---

## Install

### As a pip package (recommended)

Installs a `nophi` command on your PATH:

```bash
pip install .                  # or `pip install nophi` once published to PyPI
nophi download-models         # one-time: fetch NLP models (~560 MB)
nophi scan report.pdf
```

You can also run it without installing the script via `python -m nophi scan ...`.

### For development

```bash
pip install -e .               # editable install (deps come from pyproject.toml)
# or: pip install -r requirements.txt
python main.py download-models # optional: pre-fetch models
```

See [the user guide](../../docs/nophi-user-guide.md) for end-user instructions.
