Metadata-Version: 2.4
Name: nophi-ui
Version: 0.1.1
Summary: Local web review interface for nophi / nophi-av PHI redaction
License: MIT
Project-URL: Homepage, https://github.com/kshen3778/no-phi
Project-URL: Repository, https://github.com/kshen3778/no-phi
Keywords: phi,pii,redaction,review,ui
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nophi>=0.1
Requires-Dist: nophi-av>=0.1
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn[standard]>=0.29
Requires-Dist: python-multipart>=0.0.9
Dynamic: license-file

# nophi-ui

A local web review interface for the [`nophi`](../nophi) (documents) and
[`nophi-av`](../nophi-av) (audio/video) PII/PHI redaction engines.

It lets you run detection locally from the browser, **remove false-positive detections**
(and, for audio, add a missed segment by hand) before redaction is applied, and
view the redacted result.

## Run

```bash
nophi-ui            # opens http://127.0.0.1:8000
nophi-ui --port 9000 --no-open
```

You select a server-side **input directory** and **output directory** (raw paths),
preview the files that will be processed, then start detection.

## Prerequisites

- **Python 3.10+ — 3.12 recommended.** 3.12 is what the app is tested
  against; very new releases (e.g. 3.14) may not work.
- `pip install nophi-ui` pulls in everything else automatically: the document and
  audio/video engines (`nophi`, `nophi-av`), FastAPI, and the ML stack. No separate installs are needed and no API keys are required.
- Models are downloaded on first use and are cached

  To fetch them ahead of time instead of on the first run:

  ```bash
  nophi download-models        # document NLP models
  nophi-av download-models     # audio/video models
  ```

## Usage

The interface is a single page with two phases.

**1. Setup.** Type or **Browse…** to an input and output folder, pick the options
below, then **Preview files** to confirm what will be processed and **Start
detection** to run:

- **Audio redaction** — how PHI is removed from audio: `beep` (overlay a tone) or
  `silence` (mute the span).
- **Whisper model** — the speech-to-text model used to transcribe audio/video:
  - `tiny` — fastest, least accurate
  - `base` — middle ground
  - `small` (default) — most accurate of the three, slowest

  Documents don't use Whisper — their detection runs automatically with spaCy +
  biomedical NER, nothing to choose.

**2. Review & apply.** When detection finishes, open each file to see its
detections, remove any false positives (and, for audio, add missed segments), then
**Apply** to write the redacted result. Apply is re-runnable — toggle detections
and re-apply until you're satisfied; nothing is final until you stop the server.
See [What it does](#what-it-does) below for the per-format specifics. Redacted
files and an Excel report are written to your output folder.

## What it does

- **Documents** (`.txt .csv .docx .xlsx .pdf`): detect → review the detection
  list → uncheck false positives → apply. PDF previews inline; docx/xlsx are
  download-only.
- **Audio**: detect → review (play the original clip per detection) → uncheck
  false positives and/or add missed `start/end` segments → apply (re-scrubs from
  the original; no re-transcription).
- **Video**: view-only. Redacted in one shot; detections shown for reference. Video redaction is currently still in development.

## PDF redaction-box labels

In redacted PDFs, each box is stamped with a short code instead of the full
entity name (full names like `<ORGANIZATION>` don't fit short spans such as
"LLC"), so every entity type is reduced to a 2-letter code rendered as `<XX>`:

| Code | Entity type    | Code | Entity type        |
| ---- | -------------- | ---- | ------------------ |
| `PR` | PERSON         | `SS` | US_SSN             |
| `OR` | ORGANIZATION   | `BK` | US_BANK_NUMBER     |
| `LO` | LOCATION       | `DL` | US_DRIVER_LICENSE  |
| `DT` | DATE_TIME      | `PP` | US_PASSPORT        |
| `PH` | PHONE_NUMBER   | `IT` | US_ITIN            |
| `EM` | EMAIL_ADDRESS  | `ML` | MEDICAL_LICENSE    |
| `CC` | CREDIT_CARD    | `IB` | IBAN_CODE          |
| `IP` | IP_ADDRESS     | `NR` | NRP                |
| `UR` | URL            |      |                    |

The review table in the output report always shows the full entity type; the abbreviations appear only
inside the PDF boxes.

## Security

This tool serves PII/PHI, so by design it:

- binds locally to **`127.0.0.1` only** (refuses other hosts),
- locks **CORS** to its own origin,
- requires a **per-launch token** on every API call,
- serves files by **opaque job/file id** (never a client-supplied path),
- marks PHI responses **`Cache-Control: no-store`** and serves only the clipped
  segment for audio review.

State is held **in memory** for the process lifetime; closing the server clears
it (a restart means re-running detection).
