Metadata-Version: 2.4
Name: pleno-pii-scanner
Version: 0.3.1
Summary: Japanese-first PII scanner for source repositories with gitleaks/trufflehog-like UX
Project-URL: Homepage, https://github.com/plenoai/pleno-anonymize
Project-URL: Source, https://github.com/plenoai/pleno-anonymize
Project-URL: Issues, https://github.com/plenoai/pleno-anonymize/issues
Author-email: pleno <ai@egahika.dev>
License: AGPL-3.0-or-later
Keywords: gitleaks,japanese,pii,presidio,scanner,trufflehog
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.12
Requires-Dist: aiosmtplib>=3.0
Requires-Dist: aiosqlite>=0.20
Requires-Dist: charset-normalizer>=3.4
Requires-Dist: click>=8.1
Requires-Dist: cryptography>=43
Requires-Dist: httpx>=0.27.0
Requires-Dist: pathspec>=0.12
Requires-Dist: pleno-recognizers>=0.1.1
Requires-Dist: presidio-analyzer>=2.2
Requires-Dist: spacy[ja]>=3.8
Provides-Extra: columnar
Requires-Dist: fastavro>=1.9; extra == 'columnar'
Requires-Dist: pyarrow>=17; extra == 'columnar'
Provides-Extra: hf
Requires-Dist: huggingface-hub>=0.23; extra == 'hf'
Requires-Dist: optimum[onnxruntime]>=1.21; extra == 'hf'
Requires-Dist: transformers>=4.40; extra == 'hf'
Provides-Extra: keyring
Requires-Dist: keyring>=24; extra == 'keyring'
Provides-Extra: office
Requires-Dist: openpyxl>=3.1; extra == 'office'
Requires-Dist: python-docx>=1.1; extra == 'office'
Requires-Dist: python-pptx>=1.0; extra == 'office'
Provides-Extra: otlp
Requires-Dist: opentelemetry-api>=1.27; extra == 'otlp'
Requires-Dist: opentelemetry-exporter-otlp>=1.27; extra == 'otlp'
Requires-Dist: opentelemetry-sdk>=1.27; extra == 'otlp'
Provides-Extra: pdf
Requires-Dist: pypdfium2>=4.30; extra == 'pdf'
Description-Content-Type: text/markdown

# pleno-pii-scanner

CLI that detects Japanese PII in repository contents, commit history, and staged hunks.

## Setup

Run straight from PyPI — no clone, no `uv sync`:

```sh
uvx pleno-pii-scanner --help
```

The `ja_ner_ja` spaCy model is downloaded into the uvx-managed environment on first NER invocation. To pin a persistent install, use `uv tool install pleno-pii-scanner` instead. Workspace contributors get the model wheel preinstalled via `uv sync` (it lives in the `dev` dependency group).

### Higher-precision HF backend (opt-in)

For DLP-grade workloads where false-positive `<ORGANIZATION>` masks are unacceptable, opt into the HuggingFace token-classification backend (model `model/v0.13.0`, `0xhikae/ja-ner-onnx@v0.13.0`). It applies a per-label confidence floor (default `ORGANIZATION=0.99`) — overall F1 0.452 → 0.701 vs the spaCy baseline on the `v0.12.0/ja` adversarial corpus.

```sh
PLENO_PII_SCANNER_BACKEND=hf \
  uvx --with 'pleno-pii-scanner[hf]' pleno-pii-scanner dir <path>
```

Tunables:

- `PLENO_PII_SCANNER_THRESHOLDS=ORGANIZATION=0.99,PERSON=0.0` — per-label confidence floor (default ORG=0.99).
- `PLENO_PII_SCANNER_HF_MODEL` / `PLENO_PII_SCANNER_HF_REVISION` — pin to a custom HF Hub repo / revision (default `0xhikae/ja-ner-onnx@v0.13.0`).

The HF backend adds ~600 MB of torch + transformers; the default install stays lightweight.

## Subcommands

```sh
uvx pleno-pii-scanner dir <path>                # walk a directory
uvx pleno-pii-scanner git <path>                # working tree plus commit history
uvx pleno-pii-scanner github <owner>/<repo>     # shallow clone, then scan
uvx pleno-pii-scanner github --org <org>        # enumerate org repos via gh CLI, then scan all
uvx pleno-pii-scanner baseline <path>           # write current findings as a suppression list
uvx pleno-pii-scanner protect                   # scan only staged hunks for pre-commit hooks
```

## Local vs. offload

Default mode runs Presidio, spaCy NER, and regex on this machine. Pass `--base-url` to offload the same pipeline to a remote pleno-anonymize endpoint.

```sh
uvx pleno-pii-scanner dir ./my-repo --base-url https://pleno-anonymize.fly.dev
PLENO_BASE_URL=... uvx pleno-pii-scanner dir ./my-repo
uvx pleno-pii-scanner dir ./my-repo --base-url ... --api-key "$PLENO_API_KEY"
```

Both modes return the same entity set. Git history scans always use regex only, since per-line NER is not worth the cost on short diff lines.

## Detected entities

NER from `ja_ner_ja` plus Presidio: `PERSON` `ADDRESS` `ORGANIZATION` `DATE_OF_BIRTH` `BANK_ACCOUNT`

Regex plus checksum: `PHONE_NUMBER` `MY_NUMBER` `MY_NUMBER_CORPORATE` `CREDIT_CARD` `PASSPORT` `DRIVER_LICENSE` `HEALTH_INSURANCE` `RESIDENCE_CARD` `POSTAL_CODE` `EMAIL_ADDRESS` `IP_ADDRESS` `URL`

`URL`, `HEALTH_INSURANCE`, and `DRIVER_LICENSE` are excluded from the default profile because they fire too often in source repos. Pass `--entities ALL` to include them, or `--entities PHONE_NUMBER,EMAIL_ADDRESS` to scan a specific subset.

## Verification

Each finding carries one of three labels.

- `passed` — checksum validated by Luhn, My Number, or corporate-number rules, or a contextual keyword sits within range.
- `failed` — checksum failed; likely a false positive.
- `unverified` — no validator matched and no contextual keyword was found.

`--only-verified` keeps `passed` only.

## Output

| `--report-format` | Use case |
|---|---|
| `human` default | colorized table on stdout |
| `json` | machine-readable |
| `sarif` | SARIF 2.1.0 for GitHub Code Scanning |

`--report-path FILE` writes to a file. Exit code is `0` for no findings, `1` when findings are present, `2` for usage errors.

## DB-cluster mode (recommended for repo audits)

Repository-level PII risk follows database shape, not single mentions. A
contact email in a CODE_OF_CONDUCT is one identifiable person, not an
exfiltration target; a CSV row with name + phone + email + my_number is.

`--db-only` keeps a finding only when its file or folder forms a cluster
of co-occurring detections with multiple distinct values:

```sh
uvx pleno-pii-scanner dir ./my-repo --db-only
uvx pleno-pii-scanner github owner/repo --db-only
```

Tunables (defaults shown):

| Flag | Default | Meaning |
|---|---|---|
| `--db-file-threshold` | `2` | Minimum findings in one file to qualify as a DB cluster. |
| `--db-folder-threshold` | `3` | Minimum findings in one folder (for sharded-DB shape). |

`verification=failed` findings (e.g. ISBN matched as MY_NUMBER) are
excluded from cluster computation so an awesome-list of book links can
not promote a folder to DB-shaped. On the v0.2.4 ten-repo Japanese eval,
this mode takes 6/10 repos from "findings to triage" to zero while
keeping every real exposure (resumes, PII fixture banks, contributor
lists).

## Suppression

A `.plenoignore` file at the repo root is read automatically.

```
docs/samples/**          # path glob in gitignore syntax
PHONE_NUMBER             # entity-wide
finding:7a3b8c9d         # specific finding fingerprint
```

Inline directives:

```py
SUPPORT_PHONE = "0120-123-456"  # pleno:ignore PHONE_NUMBER
EXAMPLE_EMAIL = "user@example.com"  # pleno:ignore
```

`pleno-pii-scanner baseline` writes a fingerprint list of current findings; passing `--baseline FILE` later suppresses those known findings.

## Key flags

| Flag | Default | Role |
|---|---|---|
| `--entities` | default profile | restrict detection set, `PHONE,EMAIL` or `ALL` |
| `--language` | `ja` | analysis language, `ja` or `en` |
| `--base-url` | unset | offload to a remote pleno-anonymize |
| `--api-key` | unset | Bearer token for offload |
| `--concurrency` | 8 | parallel HTTP requests in offload mode |
| `--include` / `--exclude` | unset | gitignore-style file filters |
| `--max-file-size` | 1 MB | files larger than this are skipped |
| `--only-verified` | off | keep `passed` findings only |
| `--report-format` | `human` | `human`, `json`, or `sarif` |
| `--baseline` | unset | fingerprint JSON of known findings to suppress |

`.gitignore`, a built-in skip list for `.git`, `node_modules`, `.venv`, `dist`, `build`, `vendor`, and similar directories, and a NUL-byte binary check are all on by default.
