Metadata-Version: 2.4
Name: besshouka
Version: 0.1.1a2
Summary: Local-first Japanese PII anonymization engine
Project-URL: Homepage, https://github.com/akhi/besshouka
Project-URL: Repository, https://github.com/akhi/besshouka
Project-URL: Issues, https://github.com/akhi/besshouka/issues
Author: akhi
License-Expression: MIT
License-File: LICENSE
Keywords: anonymization,ginza,japanese,nlp,pii,privacy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.14,>=3.11
Requires-Dist: cryptography<47.0,>=44.0.1
Requires-Dist: ginza<6.0,>=5.2
Requires-Dist: ja-ginza<6.0,>=5.2
Requires-Dist: pyyaml<7.0,>=6.0
Requires-Dist: rich<14.0,>=13.0
Requires-Dist: spacy<4.0,>=3.7
Requires-Dist: sudachidict-core>=20240109
Requires-Dist: sudachipy<1.0,>=0.6
Requires-Dist: typer<1.0,>=0.9
Provides-Extra: dev
Requires-Dist: pytest-cov<7.0,>=5.0; extra == 'dev'
Requires-Dist: pytest<10.0,>=8.0; extra == 'dev'
Description-Content-Type: text/markdown

# Besshouka (別称化)

A local-first Japanese PII anonymization engine. Besshouka detects personally identifiable information (PII), payment card data (PCI), and protected health information (PHI) in Japanese text and transforms it using configurable rules — all without sending data to any external service.

> **Note:** Besshouka is in early development (alpha). It is not yet recommended for production use. Contributions to improve accuracy, coverage, and robustness are welcome — see [CONTRIBUTING.md](CONTRIBUTING.md).

## Why Besshouka?

- **Japanese-native** — built specifically for Japanese data patterns: マイナンバー, Japanese phone formats, postal codes, full-width character handling, and GiNZA-powered NER for names, organizations, and locations.
- **Local-first** — everything runs on your machine. No cloud APIs, no data leaves the device.
- **Pluggable** — add custom regex recognizers via YAML, write your own operators in Python, or plug in any importable function as a custom operator. No forking required.
- **Auditable** — every anonymization operation is logged in an audit trail with the original text, the operator used, and the new indices.

## Quick Start

```bash
pip install besshouka
```

### Anonymize text

```bash
besshouka anonymize "田中太郎の電話番号は090-1234-5678です"
# Output: <氏名>の電話番号は090-1234-****です
```

### Analyze (detect only)

```bash
besshouka analyze --explain "田中太郎の電話番号は090-1234-5678です"
```

### Adjust confidence threshold

Both commands support `--threshold` / `-t` to filter by confidence score:

```bash
# Anonymize: only anonymize detections with confidence >= threshold (default: 0.5)
besshouka anonymize --threshold 0.3 "番号は123456789018です"

# Analyze: only display detections with confidence >= threshold (default: 0.0)
besshouka analyze --threshold 0.5 --explain "マイナンバーは123456789018です"
```

Detections below the threshold are still detected internally but excluded from output. For example, a 12-digit number matching the My Number check digit but lacking context keywords scores 0.4 and is left untouched at the default anonymization threshold.

### Use custom rules

```bash
besshouka anonymize \
  --recognizers my_patterns.yaml \
  --rules my_operators.yaml \
  --input document.txt \
  --output anonymized.txt
```

## Programmatic Usage

```python
from besshouka.config.loader import load_recognizer_config, load_operator_config
from besshouka.orchestrator.pipeline import run

rec_config = load_recognizer_config("path/to/recognizers.yaml")
op_config = load_operator_config("path/to/operators.yaml")

ctx = run("田中太郎の電話番号は090-1234-5678です", rec_config, op_config,
          score_threshold=0.5)

print(ctx.engine_result.text)   # anonymized text
print(ctx.engine_result.items)  # audit trail
```

## Architecture

```
Text In → [Analyzer] → [Anonymizer] → Text Out
```

| Module        | Role                                              |
|---------------|---------------------------------------------------|
| **Analyzer**  | Detects PII using regex patterns + GiNZA NER      |
| **Anonymizer**| Transforms PII using pluggable operators           |
| **Orchestrator** | Wires analyzer and anonymizer into a pipeline  |

Each module has its own README with extension guides. See the [`besshouka/`](besshouka/) directory.

## Built-in Recognizers

| Pattern           | Entity Type       |
|-------------------|-------------------|
| Mobile phone      | `PHONE_NUMBER`    |
| Landline phone    | `PHONE_NUMBER`    |
| Toll-free phone   | `PHONE_NUMBER`    |
| Email address     | `EMAIL`           |
| マイナンバー       | `MY_NUMBER` (check digit + context-aware scoring) |
| Postal code       | `POSTAL_CODE`     |
| Credit card       | `CREDIT_CARD`     |
| Bank account      | `BANK_ACCOUNT`    |
| Driver's license  | `DRIVERS_LICENSE` |
| Passport          | `PASSPORT`        |
| Person names      | `PERSON` (GiNZA)  |
| Organizations     | `ORGANIZATION` (GiNZA) |
| Locations         | `LOCATION` (GiNZA)|

## Built-in Operators

| Operator  | What it does                          |
|-----------|---------------------------------------|
| `replace` | Substitute with a fixed value         |
| `mask`    | Mask characters from end with a symbol|
| `redact`  | Remove entirely                       |
| `hash`    | Salted SHA-256 hex digest             |
| `encrypt` | Fernet symmetric encryption           |
| `keep`    | Pass through unchanged                |
| `custom`  | Call any importable Python function    |

## Extending Besshouka

### Add a regex recognizer (no code)

Add an entry to your recognizers YAML:

```yaml
recognizers:
  - name: employee_id
    entity_type: EMPLOYEE_ID
    pattern: 'EMP-[A-Z]{2}\d{6}'
    score: 1.0
    source: custom
```

### Add a custom operator (no subclassing)

Write a function anywhere importable:

```python
def my_transform(text: str, params: dict) -> str:
    return text[::-1]  # reverse it, or whatever you need
```

Reference it in your operators YAML:

```yaml
operators:
  EMPLOYEE_ID:
    method: custom
    function: "my_module.my_transform"
```

## Development

```bash
git clone https://github.com/akhi/besshouka.git
cd besshouka
pip install -e ".[dev]"
```

### Running Tests

```bash
# All tests (excluding slow GiNZA model tests)
pytest tests/ -m "not slow"

# All tests including GiNZA
pytest tests/

# With coverage
pytest tests/ --cov=besshouka --cov-report=term-missing
```

## Requirements

- **Python >=3.11, <3.14** — Python 3.14 is not yet supported due to [PyO3](https://github.com/PyO3/pyo3) compatibility with SudachiPy (GiNZA's tokenizer). Python 3.13 is recommended.
- GiNZA / spaCy (for NER)
- See [requirements.txt](requirements.txt) for full list

## License

[MIT](LICENSE)
