Metadata-Version: 2.4
Name: yomail
Version: 0.1.1
Summary: Extract body text from Japanese business emails
Project-URL: Homepage, https://github.com/terallite/yomail
Project-URL: Repository, https://github.com/terallite/yomail
Project-URL: Issues, https://github.com/terallite/yomail/issues
Author: terallite, Linus Tan
License-Expression: MIT
License-File: LICENSE
Keywords: crf,email,japanese,nlp,text-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Japanese
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Communications :: Email
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.12
Requires-Dist: neologdn>=0.5.6
Requires-Dist: python-crfsuite>=0.9.12
Requires-Dist: pyyaml>=6.0
Description-Content-Type: text/markdown

# yomail (読メール)

yomail extracts body text from Japanese business emails. It uses a CRF (Conditional Random Field) model to classify each line, then assembles the body from labeled lines.

## Features

- Handles formal and informal Japanese business emails
- Detects and excludes signatures, greetings, closings, quoted content
- Works with forwarded emails, replies, and inline quotes
- Returns confidence scores for quality control
- Small model size (12 KB)
- Fast inference (~10-30ms)

## Installation

```
pip install yomail
```

Requires Python 3.12+.

## Usage

```python
from yomail import EmailBodyExtractor

extractor = EmailBodyExtractor()

# Raises on failure
body = extractor.extract(email_text)

# Returns None on failure
body = extractor.extract_safe(email_text)

# Full result with metadata
result = extractor.extract_with_metadata(email_text)
print(result.body)
print(result.confidence)
print(result.signature_detected)
```

### Example

Input:

```
株式会社サンプル
田中様

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

--
山田太郎
株式会社テスト
TEL: 03-1234-5678
```

Output:

```
お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上
```

## How It Works

The extraction pipeline:

1. **Normalize** — Line endings, neologdn normalization, NFKC
2. **Analyze structure** — Quote depth, forward/reply headers, delimiters
3. **Extract features** — Position, character ratios, pattern matches
4. **Label with CRF** — GREETING, BODY, CLOSING, SIGNATURE, QUOTE, OTHER
5. **Assemble body** — Find signature boundary, handle inline quotes, merge blocks

See [ARCHITECTURE.md](ARCHITECTURE.md) for details.

## Label Scheme

| Label     | Description                        |
| --------- | ---------------------------------- |
| GREETING  | Opening (お世話になっております)   |
| BODY      | Main content                       |
| CLOSING   | Closing (よろしくお願いいたします) |
| SIGNATURE | Sender information                 |
| QUOTE     | Quoted content                     |
| OTHER     | Separators, blank lines            |

## Performance

Evaluated on 19,642 synthetic test emails:

| Metric          | Value |
| --------------- | ----- |
| Content match   | 97.9% |
| Acceptable rate | 98.0% |
| Confident wrong | 0.14% |

See [PERFORMANCE.md](PERFORMANCE.md) for details.

## Exceptions

```python
from yomail import (
    ExtractionError,      # Base class
    InvalidInputError,    # Empty or invalid input
    NoBodyDetectedError,  # No body found
    LowConfidenceError,   # Confidence below threshold
)
```

## Configuration

```python
extractor = EmailBodyExtractor(
    model_path="path/to/model.crfsuite",  # Custom model
    confidence_threshold=0.5,              # Minimum confidence
)
```

## Development

```bash
# Setup
uv sync

# Run tests
uv run pytest

# Type check
uv run ty check

# Lint
uv run ruff check .
```

### Training

Training data is generated by the [yasumail](https://github.com/terallite/yasumail) project.

```bash
# Train model
python scripts/train.py data/training.jsonl -o models/email_body.crfsuite

# Evaluate
python scripts/evaluate.py data/test.jsonl
```

## Dependencies

- [neologdn](https://github.com/ikegami-yukino/neologdn) — Japanese text normalization
- [python-crfsuite](https://github.com/scrapinghub/python-crfsuite) — CRF implementation
- [PyYAML](https://pyyaml.org/) — Name data loading
