Metadata-Version: 2.4
Name: smssafe
Version: 0.0.1
Summary: Turn arbitrary text into guaranteed GSM-7 deliverable SMS — no UCS-2 fallback, deliverable on feature phones.
Project-URL: Homepage, https://github.com/BRIQ-BLOCK/smssafe
Project-URL: Repository, https://github.com/BRIQ-BLOCK/smssafe
Project-URL: Changelog, https://github.com/BRIQ-BLOCK/smssafe/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/BRIQ-BLOCK/smssafe/issues
Author-email: Eddie Gulay <groundhalt@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: a2p,encoding,gsm-7,gsm0338,sanitize,sms,transliterate,ucs2
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Communications :: Telephony
Classifier: Topic :: Text Processing :: Filters
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Description-Content-Type: text/markdown

# smssafe

[![CI](https://github.com/BRIQ-BLOCK/smssafe/actions/workflows/ci.yml/badge.svg)](https://github.com/BRIQ-BLOCK/smssafe/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/smssafe.svg)](https://pypi.org/project/smssafe/)
[![Python](https://img.shields.io/pypi/pyversions/smssafe.svg)](https://pypi.org/project/smssafe/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

**Turn arbitrary text into guaranteed GSM-7 deliverable SMS — no UCS-2 fallback.**

A single non-GSM-7 character (a smart quote, an em-dash, a stray emoji) forces an
*entire* SMS into UCS-2 encoding: 70 characters per segment instead of 160, triple
the cost, and silent delivery failure on feature/dumb phones that can't render it.

`smssafe` sanitizes text **before** you send it — transliterating, replacing, or
dropping every non-GSM-7 character while preserving as much meaning as possible.

- ✅ **Zero dependencies** — pure Python standard library, 3.10+
- ✅ **Fully typed** — ships `py.typed` (PEP 561)
- ✅ Homoglyph-aware — Cyrillic/Greek/full-width lookalikes are *transliterated*, not dropped
- ✅ Accurate segment & cost accounting (extended chars correctly counted as 2)
- ✅ Deterministic, side-effect-free, full audit trail of every change

## Install

```bash
pip install smssafe
```

## Usage

```python
from smssafe import sanitize

result = sanitize("Hi — “there”, pay ~5,000 now… 🙂")

result.sanitized          # 'Hi - "there", pay ~5,000 now...'
result.encoding           # 'gsm7'
result.char_count         # encoded length (extended chars count as 2)
result.segments           # number of SMS parts
result.replacements       # list[dict] — audit trail of every substitution
result.remaining_unsafe   # list[str] — chars that could not be mapped (dropped)
result.is_clean           # True if no changes were needed
```

### What it handles

| Input | Output | Notes |
|---|---|---|
| `“ ” ‘ ’` smart quotes | `" '` | Word / Google Docs / AI output |
| `– — …` dash & ellipsis | `- ...` | |
| `• · ‣` bullets | `-` | |
| Cyrillic `аеор`, Greek `Α`, full-width `Ａ１` | `aeop A A1` | lookalikes transliterated, not dropped |
| `ÀÈÌÒÙ` uppercase accent traps | `AEIOU` | *not* in GSM-7 despite lowercase being valid |
| `™ © ®` | `TM (c) (R)` | |
| `₹ ₽ ₿` | `INR RUB BTC` | (`€ £ $ ¥` are kept — they're valid GSM-7) |
| emoji, math-alphanumerics, non-BMP | _(stripped)_ | recorded in `remaining_unsafe` |
| zero-width / BOM / exotic spaces | _(stripped / normalized)_ | |

> **Tilde:** `~` is a GSM-7 extended character (escape `0x3D`, costs 2 septets), so
> it passes through unchanged — `~5,000` stays `~5,000` rather than becoming a
> misleading `-5,000`. Non-ASCII tilde lookalikes (`˜ ∼ ～`) normalise to `~`.

### Drop vs. replace unknowns

By default, characters with no safe mapping are dropped. Pass `drop_unknown=False`
to replace them with `?` instead:

```python
sanitize("A中B", drop_unknown=False).sanitized   # 'A?B'
```

## How it works

A deterministic 8-step pipeline (`smssafe.core`), each step independently testable:

0. Strip non-BMP / surrogate codepoints (emoji, math-alphanumerics)
1. Apply the homoglyph map (Cyrillic/Greek/full-width → Latin)
2. Apply the explicit replacement map (quotes, dashes, currency, symbols…)
3. NFD-normalize **per character** and strip diacritics for remaining accented chars
4. Normalize whitespace (tabs, exotic/zero-width spaces)
5. Collapse artifacts (runs of dashes/spaces)
6. Final GSM-7 scan — drop or `?`-replace anything left
7. Compute encoding, character count (extended = 2), and segment count

## Development

```bash
git clone https://github.com/BRIQ-BLOCK/smssafe
cd smssafe
pip install -e ".[dev]"
pytest            # 689 tests
ruff check .
mypy
```

## License

MIT © Eddie Gulay
