Metadata-Version: 2.4
Name: phonemenal
Version: 0.2.0
Summary: General-purpose phonetic similarity detection — homophones, near-homophones, and sound-alike analysis
Project-URL: Homepage, https://github.com/brokensound77/phonemenal
Project-URL: Documentation, https://brokensound77.github.io/phonemenal/
Project-URL: Repository, https://github.com/brokensound77/phonemenal
Project-URL: Issues, https://github.com/brokensound77/phonemenal/issues
Author: Justin Ibarra (br0k3ns0und)
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: homophones,phonemes,phonetics,security,similarity,supply-chain,typosquatting
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Security
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: click>=8.0
Requires-Dist: nltk>=3.8
Requires-Dist: rapidfuzz>=3.0
Requires-Dist: rich>=13.0
Requires-Dist: wordninja>=2.0
Provides-Extra: g2p
Requires-Dist: g2p-en>=2.0; extra == 'g2p'
Provides-Extra: llm
Requires-Dist: anthropic>=0.40; extra == 'llm'
Requires-Dist: openai>=1.50; extra == 'llm'
Description-Content-Type: text/markdown

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo.svg">
    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo.svg">
    <img src="https://raw.githubusercontent.com/brokensound77/phonemenal/main/docs/assets/logo.svg" alt="phonemenal logo" width="120">
  </picture>
</p>

<h1 align="center">phonemenal</h1>

<p align="center">
  Phonetic similarity and homophone detection library for Python — near-homophones, sound-alike collisions, and variant generation.
</p>

<p align="center">
  <a href="https://brokensound77.github.io/phonemenal/"><img src="https://img.shields.io/badge/docs-GitHub%20Pages-blue" alt="Docs"></a>
  <a href="https://pypi.org/project/phonemenal/"><img src="https://img.shields.io/pypi/v/phonemenal" alt="PyPI"></a>
  <a href="https://pypi.org/project/phonemenal/"><img src="https://img.shields.io/pypi/pyversions/phonemenal" alt="Python"></a>
  <a href="https://github.com/brokensound77/phonemenal/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-Apache--2.0-green" alt="License"></a>
</p>

---

## Features

- **Four scoring algorithms** (all normalized 0.0–1.0):
  - **PPC-A** — Positional Phoneme Correlation (Absolute)
  - **PLD** — Phoneme Levenshtein Distance at syllable level
  - **PED** — Phoneme edit distance at phoneme level
  - **LCS** — Longest Common Subsequence ratio on phoneme sequences
- **Composite scoring** with configurable weights
- **Exact homophone discovery** via CMU Pronouncing Dictionary inversion
- **Near-homophone search** with threshold-based fuzzy matching
- **Variant generation** — phonetic substitutions, morphological variants, and separator permutations
- **Compound word splitting** with homophone permutation recombination
- **Fast fallback encoder** for words not in the dictionary (brand names, neologisms)
- **Batch collision scanning** — forward and reverse scanning pipelines
- **LLM-powered deep analysis** (optional, via Anthropic/OpenAI API or local agents)
- **Rich CLI** with formatted tables and JSON output

## Install

```bash
pip install phonemenal

# With LLM support
pip install phonemenal[llm]
```

## Quick Start

```python
from phonemenal import similarity, homophones, variants, splitting, fallback, scanning

# Compare two words (all scores 0.0–1.0)
similarity.ppc("crowd", "crown")        # PPC-A
similarity.pld("elastic", "fantastic")   # PLD
similarity.ped("cat", "bat")             # PED
similarity.lcs("packaging", "packages")  # LCS
similarity.composite("crowd", "crown")   # Weighted average

# Find exact homophones
homophones.find("blue")  # → ["blew"]

# Find near-homophones
homophones.find_similar("crowd", min_score=0.7)

# Generate sound-alike variants
variants.generate("flask")  # → {"phlask", "flazk", ...}
variants.generate_morphological("packaging")  # → {"packaged", "packager", ...}

# Split compound words & generate permutations
splitting.split("bluevoyage")  # → ["blue", "voyage"]
splitting.homophone_permutations("bluevoyage")  # → all recombinations

# Fallback for non-dictionary words
fallback.phonetic_key("numpy")   # → "nAmpY"
fallback.phonetic_key("numpie")  # → "nAmpY" (same key)

# Batch collision scanning
matches = scanning.scan(
    candidates=["numpie", "phlask"],
    known_names=["numpy", "flask"],
)

# Composite tuning for CMU-backed comparisons
matches = scanning.scan(
    candidates=["cat"],
    known_names=["bat"],
    use_composite=True,
    edit_mode="length",
)
```

## CLI

```bash
phonemenal similarity crowd crown           # compare with all algorithms
phonemenal similarity crowd crown -a ppc    # specific algorithm
phonemenal homophones blue                  # exact homophones
phonemenal variants flask -m                # phonetic + morphological variants
phonemenal split bluevoyage -p              # split & show permutations
phonemenal compare crowd crown              # full comparison report
phonemenal compare crowd crown -j           # JSON output
phonemenal analyze numpy --provider anthropic  # LLM deep analysis
phonemenal prompt numpy | pbcopy            # get raw prompt
```

## Algorithms

### PPC-A (Positional Phoneme Correlation — Absolute)

Builds positional phoneme combinations by traversing forward and reverse directions with padding, then measures set intersection. Captures how much of the positional phoneme structure two words share.

### PLD (Phoneme Levenshtein Distance)

Syllable-level edit distance using the CMU dict's stress markers to split phonemes into syllable groups. Each syllable is an atomic unit, so the distance reflects how many whole syllables differ — matching how sound flows in speech.

### PED (Phoneme Edit Distance)

Phoneme-level edit distance on stress-stripped CMU pronunciations. This complements PLD for short and monosyllabic words where syllable-level scoring is too coarse.

### LCS (Longest Common Subsequence)

Ratio of the longest common subsequence length to the total sequence length. Applied to phoneme sequences from CMU dict, or to raw character strings as a fallback.

### Composite

Weighted average of PPC-A, an adaptive edit channel, and LCS. By default the edit channel uses `max(PLD, PED)`, and callers can switch to a length-based selector for monosyllables vs. longer words. Default weights are `(1.0, 2.0, 1.0)` to emphasize edit similarity. All bounded 0.0–1.0.

Note: the default composite score changed in `0.2.0`, so scores are not directly comparable with `0.1.x`.

### Fallback Encoder

Simplified Metaphone-inspired encoding for words not in the CMU dict. Applies digraph replacement, vowel normalization, and character collapsing to produce phonetic keys. Sound-alike names produce the same or similar keys — e.g. `numpy` and `numpie` both map to `nAmpY`.

## Documentation

Full documentation is available at [**brokensound77.github.io/phonemenal**](https://brokensound77.github.io/phonemenal/).

## License

Apache-2.0

## Background and Research

This project stems from previous research on homophonic collisions conducted by [Reagan Short](https://x.com/ReaganShort) and 
[Justin Ibarra](https://x.com/br0k3ns0und). You can check out our [TROOPERS](https://troopers.de/troopers23/talks/mmtwsy/) 
2023 talk: _Homophonic Collisions: Hold me closer Tony Danza_ for more background on the problem space and their approach 
to phonetic similarity detection.
* [YouTube](https://www.youtube.com/watch?v=nj4fZAM_IDg)
* [slides](https://troopers.de/downloads/troopers23/TR23_HomophonicCollisions.pdf)

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: light)" srcset="docs/assets/hc-c3.png">
    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/hc-c3.png">
    <img src="https://raw.githubusercontent.com/brokensound77/phonemenal/main/docs/assets/hc-c3.png" alt="Coils of Communication Chaos" width="500">
  </picture>
</p>
