Metadata-Version: 2.4
Name: merkmal
Version: 0.2.0
Summary: Standalone phonological feature systems for historical linguistics
Author-email: Tiago Tresoldi <tiago.tresoldi@lingfil.uu.se>
License-Expression: MIT
Project-URL: Homepage, https://github.com/tresoldi/merkmal
Project-URL: Documentation, https://github.com/tresoldi/merkmal#readme
Project-URL: Repository, https://github.com/tresoldi/merkmal
Project-URL: Bug Tracker, https://github.com/tresoldi/merkmal/issues
Keywords: phonology,phonological features,historical linguistics,sequence alignment,cognate detection
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Dynamic: license-file

# merkmal

`merkmal` is a standalone Python package for manipulating phonological
features. Zero runtime dependencies, Python 3.12+.

It provides:

- bundled phonological feature datasets
- pluggable feature systems (9 built-in)
- feature geometry and distance functions (Clements & Hume 1995)
- tonal geometry (Yip/Bao)
- query and analysis helpers for graphemes and feature sets
- UPA transcription support

## Installation

Install from PyPI:

```bash
pip install merkmal
```

Development install:

```bash
git clone https://github.com/tresoldi/merkmal.git
cd merkmal
pip install -e ".[dev]"
```

Run checks:

```bash
ruff check .
mypy src
pytest -q
```

## Quick start

```python
import merkmal

# Built-in systems
print(merkmal.list_systems())
# ['descriptive', 'broad', 'distinctive', 'pbase-hc', 'pbase-jfh',
#  'pbase-spe', 'pbase-uftc', 'phoible', 'classfeat']

# Basic grapheme lookup
print(merkmal.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})

# Predefined sound classes
print(merkmal.get_class_features("V"))
# frozenset({'vowel'})

# Distance
print(merkmal.distance("a", "e"))
print(merkmal.distance("p", "b", system="classfeat"))
```

## Systems

| System | Type | Features | Distance |
|--------|------|----------|----------|
| `descriptive` | categorical | articulatory | geometry-weighted |
| `broad` | categorical | simplified | geometry-weighted |
| `distinctive` | privative | Clements & Hume | geometry-weighted |
| `pbase-hc`, `-jfh`, `-spe`, `-uftc` | multi-state | 4 theoretical families | geometry-weighted |
| `phoible` | binary | 37 features | Hamming |
| `classfeat` | hybrid | sound classes + continuous | trained weights |

All systems implement the same `FeatureSystem` protocol. Distances, queries,
matrices, and natural class derivation work across all of them.

## Working with systems

You can use the lazy default registry through top-level helpers, or work
with a specific system object.

```python
import merkmal

descriptive = merkmal.get_system("descriptive")
distinctive = merkmal.get_system("distinctive")
pbase = merkmal.get_system("pbase-hc")

print(descriptive.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))
```

Exact reverse lookup is available when a native representation maps directly to
a known grapheme.

```python
descriptive = merkmal.get_system("descriptive")

grapheme = descriptive.features_to_grapheme(
    frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'
```

## Feature queries

Use `features_to_graphemes(...)` to find all graphemes matching a feature set.
Matching is partial by default.

```python
import merkmal

vowels = merkmal.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])

# Exact matching
features = merkmal.get_features("a")
print(merkmal.features_to_graphemes(features, exact=True))
```

## Natural classes and matrices

```python
import merkmal

# Shared features of a segment set
print(merkmal.derive_class_features(["p", "t", "k"]))
# frozenset({'consonant', 'voiceless', 'stop'})

# Minimal distinguishing matrix
matrix = merkmal.minimal_matrix(["t", "d", "s"])
print(merkmal.tabulate_matrix(matrix))
```

```text
grapheme | continuant | voiced
---------+------------+-------
t        | False      | False
d        | False      | True
s        | True       | False
```

## Distance

```python
import merkmal

print(merkmal.distance("a", "e"))
print(merkmal.distance("a", "u"))
print(merkmal.distance("p", "b"))
print(merkmal.distance("t", "d", system="pbase-hc"))
```

You can also supply a precomputed nested dictionary:

```python
precomputed = {"a": {"e": 1.5, "u": 2.0}, "p": {"b": 0.5}}
print(merkmal.distance("a", "e", precomputed=precomputed))
```

## Multi-state systems (P-base)

P-base-derived systems expose multi-state values (`+`, `-`, `n`, `.`, `o`, `x`)
through `FeatureState`.

```python
import merkmal

rep = merkmal.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE
```

The bundled P-base table is derived, not verbatim. Duplicate rows with
conflicting values have the conflicting cells downgraded to `.`
(`FeatureState.DOT`). The P-base data retains its own attribution and license
notice in `src/merkmal/data/pbase/`.

## Custom datasets

```python
from merkmal import create_registry, load_dataset

dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("descriptive")
print(system.grapheme_to_features("k"))
```

Expected files in `my_feature_data/`: `sounds.tsv`, `classes.tsv`, `features.tsv`.

## Cognator export

`merkmal export-cognator` writes a small, byte-stable bundle of TSV + JSON
files that downstream consumers (in particular the `cognator` Go package)
can read without any Python dependency on merkmal.

```sh
# single system → ./cognator_export/descriptive/
merkmal export-cognator --system=descriptive

# every built-in system → ./cognator_export/<system>/
merkmal export-cognator --all-systems --out=./cognator_export --force
```

The bundle contains:

- `distances.tsv` — full Cartesian pairwise distances, normalized to
  `[0.0, 1.0]` via `d' = clip(d_raw / d_max_raw, 0, 1)`.
- `classes.tsv` — sound-class reduction (only for systems that expose
  one, e.g. `classfeat`).
- `prosody.tsv` — per-grapheme role tag (`C`, `R`, `V`, `G`, `T`, `S`,
  `X`).
- `fallback.tsv` — optional grapheme-normalization table for
  out-of-inventory inputs (initially empty, populated over time).
- `manifest.json` — merkmal version, export date, grapheme count, and
  SHA-256 hashes of every file in the bundle.

All text files are UTF-8 with NFC-normalized graphemes, LF line endings,
and deterministic row ordering. Floats use fixed `%.6f` formatting. Pin
`SOURCE_DATE_EPOCH` to produce byte-identical bundles across runs.

The same capability is available as a library function:

```python
import merkmal

merkmal.export_cognator("descriptive", "./cognator_export/descriptive")
merkmal.export_all_systems("./cognator_export")
```

## Documentation

See the [tutorials](docs/tutorials/) for worked examples covering phonological
features, typology, historical linguistics, cognate detection, and UPA
transcription.

## License

MIT. See [LICENSE](LICENSE).
