Metadata-Version: 2.4
Name: simmetry
Version: 1.0.1
Summary: Blazing-fast similarity scores for strings, vectors, points, and sets.
Project-URL: Homepage, https://pypi.org/project/simmetry/
Project-URL: Repository, https://github.com/algumusrende/simmetry
Author-email: Ali Can Gumusrende <algumusrende@gmail.com>
License: MIT
License-File: LICENSE
Keywords: cosine,distance,haversine,jaccard,levenshtein,similarity
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Requires-Dist: numpy>=1.23
Provides-Extra: ann
Requires-Dist: hnswlib>=0.8.0; extra == 'ann'
Provides-Extra: ann-faiss
Requires-Dist: faiss-cpu>=1.7.4; extra == 'ann-faiss'
Provides-Extra: ann-hnsw
Requires-Dist: hnswlib>=0.8.0; extra == 'ann-hnsw'
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: fast
Requires-Dist: numba>=0.58; extra == 'fast'
Description-Content-Type: text/markdown

# simmetry

Blazing-fast similarity scores for **strings**, **vectors**, **points**, and **sets** — with a simple API.

## Install

```bash
pip install simmetry
pip install "simmetry[fast]"
```

`simmetry[fast]` enables optional Numba acceleration for `pairwise(..., metric="euclidean_sim")` and `pairwise(..., metric="manhattan_sim")`.

## Quickstart

### One function
```python
from simmetry import similarity

similarity("kitten", "sitting", metric="levenshtein")     
similarity([1,2,3], [1,2,4], metric="cosine")             
similarity((41.1, 29.0), (41.2, 29.1), metric="haversine_km")
similarity({1,2,3}, {2,3,4}, metric="jaccard")
```

### Pairwise matrices (fast for vectors)
```python
import numpy as np
from simmetry import pairwise

X = np.random.randn(1000, 128)
S = pairwise(X, metric="cosine")          
```

### Top-k search
```python
import numpy as np
from simmetry import topk

X = np.random.randn(5000, 64)
q = np.random.randn(64)
idx, scores = topk(q, X, k=10, metric="cosine")
```

## Available metrics

```python
from simmetry import available
available()             
available("vector")     
available("string")     
available("point")      
available("set")        
```

### Vectors
- `cosine`, `dot`, `euclidean_sim`, `manhattan_sim`, `pearson`

### Strings
- `levenshtein` (normalized similarity)
- `jaro_winkler`
- `ngram_jaccard` (character n-gram set Jaccard)
- `token_jaccard` (whitespace token set Jaccard)

### Points / Geo
- `euclidean_2d`
- `haversine_km`

### Sets
- `jaccard`, `dice`, `overlap`

## License
MIT
## Batch string APIs

If you need many string-to-string similarities (e.g., deduping names), use:

```python
from simmetry.strings import pairwise_strings, topk_strings

S = pairwise_strings(["item_one", "item_two"], ["item_one", "item_alt"], metric="jaro_winkler")
idx, scores = topk_strings("samplecorp", ["samplecorp", "examplefinance", "testgroup"], k=2, metric="levenshtein")
```

## ANN top-k (optional, does NOT bloat core)

For very large vector corpora (100k+), exact `topk()` can be slow. ANN gives fast approximate results.

### hnswlib (recommended)
```bash
pip install "simmetry[ann-hnsw]"
```

```python
import numpy as np
from simmetry.ann import build_hnsw

X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)  

index = build_hnsw(X, space="cosine")
labels, distances = index.query(X[0], k=10)
```

### faiss
```bash
pip install "simmetry[ann-faiss]"
```

```python
import numpy as np
from simmetry.ann import build_faiss

X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)  

index = build_faiss(X, metric="ip")
labels, scores = index.query(X[0], k=10)
```


## SimIndex (exact or ANN)

Exact search (no extras):

```python
import numpy as np
from simmetry import SimIndex

X = np.random.randn(50_000, 128).astype("float32")
index = SimIndex(metric="cosine", backend="exact").add(X)

idx, scores = index.query(X[0], k=10)
```

ANN (optional):

```bash
pip install "simmetry[ann-hnsw]"
```

```python
import numpy as np
from simmetry import SimIndex

X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)

index = SimIndex(metric="cosine", backend="hnsw").add(X)
labels, distances = index.query(X[0], k=10)
```

## Auto similarity and composite records

Auto metric selection:

```python
from simmetry import similarity

similarity("samplecorp", "sample corp")
similarity((41.0, 29.0), (41.1, 29.1)) 
similarity({1,2,3}, {2,3,4})         
```

Composite similarity over dict fields:

```python
a = {"name": "Entity One", "city": "CityAlpha", "loc": (41.0, 29.0)}
b = {"name": "Entity One Extended", "city": "CityAlpha", "loc": (41.01, 28.99)}

score = similarity(
    a, b,
    metric={"name": "jaro_winkler", "loc": "haversine_km"},
    weights={"name": 0.7, "loc": 0.3},
)
```
