Metadata-Version: 2.4
Name: polars-strsim
Version: 0.2.5
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Programming Language :: Rust
Requires-Dist: polars>=1.0,<2.0
License-File: LICENSE
Summary: Polars extension for string similarity
Keywords: polars-extension,string-similarity
Author: Jeremy Foxcroft
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/foxcroftjn/polars-strsim
Project-URL: Issues, https://github.com/foxcroftjn/polars-strsim/issues

<a href="https://pypi.org/project/polars-strsim/">
    <img src="https://img.shields.io/pypi/v/polars-strsim.svg" alt="PyPi Latest Release"/>
</a>

# String Similarity Measures for Polars

This package provides python bindings to compute various string similarity measures directly on a polars dataframe. All string similarity measures are implemented in rust and computed in parallel.

The similarity measures that have been implemented are:

- Levenshtein
- Jaro
- Jaro-Winkler
- Jaccard
- Sørensen-Dice

Each similarity measure returns a value normalized between 0.0 and 1.0 (inclusive), where 0.0 indicates the inputs are maximally different and 1.0 means the strings are maximally similar.

## Installing the Library

### With pip

```bash
pip install polars-strsim
```

### From Source

To build and install this library from source, first ensure you have [cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html) installed. You will also need maturin, which you can install via `pip install 'maturin[patchelf]'`

polars-strsim can then be installed in your current python environment by running `maturin develop --release`

## Using the Library

**Input:**

```python
import polars as pl
from polars_strsim import levenshtein, jaro, jaro_winkler, jaccard, sorensen_dice

df = pl.DataFrame(
    {
        "name_a": ["phillips", "phillips", ""        , "", None      , None],
        "name_b": ["phillips", "philips" , "phillips", "", "phillips", None],
    }
).with_columns(
    levenshtein=levenshtein("name_a", "name_b"),
    jaro=jaro("name_a", "name_b"),
    jaro_winkler=jaro_winkler("name_a", "name_b"),
    jaccard=jaccard("name_a", "name_b"),
    sorensen_dice=sorensen_dice("name_a", "name_b"),
)

with pl.Config(ascii_tables=True):
    print(df)
```
**Output:**
```
shape: (6, 7)
+----------+----------+-------------+----------+--------------+---------+---------------+
| name_a   | name_b   | levenshtein | jaro     | jaro_winkler | jaccard | sorensen_dice |
| ---      | ---      | ---         | ---      | ---          | ---     | ---           |
| str      | str      | f64         | f64      | f64          | f64     | f64           |
+=======================================================================================+
| phillips | phillips | 1.0         | 1.0      | 1.0          | 1.0     | 1.0           |
| phillips | philips  | 0.875       | 0.958333 | 0.975        | 0.875   | 0.933333      |
|          | phillips | 0.0         | 0.0      | 0.0          | 0.0     | 0.0           |
|          |          | 1.0         | 1.0      | 1.0          | 1.0     | 1.0           |
| null     | phillips | null        | null     | null         | null    | null          |
| null     | null     | null        | null     | null         | null    | null          |
+----------+----------+-------------+----------+--------------+---------+---------------+
```

