Metadata-Version: 2.4
Name: mi-chainlink
Version: 0.0.9
Summary: A flexible record linkage framework that enables matching between multiple datasets using both exact and fuzzy matching techniques.
Author-email: "Mansueto Institute,Austin Steinhart" <asteinhart3@gmail.com>
Project-URL: Homepage, https://mansueto-institute.github.io/mi-chainlink/
Project-URL: Repository, https://github.com/mansueto-institute/mi-chainlink
Project-URL: Documentation, https://mansueto-institute.github.io/mi-chainlink/
Keywords: python
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <4.0,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: duckdb>=1.2.0
Requires-Dist: jsonschema>=4.23.0
Requires-Dist: numpy>=2.0.2
Requires-Dist: polars>=1.22.0
Requires-Dist: pyarrow>=19.0.0
Requires-Dist: python-levenshtein>=0.26.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=14.0.0
Requires-Dist: scikit-learn>=1.6.1
Requires-Dist: scipy>=1.15.1
Requires-Dist: sparse-dot-topn>=1.1.5
Requires-Dist: sqlalchemy-mate<=2.0.0.0
Requires-Dist: typer>=0.15.2
Requires-Dist: us>=3.2.0
Requires-Dist: usaddress==0.5.11
Requires-Dist: usaddress-scourgify>=0.6.0
Requires-Dist: uszipcode>=1.0.1
Dynamic: license-file

# mi-chainlink

A powerful, flexible framework for entity resolution and record linkage using DuckDB as the database engine built upon the work of [Who Owns Chicago](https://github.com/mansueto-institute/who-owns-chi/) by the [Mansueto Institute for Urban Innovation](https://miurban.uchicago.edu/) including the work of [Kevin Bryson](https://github.com/cmdkev), [Ana (Anita) Restrepo Lachman](https://github.com/anitarestrepo16), [Caitlin P.](https://github.com/CaitlinCP), [Joaquin Pinto](https://github.com/joaquinpinto), and [Divij Sinha](https://github.com/divij-sinha). 


This package enables you to load data from various sources, clean and standardize entity names and addresses, and create links between entities based on exact and fuzzy matching techniques.

Source: [https://github.com/mansueto-institute/mi-chainlink](https://github.com/mansueto-institute/mi-chainlink)

Documentation: [https://mansueto-institute.github.io/mi-chainlink/](https://mansueto-institute.github.io/mi-chainlink/)

Issues: [https://github.com/mansueto-institute/mi-chainlink/issues](https://github.com/mansueto-institute/mi-chainlink/issues)

## Overview

This framework helps you solve the entity resolution problem by:

1. Loading data from multiple sources into a DuckDB database
2. Cleaning and standardizing entity names and addresses
3. Creating exact matches between entities based on names and addresses
4. Generating fuzzy matches using TF-IDF similarity
5. Exporting the resulting linked data for further analysis

The system is designed to be configurable through YAML files and supports incremental updates to an existing database.

## Installation

Package is available on PyPI. You can install it using pip or uv:

```bash
pip install mi-chainlink
```

```bash
uv add mi-chainlink
```

## Usage

### Command Line Interface

```bash
# Run interactive session
chainlink

# Run with path to config yaml
chainlink path/to/config.yaml
```

```python
# run as a python function

from chainlink import chainlink

chainlink(
    config: dict, ## dict with config details
    config_path: str | Path = DIR / "configs/config.yaml", ## path to store dict post processing
)
```
