Metadata-Version: 2.4
Name: emm
Version: 2.1.11
Summary: Entity Matching Model package
Author-email: Max Baak <max.baak@ing.com>, Stephane Collot <stephane.collot@gmail.com>, Apoorva Mahajan <apoorva.mahajan@ing.com>, Tomasz Waleń <tomasz.walen@ing.com>, Simon Brugman <simon.brugman@ing.com>
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: setuptools<81
Requires-Dist: numpy>=1.20.1
Requires-Dist: scipy
Requires-Dist: scikit-learn<1.6.0
Requires-Dist: pandas!=1.5.0,>=1.1.0
Requires-Dist: jinja2
Requires-Dist: rapidfuzz<3.0.0
Requires-Dist: regex
Requires-Dist: urllib3
Requires-Dist: recordlinkage
Requires-Dist: cleanco>=2.2
Requires-Dist: xgboost
Requires-Dist: sparse-dot-topn>=1.1.1
Requires-Dist: joblib
Requires-Dist: pyarrow>=6.0.1
Requires-Dist: requests
Provides-Extra: spark
Requires-Dist: pyspark>=3.1; python_version <= "3.11" and extra == "spark"
Provides-Extra: dev
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: gitpython; extra == "dev"
Requires-Dist: nbconvert; extra == "dev"
Requires-Dist: jupyter_client>=5.2.3; extra == "dev"
Requires-Dist: ipykernel>=5.1.3; extra == "dev"
Requires-Dist: matplotlib; extra == "dev"
Requires-Dist: pygments; extra == "dev"
Requires-Dist: pandoc; extra == "dev"
Requires-Dist: pympler; extra == "dev"
Provides-Extra: preprocessing
Requires-Dist: unidecode; extra == "preprocessing"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-ordering; extra == "test"
Requires-Dist: virtualenv; extra == "test"
Requires-Dist: unidecode; extra == "test"
Provides-Extra: test-cov
Requires-Dist: coverage; extra == "test-cov"
Requires-Dist: pytest-cov; extra == "test-cov"
Provides-Extra: test-bench
Requires-Dist: pytest-benchmark; extra == "test-bench"
Provides-Extra: test-notebook
Requires-Dist: pytest-notebook>=0.6.1; extra == "test-notebook"
Requires-Dist: ipykernel>=5.1.3; extra == "test-notebook"
Requires-Dist: matplotlib; extra == "test-notebook"
Requires-Dist: nbdime<4; extra == "test-notebook"
Provides-Extra: doc
Requires-Dist: matplotlib; extra == "doc"
Requires-Dist: seaborn; extra == "doc"
Requires-Dist: sphinx; extra == "doc"
Requires-Dist: sphinx-material; extra == "doc"
Requires-Dist: furo; extra == "doc"
Requires-Dist: sphinx-copybutton; extra == "doc"
Requires-Dist: sphinx-autodoc-typehints; extra == "doc"
Requires-Dist: jupyter_contrib_nbextensions; extra == "doc"
Requires-Dist: nbstripout; extra == "doc"
Requires-Dist: nbsphinx; extra == "doc"
Requires-Dist: nbsphinx-link; extra == "doc"
Requires-Dist: ipywidgets; extra == "doc"
Requires-Dist: jinja2; extra == "doc"
Requires-Dist: jinja-cli; extra == "doc"
Requires-Dist: markupsafe; extra == "doc"
Requires-Dist: pandoc; extra == "doc"
Requires-Dist: jupyter_client>=5.2.3; extra == "doc"
Requires-Dist: myst_parser; extra == "doc"
Dynamic: license-file

# Entity Matching model

[![Build](https://github.com/ing-bank/EntityMatchingModel/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/ing-bank/EntityMatchingModel/actions)
[![Latest Github release](https://img.shields.io/github/v/release/ing-bank/EntityMatchingModel)](https://github.com/ing-bank/EntityMatchingModel/releases)
[![GitHub release date](https://img.shields.io/github/release-date/ing-bank/EntityMatchingModel)](https://github.com/ing-bank/EntityMatchingModel/releases)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v1.json)](https://github.com/astral-sh/ruff)
[![Downloads](https://static.pepy.tech/badge/emm)](https://pepy.tech/project/emm)


Entity Matching Model (EMM) solves the problem of matching company names between two possibly very
large datasets. EMM can match millions against millions of names with a distributed approach.
It uses the well-established candidate selection techniques in string matching,
namely: tfidf vectorization combined with cosine similarity (with significant optimization),
both word-based and character-based, and sorted neighbourhood indexing.
These so-called indexers act complementary for selecting realistic name-pair candidates.
On top of the indexers, EMM has a classifier with optimized string-based, rank-based, and legal-entity
based features to estimate how confident a company name match is.

The classifier can be trained to give a string similarity score or a probability of match.
Both types of score are useful, in particular when there are many good-looking matches to choose between.
Optionally, the EMM package can also be used to match a group of company names that belong together,
to a common company name in the ground truth. For example, all different names used to address an external bank account.
This step aggregates the name-matching scores from the supervised layer into a single match.

The package is modular in design and and works both using both Pandas and Spark. A classifier trained with the former
can be used with the latter and vice versa.

For release history see [GitHub Releases](https://github.com/ing-bank/EntityMatchingModel/releases).

## Notebooks

For detailed examples of the code please see the notebooks under `notebooks/`.

- `01-entity-matching-pandas-version.ipynb`: Using the Pandas version of EMM for name-matching.
- `02-entity-matching-spark-version.ipynb`: Using the Spark version of EMM for name-matching.
- `03-entity-matching-training-pandas-version.ipynb`: Fitting the supervised model and setting a discrimination threshold (Pandas).
- `04-entity-matching-aggregation-pandas-version.ipynb`: Using the aggregation layer and setting a discrimination threshold (Pandas).

## Documentation

For documentation, design, and API see [the documentation](https://entitymatchingmodel.readthedocs.io/en/latest/).
Or read our Medium blog [Entity Matching at Scale!](https://medium.com/p/af20429a80c7)

## Check it out

The Entity matching model library requires Python >= 3.7 and is pip friendly. To get started, simply do:

```shell
pip install emm
```

or check out the code from our repository:

```shell
git clone https://github.com/ing-bank/EntityMatchingModel.git
pip install -e EntityMatchingModel/
```

where in this example the code is installed in edit mode (option -e).

Additional dependencies can be installed with, e.g.:

```shell
pip install "emm[spark,dev,test]"
```

You can now use the package in Python with:


```python
import emm
```

**Congratulations, you are now ready to use the Entity Matching model!**

## Quick run

As a quick example, you can do:

```python
from emm import PandasEntityMatching
from emm.data.create_data import create_example_noised_names

# generate example ground-truth names and matching noised names, with typos and missing words.
ground_truth, noised_names = create_example_noised_names(random_seed=42)
train_names, test_names = noised_names[:5000], noised_names[5000:]

# two example name-pair candidate generators: character-based cosine similarity and sorted neighbouring indexing
indexers = [
  {
      'type': 'cosine_similarity',
      'tokenizer': 'characters',   # character-based cosine similarity. alternative: 'words'
      'ngram': 2,                  # 2-character tokens only
      'num_candidates': 5,         # max 5 candidates per name-to-match
      'cos_sim_lower_bound': 0.2,  # lower bound on cosine similarity
  },
  {'type': 'sni', 'window_length': 3}  # sorted neighbouring indexing window of size 3.
]
em_params = {
  'name_only': True,         # only consider name information for matching
  'entity_id_col': 'Index',  # important to set both index and name columns to pick up
  'name_col': 'Name',
  'indexers': indexers,
  'supervised_on': False,    # no supervided model (yet) to select best candidates
  'with_legal_entity_forms_match': True,   # add feature that indicates match of legal entity forms (e.g. ltd != co)
}
# 1. initialize the entity matcher
p = PandasEntityMatching(em_params)

# 2. fitting: prepare the indexers based on the ground truth names, eg. fit the tfidf matrix of the first indexer.
p.fit(ground_truth)

# 3. create and fit a supervised model for the PandasEntityMatching object, to pick the best match (this takes a while)
#    input is "positive" names column 'Name' that are all supposed to match to the ground truth,
#    and an id column 'Index' to check with candidate name-pairs are matching and which not.
#    A fraction of these names may be turned into negative names (= no match to the ground truth).
#    (internally, candidate name-pairs are automatically generated, these are the input to the classification)
p.fit_classifier(train_names, create_negative_sample_fraction=0.5)

# 4. scoring: generate pandas dataframe of all name-pair candidates.
#    The classifier-based probability of match is provided in the column 'nm_score'.
#    Note: can also call p.transform() without training the classifier first.
candidates_scored_pd = p.transform(test_names)

# 5. scoring: for each name-to-match, select the best ground-truth candidate.
best_candidates = candidates_scored_pd[candidates_scored_pd.best_match]
best_candidates.head()
```

For Spark, you can use the class `SparkEntityMatching` instead, with the same API as the Pandas version.
For all available examples, please see the tutorial notebooks under `notebooks/`.

## Project contributors

This package was authored by ING Analytics Wholesale Banking.

## Contact and support

Contact the WBAA team via Github issues.
Please note that INGA-WB provides support only on a best-effort basis.

## License

Copyright ING WBAA 2023. Entity Matching Model is completely free, open-source and licensed under the [MIT license](https://en.wikipedia.org/wiki/MIT_License).
