Metadata-Version: 2.4
Name: easybiz-companyname-normalisation
Version: 1.0.0
Summary: Canonical company-name normalisation, shared byte-for-byte across EasyBiz services (L-MDS-CR-08).
Project-URL: Homepage, https://github.com/EasyBizIO/easybiz-companyname-normalisation
Project-URL: Repository, https://github.com/EasyBizIO/easybiz-companyname-normalisation
Author: EasyBiz
License: Proprietary
License-File: LICENSE
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == 'test'
Description-Content-Type: text/markdown

# easybiz-companyname-normalisation

Canonical company-name normalisation, shared **byte-for-byte** by the MDS
company-resolver and accounting-service (locked decision **L-MDS-CR-08**).

It is a **library, not a service**: `normalise_name()` is a pure, deterministic
function on the resolution hot path. A service would add a network hop and an
availability dependency for zero benefit.

```python
from easybiz_companyname_normalisation import normalise_name, NORMALISER_VERSION

normalise_name("ACME S.à r.l.")   # -> "acme sarl"
NORMALISER_VERSION                  # -> "1.0.0"
```

## Why this is a shared package and not copy-pasted logic

Both services store **normalised values** (`EntitySynonym.normalised_value`,
embedding source text). Those stored values are a **derived cache**; the
**raw value is the source of truth**. If the two services normalised differently,
a name stored by one would silently fail to match a query from the other. So the
logic lives in exactly one place and both import it.

## Versioning policy (the important part)

| Bump | When | Consequence |
|---|---|---|
| **MAJOR** | `normalise_name()` output changes for *any* input (e.g. adding a legal form) | Requires a **coordinated re-normalise backfill** of stored values from `raw_value`, and a re-embed, in MDS. Both services upgrade in lockstep. |
| **MINOR / PATCH** | tests, perf, docs, *adding* fixture rows | Must **never** change output. |

`NORMALISER_VERSION` equals the package version. MDS persists it alongside each
stored value (`EntitySynonym.normaliser_version`, mirroring
`EntityEmbedding.embedding_model`) so stale rows are detectable; a
`renormalise_stale` management command re-derives rows whose version != current,
from `raw_value`. **Skew only ever degrades a match to a _miss_ (→ human review),
never to a _wrong match_** — the safe failure direction.

## The parity contract

`tests/test_normalisation.py` runs against `fixtures/lu_names.csv`, which ships
**inside** the package. Running this test in each consumer's CI proves that
consumer uses the canonical behaviour for the version it pinned.

**Never edit an existing `expected` value without a MAJOR bump** — that is, by
definition, an output change. *Adding* rows is fine (MINOR) and is the encouraged
way to harden coverage as real LU supplier-name shapes surface. The 20 starter
rows are illustrative, not a target; there is no requirement to reach any count.

## Installation

Distribution is **deliberately deferred** while in solo development. The import
path is identical across all options below — only the install source changes.

- **Now (solo inner loop):** clone next to the service repos and install editable.
  Both services point at one working copy; edits are picked up instantly.
  ```
  pip install -e ../easybiz-companyname-normalisation
  ```
- **When CI or a teammate appears — switch to a pinned git tag:**
  ```
  pip install git+https://github.com/<org>/easybiz-companyname-normalisation@v1.0.0
  ```
  Pin a **tag, not a branch** — a branch ref lets two installs drift, which breaks
  the version-sync contract.
- **Target later (optional):** AWS CodeArtifact private PyPI. `python -m build &&
  twine upload --repository codeartifact dist/*`; consume via
  `aws codeartifact login --tool pip ...` and pin `==1.0.0`.

## Scope

Company-name normalisation only. Identifier checksum validators (IBAN/VAT/RCS)
stay accounting-side (the resolver trusts pre-validated input); they are **not**
part of this package.

## TODO (deferred, not yet built)

- **Cross-service CI parity guard** — once both services have CI, add a check that
  fails the build if MDS and accounting pin *different* `easybiz-companyname-normalisation`
  versions. Not needed while both use one editable install.
