Metadata-Version: 2.4
Name: idri
Version: 2.0.0rc2603130931.dev0
Summary: Deterministic approximate vectorization for short identifiers and labels
License-Expression: MPL-2.0
Project-URL: Repository, https://github.com/gocova/idri_py
Project-URL: Issues, https://github.com/gocova/idri_py/issues
Project-URL: Funding, https://buymeacoffee.com/gocova
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Dynamic: license-file

# idri — deterministic approximate vectorization for short identifiers and labels
[![Buy Me a Coffee](https://img.shields.io/badge/Buy%20Me%20a%20Coffee-Support-orange?logo=buy-me-a-coffee&style=flat-square)](https://buymeacoffee.com/gocova)

`idri` is a tiny low-level Python vectorization engine for short, noisy identifiers and labels.

It is designed for reproducible approximate matching over short strings, and for composing multiple identifier signals into a single vector before retrieval.

## What It Is Good At

- Short business identifiers
- Names and tags
- Prefix/suffix-heavy codes
- OCR-truncated fields
- Partial multiword labels

Examples:

- `ACME INDUSTRIAL SA DE CV` vs `ACME INDUSTRIAL`
- `INV-2024-001-03` vs `INV-2024-001`
- `TOPER PULIDORA 4 1/2` vs `TOPER PULIDORA 4 1 2`

## What It Is Not For

- Semantic similarity
- Long natural-language embeddings
- Full entity resolution workflows
- Business-rule matching engines

## Determinism and Distributed Use

`idri` is deterministic by construction: the same normalized input, profile, family, dimension, seed, and library behavior should produce the same vector.

That makes it distributed-friendly for systems that need reproducible encoding across workers, services, or edge nodes. This is a reproducibility property, not yet a fully validated distributed-systems guarantee.

## Use Cases

`idri` is strongest when the problem is short-string matching, candidate generation, clustering, or blocking rather than semantic understanding.

- E-commerce and retail catalog deduplication for messy product titles, vendor labels, and SKU-like strings
- Procurement and ERP item crosswalks across supplier catalogs and internal part masters
- Invoice and payment-reference matching for noisy parent/child identifiers
- Marketplace or PIM listing clustering before rules or human review
- Media tag and label search over large collections
- CRM and analytics normalization for campaign names, UTM values, and source labels
- Fraud and payments clustering for noisy merchant descriptors
- IoT and edge-device matching where local deterministic encoding matters more than a cloud model

## Why This Exists Instead of Just Fuzzy Matching or TF-IDF

Compared with classic fuzzy matching:

- vectors can be precomputed and indexed once instead of rescoring every candidate pair
- weighted composition lets you match multi-field records in one ANN search instead of several rule stages
- word plus character n-grams preserve robustness to OCR truncation, token reordering, and small textual variation

Compared with TF-IDF:

- no vocabulary fitting at runtime
- no corpus dependency
- same output dimension always
- easier ANN integration
- easier deployment
- simpler persistence

Additional practical advantages:

- deterministic output for the same input under the same encoder settings
- can be used online with zero fitting
- easy to port to Rust later
- compact fixed-size vectors
- supports lightweight family-aware encoding
- supports weighted composition of multiple identifier signals into one query vector

## Complex Matching in One Pass

Weighted composition is a practical advantage, not just an API feature.

Example:

- specific invoice id: `x2437-1`
- parent invoice family: `x2437`

You can encode both and compose them into one retrieval vector:

```python
from idri import IdentifierEncoder, compose_texts, compose_vectors

enc = IdentifierEncoder()

specific = enc.encode("x2437-1", family="invoice")
parent = enc.encode("x2437", family="invoice")

query = compose_vectors([
    (specific, 0.7),
    (parent, 0.3),
])

query_fast = compose_texts(
    enc,
    [("x2437-1", 0.7), ("x2437", 0.3)],
    family="invoice",
)
```

That lets a downstream ANN index perform one search over a vector that captures both the specific invoice and its family context.

The same pattern works for provider or merchant names. Word features keep strong token identity, while character n-grams add protection against OCR truncation, spacing variation, and partial suffix loss.

## Install

```bash
uv sync
```

## Quick Start

```python
from idri import IdentifierEncoder, normalize_text

enc = IdentifierEncoder(profile="word_ngram")

vec = enc.encode("ACME Industrial SA de CV")
score = enc.similarity("ACME Industrial", "ACME Industrial SA de CV")
explanation = enc.explain("ACME Industrial", "ACME Industrial SA de CV")

print(normalize_text("  ACME   Industrial SA de CV  "))
print(vec.shape)              # (2048,)
print(type(vec).__name__)     # ndarray
print(score)
print(explanation)
```

## Family-Aware Encoding

```python
from idri import IdentifierEncoder

enc = IdentifierEncoder()
vec = enc.encode("INV-2024-001-03", family="invoice")
score = enc.similarity("INV-2024-001", "INV-2024-001-03", family="invoice")
```

If `family=None`, `idri` uses `generic`.

## Composition Is First-Class

```python
from idri import (
    IdentifierEncoder,
    compose_texts,
    compose_vectors,
    cosine_similarity,
    l2_distance,
)

enc = IdentifierEncoder()

v1 = enc.encode("F0032-3", family="invoice")
v2 = enc.encode("F0032", family="invoice")

composed = compose_vectors([(v1, 0.6), (v2, 0.4)])
composed_from_text = compose_texts(enc, [("F0032-3", 0.6), ("F0032", 0.4)], family="invoice")

composer = enc.start_composer()
composer.add_text("F0032-3", weight=0.6, family="invoice")
composer.add_text("F0032", weight=0.4, family="invoice")
incremental = composer.build()

cos_score = cosine_similarity(composed, incremental)
euclidean_gap = l2_distance(composed, incremental)
```

`compose_texts(...)` is the short path for "encode these weighted texts and combine them once". `VectorComposer` remains the better fit when you need incremental building or explainability.

## Retrieval Helpers

```python
from idri import IdentifierEncoder, cosine_similarity_matrix, topk_by_similarity

enc = IdentifierEncoder()
query = enc.encode("invoice 001")
candidates = enc.batch_encode(["invoice 001", "invoice 002", "hammer"])

scores = cosine_similarity_matrix(query, candidates)
indices, top_scores = topk_by_similarity(query, candidates, k=2)
```

## Profiles

- `word`: family + words only
- `word_ngram` (default): family + words + character n-grams
- `word_ngram_position`: same as `word_ngram` with weak positional decay

## API Summary

- `IdentifierEncoder.encode(...) -> numpy.ndarray`
- `IdentifierEncoder.batch_encode(...) -> numpy.ndarray`
- `IdentifierEncoder.similarity(...) -> float`
- `IdentifierEncoder.start_composer(...) -> VectorComposer`
- `IdentifierEncoder.explain(...) -> dict`
- `compose_texts(...) -> numpy.ndarray`
- `compose_vectors(...) -> numpy.ndarray`
- `cosine_similarity(...) -> float`
- `cosine_similarity_matrix(...) -> numpy.ndarray`
- `l2_distance(...) -> float`
- `l2_normalize(...) -> numpy.ndarray`
- `available_profiles() -> tuple[str, ...]`
- `normalize_text(...) -> str`
- `topk_by_similarity(...) -> tuple[numpy.ndarray, numpy.ndarray]`

Most helpers accept `numpy.typing.ArrayLike` inputs and return dense `numpy.ndarray` outputs. `encode(...)`, `batch_encode(...)`, `l2_normalize(...)`, and the similarity-matrix helpers return `float32` arrays; `compose_vectors(...)` and `compose_texts(...)` also default to `float32` but still allow an explicit output dtype.

## Exact Cosine vs ANN Search

Exact cosine similarity computed by this library over normalized vectors is the reference behavior.

ANN/vector database search runs in the same vector space but may not return bit-identical rankings. Approximate indexes can reorder near-ties and sometimes change lower-ranked results depending on index settings.

## Development

```bash
uv sync
uv run pytest
uv run pytest --cov=idri --cov-report=term-missing
uv run python examples/basic_usage.py
uv run python examples/composition_usage.py
uv build
```

## PyPI Release

This repository is configured to publish `idri` to PyPI through GitHub Actions Trusted Publishing.

Workflow details:

- repository owner: `gocova`
- repository name: `idri_py`
- workflow filename: `publish.yml`
- GitHub environment: `pypi`
- PyPI project name: `idri`

PyPI setup:

1. If `idri` does not exist yet on PyPI, create a pending publisher for project name `idri`.
2. If `idri` already exists on PyPI, add a Trusted Publisher to that project with the workflow details above.
3. Push a version tag such as `v0.1.0` to trigger the release workflow.

The workflow builds the package with `uv build`, runs the test suite, and publishes with `pypa/gh-action-pypi-publish` using GitHub OIDC. No long-lived PyPI API token is required.

## License

This project is open source under the Mozilla Public License 2.0 (`MPL-2.0`).  
See [LICENSE](LICENSE).

## Consulting / Commercial Terms

For custom integration, proprietary licensing without attribution, or high-performance optimization, contact the maintainer for consulting.

## Contributing

Contributions are welcome, but they are subject to the contributor terms in [CLA.md](CLA.md).
See [CONTRIBUTING.md](CONTRIBUTING.md) for workflow and test requirements.
