Metadata-Version: 2.4
Name: omop-emb
Version: 0.4.1
Summary: Embedding extension to omop-graph
Author-email: Nico Loesch <n.loesch@unsw.edu.au>
License-Expression: Apache-2.0
Keywords: LLM-grounding,OHDSI,OMOP,clinical-data,health-informatics,knowledge-graph,sqlalchemy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.12
Requires-Dist: numpy
Requires-Dist: numpy>=1.26
Requires-Dist: omop-alchemy>=0.5.7
Requires-Dist: openai
Requires-Dist: orm-loader>=0.3.15
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: requests
Requires-Dist: sqlalchemy>=2.0.45
Requires-Dist: typer
Requires-Dist: typing-extensions>=4.15.0
Provides-Extra: all
Requires-Dist: faiss-cpu>=1.8.0; extra == 'all'
Requires-Dist: h5py; extra == 'all'
Requires-Dist: pgvector; extra == 'all'
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.8.0; extra == 'faiss'
Requires-Dist: h5py; extra == 'faiss'
Provides-Extra: pgvector
Requires-Dist: pgvector; extra == 'pgvector'
Description-Content-Type: text/markdown

# omop-emb
Embedding layer for OMOP CDM.

`omop-emb` now separates model metadata from embedding storage:

- model metadata is stored locally in SQLite (`metadata.db`)
- embedding vectors are stored by the selected backend (`pgvector` or `faiss`)
- OMOP concept metadata remains in the OMOP CDM database

## Installation

`omop-emb` now exposes backend-specific optional dependencies so installation
can match the embedding backend you actually intend to use.

```bash
pip install "omop-emb[pgvector]"
pip install "omop-emb[faiss]"
pip install "omop-emb[all]"
```

Notes:

- `pgvector` installs the PostgreSQL/pgvector dependencies.
- `faiss` installs the FAISS-based backend dependencies. This currently only includes CPU support
- `all` installs both backend stacks for development or mixed environments.
- A plain `pip install omop-emb` installs the shared core package only.
- PostgreSQL-specific embedding dependencies are optional, but `omop-emb`
  still requires OMOP CDM database access.
- Non-PostgreSQL database backends have not yet been tested.

## Runtime Configuration

Common environment variables:

- `OMOP_EMB_BACKEND`: backend name (`pgvector` or `faiss`) used by the backend factory.
- `OMOP_EMB_BASE_STORAGE_DIR`: local base directory for `omop-emb` artifacts, including local metadata (`metadata.db`) and FAISS files. If unset, `omop-emb` defaults to `./.omop_emb` in the current working directory.
- `OMOP_DATABASE_URL`: SQLAlchemy URL for the OMOP CDM database.
- `OMOP_EMB_DOCUMENT_EMBEDDING_PREFIX`: task prefix prepended to concept texts at index time. Required for asymmetric models (e.g. `search_document: ` for nomic-embed-text, `passage: ` for E5).
- `OMOP_EMB_QUERY_EMBEDDING_PREFIX`: task prefix prepended to search queries at query time. Required for asymmetric models (e.g. `search_query: ` for nomic-embed-text, `query: ` for E5).

The prefix variables default to `""` and are safe to omit for symmetric models. See [Asymmetric Embeddings](https://AustralianCancerDataNetwork.github.io/omop-emb/usage/asymmetric-embeddings/) for details.

Extended documentation can be found [here](https://AustralianCancerDataNetwork.github.io/omop-emb).

# Project Roadmap

- [x] Interface for PostgreSQL storage of vectors
- [x] Interface for FAISS storage of embeddings
- [x] Extensive unit testing
    - [x] Backend testing
    - [x] Corruption and restoration of DB testing
- [ ] Support importing and exporting of calculated embeddings
- [ ] Support non-Flat indices for each backend
- [ ] `faiss` GPU support
- [ ] [`pgvectorscale`](https://github.com/timescale/pgvectorscale) support
- [ ] Vector-quantisation for more efficient storage
