Metadata-Version: 2.4
Name: omop-emb
Version: 0.4.0
Summary: Embedding extension to omop-graph
Author-email: Nico Loesch <n.loesch@unsw.edu.au>
License-Expression: Apache-2.0
Keywords: LLM-grounding,OHDSI,OMOP,clinical-data,health-informatics,knowledge-graph,sqlalchemy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.12
Requires-Dist: numpy
Requires-Dist: numpy>=1.26
Requires-Dist: omop-alchemy>=0.5.7
Requires-Dist: openai
Requires-Dist: orm-loader>=0.3.15
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: requests
Requires-Dist: sqlalchemy>=2.0.45
Requires-Dist: typer
Requires-Dist: typing-extensions>=4.15.0
Provides-Extra: all
Requires-Dist: faiss-cpu>=1.8.0; extra == 'all'
Requires-Dist: h5py; extra == 'all'
Requires-Dist: pgvector; extra == 'all'
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.8.0; extra == 'faiss'
Requires-Dist: h5py; extra == 'faiss'
Provides-Extra: pgvector
Requires-Dist: pgvector; extra == 'pgvector'
Description-Content-Type: text/markdown

# omop-emb
Embedding layer for OMOP CDM.

`omop-emb` now separates model metadata from embedding storage:

- model metadata is stored locally in SQLite (`metadata.db`)
- embedding vectors are stored by the selected backend (`pgvector` or `faiss`)
- OMOP concept metadata remains in the OMOP CDM database

## Installation

`omop-emb` now exposes backend-specific optional dependencies so installation
can match the embedding backend you actually intend to use.

```bash
pip install "omop-emb[pgvector]"
pip install "omop-emb[faiss]"
pip install "omop-emb[all]"
```

Notes:

- `pgvector` installs the PostgreSQL/pgvector dependencies.
- `faiss` installs the FAISS-based backend dependencies. This currently only includes CPU support
- `all` installs both backend stacks for development or mixed environments.
- A plain `pip install omop-emb` installs the shared core package only.
- PostgreSQL-specific embedding dependencies are optional, but `omop-emb`
  still requires OMOP CDM database access.
- Non-PostgreSQL database backends have not yet been tested.

## Runtime Configuration

Common environment variables:

- `OMOP_EMB_BACKEND`: backend name (`pgvector` or `faiss`) used by the backend factory.
- `OMOP_EMB_BASE_STORAGE_DIR`: local base directory for `omop-emb` artifacts, including local metadata (`metadata.db`) and FAISS files. If unset, `omop-emb` defaults to `./.omop_emb` in the current working directory.
- `OMOP_DATABASE_URL`: SQLAlchemy URL for the OMOP CDM database.

Extended documentation can be found [here](https://AustralianCancerDataNetwork.github.io/omop-emb).

# Project Roadmap

- [x] Interface for PostgreSQL storage of vectors
- [x] Interface for FAISS storage of embeddings
- [x] Extensive unit testing
    - [x] Backend testing
    - [x] Corruption and restoration of DB testing
- [ ] Support importing and exporting of calculated embeddings
- [ ] Support non-Flat indices for each backend
- [ ] `faiss` GPU support
- [ ] [`pgvectorscale`](https://github.com/timescale/pgvectorscale) support
- [ ] Vector-quantisation for more efficient storage
