Metadata-Version: 2.4
Name: thirawat-mapper
Version: 0.2.1
Summary: Minimal indexing and inference toolkit for terminology mapping.
Project-URL: Repository, https://sidata.plus/projects/thirawat-mapper-beta
Project-URL: Issues, https://sidata.plus/support
Author-email: Natthawut Max Adulyanukosol <max@sidata.plus>
Requires-Python: >=3.10
Requires-Dist: datasets>=4.0
Requires-Dist: duckdb>=0.10
Requires-Dist: lance>=0.13
Requires-Dist: lancedb>=0.9
Requires-Dist: numpy>=1.24
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pandas>=2.0
Requires-Dist: protobuf>=4.25
Requires-Dist: pyarrow>=14.0
Requires-Dist: pydantic>=2.7
Requires-Dist: pylate>=1.3
Requires-Dist: rerankers>=0.7
Requires-Dist: rich>=13.7
Requires-Dist: sentence-transformers>=2.4
Requires-Dist: sentencepiece>=0.2.1
Requires-Dist: tiktoken>=0.7
Requires-Dist: torch>=2.2
Requires-Dist: tqdm>=4.66
Requires-Dist: transformers>=4.38
Description-Content-Type: text/markdown

# THIRAWAT Mapper 

**T**erminology **H**armonization using Late-**I**nteraction **R**eranker **W**ith **A**lignment-tuned **T**ransformers

## Prerequisites

### Environment

1. Python 3.10+
2. [`uv`](https://docs.astral.sh/uv/getting-started/installation/)

### OHDSI Standard Concepts

1. Request and download the standard concepts in csv format from [https://athena.ohdsi.org/](https://athena.ohdsi.org/)
2. Convert the csv files into a DuckDB database using [sidataplus/athena2duckdb](https://github.com/sidataplus/athena2duckdb)

### Models

Fine-tuned reranker models are hosted on [Hugging Face](https://huggingface.co/collections/sidataplus/thirawat).

Pre-built indexes will be made available soon.

## Install from PyPI

```bash
pip install thirawat-mapper
# or (recommended for global CLI installs)
pipx install thirawat-mapper

thirawat --help
```

Command mapping:

- `thirawat index build ...` → `python -m thirawat_mapper.index.build ...`
- `thirawat infer bulk ...` → `python -m thirawat_mapper.infer.bulk ...`
- `thirawat infer query ...` → `python -m thirawat_mapper.infer.query ...`



## 1. Build a LanceDB Index

```bash
thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --domain-id Drug \
  --concept-class-id "Clinical Drug,Quant Clinical Drug,Clinical Drug Comp,Clinical Drug Form,Ingredient" \
  --exclude-concept-class-id "Clinical Drug Box,Branded Drug Box,Branded Pack Box,Clinical Pack Box,Marketed Product,Quant Branded Box,Quant Clinical Box" \
  --extra-column "concept_name,domain_id,vocabulary_id,concept_class_id" \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --batch-size 256 \
  --device cuda
```

Key options:

- `--duckdb` - DuckDB file produced by [`sidataplus/athena2duckdb`](https://github.com/sidataplus/athena2duckdb).
- `--profiles-table` - Preferred table containing `concept_id` and `profile_text`. If the table is missing, the builder falls back to generating profiles inline from `concept` (and `concept_synonym` when available).
- `--concepts-table` - OMOP concept table (defaults to `concept`). The builder always joins to this table and keeps only standard, valid concepts (`standard_concept = 'S' AND invalid_reason IS NULL`).
- `--domain-id`, `--concept-class-id` - Optional filters; accept comma-separated lists or repeated flags.
- `--exclude-concept-class-id` - Exclude specific classes (comma-separated or repeat flag). Default empty; recommended exclusions: Clinical Drug Box, Branded Drug Box, Branded Pack Box, Clinical Pack Box, Marketed Product, Quant Branded Box, Quant Clinical Box.
- `--extra-column` - Carry additional columns from the profiles table into LanceDB (repeat flag).
- `--max-synonyms` - Number of synonyms appended when inline profile generation is used.
- `--include-codes-in-text` - Include `concept_code` in generated inline profile text.
- `--model-id`, `--pooling`, `--max-length` - Encoder controls for building the index vectors (also written into the index manifest for inference defaults).
- `--out-db` / `--table` - Target LanceDB directory and table name.

If your Athena-to-DuckDB file does not contain a `concept_profiles` table, the command still works via inline profile generation:

```bash
thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --max-synonyms 3 \
  --include-codes-in-text
```

Device matrix:

- `index build --device`: explicit `cuda|mps|cpu`. If omitted, the encoder uses `cuda` when available, otherwise `cpu`.
- `infer bulk/query --device`: `auto|cuda|mps|cpu` (default `cpu` for stability; `auto` prefers `cuda`, then `mps`, then `cpu`).

Apple Silicon example:

```bash
thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --device mps
```

The command will:

1. Load profiles (and apply filters if provided).
2. Normalize `profile_text` and embed with SapBERT vectors (via `transformers`; pooling configurable).
3. Write a LanceDB table where `vector` is a `FixedSizeList<float32>[768]` column.
4. Emit a `<table>_manifest.json` manifest describing the build (model id, filters, counts).

## 2. Interactive Query (REPL)

```bash
thirawat infer query \
  --db data/lancedb/db \
  --table concepts_drug \
  --device cpu \
  --reranker-id sidataplus/THIRAWAT-BioLORD  # optional override; defaults to sidataplus/THIRAWAT-SapBERT
```

Type a query and press Enter to see the post-scored top results:

```
query> amoxicillin clavulanate 875 mg
concept_id   | score  | s_sim | name
--------------------------------------------------------------------------------
123456       | 0.841  | 0.990 | Amoxicillin / Clavulanate 875 MG Oral Tablet
...
```

Commands:

- Type `:q`, `:quit`, or `:exit` to leave.
- Use `--candidate-topk` to change the candidate pool and `--show-topk` to limit display rows.
- `--reranker-id` works here too if you want to test a local or alternative reranker in the REPL.


## 3. Bulk Inference

```bash
export TOKENIZERS_PARALLELISM=false

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/usagi.csv \
  --out runs/mapping \
  --candidate-topk 200 \
  --n-limit 20 \
  --device cuda
```

Add `--reranker-id` to point at a different reranker checkpoint. The flag accepts either a Hugging Face model ID or a local path, e.g. `--reranker-id models/nde_biolord`.

Input formats: CSV, TSV, Parquet, or Excel. By default the CLI expects the following columns (override via flags):

- `sourceName` (required)
- `sourceCode` (optional)
- `conceptId` (optional ground truth)
- `mappingStatus` (used for Usagi detection). When the input already follows the Usagi CSV schema (see `data/eval/tmt_to_rxnorm.csv`), the CLI validates a sample of rows through a Pydantic schema and surfaces a clear error if the structure is invalid. Otherwise, it synthesizes a minimal Usagi row per record so downstream exports stay consistent.

Selected flags:

- `--source-name-column`, `--source-code-column` - Override input headers.
- `--label-column` - Column containing gold concept IDs (optional, default `conceptId`).
- `--status-column`, `--approved-value` - Configure Usagi approval detection.
- `--batch-size` - Query embedding batch size (increase for better GPU throughput).
- `--n-limit` - Limit to the first N rows (smoke runs).
- `--where` - Optional LanceDB filter, e.g., `vocabulary_id = 'RxNorm' AND concept_class_id != 'Ingredient'` (when those columns exist in the index).
- `--device` - `auto|cuda|mps|cpu` (default `cpu` for stability; use `auto` to prefer `cuda`, then `mps`, then `cpu`).
- `--encoder-model-id`, `--encoder-pooling`, `--encoder-max-length` - Override the query encoder used for retrieval (defaults to the index manifest when present).
- `--post-mode` - Post-score behavior: `blend|tiebreak|lex` (default `tiebreak`).
- `--post-weight` - Blend weight (only when `--post-mode blend`, default `0.05`).
- `--tiebreak-eps`, `--tiebreak-topn` - Controls near-tie grouping for `--post-mode tiebreak`.
- `--brand-strict` - For bracketed brand queries, drop brand-mismatched candidates when possible.
- `--inn2usan/--no-inn2usan` - Normalize INN/BAN drug names to USAN during inference (default enabled).
- `--atc-scope` - Boost candidates matching per-row `atc_ids`/`atc_codes` (requires `--vocab` or a DuckDB path in the index manifest).
- `--reranker-id` - Override the default reranker (`sidataplus/THIRAWAT-SapBERT`) with another HF model ID or a local directory/filename. Relative paths are resolved to absolute paths so you can pass `models/nde_biolord`.

### Deterministic post-ranking modes

`--post-mode` controls how post features influence ranking:

- `tiebreak` (default): keeps the ML relevance ordering globally, but reorders only near-tied candidates (gap `<= --tiebreak-eps`) within the first `--tiebreak-topn` rows.
- `lex`: full lexicographic sort by relevance + post features across all rows.
- `blend`: computes a weighted final score.

For `blend`, the score is:

`final_score = (1 - post_weight) * relevance + post_weight * post_score`

For `lex` and `tiebreak`, tie-break keys are applied in this deterministic order (descending):

1. `brand_strength_exact`
2. `top20_strength_form_exact`
3. `brand_score`
4. `rerank_top20`
5. `strength_exact`
6. `strength_sim`
7. `form_route_score`
8. `release_score`

Pipeline steps per row:

1. Build query text (`sourceName` with `sourceCode` appended in parentheses when present).
2. Embed with SapBERT.
3. Vector search (cosine) against the LanceDB table to gather `--candidate-topk` entries.
4. Rerank with the THIRAWAT reranker. Beta is vector-only; no FTS/BM25/hybrid.
5. Apply post-scoring per `--post-mode` (default `tiebreak`: only reorders within near-ties of the ML score). Disable post-scoring via `--post-mode blend --post-weight 0.0`.

Outputs (written to `--out`):

- `results.csv` - Classic relabel layout (wide, block-per-query). Columns: leading `rank` 1..K, then for each query three adjacent columns `[match_rank_or_unmatched, source_concept_name, source_concept_code]` with K rows beneath. Non-Usagi inputs preserve the original row order; Usagi inputs continue to sort matched rows first so reviewers can focus on confirmed gold IDs.
- `results_with_input.csv` - Original input row with candidate columns appended.
- `results_usagi.csv` - Always emitted. Each processed row is coerced into the Usagi schema (using the sample in `data/eval/tmt_to_rxnorm.csv` as ground truth). The top candidate populates `conceptId`, `conceptName`, `domainId`, and `matchScore` when available; otherwise those fields remain blank. Every row is marked `mappingStatus=UNCHECKED`, `statusSetBy=THIRAWAT-mapper`, `mappingType=MAPS_TO` so reviewers can import the file directly into Usagi even when the source sheet was not originally in that format.
- `metrics.json` - When ground-truth IDs are available (either via `conceptId` or Usagi rows with `mappingStatus == APPROVED`) the file reports Hit@{1,2,5,10,20,50,100}, MRR@100, coverage, and counts.

### LLM-assisted RAG reranking

Bulk inference can optionally send the top reranked candidates to an LLM for tie-breaking or abstention logic. Enable this flow with `--rag-provider` and supply provider-specific flags. The CLI saves every prompt/response pair to `rag_prompts.md` under the chosen `--out` directory so you can audit exactly what was sent.

LLM output must be structured JSON with a `concept_ids` array, e.g. `{"concept_ids":[123,456,789]}`. If a provider returns invalid JSON for a query, that query falls back to the non-LLM ranking and logs an error.

General RAG knobs:

```bash
--rag-provider {ollama,llamacpp,openrouter,cloudflare}
--rag-model MODEL_ID                # default openai/gpt-oss-20b
--rag-candidate-limit 50            # number of reranked candidates passed to the LLM
--rag-profile-char-limit 512        # truncate long profile_text snippets
--rag-include-retrieval-score/--no-rag-include-retrieval-score
--rag-include-final-score/--no-rag-include-final-score
--rag-extra-context-column COLUMN   # optional extra context column from the input sheet
--rag-stop-sequence TEXT (repeatable)
--rag-use-normalized-query/--no-rag-use-normalized-query
```

> **Tip:** RAG is isolated to `infer.bulk`. The interactive REPL intentionally remains retrieval-only in this beta.

#### Ollama (local GGUF/chat server)

```bash
thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/ollama_rag \
  --n-limit 100 \
  --rag-provider ollama \
  --ollama-base-url http://localhost:11434 \
  --ollama-model "gpt-oss:20b"
```

Ollama-specific flags:

```bash
--ollama-base-url URL          # default http://localhost:11434
--ollama-model MODEL_TAG       # defaults to --rag-model value
--ollama-timeout 120           # seconds
--ollama-keep-alive "5m"       # optional keep-alive hint sent to server
```

#### llama.cpp server (local HTTP API)

Use `--rag-provider llamacpp` only when a [llama.cpp `llama-server`](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) process is already running (default `http://127.0.0.1:8080`). Launch the server separately with your desired context and batching flags (for example: `llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -fa on`). Point the CLI at that HTTP endpoint, not at GGUF files directly:

```bash
thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/llamacpp_rag \
  --rag-provider llamacpp \
  --llamacpp-base-url http://127.0.0.1:8080 \
  --rag-model ggml-org/gpt-oss-20b-GGUF
```

llama.cpp flags:

```bash
--llamacpp-base-url URL          # default http://127.0.0.1:8080
--llamacpp-timeout 120           # HTTP timeout in seconds
--llamacpp-chat-format FORMAT    # e.g., qwen, llama
--llamacpp-system-prompt TEXT    # optional instruction prefix
--llamacpp-n-ctx 8192            # forwarded via query parameters when supported
--llamacpp-model-path /path/model.gguf   # fallback to llama-cpp-python bindings when no base URL is set
```

If you omit `--llamacpp-base-url`, the CLI falls back to the python bindings and expects `--llamacpp-model-path` to point to a local GGUF file (plus any `--llamacpp-n-*` overrides). In that mode, the `rag-model` flag is ignored and the file name controls which model loads.

For all providers, the CLI logs each prompt/response pair and the parsed candidate ordering to `rag_prompts.md` in the `--out` directory for downstream review.

#### OpenRouter (hosted multi-model API)

```bash
export OPENROUTER_API_KEY=<YOUR_KEY>

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/openrouter_rag \
  --rag-provider openrouter \
  --rag-model openrouter/polaris-alpha
```

Set `OPENROUTER_API_KEY` in your environment; the CLI will refuse to call OpenRouter without it.

#### Cloudflare Workers AI (remote)

```bash
export CLOUDFLARE_ACCOUNT_ID=<ACCOUNT_ID>
export CLOUDFLARE_API_TOKEN=<API_TOKEN>

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/cf_rag \
  --n-limit 100 \
  --rag-provider cloudflare \
  --rag-model openai/gpt-oss-20b
```

Cloudflare-specific flags:

```bash
--cloudflare-base-url https://api.cloudflare.com/client/v4
--cloudflare-use-responses-api / --no-cloudflare-use-responses-api
--gpt-reasoning-effort {low,medium,high}
--cf-reasoning-summary {auto,concise,detailed}
```

Set `CLOUDFLARE_ACCOUNT_ID` and `CLOUDFLARE_API_TOKEN` in your environment before invoking the Cloudflare provider; the CLI reads only from those variables.

- Models under `@cf/openai/*` (for example `@cf/openai/gpt-oss-120b`) use the Workers AI Responses API, so leave `--cloudflare-use-responses-api` enabled to send the prompt as an `input` payload.
- Meta's `@cf/meta/llama-4-*` family is served via the `/ai/run/<model>` endpoint; pass `--no-cloudflare-use-responses-api` when targeting those models so the CLI emits the `messages` payload the endpoint expects.


## Development

```bash
# 1. Install dependencies into a local virtual environment (creates .venv/)
uv sync

# 2. (Optional) Activate the environment for interactive shells
source .venv/bin/activate

# 3. Or just run commands directly via uv
uv run python -m thirawat_mapper.index.build --help
```

`uv sync` reads the project metadata and installs the required packages (PyTorch, LanceDB, transformers, etc.) against Python 3.10+. Subsequent `uv run ...` invocations will reuse the same environment. Replace paths in the examples below to match your workspace. All text used for indexing and inference is normalized (lower-cased, whitespace collapsed) for stable matching.


## Notes & Requirements

- Vector-only retrieval + reranking (no FTS/BM25/hybrid in beta).
- Text is normalized (lowercase + collapsed whitespace) for indexing and inference.
- The reranker default is `sidataplus/THIRAWAT-SapBERT`. As verified on February 10, 2026 via the Hugging Face model API, this model is public (`gated=false`, `private=false`). If upstream access settings change later, authenticate with Hugging Face as needed.
- LanceDB tables must expose a float32 fixed-size vector column (named `vector` when built with this CLI).
- Index build keeps only standard, valid OMOP concepts (`standard_concept='S' AND invalid_reason IS NULL`).
- This beta uses the `transformers` encoder path directly (no `--backend st` switch in this CLI).

### Troubleshooting: SapBERT warning during startup

You may see this warning while loading SapBERT-related components:

`No sentence-transformers model found with name cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR.`

In this project that warning is often benign fallback behavior when loading through `transformers`/ColBERT wrappers. Treat it as an error only when model loading or inference actually fails (for example, a raised exception, process exit, or no embeddings produced).
