Metadata-Version: 2.4
Name: biometaharmonizer
Version: 0.6.0
Summary: Harmonize messy NCBI BioSample metadata at scale
License: MIT License
        
        Copyright (c) 2026 Rustam Heydarov
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.24
Requires-Dist: biopython>=1.80
Requires-Dist: requests>=2.28
Requires-Dist: pycountry>=22.3
Requires-Dist: python-dateutil>=2.8
Requires-Dist: openpyxl>=3.0
Requires-Dist: pyarrow>=12.0
Requires-Dist: rapidfuzz>=3.0.0
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=2.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.24; extra == "docs"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: sphinx>=7.0; extra == "dev"
Requires-Dist: sphinx-rtd-theme>=2.0; extra == "dev"
Dynamic: license-file

# BioMetaHarmonizer

[![version](https://img.shields.io/badge/version-0.6.0-blue)](#)
[![python](https://img.shields.io/badge/python-3.9%2B-blue)](#)
[![license](https://img.shields.io/badge/license-MIT-green)](#)
[![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://rustam-bioinfo.github.io/BioMetaHarmonizer/)

A Python package for fetching, parsing, and standardizing NCBI BioSample metadata for large-scale genomic epidemiology.

---

## What it does

NCBI BioSample metadata is free-text, crowd-sourced, and inconsistent across submitters. BioMetaHarmonizer fetches BioSample XML records via the Entrez API, maps raw attribute names to a fixed set of standard columns, normalizes placeholder null values, parses dates and geographic strings, and assigns One Health categories. The result is a pandas DataFrame that can be written to CSV, TSV, Excel, or Parquet.

Input can be BioSample accessions (`SAMN`, `SAME`, `SAMD`), assembly accessions (`GCF_`, `GCA_`), or a mix of both. Assembly accessions are resolved to BioSample IDs through locally cached NCBI assembly summary flat files.

Records submitted under NCBI pathogen packages (e.g. `Pathogen.cl.1.0`, `Pathogen.env.1.0`) often carry a structured `<Antibiogram>` section alongside standard attributes. BioMetaHarmonizer parses the antibiogram and serializes it as a compact JSON list in `_extra_attributes["antibiogram"]` so that MIC and phenotype data are never silently discarded.

---

## Installation

```bash
git clone https://github.com/rustam-bioinfo/BioMetaHarmonizer.git
cd BioMetaHarmonizer
pip install -e .
```

Requires Python 3.9+. Dependencies are declared in `pyproject.toml` and installed automatically.

The package ships with pre-built schema files (`unified.json`, `one_health_dictionaries.json`, `ncbi_attributes.xml`). The rebuild scripts in `scripts/` are only needed when you want to refresh those files from upstream sources — see [Rebuilding schema files](#rebuilding-schema-files).

---

## Quick start

### Command line

```bash
biometaharmonizer run \
    --input  accessions.txt \
    --email  your@email.com \
    --output harmonized.csv
```

| Flag | Default | Description |
|---|---|---|
| `--input FILE` | required | Path to accession list (one per line) |
| `--email EMAIL` | required | Valid contact email for NCBI Entrez — must contain `@` and a domain |
| `--output FILE` | required | Output file path |
| `--api-key KEY` | — | NCBI API key; raises rate limit from 3 to 10 requests/second |
| `--cache-dir DIR` | `~/.biometaharmonizer/cache/` | Directory for assembly summary flat files |
| `--format FORMAT` | inferred from file extension | `csv`, `tsv`, `excel`, `parquet` |
| `--summary FILE` | — | Write a per-column fill-rate CSV |
| `--fetch-batch-size N` | `200` | Number of records per efetch request |
| `--esearch-batch-size N` | `200` | Number of accessions per esearch term |
| `--refresh-cache` | off | Force re-download of assembly summary flat files regardless of age |
| `--verbose` | off | Enable DEBUG-level logging |

### Python API

```python
from biometaharmonizer.ingestion import set_email, ingest
from biometaharmonizer import KeyMapper, DateEngine, GeoEngine, OneHealthClassifier
from biometaharmonizer import write, write_summary

# Ingest: accepts a file path, a Python list, or a mix of both accession types
set_email("your@email.com")
df = ingest("accessions.txt")
# or: df = ingest(["SAMN12345678", "GCF_000001405.39"])

# Force re-download of assembly summary flat files (bypasses 7-day TTL):
# df = ingest("accessions.txt", refresh_cache=True)

# Key harmonization — renames raw columns to standard keys, coalesces duplicates
# Needed only if you bring your own DataFrame; ingest() already applies the schema
mapper = KeyMapper()
df = mapper.map_columns(df)

# Date parsing: 40+ input formats -> ISO 8601 (YYYY / YYYY-MM / YYYY-MM-DD)
de = DateEngine()
date_df = de.parse_with_range(df["collection_date"])
df["collection_date"] = date_df["collection_date"]
df["collection_date_range"] = date_df["collection_date_range"]

# Geography: splits geo_loc_name into country, region, locality, ISO code, sea
ge = GeoEngine()
geo_df = ge.parse(df["geo_loc_name"])
for col in geo_df.columns:
    df[col] = geo_df[col]

# One Health classification across multiple source columns simultaneously
oh = OneHealthClassifier()
src = {col: df[col] for col in
       ["isolation_source", "env_broad_scale", "env_local_scale",
        "env_medium", "sample_type", "host"]
       if col in df.columns}
oh_df = oh.classify_multi_field(**src)
for col in oh_df.columns:
    df[col] = oh_df[col]

# Write output
write(df, "harmonized.csv")
write_summary(df, "fill_rates.csv")
```

---

## Output columns

The output DataFrame contains the following 57 columns. Columns with no data for a given dataset are present and filled with `NaN`. Attributes that do not map to any column are preserved as a JSON string in `_extra_attributes`.

The first 52 columns come from ingestion. The final 5 are added by `OneHealthClassifier.classify_multi_field()` (column 28, `one_health_category`, is also from that step).

| # | Column | Source | Description |
|---|--------|--------|-------------|
| 1 | `biosample_accession` | BioSample XML | NCBI BioSample accession (e.g. `SAMN07597573`) |
| 2 | `biosample_id` | BioSample XML | NCBI internal numeric BioSample ID |
| 3 | `sra_accession` | BioSample XML | Linked SRA accession, if present |
| 4 | `bioproject_accession` | BioSample XML / assembly index | Parent BioProject accession |
| 5 | `assembly_accession_refseq` | Assembly index | RefSeq assembly accession (GCF\_) |
| 6 | `assembly_accession_genbank` | Assembly index | GenBank assembly accession (GCA\_) |
| 7 | `sample_name_id` | BioSample XML | Submitter sample name from `<Id db_label="Sample name">` |
| 8 | `taxonomy_id` | BioSample XML | NCBI Taxonomy numeric ID |
| 9 | `taxonomy_name` | BioSample XML | Taxon name for the assigned taxonomy_id |
| 10 | `organism_name` | BioSample XML | Organism name from `<OrganismName>`; falls back to taxonomy_name |
| 11 | `collection_date` | BioSample attribute → DateEngine | Collection date normalized to ISO 8601 |
| 12 | `collection_date_range` | DateEngine | Inferred date range when only year or year-month was provided |
| 13 | `geo_loc_name` | BioSample attribute | Raw geographic location string as submitted |
| 14 | `lat_lon` | BioSample attribute | Decimal lat/lon as submitted |
| 15 | `geo_country` | GeoEngine | Country resolved from `geo_loc_name` |
| 16 | `geo_region` | GeoEngine | Sub-national region; populated only from colon-format inputs (`"Country: Region, Locality"`); `NaN` for comma-only inputs |
| 17 | `geo_locality` | GeoEngine | Locality after the region in colon format, or the part after the first comma in comma-only inputs |
| 18 | `geo_iso3166` | GeoEngine | ISO 3166-1 alpha-2 country code; historical names tagged `HISTORICAL` |
| 19 | `geo_sea_ocean` | GeoEngine | Sea or ocean name for marine locations |
| 20 | `geo_loc_raw` | GeoEngine | Preserved raw string for coordinate-only inputs (e.g. `"40.71 N, 74.00 W"`); `NaN` for all other inputs |
| 21 | `host` | BioSample attribute | Host organism name |
| 22 | `host_disease` | BioSample attribute | Disease associated with host at sampling |
| 23 | `host_age` | BioSample attribute | Age of host |
| 24 | `host_sex` | BioSample attribute | Biological sex of host |
| 25 | `host_tissue_sampled` | BioSample attribute | Tissue or body site sampled |
| 26 | `isolation_source` | BioSample attribute | Material or environment from which the isolate was obtained |
| 27 | `sample_type` | BioSample attribute | Sample type or specimen classification |
| 28 | `one_health_category` | OneHealthClassifier | One of: Human, Animal, Aquatic, Wildlife, Plant, Food, Environmental, Lab, Unclassified |
| 29 | `one_health_term` | OneHealthClassifier | The specific term or phrase that triggered the classification |
| 30 | `one_health_confidence` | OneHealthClassifier | Float in [0, 1] — see [One Health classification](#one-health-classification) |
| 31 | `one_health_evidence_level` | OneHealthClassifier | Discretized confidence: `high` (≥0.85), `medium` (≥0.60), `low` (≥0.30), `unresolved` |
| 32 | `one_health_processing` | OneHealthClassifier | Processing/handling term detected in the field text (e.g. `pasteurized`, `frozen`), if any |
| 33 | `one_health_setting` | OneHealthClassifier | Setting term detected in the field text (e.g. `clinical`, `farm`, `retail`), if any |
| 34 | `one_health_source_field` | OneHealthClassifier | Which input field produced the winning classification |
| 35 | `isolate` | BioSample attribute | Isolate identifier |
| 36 | `strain` | BioSample attribute | Strain designation |
| 37 | `sub_strain` | BioSample attribute | Sub-strain designation |
| 38 | `serotype` | BioSample attribute | Serotype |
| 39 | `serovar` | BioSample attribute | Serovar |
| 40 | `genotype` | BioSample attribute | Genotype or sequence type |
| 41 | `culture_collection` | BioSample attribute | Culture collection identifier |
| 42 | `outbreak` | BioSample attribute | Outbreak identifier |
| 43 | `env_broad_scale` | BioSample attribute | Broad environmental context (ENVO) |
| 44 | `env_local_scale` | BioSample attribute | Local environmental feature (ENVO) |
| 45 | `env_medium` | BioSample attribute | Environmental medium (ENVO) |
| 46 | `sequencing_method` | BioSample attribute | Sequencing platform |
| 47 | `assembly_method` | BioSample attribute | Genome assembly software |
| 48 | `collected_by` | BioSample attribute; `<Owner/Name>` fallback | Collector name or institution |
| 49 | `ncbi_package` | BioSample XML | NCBI BioSample package (e.g. `Microbe.1.0`) |
| 50 | `submission_date` | BioSample XML | Date first submitted |
| 51 | `last_update` | BioSample XML | Date last modified |
| 52 | `publication_date` | BioSample XML | Date made publicly available |
| 53 | `access` | BioSample XML | `public` or `controlled-access` |
| 54 | `status` | BioSample XML | Record status (e.g. `live`, `suppressed`) |
| 55 | `status_date` | BioSample XML | Date current status was assigned |
| 56 | `title` | BioSample XML | Free-text title of the BioSample record |
| 57 | `description_comment` | BioSample XML | Free-text description or comment block |
| 58 | `_extra_attributes` | JSON | All attributes that could not be mapped to a schema column, serialized as a JSON dict. Also contains `submission_owner` and `submission_contact` when `<Owner>` provenance is present alongside an explicit collector. For records submitted under pathogen packages, contains an `antibiogram` key (see [Antibiogram data](#antibiogram-data)). |

---

## Antibiogram data

BioSample records submitted under NCBI pathogen packages (`Pathogen.cl.1.0`, `Pathogen.env.1.0`, etc.) may include a structured `<Antibiogram>` section that is a sibling of `<Attributes>` in the XML — not a child. Standard attribute parsers that only iterate `<Attributes>` silently drop this section. BioMetaHarmonizer parses it explicitly.

When an antibiogram is present, `_extra_attributes["antibiogram"]` contains a compact JSON-encoded list of dicts, one per antibiotic row. Each dict includes whichever of the following fields NCBI populated for that row:

| Field | Description |
|---|---|
| `antibiotic_name` | Antibiotic name (e.g. `amikacin`) |
| `resistance_phenotype` | `susceptible`, `resistant`, or `intermediate` |
| `measurement_sign` | `==`, `<=`, `>=`, `<`, `>` |
| `measurement` | Numeric MIC or disk diffusion value |
| `measurement_units` | `mg/L`, `mm`, etc. |
| `laboratory_typing_method` | `MIC`, `disk diffusion`, etc. |
| `laboratory_typing_platform` | Instrument or method platform |
| `vendor` | Reagent/kit vendor |
| `laboratory_typing_method_version_or_reagent` | Version or reagent identifier |
| `testing_standard` | `CLSI`, `EUCAST`, etc. |

Fields with null or missing values are omitted from each row dict so the JSON payload stays compact. Rows where all fields resolved to null are excluded entirely.

**Extracting antibiogram data from a result DataFrame:**

```python
import json
import pandas as pd

def extract_antibiogram(df):
    rows = []
    for _, rec in df.iterrows():
        extras = rec.get("_extra_attributes")
        if not extras:
            continue
        try:
            d = json.loads(extras)
        except (ValueError, TypeError):
            continue
        ab = d.get("antibiogram")
        if not ab:
            continue
        ab_rows = json.loads(ab) if isinstance(ab, str) else ab
        for row in ab_rows:
            row["biosample_accession"] = rec["biosample_accession"]
            rows.append(row)
    return pd.DataFrame(rows)

antibiogram_df = extract_antibiogram(df)
```

---

## Attribute resolution order

For each `<Attribute>` element in BioSample XML, the column mapping is resolved in this order:

1. **`harmonized_name` direct match** — if the NCBI-assigned `harmonized_name` matches a schema column exactly, it is used without any synonym lookup.
2. **Synonym lookup on `harmonized_name`** — if not a direct match, the `harmonized_name` is looked up in the synonym table. If the resolved key is in the schema, it is used; otherwise the resolved key is stored in `_extra_attributes`.
3. **Synonym lookup on `attribute_name`** — if `harmonized_name` is absent or unresolvable, the raw `attribute_name` is tried.
4. **`_extra_attributes`** — any attribute that could not be resolved by any of the above is written to `_extra_attributes` as a JSON key-value pair.

The synonym table is built from two layers in `synonyms.py` and cached for the lifetime of the process:

- **Layer 1 — `schemas/unified.json`** — manually curated synonym lists for all standard keys.
- **Layer 2 — `schemas/ncbi_attributes.xml`** — the official NCBI BioSample harmonization table. Optional; loaded only if present.

Both `ingestion.py` and `key_mapper.py` use the same `build_synonym_lookup()` function.

---

## Null normalization

During XML parsing, placeholder values are converted to `None` before any downstream processing. The full pattern list covers:

- `missing`, `missing: lab stock`, `missing: data agreement established pre-2023`
- `N/A`, `na`, `null`, `none`, `nil`, `-`, `.`
- `unknown`, `not provided`, `not collected`, `not applicable`, `not available`, `not determined`, `not recorded`, `not reported`
- `unavailable`, `unspecified`, `undetermined`, `unidentified`
- `restricted`, `restricted access`, `withheld`, `confidential`
- `tbd`, `tba`

Common misspellings (`misssing`, `unkown`, `unknwon`) are also matched. Matching is case-insensitive.

---

## Assembly summary cache

On the first run, `ingest()` downloads two NCBI flat files to resolve assembly accessions and BioProject links:

- `assembly_summary_refseq.txt` (~100–300 MB)
- `assembly_summary_genbank.txt` (~100–300 MB)

These are cached in `~/.biometaharmonizer/cache/` (overridable with `--cache-dir` or `set_cache_dir()`). Files older than 7 days are automatically deleted and re-downloaded on the next run.

To force a refresh before the 7-day TTL expires — for example, immediately after a large batch of new assemblies is added to NCBI — pass `refresh_cache=True` to `ingest()` or use `--refresh-cache` on the CLI:

```bash
biometaharmonizer run --input ids.txt --email you@example.com \
    --output out.csv --refresh-cache
```

```python
df = ingest("ids.txt", email="you@example.com", refresh_cache=True)
```

In Colab:

```python
from biometaharmonizer.ingestion import set_cache_dir
set_cache_dir("/content/bmh_cache")
```

---

## Entrez rate limits

Without an API key, NCBI allows 3 requests per second. With a key, the limit is 10 requests per second. BioMetaHarmonizer enforces inter-request sleep intervals automatically based on whether an API key is set.

Register a free API key at https://www.ncbi.nlm.nih.gov/account/ and pass it as:

```bash
biometaharmonizer run --input ids.txt --email you@example.com \
    --api-key YOUR_KEY --output out.csv
```

or:

```python
df = ingest("ids.txt", email="you@example.com", api_key="YOUR_KEY")
```

---

## Geospatial parsing

`GeoEngine` splits `geo_loc_name` into `geo_country`, `geo_region`, `geo_locality`, `geo_iso3166`, `geo_sea_ocean`, and `geo_loc_raw`.

The parser recognizes two input formats:

- **Colon format** `"Country: Region, Locality"` — the part before `:` becomes `geo_country`, the first segment after `:` becomes `geo_region`, and any remainder after the comma becomes `geo_locality`.
- **Comma-only format** `"Country, Locality"` — the part before the first `,` becomes `geo_country` and the remainder becomes `geo_locality`. `geo_region` is left `NaN`.

Parenthetical qualifiers (e.g. `"United Kingdom (England, Wales & N. Ireland)"`, `"Pacific Ocean (NE)"`) are stripped from the country token before any lookup. This means ocean and sea names with qualifiers are still correctly routed to `geo_sea_ocean` rather than falling through to the country resolver.

| Input | Result |
|---|---|
| `"USA: California, Los Angeles"` | country=USA, region=California, locality=Los Angeles, iso=US |
| `"USA: California"` | country=USA, region=California, iso=US |
| `"Germany, Bavaria"` | country=Germany, locality=Bavaria, iso=DE |
| `"France"` | country=France, iso=FR |
| `"Pacific Ocean"` | sea\_ocean=Pacific Ocean |
| `"Pacific Ocean (NE)"` | sea\_ocean=Pacific Ocean |
| `"Pacific Ocean: Mariana Trench"` | sea\_ocean=Pacific Ocean, locality=Mariana Trench |
| `"Red Sea (sampling site 3): surface"` | sea\_ocean=Red Sea, locality=surface |
| `"40.71 N, 74.00 W"` | geo\_loc\_raw preserved; all other geo columns NaN |
| `"Gaza Strip"` | country=Gaza Strip, iso=PS |
| `"West Bank"` | country=West Bank, iso=PS |
| `"United Kingdom (England, Wales & N. Ireland)"` | country=United Kingdom, iso=GB |
| `"not applicable"` | all geo columns NaN |

Handling notes:

- `England`, `Scotland`, `Wales`, `Northern Ireland` → `United Kingdom`, iso `GB`
- `United Kingdom (England, Wales & N. Ireland)` and similar compound UK variants → `United Kingdom`, iso `GB`
- `Gaza Strip`, `West Bank`, `Gaza`, `Palestine`, `Palestinian territories` → iso `PS`
- `Korea` (bare, no qualifier) → South Korea (`KR`); logged at INFO level
- Historical country names (`USSR`, `Yugoslavia`, `Zaire`, `East Germany`, etc.) → preserved in `geo_country`, `geo_iso3166 = HISTORICAL`
- Coordinate-only strings are preserved in `geo_loc_raw` and not reverse-geocoded; all other geo columns are `NaN`
- `Turkey` / `Türkiye`, `Namibia`, `Burma`, `DR Congo` and several aliases are resolved via a hardcoded table before pycountry fuzzy lookup
- All unique `geo_loc_name` values are resolved once and cached; pycountry fuzzy lookup runs at most once per unique country string regardless of row count

---

## One Health classification

`OneHealthClassifier` loads all biological knowledge from `schemas/one_health_dictionaries.json` and assigns each record one of nine categories: **Human**, **Animal**, **Aquatic**, **Wildlife**, **Plant**, **Food**, **Environmental**, **Lab**, **Unclassified**.

`classify_multi_field()` accepts up to six named `pd.Series` and returns a DataFrame with seven columns:

| Column | Type | Description |
|---|---|---|
| `one_health_category` | str | Assigned category; always a string, never NaN |
| `one_health_term` | str / NaN | The specific term or phrase that triggered the classification |
| `one_health_confidence` | float | Score in [0, 1]; computed as `term_specificity × field_weight + corroboration_bonus` |
| `one_health_evidence_level` | str | `high` (≥0.85), `medium` (≥0.60), `low` (≥0.30), `unresolved` |
| `one_health_processing` | str / NaN | Processing/handling term detected in the text (e.g. `pasteurized`, `frozen`) |
| `one_health_setting` | str / NaN | Setting term detected in the text (e.g. `clinical`, `farm`, `retail`) |
| `one_health_source_field` | str / NaN | Input field that produced the winning classification |

**Confidence model.** For each field, `confidence = min(1.0, term_specificity × field_weight + corroboration_bonus)`:

- `term_specificity`: 1.0 for host dictionary or unambiguous list hits; 0.90/0.75/0.50 for tier1 phrases by length; `WRatio / 100` for rapidfuzz fallback; 0.30 for ambiguous terms.
- `field_weight`: `isolation_source` / host dict hit → 1.00; host text hit → 0.90; `env_medium` → 0.85; `env_local_scale` → 0.80; `sample_type` → 0.70; `env_broad_scale` → 0.50.
- `corroboration_bonus`: +0.10 when a second independent field agrees with the same category.

**Classification pipeline per record:**

1. `host` field: institution guard (strips culture collection prefixes; returns Lab if residual < 4 chars), then `host_to_category` dictionary lookup, then text classification fallback.
2. `isolation_source`, `env_medium`, `env_local_scale`: matched against unambiguous human/animal term lists, then tier1 patterns, then rapidfuzz fuzzy fallback against the ontology map.
3. `sample_type`: domain-level signal; used to set category if no specimen field matched.
4. `env_broad_scale`: supporting signal only; contributes a corroboration bonus but does not set the primary category on its own.
5. Pass 2 resolves the winning category from accumulated domain/specimen/supporting evidence.

---

## `collected_by` priority

1. **Explicit BioSample attribute** — any `<Attribute harmonized_name="collected_by">` or synonym is always preferred.
2. **`<Owner/Name>` fallback** — used only if no explicit collector attribute was found.

When both are present, the submission-side provenance is written to `_extra_attributes`:

- `submission_owner` — `<Owner/Name>` value
- `submission_contact` — full name from `<Owner/Contacts/Contact>`

---

## Output formats

```python
from biometaharmonizer import write, write_summary

write(df, "out.csv")                        # CSV
write(df, "out.tsv", fmt="tsv")             # TSV
write(df, "out.xlsx", fmt="excel")          # Excel
write(df, "out.parquet", fmt="parquet")     # Parquet

write_summary(df, "fill_rates.csv")         # column, non_null_count, fill_pct
```

Format strings are case-insensitive. If `--format` is not specified on the CLI, the format is inferred from the output file extension.

---

## Rebuilding schema files

The package ships with pre-built schema files. Rebuild them only when you want to incorporate upstream ontology or NCBI updates.

### `one_health_dictionaries.json`

Generated by `scripts/build_dictionaries.py`. It queries OLS4 (ENVO, FoodOn, UBERON, Plant Ontology), downloads the NCBI Taxonomy dump (~65 MB), and optionally queries the UMLS API for synonym expansion. Hand-curated entries in the base file always win over ontology-derived ones.

```bash
# Full rebuild (downloads taxdmp.zip from NCBI automatically)
python scripts/build_dictionaries.py \
    --base   src/biometaharmonizer/schemas/one_health_dictionaries.json \
    --output src/biometaharmonizer/schemas/one_health_dictionaries.json

# Use a pre-downloaded taxdmp.zip
python scripts/build_dictionaries.py --taxdmp /path/to/taxdmp.zip

# Skip NCBI Taxonomy entirely
python scripts/build_dictionaries.py --skip-ncbi

# Add UMLS synonym expansion (requires a free UMLS API key)
python scripts/build_dictionaries.py --umls-key YOUR_UMLS_KEY
```

### `ncbi_attributes.xml`

Generated by `scripts/build_ncbi_attribute_cache.py`. Downloads the official NCBI BioSample attribute harmonization table and stores it as `schemas/ncbi_attributes.xml`, which becomes Layer 2 of the synonym lookup.

```bash
python scripts/build_ncbi_attribute_cache.py
```

---

## Repository structure

```
BioMetaHarmonizer/
├── src/biometaharmonizer/
│   ├── __init__.py             # public API, version 0.6.0
│   ├── cli.py                  # CLI entrypoint
│   ├── ingestion.py            # Entrez fetching, XML parsing, schema definition
│   ├── synonyms.py             # two-layer synonym lookup (unified.json + NCBI XML)
│   ├── key_mapper.py           # column rename, coalesce, reindex
│   ├── date_engine.py          # date parsing, ISO 8601 output
│   ├── geo_engine.py           # geo_loc_name splitting, ISO-3166 resolution
│   ├── one_health.py           # One Health categorization
│   ├── output.py               # write CSV / TSV / Excel / Parquet
│   └── schemas/
│       ├── unified.json                      # standard keys + synonym lists
│       ├── one_health_dictionaries.json      # One Health keyword/ontology dict
│       └── ncbi_attributes.xml               # NCBI harmonization table (optional)
├── scripts/
│   ├── build_dictionaries.py               # rebuild one_health_dictionaries.json
│   └── build_ncbi_attribute_cache.py       # rebuild ncbi_attributes.xml
├── tests/
│   ├── test_ingestion.py
│   ├── test_key_mapper.py
│   ├── test_date_engine.py
│   ├── test_geo_engine.py
│   ├── test_one_health.py
│   ├── test_output.py
│   └── test_pipeline.py
└── pyproject.toml
```

---

## Running tests

```bash
pip install pytest
pytest tests/ -v --tb=short
```

All tests use synthetic data — no live NCBI calls are made.

---

## License

MIT
