Metadata-Version: 2.4
Name: ontology-transformer
Version: 0.1.6
Summary: End-to-end ontology embedding via fine-tuning sentence transformers with hyperbolic geometry and role-based rotation for existential restrictions.
Author: Hui
License: Apache-2.0
Keywords: ontology,embedding,transformer,hyperbolic,knowledge-graph
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: geoopt>=0.5
Requires-Dist: deeponto>=0.9
Requires-Dist: datasets>=2.0
Requires-Dist: click>=8.0
Requires-Dist: yacs>=0.1.8
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: tqdm>=4.60
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"
Requires-Dist: pytest-timeout>=2.0; extra == "test"
Dynamic: license-file

# Ontology-Transformer

**End-to-end ontology embedding** via fine-tuning sentence transformers with **hyperbolic geometry** and **role-based rotation** for embedding EL-concept (e.g., ∃r.C).

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

## Features

- **One-line training**: `OntologyTransformer.fit("ontology.owl")` → fine-tuned embeddings
- **Hyperbolic space**: Poincaré ball embeddings for hierarchical structures
- **Role-aware existential restrictions**: ∃r.C encoded via learned rotation transformations
- **Automatic data preparation**: Converts OWL/OFN axioms to training data (no manual preprocessing)
- **Best lambda auto-tuning**: Centripetal weight optimized on evaluation data and saved with model
- **Flexible evaluation**: Use training ontology samples or separate eval/test ontologies

## Installation


```bash
pip install ontology-transformer
```


### Requirements

- Python ≥ 3.9
- PyTorch ≥ 2.0 (with CUDA recommended)
- `sentence-transformers`, `geoopt`, `deeponto`, `datasets`

## Quick Start

### 1. End-to-end: OWL → Fine-tune → Embeddings

```python
from ont import OntologyTransformer

# Train on any OWL/OFN ontology (all axioms used for training)
model = OntologyTransformer.fit(
    owl_path="path/to/ontology.owl",
    output_dir="./output",
    num_epochs=3,
    batch_size=64,        # training batch size (sentences per step)
    eval_batch_size=32,   # evaluation batch size (queries scored per step)
    eval_ratio=0.1,       # 10% of axioms sampled for evaluation
    max_eval=1000,        # max 1000 eval samples
)

# The best lambda (centripetal weight) is determined during training
print(f"Best lambda: {model.best_lambda}")

# Encode concepts
emb = model.encode("food product")

# Encode ∃r.C (existential restrictions) via role rotation
exist_emb = model.encode_existence("has ingredient", "sugar")
```

### 2. Use separate ontologies for evaluation/testing

```python
model = OntologyTransformer.fit(
    owl_path="train_ontology.owl",
    eval_owl_path="eval_ontology.owl",   # optional: separate eval ontology
    test_owl_path="test_ontology.owl",   # optional: separate test ontology
    output_dir="./output",
    num_epochs=3,
)
```

### 3. Load a pre-trained model

```python
from ont import OntologyTransformer

# Load model (best_lambda is automatically restored)
model = OntologyTransformer.from_pretrained("./output/final")
print(f"Loaded best_lambda: {model.best_lambda}")

# Encode
emb = model.encode("heart disease")
exist_emb = model.encode_existence("has part", "cell membrane")
```

### 4. CLI

```bash
# Basic training
ont-train --owl ontology.owl --output ./output --epochs 3

# With explicit batch sizes
ont-train --owl ontology.owl --output ./output \
    --epochs 20 --batch-size 256 --eval-batch-size 64

# With separate eval ontology
ont-train --owl train.owl --eval-owl eval.owl --output ./output --epochs 3

# Balanced mode (adds C_neg contrastive loss)
ont-train --owl ontology.owl --output ./output --balanced --epochs 3
```

### Key training parameters

| Parameter (Python) | CLI flag | Default | Description |
|---|---|---|---|
| `num_epochs` | `--epochs` | `1` | Number of training epochs |
| `batch_size` | `--batch-size` | `64` | Sentences per training step. Increase for larger GPUs (e.g. 256 on 40+ GB). |
| `eval_batch_size` | `--eval-batch-size` | `32` | Queries scored per evaluation step. Increase to speed up evaluation when GPU memory allows. |
| `learning_rate` | `--lr` | `1e-5` | Learning rate |
| `balanced` | `--balanced` | `False` | Add C_neg contrastive loss for existential restrictions |
| `balanced_negatives` | `--balanced-negatives` | `1` | Number of negative samples in balanced mode |
| `eval_ratio` | — | `0.1` | Fraction of axioms sampled for eval (Python API only) |
| `max_eval` | — | `1000` | Max number of eval samples (Python API only) |

## Data Preparation Flow

**By default** (no separate eval/test ontologies):
1. **All axioms** from input ontology → training data (`train.jsonl`, `train_exist.jsonl`, `train_conj.jsonl`)
2. **10% of axioms** (max 1000) randomly sampled → evaluation data (`val.json`)
3. **No test split** created (unless `test_owl_path` is provided)

**With external eval/test ontologies**:
- `eval_owl_path`: evaluation data prepared from this ontology
- `test_owl_path`: test evaluation performed after training

This design ensures **all available training data is used** while still enabling hyperparameter tuning (best lambda) via evaluation.

## Training Modes

### Non-balanced (default)
Standard contrastive loss on taxonomy + existential axioms:
- Clustering loss: push child closer to parent
- Centripetal loss: pull child away from non-ancestors
- Conjunction loss: C₁ ⊓ C₂ ⊑ D
- Existential loss: ∃r.C encoded via rotation

### Balanced
Adds extra contrastive loss with negative concept samples (C_neg) for existential restrictions:
```python
model = OntologyTransformer.fit(
    owl_path="ontology.owl",
    balanced=True,
    balanced_negatives=5,  # number of negative samples
)
```

## Architecture

- **Base model**: `SentenceTransformer` fine-tuned in Poincaré ball (hyperbolic space)
- **Role model**: Linear layer mapping role embeddings to rotation angles (rotation or transition mode)
- **Existential encoding**: ∃r.C = rotate(embed(C), f_r(embed(r)))
- **Best lambda**: Centripetal weight λ optimized on eval data, saved in `wrapper_config.json`

## Model Saving & Loading

Models are saved with:
- Base sentence transformer weights
- Role model weights (`role_model.pt`)
- Configuration (`wrapper_config.json`) including `best_lambda`
- Concept/role vocabularies

```python
# Save
model.save("./my_model")

# Load (best_lambda automatically restored)
loaded = OntologyTransformer.from_pretrained("./my_model")
```

## Running Tests

```bash
# Install with test dependencies
pip install -e ".[test]"

# Run all tests
pytest tests/ -v

# Skip integration tests (large ontologies)
pytest tests/ -v -m "not integration"

# Run specific test
pytest tests/test_pipeline.py::TestPipeline::test_fit_tiny_owl -v
```

## Examples

See `examples/` directory for:
- Training on FoodOn, SNOMED CT, GALEN ontologies
- Evaluating embeddings for subsumption prediction
- Using external eval/test ontologies

## Citation

If you use this package, please cite:

```bibtex
@inproceedings{yang2025language,
  title={Language Models as Ontology Encoder},
  author={Yang, Hui and Chen, Jiaoyan and Horrocks, Ian},
  booktitle={International Semantic Web Conference (ISWC)},
  year={2025},
  organization={Springer}
}
```

GitHub: https://github.com/HuiYang1997/OnT

## Changelog
### 0.1.5 (2026-04-02)
- **Update Readme**

### 0.1.4
- **Fix: axiom duplication in data preparation** — `create_dataset()` previously
  counted every axiom twice because `getImportsClosure()` already includes the
  ontology itself. The duplicated data inflated training set size 2–3× and
  degraded embedding quality.
- **Fix: OOM during evaluation on large ontologies** — `OnTEvaluator` now scores
  candidates in GPU chunks (`cand_chunk_size=4096`, configurable) instead of
  broadcasting the full `(batch, N, dim)` tensor, eliminating OOM errors for
  ontologies with 100K+ concepts (e.g. SNOMED CT ~364K concepts).
- **Improvement: skip repeated data preparation** — `pipeline.fit()` reuses
  already-prepared `data/` directory on restart, avoiding the 5-minute OWL
  parsing step when resuming crashed runs.

### 0.1.2
- Initial public release.

## License

Apache License 2.0 - see [LICENSE](LICENSE) file for details.
