Metadata-Version: 2.4
Name: scibite_toolkit
Version: 1.5.0a2
Summary: scibite-toolkit - python library for calling SciBite applications: TERMite, TExpress, SciBite Search, CENtree and Workbench. The library also enables processing of the JSON results from such requests
Author-email: SciBite <help@scibite.com>
License-Expression: CC-BY-NC-SA-4.0
Project-URL: Homepage, https://github.com/elsevier-health/scibite-toolkit
Project-URL: Repository, https://github.com/elsevier-health/scibite-toolkit
Project-URL: Issues, https://github.com/elsevier-health/scibite-toolkit/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: beautifulsoup4>=4.9.0
Requires-Dist: boto3>=1.26.0
Requires-Dist: pandas<3.0,>=1.0.0
Requires-Dist: pyyaml>=5.4
Requires-Dist: qdrant-client>=1.7.0
Requires-Dist: requests>=2.25.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: httpx>=0.24.0; extra == "dev"
Requires-Dist: coverage; extra == "dev"
Requires-Dist: sphinx; extra == "dev"
Requires-Dist: sphinx-js; extra == "dev"
Requires-Dist: rst2pdf; extra == "dev"
Provides-Extra: async
Requires-Dist: httpx>=0.24.0; extra == "async"
Provides-Extra: workbench
Requires-Dist: openpyxl>=3.0.0; extra == "workbench"
Provides-Extra: oml
Requires-Dist: rdflib>=6.0.0; extra == "oml"
Requires-Dist: sentence-transformers>=2.2.0; extra == "oml"
Provides-Extra: all
Requires-Dist: httpx>=0.24.0; extra == "all"
Requires-Dist: openpyxl>=3.0.0; extra == "all"
Requires-Dist: rdflib>=6.0.0; extra == "all"
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Dynamic: license-file

# SciBite Toolkit

Python library for making API calls to [SciBite](https://www.scibite.com/)'s suite of products and processing the JSON responses.

## Supported Products

- **TERMite** - Entity recognition and semantic enrichment (version 6.x)
- **TERMite 7** - Next-generation entity recognition with modern OAuth2 authentication
- **TExpress** - Pattern-based entity relationship extraction
- **SciBite Search** - Semantic search, document and entity analytics
- **CENtree** - Ontology management, navigation, and integration
- **CENtree VectorDB Uploader** - Upload ontology embedding CSVs from S3 or local files to Qdrant
- **CENtree Vector Generator** - End-to-end ontology→embedding CSV pipeline
- **CENtree Ontology ML** - OWL→sentence corpus, embedding generation, and Qdrant indexing
- **Workbench** - Dataset annotation and management

## Installation

```bash
pip install scibite-toolkit
```

See versions on [PyPI](https://pypi.org/project/scibite-toolkit/)

## Quick Start Examples

- [TERMite 7](#termite-7-examples) - Modern client with OAuth2
- [TERMite 6](#termite-6-examples) - Legacy client
- [TExpress](#texpress-examples) - Pattern matching
- [SciBite Search](#scibite-search-example)
- [CENtree](#centree-examples) - Ontology navigation
- [CENtree VectorDB Uploader](#centree-vectordb-uploader-examples) - S3/local→Qdrant upload
- [CENtree Vector Generator](#centree-vector-generator-examples) - Ontology→embedding CSV
- [CENtree Ontology ML](#centree-ontology-ml-examples) - OWL→embeddings pipeline
- [Workbench](#workbench-example)

---

## TERMite 7 Examples

TERMite 7 is the modern version with enhanced OAuth2 authentication and improved API.

### OAuth2 Client Credentials (SaaS - Recommended)

For modern SaaS deployments using a separate authentication server:

```python
from scibite_toolkit import termite7

# Initialize with context manager for automatic cleanup
with termite7.Termite7RequestBuilder() as t:
    # Set URLs
    t.set_url('https://termite.saas.scibite.com')
    t.set_token_url('https://auth.saas.scibite.com')

    # Authenticate with OAuth2 client credentials
    if not t.set_oauth2('your_client_id', 'your_client_secret'):
        print("Authentication failed!")
        exit(1)

    # Annotate text
    t.set_entities('DRUG,INDICATION')
    t.set_subsume(True)
    t.set_text('Aspirin is used to treat headaches and reduce inflammation.')

    response = t.annotate_text()

    # Process the response
    df = termite7.process_annotation_output(response)
    print(df.head())
```

### OAuth2 Password Grant (Legacy)

For on-premise deployments using username/password authentication:

```python
from scibite_toolkit import termite7

t = termite7.Termite7RequestBuilder()

# Set main TERMite URL and token URL (same server for legacy)
t.set_url('https://termite.example.com')
t.set_token_url('https://termite.example.com')

# Authenticate with username and password
if not t.set_oauth2_legacy('client_id', 'username', 'password'):
    print("Authentication failed!")
    exit(1)

# Annotate a document
t.set_entities('INDICATION,DRUG')
t.set_parser_id('generic')
t.set_file('path/to/document.pdf')

response = t.annotate_document()

# Process the response
df = termite7.process_annotation_output(response)
print(df)

# Clean up file handles
t.close()
```

### Get System Status

```python
from scibite_toolkit import termite7

t = termite7.Termite7RequestBuilder()
t.set_url('https://termite.example.com')
t.set_token_url('https://auth.example.com')
t.set_oauth2('client_id', 'client_secret')

# Get system status
status = termite7.get_system_status(t.url, t.headers)
print(f"Server Version: {status['data']['serverVersion']}")

# Get available vocabularies
vocabs = termite7.get_vocabs(t.url, t.headers)
print(f"Available vocabularies: {len(vocabs['data'])}")

# Get runtime options
rtos = termite7.get_runtime_options(t.url, t.headers)
print(rtos)
```

---

## TERMite 6 Examples

For legacy TERMite 6.x deployments.

### SciBite Hosted (SaaS)

```python
from scibite_toolkit import termite

# Initialize
t = termite.TermiteRequestBuilder()

# Configure
t.set_url('https://termite.saas.scibite.com')
t.set_saas_login_url('https://login.saas.scibite.com')

# Authenticate
t.set_auth_saas('username', 'password')

# Set runtime options
t.set_entities('INDICATION')
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
t.set_subsume(True)

# Execute and process
response = t.execute()
df = termite.get_termite_dataframe(response)
print(df.head(3))
```

### Local Instance (Customer Hosted)

```python
from scibite_toolkit import termite

t = termite.TermiteRequestBuilder()
t.set_url('https://termite.local.example.com')

# Basic authentication for local instances
t.set_basic_auth('username', 'password')

# Configure and execute
t.set_entities('INDICATION')
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
t.set_subsume(True)

response = t.execute()
df = termite.get_termite_dataframe(response)
print(df.head(3))
```

---

## TExpress Examples

Pattern-based entity relationship extraction.

### SciBite Hosted

```python
from scibite_toolkit import texpress

t = texpress.TexpressRequestBuilder()

t.set_url('https://texpress.saas.scibite.com')
t.set_saas_login_url('https://login.saas.scibite.com')
t.set_auth_saas('username', 'password')

# Set pattern to find relationships
t.set_entities('INDICATION,DRUG')
t.set_pattern(':(DRUG):{0,5}:(INDICATION)')  # Find DRUG within 5 words of INDICATION
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')

response = t.execute()
df = texpress.get_texpress_dataframe(response)
print(df.head())
```

### Local Instance

```python
from scibite_toolkit import texpress

t = texpress.TexpressRequestBuilder()
t.set_url('https://texpress.local.example.com')
t.set_basic_auth('username', 'password')

t.set_entities('INDICATION,DRUG')
t.set_pattern(':(INDICATION):{0,5}:(INDICATION)')
t.set_input_format('pdf')
t.set_output_format('json')
t.set_binary_content('/path/to/file.pdf')

response = t.execute()
df = texpress.get_texpress_dataframe(response)
print(df.head())
```

---

## SciBite Search Example

Semantic search with entity-based queries and aggregations.

```python
from scibite_toolkit import scibite_search

# Configure
s = scibite_search.SBSRequestBuilder()
s.set_url('https://yourdomain-search.saas.scibite.com/')
s.set_auth_url('https://yourdomain.saas.scibite.com/')

# Authenticate with OAuth2
s.set_oauth2('your_client_id', 'your_client_secret')

# Search documents
query = 'schema_id="clinical_trial" AND (title~INDICATION$D011565 AND DRUG$*)'
# Preferred: request specific fields using the new 'fields' parameter (legacy: 'additional_fields')
response = s.get_docs(query=query, markup=True, limit=100, fields=['*'])

# Get co-occurrence aggregations
# Find top 50 genes co-occurring with psoriasis
response = s.get_aggregates(
    query='INDICATION$D011565',
    vocabs=['HGNCGENE'],
    limit=50
)
```

> **Note:** Preferred parameter name is `fields`. The legacy `additional_fields` is still supported for backward compatibility. When both are provided, `fields` takes precedence.

---

## CENtree Examples

Ontology navigation and search.

### Modern Client (Recommended)

The modern `centree_clients` module provides better error handling, retries, and context manager support.

```python
from scibite_toolkit.centree_clients import CENtreeReaderClient

# Use context manager for automatic cleanup
with CENtreeReaderClient(
    base_url="https://centree.example.com",
    bearer_token="your_token",
    timeout=(3.0, None)  # Quick connect, unlimited read
) as reader:

    # Search by exact label
    hits = reader.get_classes_by_exact_label("efo", "neuron")
    print(f"Found {len(hits)} matches")

    # Get ontology roots
    roots = reader.get_root_entities("efo", "classes", size=10)

    # Get paths from root to target (great for LLM grounding)
    paths = reader.get_paths_from_root("efo", "MONDO_0007739", as_="labels")
    for path in paths:
        print(" → ".join(path))

# Or authenticate with OAuth2
from scibite_toolkit.centree_clients import CENtreeReaderClient

reader = CENtreeReaderClient(base_url="https://centree.example.com")
if reader.set_oauth2(client_id="...", client_secret="..."):
    hits = reader.get_classes_by_exact_label("efo", "lung")
    print(hits)
```

---

## CENtree VectorDB Uploader Examples

Upload ontology embedding CSVs from S3 or local files to Qdrant for vector search.

> **Qdrant version compatibility:** The `qdrant-client` Python package must match your Qdrant server version within one minor version (e.g. client 1.7.x for server 1.7.x or 1.8.x). A mismatch may cause silent data corruption or connection errors. Pin the client version to match your server: `pip install qdrant-client==1.7.0`

### CLI Usage

```bash
# Upload all datasets under the configured S3 prefix
centree2vec-upload --config config.yaml

# Upload only specific ontologies
centree2vec-upload --config config.yaml --ontology efo mondo

# Upload local embedding files directly (no S3 required)
centree2vec-upload --config config.yaml --local efo_embeddings.csv.gz

# Replace existing vectors for each ontology before uploading
centree2vec-upload --config config.yaml --replace

# Combine --local and --replace to re-upload a single ontology
centree2vec-upload --config config.yaml --local efo_embeddings.csv.gz --replace

# Dry-run to preview which files would be processed
centree2vec-upload --config config.yaml --dry-run

# Public S3 bucket with anonymous access
centree2vec-upload --config config.yaml --anonymous
```

### Python API

```python
from scibite_toolkit.centree_vectordb_uploader import run, load_config

# Load YAML configuration
cfg = load_config("config.yaml")

# Run the upload pipeline
results = run(cfg)
for r in results:
    print(f"{r['ontology']}: {r['total_rows']} vectors uploaded")

# Replace existing vectors for each ontology before uploading
results = run(cfg, replace=True)

# Dry-run to inspect what would be uploaded
results = run(cfg, dry_run=True)
```

### Generate a Starter Config

```bash
# Write the bundled example config to the current directory
centree2vec-upload --init

# Or specify a custom path
centree2vec-upload --init my-config.yaml
```

### Configuration Reference

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `qdrant.url` | str | — | **Required.** Qdrant server URL |
| `qdrant.collection_name` | str | — | **Required.** Target collection name |
| `qdrant.distance` | str | `cosine` | Distance metric: `cosine`, `euclid`, `dot`, `manhattan` |
| `qdrant.api_key_env` | str | — | Env var name holding the Qdrant API key |
| `qdrant.hnsw_config.m` | int | `32` | HNSW graph connectivity |
| `qdrant.hnsw_config.ef_construct` | int | `256` | HNSW index build search depth |
| `qdrant.hnsw_config.full_scan_threshold` | int | `10000` | Point count below which brute-force is used |
| `s3.bucket` | str | — | **Required** (S3 mode). S3 bucket name |
| `s3.prefix` | str | — | **Required** (S3 mode). S3 key prefix for embedding files |
| `s3.anonymous` | bool | `false` | Use unsigned requests for public buckets |
| `s3.endpoint_url` | str | — | Custom S3-compatible endpoint URL |
| `s3.region` | str | `eu-west-2` | AWS region |
| `ingest.vector_size` | int | `384` | Embedding dimension |
| `ingest.batch_size` | int | `1024` | Points per Qdrant upload batch |
| `ingest.chunk_size` | int | `500000` | Rows per pandas read chunk |
| `ingest.parallel_uploads` | int | `4` | Parallel upload threads |
| `ingest.build_indices_after_upload` | bool | `true` | Build payload indexes after upload |
| `ingest.payload_index_fields` | list | `[metadata.iri, metadata.id, metadata.ontology]` | Fields to index |
| `selection.ontologies` | list | — | Ontology names to ingest (all if omitted) |
| `selection.include_files` | list | — | S3 keys to force-include |
| `selection.exclude_files` | list | — | S3 keys to always skip (highest priority) |

---

## CENtree Vector Generator Examples

End-to-end pipeline that takes a local ontology file, generates a sentence corpus via `Owl2Sentence`, encodes embeddings with `sentence-transformers`, and writes a gzipped CSV ready for Qdrant upload. Requires the `oml` extras:

```bash
pip install scibite-toolkit[oml]
```

### CLI Usage

```bash
# Generate embeddings from an OWL file (outputs <name>_embeddings.csv.gz)
centree2vec-generate ontology.owl

# Custom output path and model
centree2vec-generate ontology.owl -o output.csv.gz --model all-MiniLM-L6-v2

# With debug logging and custom batch size
centree2vec-generate ontology.owl --debug --batch-size 64
```

### Python API

```python
import argparse
from scibite_toolkit.centree_vector_generator import (
    validate_format,
    derive_ontology_name,
    generate_corpus,
    generate_embeddings,
    write_output,
    run,
)

# Use the full pipeline via run()
args = argparse.Namespace(
    input_file="ontology.owl",
    output="embeddings.csv.gz",
    model="sentence-transformers/all-MiniLM-L6-v2",
    batch_size=128,
    debug=False,
    include_sentences=False,
)
run(args)

# Or use individual stages
fmt = validate_format("ontology.owl")       # "xml"
name = derive_ontology_name("ontology.owl")  # "ontology"
df = generate_corpus("ontology.owl", name)
df = generate_embeddings(df, "sentence-transformers/all-MiniLM-L6-v2", batch_size=128)
write_output(df, "ontology_embeddings.csv.gz")
```

## Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `input_file` | *(required)* | Path to the ontology file |
| `--output`, `-o` | `<name>_embeddings.csv.gz` | Output file path |
| `--model` | `sentence-transformers/all-MiniLM-L6-v2` | Sentence-transformers model |
| `--batch-size` | `128` | Encoding batch size |
| `--debug` | `false` | Enable verbose Owl2Sentence logging |

## Output Format

Gzipped CSV with columns:

| Column | Description |
|--------|-------------|
| `id` | Unique identifier for the sentence |
| `iri` | IRI of the ontology class |
| `label` | Human-readable class label |
| `ontology` | Ontology name (derived from filename) |
| `content` | Generated sentence text |
| `embeddings` | JSON-encoded 384-dimensional float array |

## Pipeline

```
ontology.owl ──▶ Owl2Sentence ──▶ corpus (DataFrame) ──▶ SentenceTransformer ──▶ embeddings.csv.gz
                 (parse & generate     (id, iri, label,     (encode content         (ready for
                  sentences)            ontology, content)    column)                 Qdrant upload)
```

The output is directly compatible with `centree2vec_qdrant_uploader.py`.

---

## CENtree Ontology ML Examples

Convert OWL ontologies to natural-language corpora, generate sentence embeddings, and index them in Qdrant. Requires the `oml` extras:

```bash
pip install scibite-toolkit[oml]
```

### Python API

```python
from scibite_toolkit.centree_ontology_ml import Owl2Sentence, generate_embeddings

# Load ontology and generate sentence corpus
o2s = Owl2Sentence(owl_file="ontology.owl")
documents = o2s.run()

# Generate embeddings
texts = [doc.content for doc in documents]
embeddings = generate_embeddings(texts, model_name="sentence-transformers/all-MiniLM-L6-v2")
```

### CLI Usage

The `owl2sentence` command exposes three pipeline stages:

```bash
# 1. Convert OWL to sentence corpus
owl2sentence corpus -i ontology.owl -o corpus.csv

# 2. Generate embeddings
owl2sentence embed -i corpus.csv -o embeddings.csv -m sentence-transformers/all-MiniLM-L6-v2

# 3. Index in Qdrant
owl2sentence index -i embeddings.csv --url http://localhost:6333 --collection my_ontology

# Pipeline chaining via stdout/stdin
owl2sentence corpus -i ontology.owl -o - | owl2sentence embed -i - -o - | owl2sentence index -i - --url http://localhost:6333 --collection my_ontology
```

---

## Workbench Example

Dataset management and annotation.

```python
from scibite_toolkit import workbench

# Initialize
wb = workbench.WorkbenchRequestBuilder()
wb.set_url('https://workbench.example.com')

# Authenticate
wb.set_oauth2('client_id', 'username', 'password')

# Create dataset
wb.set_dataset_name('My Analysis Dataset')
wb.set_dataset_desc('Dataset for clinical trial analysis')
wb.create_dataset()

# Upload file
wb.set_file_input('path/to/data.xlsx')
wb.upload_file_to_dataset()

# Configure and run annotation
vocabs = [[5, 6], [8, 9]]  # Vocabulary IDs
attrs = [200, 201]  # Attribute IDs
wb.set_termite_config('', vocabs, attrs)
wb.auto_annotate_dataset()
```

---

## Key Features

### Context Manager Support (TERMite 7, CENtree Clients)

Modern clients support context managers for automatic resource cleanup:

```python
with termite7.Termite7RequestBuilder() as t:
    t.set_url('...')
    # ... work with client ...
# File handles automatically closed
```

### Error Handling

All OAuth2 methods return boolean status for easy error handling:

```python
if not t.set_oauth2(client_id, client_secret):
    print("Authentication failed - check credentials")
    exit(1)
```

### Logging

Enable detailed logging for debugging:

```python
import logging

logging.basicConfig(level=logging.DEBUG)

# Or set per-client
t = termite7.Termite7RequestBuilder(log_level='DEBUG')
```

### Session Management

All clients use `requests.Session()` for efficient connection pooling and automatic retry handling.

---



## License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
