Metadata-Version: 2.4
Name: rettxmutation
Version: 0.3.6
Summary: Extract Rett Syndrome mutations from genetic diagnosis report
Author-email: Pedro Rocha <procha@rettsyndrome.eu>
License: MIT License
Project-URL: Homepage, https://github.com/rett-europe/rettxmutation
Project-URL: Issues, https://github.com/rett-europe/rettxmutation/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-dotenv
Requires-Dist: azure-core
Requires-Dist: azure-ai-textanalytics
Requires-Dist: mutalyzer_hgvs_parser
Requires-Dist: pydantic
Requires-Dist: openai
Requires-Dist: backoff
Requires-Dist: jmespath
Requires-Dist: azure-search-documents
Requires-Dist: numpy
Dynamic: license-file

# RettX Mutation Analysis Library

[![PyPI version](https://badge.fury.io/py/rettxmutation.svg)](https://badge.fury.io/py/rettxmutation)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A Python library for extracting and validating genetic mutations from clinical reports using an AI-powered agentic pipeline. Supports **6 Rett Syndrome-related genes** across **multiple languages**, returning fully normalized HGVS nomenclature with genomic coordinates on both GRCh37 and GRCh38 assemblies.

## 🚀 Quick Start

### Installation

```bash
pip install rettxmutation
```

### Basic Usage

```python
import asyncio
from rettxmutation import RettxServices, DefaultConfig

async def extract():
    config = DefaultConfig()  # loads from .env / environment variables

    with RettxServices(config) as services:
        result = await services.agent_extraction_service.extract_mutations(
            "The patient carries the mutation NM_004992.4:c.916C>T (p.Arg306Cys) in MECP2."
        )

        for key, mutation in result.mutations.items():
            pt = mutation.primary_transcript
            print(f"Gene:       {pt.gene_id}")
            print(f"Transcript: {pt.hgvs_transcript_variant}")
            print(f"Protein:    {pt.protein_consequence_tlr}")
            print(f"Type:       {mutation.variant_type}")
            for assembly, coord in mutation.genomic_coordinates.items():
                print(f"{assembly}:    {coord.hgvs}  (pos {coord.start:,}–{coord.end:,})")

asyncio.run(extract())
```

### CLI Example

```bash
python examples/extract_from_file.py path/to/genetic_report.txt --verbose
```

## ✨ Key Features

- **🧬 Multi-Gene Support**: MECP2, FOXG1, SLC6A1, CDKL5, EIF2B2, MEF2C — with curated RefSeq transcripts
- **🌍 Multilingual**: Processes reports in English, Spanish, Greek, Turkish, and more
- **🤖 Agentic Extraction**: Azure OpenAI-powered agent with tool-calling (gene registry lookup, variant validation, complex variant handling)
- **✅ HGVS Validation**: Every mutation is validated via VariantValidator with automatic coordinate liftover (GRCh37 ↔ GRCh38)
- **🔒 PHI Redaction**: Automatic removal of personal health information before LLM processing
- **⚡ Production Ready**: Type-safe Pydantic v2 models, exponential backoff, connection pooling
- **🔄 Dual Assembly Output**: Every mutation includes genomic coordinates on both GRCh37 and GRCh38
- **🏗️ Modular Architecture**: Lazy-initialized services with dependency injection and context manager support

## 🧬 Supported Genes

| Gene | Chromosome | Primary Transcript | Condition |
|------|-----------|-------------------|-----------|
| **MECP2** | Xq28 | NM_004992.4 (+NM_001110792.2) | Classic Rett Syndrome |
| **FOXG1** | 14q12 | NM_005249.5 | Congenital variant Rett |
| **SLC6A1** | 3p25.3 | NM_003042.4 | Myoclonic-atonic epilepsy |
| **CDKL5** | Xp22.13 | NM_001323289.2 | CDKL5 deficiency disorder |
| **EIF2B2** | 14q24.3 | NM_014239.4 | Vanishing white matter disease |
| **MEF2C** | 5q14.3 | NM_002397.5 (+NM_001131005.2) | MEF2C haploinsufficiency |

## 📊 Output Structure

The `ExtractionResult` contains:

```
ExtractionResult
├── mutations: Dict[str, GeneMutation]    ← keyed by GRCh38 genomic HGVS
├── genes_detected: List[str]             ← e.g. ["MECP2"]
├── extraction_log: List[str]             ← agent reasoning trace
└── tool_calls_count: int                 ← total tool invocations
```

Each `GeneMutation` provides:

```
GeneMutation
├── genomic_coordinates:
│   ├── GRCh38: { assembly, hgvs, start, end, size }
│   └── GRCh37: { assembly, hgvs, start, end, size }
├── variant_type: "SNV" | "deletion" | "duplication" | "insertion" | "indel"
├── primary_transcript:
│   ├── gene_id, transcript_id
│   ├── hgvs_transcript_variant      ← e.g. NM_004992.4:c.916C>T
│   ├── protein_consequence_tlr      ← e.g. NP_004983.1:p.(Arg306Cys)
│   └── protein_consequence_slr      ← e.g. NP_004983.1:p.(R306C)
└── secondary_transcript (optional)
```

## 🛠️ Requirements

### Python Version
- Python 3.8 or higher

### Azure Services

| Service | Required? | Purpose |
|---------|-----------|---------|
| **Azure OpenAI** | ✅ Required | Agentic mutation extraction |
| **Azure AI Search** | Optional | Semantic search for keyword detection |
| **Azure Cognitive Services** | Optional | Text analytics enrichment |

### Environment Variables

```bash
# Required — Azure OpenAI
RETTX_OPENAI_ENDPOINT=https://your-openai.openai.azure.com/
RETTX_OPENAI_KEY=your-openai-key
RETTX_OPENAI_MODEL_NAME=gpt-4o           # deployment name

# Optional — Agent model (defaults to RETTX_OPENAI_MODEL_NAME if not set)
RETTX_OPENAI_AGENT_DEPLOYMENT=gpt-4o     # agent-specific deployment
RETTX_OPENAI_AGENT_MODEL_VERSION=2024-11-20

# Optional — Embeddings
RETTX_EMBEDDING_DEPLOYMENT=text-embedding-ada-002

# Optional — Azure AI Search
RETTX_AI_SEARCH_SERVICE=your-search-service
RETTX_AI_SEARCH_API_KEY=your-search-key
RETTX_AI_SEARCH_INDEX_NAME=your-index-name

# Optional — Azure Cognitive Services
RETTX_COGNITIVE_SERVICES_ENDPOINT=https://your-cognitive-services.cognitiveservices.azure.com/
RETTX_COGNITIVE_SERVICES_KEY=your-cognitive-services-key
```

## 📋 Processing Pipeline

The agentic extraction pipeline works as follows:

```
Input Text (any language)
    │
    ▼
┌──────────────────────┐
│  PHI Redaction        │  Remove patient names, DOBs, IDs
└──────────┬───────────┘
           ▼
┌──────────────────────┐
│  AI Agent (OpenAI)    │  Reads text, identifies mutations
│                      │
│  Tools available:    │
│  • lookup_gene_registry  → gene info + RefSeq transcripts
│  • validate_variant      → HGVS validation + coordinates
│  • validate_complex      → CNV / genomic coordinate validation
└──────────┬───────────┘
           ▼
┌──────────────────────┐
│  Structured Output    │  ExtractionResult with validated
│                      │  GeneMutation objects
└──────────────────────┘
```

Key capabilities:
- **Ensembl → RefSeq remapping**: Handles reports using Ensembl transcripts (ENST*) by using genomic coordinates
- **Minus-strand awareness**: Correctly complements alleles for genes on the reverse strand
- **Old nomenclature**: Normalizes legacy formats (e.g., `502C->T`, `R168X`) to current HGVS

## 💻 Available Services

All services are lazily initialized via `RettxServices`:

```python
with RettxServices(config) as services:
    # Core extraction
    services.agent_extraction_service    # AI-powered mutation extraction
    
    # Validation & analysis
    services.variant_validator_service   # HGVS validation + coordinate liftover
    services.mutation_tokenizator        # Mutation string tokenization
    
    # Search & embeddings
    services.embedding_service           # Azure OpenAI embeddings
    services.ai_search_service           # Azure AI Search integration
    services.keyword_detector_service    # Multi-layer keyword detection
```

### Direct Variant Validation

```python
with RettxServices(config) as services:
    vvs = services.variant_validator_service

    # Validate a transcript-level variant
    result = vvs.get_gene_mutation_from_transcript("NM_004992.4:c.916C>T")
    print(result.primary_transcript.protein_consequence_tlr)
    # → NP_004983.1:p.(Arg306Cys)

    # Validate a complex / genomic variant
    result = vvs.create_gene_mutation_from_complex_variant(
        assembly_build="GRCh38",
        assembly_refseq="NC_000023.11",
        variant_description="NC_000023.11:g.154030912G>A",
        gene_symbol="MECP2"
    )
```

### Custom Configuration

```python
class MyConfig:
    """Custom configuration for production (e.g., from Key Vault)."""
    RETTX_OPENAI_ENDPOINT = "https://my-openai.openai.azure.com/"
    RETTX_OPENAI_KEY = get_secret("openai-key")
    RETTX_OPENAI_MODEL_NAME = "gpt-4o"
    # Only set fields needed for the services you use

with RettxServices(MyConfig()) as services:
    result = await services.agent_extraction_service.extract_mutations(text)
```

## 🧪 Golden Test Suite

The library includes a comprehensive golden test suite with 11 real-world genetic reports:

| Gene | Variant Type | Language | Key Feature |
|------|-------------|----------|-------------|
| MECP2 | SNV, splicing, deletion, duplication | EN, ES, EL, TR | Multiple transcripts |
| FOXG1 | Frameshift deletion | TR | Non-MECP2 gene |
| SLC6A1 | Whole-gene CNV (~20kb) | ES | Copy number variant |

Run golden tests:
```bash
# Mock mode (no API calls, uses recorded responses)
python -m pytest tests/golden/ --golden-mode=mock -v

# Live mode (calls real APIs)
python -m pytest tests/golden/ --golden-mode=live -v
```

## 🎯 Use Cases

- **🏥 Clinical Genetics**: Extract mutations from diagnostic reports in any language
- **🔬 Research**: Analyze genetic data across Rett Syndrome and related conditions
- **📊 Patient Registries**: Populate genetic databases with normalized HGVS nomenclature
- **🤖 Bioinformatics Pipelines**: Integrate as a library or via the CLI example
- **📱 Clinical Applications**: Build tools with structured mutation data (dual-assembly coordinates)

## 🔧 Reliability

- **Exponential Backoff**: Automatic retry for VariantValidator and OpenAI API calls
- **Graceful Degradation**: Optional services (AI Search, Cognitive Services) degrade gracefully
- **PHI Redaction**: Patient data is stripped before any LLM processing
- **Type Safety**: Pydantic v2 models with runtime validation
- **Context Manager**: Automatic resource cleanup via `with` statement
- **Comprehensive Logging**: Structured extraction logs with tool call traces

## 🤝 Contributing

We welcome contributions! Please see our [GitHub repository](https://github.com/rett-europe/rettxmutation) for:
- Issue reporting
- Feature requests
- Pull request guidelines
- Development setup instructions

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🆘 Support

- **Issues**: [GitHub Issues](https://github.com/rett-europe/rettxmutation/issues)
- **Documentation**: [API Documentation](https://github.com/rett-europe/rettxmutation)
- **Contact**: procha@rettsyndrome.eu

## 🔮 Roadmap

- **Additional Genes**: Expand the gene registry beyond the current 6 genes
- **Batch Processing**: Process multiple reports in parallel with rate limiting
- **Confidence Scoring**: Per-mutation confidence metrics based on report quality
- **Structured Report Parsing**: Native support for VCF, JSON, and HL7 FHIR formats
- **Cloud Deployment**: Docker containers and Azure deployment templates
