Metadata-Version: 2.4
Name: ashmatics-fda-pipeline
Version: 0.1.1
Summary: FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline
Project-URL: Homepage, https://github.com/AshMatics/ashmatics-fda-pipeline
Project-URL: Repository, https://github.com/AshMatics/ashmatics-fda-pipeline
Project-URL: Documentation, https://github.com/AshMatics/ashmatics-fda-pipeline#readme
Project-URL: Bug Tracker, https://github.com/AshMatics/ashmatics-fda-pipeline/issues
Author-email: Asher Informatics PBC <engineering@asherinformatics.com>
License: Proprietary
License-File: LICENSE
Keywords: 510k,ai-ml,de-novo,document-processing,fda,medical-device,regulatory
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Healthcare Industry
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.11
Requires-Dist: aiofiles>=23.0
Requires-Dist: aiohttp>=3.9
Requires-Dist: ashmatics-datamodels<0.4.0,>=0.3.2
Requires-Dist: ashmatics-tools<0.8.0,>=0.7.2
Requires-Dist: docling>=2.0.0
Requires-Dist: pandas>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Provides-Extra: all
Requires-Dist: anthropic>=0.20; extra == 'all'
Requires-Dist: azure-storage-blob>=12.0; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: pymongo>=4.0; extra == 'all'
Requires-Dist: tiktoken>=0.5; extra == 'all'
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.0; extra == 'azure'
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.3; extra == 'dev'
Provides-Extra: llm
Requires-Dist: anthropic>=0.20; extra == 'llm'
Requires-Dist: openai>=1.0; extra == 'llm'
Requires-Dist: tiktoken>=0.5; extra == 'llm'
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.0; extra == 'mongodb'
Description-Content-Type: text/markdown

# Ashmatics FDA Pipeline

**Version: 0.1.0** | **Last Updated: 2025-11-29**

Copyright 2025 Asher Informatics PBC - Proprietary and Confidential

---

FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline for the Ashmatics Knowledge Base platform.

## Overview

The Ashmatics FDA Pipeline provides a comprehensive solution for extracting structured metadata from FDA regulatory documents (510(k) summaries and De Novo decision summaries). It transforms unstructured PDF documents into MongoDB-ready structured documents suitable for AI/ML analysis and knowledge base integration.

### Key Features

- **PDF Parsing**: High-quality document parsing using DocLing
- **Metadata Extraction**: Regex-based extraction with LLM validation
- **Table Processing**: Multi-page table consolidation and classification
- **Predicate Extraction**: Multi-source predicate device identification
- **AI/ML Data Extraction**: Training data and performance metrics extraction
- **Domain Knowledge**: Product code-aware extraction with confidence scoring
- **Batch Processing**: Concurrent document processing with structured output

## Installation

### Prerequisites

- Python 3.12+
- [uv](https://docs.astral.sh/uv/) package manager (recommended)
- Access to JFK-Ashmatics private repositories

### Install from GitHub

```bash
# Using uv (recommended)
uv add "ashmatics-fda-pipeline @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"

# With optional dependencies
uv add "ashmatics-fda-pipeline[all] @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"
```

### Development Installation

```bash
git clone https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git
cd ashmatics-fda-pipeline
uv sync --all-extras
```

## Quick Start

### CLI Usage

```bash
# Process a batch of PDFs
fda-pipeline process /path/to/pdfs --output /path/to/output

# Process a single document
fda-pipeline process-single /path/to/K123456.pdf

# Show version
fda-pipeline version
```

### Python API

```python
from ashmatics_fda_pipeline import FDA510kPipeline, PipelineConfig

# Configure pipeline
config = PipelineConfig(
    enable_llm_validation=True,
    enable_performance_extraction=True,
    llm_provider="azure_openai",
)

# Create pipeline
pipeline = FDA510kPipeline(config)

# Process single document
result = await pipeline.process_single(Path("K123456.pdf"))

print(f"K-Number: {result.k_number}")
print(f"Manufacturer: {result.metadata['manufacturer']}")

# Process batch
results = await pipeline.process_batch([Path("K123456.pdf"), Path("K234567.pdf")])
```

## Architecture

```
ashmatics_fda_pipeline/
├── __init__.py          # Public API exports
├── config.py            # PipelineConfig dataclass
├── pipeline.py          # FDA510kPipeline main class
├── pipeline_registry.py # Factory pattern for pipelines
├── cli.py               # Typer CLI entry point
│
├── extractors/          # Metadata extraction
│   ├── base.py          # DocumentExtractor ABC
│   ├── metadata_extractor.py  # FDA510kExtractor
│   └── llm_validator.py       # LLM validation
│
├── enrichers/           # Content enrichment
│   ├── table_classifier.py       # Table classification
│   ├── table_consolidator.py     # Multi-page table merge
│   ├── predicate_extractor.py    # Predicate device extraction
│   ├── training_data_extractor.py    # AI/ML training data
│   └── performance_data_extractor.py # Validation results
│
├── mappers/             # Schema mapping
│   ├── base.py          # DocumentMapper ABC
│   └── document_mapper.py # RegulatoryDocumentMapper
│
├── storage/             # Output management
│   └── batch_output.py  # BatchOutputManager
│
└── domain_knowledge/    # FDA domain patterns
    ├── __init__.py      # DomainKnowledge, DocumentPatternLoader
    ├── document_patterns.py
    ├── ai_device_expectations.yaml
    ├── 510k_summary_document_patterns.yaml
    └── de_novo_document_patterns.yaml
```

## Configuration

### Environment Variables

```bash
# Azure OpenAI (default LLM provider)
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o

# OpenAI (alternative)
OPENAI_API_KEY=your-api-key

# Anthropic (alternative)
ANTHROPIC_API_KEY=your-api-key
```

### PipelineConfig Options

```python
@dataclass
class PipelineConfig:
    # LLM configuration
    enable_llm_validation: bool = True
    llm_provider: str = "azure_openai"
    enable_performance_extraction: bool = True

    # Processing
    max_concurrent: int = 3
    batch_size: int = 5

    # Output
    enable_batch_output: bool = True
    write_markdown: bool = False
    save_figures: bool = True
    save_tables: bool = True
```

## Output Structure

When `enable_batch_output=True`, the pipeline creates a structured output:

```
batch-YYYYMMDD-HHMMSS/
├── manifest.json          # Batch metadata and processing summary
├── K123456/
│   ├── K123456_parsed.md  # Parsed markdown
│   ├── K123456_metadata.json
│   ├── K123456_mongo_doc.json
│   ├── figures/
│   │   ├── figure_1.png
│   │   └── figure_metadata.json
│   └── tables/
│       ├── table_1.md
│       ├── table_1.json
│       └── table_classifications.json
└── K234567/
    └── ...
```

## Dependencies

### Core Ashmatics Packages

- **ashmatics-tools**: Base utilities, parsers, LLM clients
- **ashmatics-datamodels**: Shared Pydantic data models

### Optional Dependencies

```bash
# LLM enrichment (Azure OpenAI, OpenAI, Anthropic)
uv add ashmatics-fda-pipeline[llm]

# Azure storage support
uv add ashmatics-fda-pipeline[azure]

# MongoDB support
uv add ashmatics-fda-pipeline[mongodb]

# All optional dependencies
uv add ashmatics-fda-pipeline[all]
```

## Development

### Running Tests

```bash
# Run all tests
uv run pytest

# With coverage
uv run pytest --cov=ashmatics_fda_pipeline

# Run specific test
uv run pytest tests/unit/test_extractors.py -v
```

### Code Quality

```bash
# Format code
uv run ruff format .

# Lint
uv run ruff check .

# Type check
uv run mypy src/
```

## License

Copyright 2025 Asher Informatics PBC. All rights reserved.

This software is proprietary and confidential. Unauthorized copying, modification, distribution, or use is strictly prohibited.

See [LICENSE](LICENSE) for details.

## Support

For licensing inquiries: legal@asherinformatics.com

For technical support: engineering@asherinformatics.com
