Metadata-Version: 2.4
Name: askemblaex
Version: 0.5.1
Summary: Document extraction and reconciliation for genealogical sources.
Project-URL: Homepage, https://github.com/askembla/askemblaex
Project-URL: Issues, https://github.com/askembla/askemblaex/issues
Author: Askembla
License: MIT
License-File: LICENSE
Keywords: azure,document intelligence,extraction,genealogy,ocr,openai,pdf
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.11
Requires-Dist: azure-ai-documentintelligence>=1.0.0
Requires-Dist: azure-cognitiveservices-vision-computervision>=0.9.0
Requires-Dist: azure-core>=1.29.0
Requires-Dist: msrest>=0.7.1
Requires-Dist: openai>=1.30.0
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pypdf>=4.0.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: mypy>=1.9.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# askemblaex

Document extraction, reconciliation, entity extraction, and embedding for genealogical sources.

askemblaex processes PDF and image-based genealogical documents through a multi-stage pipeline:

1. **Extract** — run multiple OCR providers per page
2. **Reconcile** — merge OCR outputs into a single best-quality transcription via OpenAI
3. **Entities** — extract structured persons, events, and claims from each page via windowed AI analysis
4. **Embed** — generate vector embeddings from reconciled text or person descriptions

Each stage is independently runnable and fully incremental — nothing is overwritten unless explicitly forced.

---

## Features

- **Multi-method OCR extraction** — Azure Computer Vision, Azure Document Intelligence, PyMuPDF, and pdfplumber run per page
- **AI reconciliation** — OpenAI merges all OCR outputs into the most accurate transcription, preferring Azure sources
- **Windowed entity extraction** — structured persons, events, claims, and places extracted from each page with surrounding page context
- **Per-person embeddings** — each person mention gets an embedding-ready text string and optional vector
- **Structured output** — every document gets a hash-keyed folder; every page gets its own JSON file
- **Incremental processing** — partial runs pick up where they left off; model-change detection triggers re-reconciliation automatically
- **Interactive preflight** — missing credentials trigger a prompt to enter them, skip the service, or quit

---

## Installation

```bash
pip install askemblaex
```

### System dependencies

`pdf2image` requires Poppler:

```bash
# macOS
brew install poppler

# Ubuntu / Debian
sudo apt-get install poppler-utils

# Windows — download from https://github.com/oschwartz10612/poppler-windows
```

---

## Configuration

Create a `.env` file in your working directory (or set environment variables directly):

```env
# Azure Computer Vision (optional — enables azure_computer_vision method)
AZURE_VISION_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_VISION_KEY=<your-key>

# Azure Document Intelligence (optional — enables azure_docint method)
AZURE_DOCINT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_DOCINT_KEY=<your-key>

# OpenAI (required for --reconcile and --entities)
OPENAI_KEY=<your-key>
OPENAI_MODEL=gpt-4o
OPENAI_BASE_URL=https://api.openai.com/v1   # override for proxies / compatible APIs

# OpenAI entity extraction model (optional — falls back to OPENAI_MODEL then gpt-4o)
OPENAI_ENTITY_MODEL=gpt-4o

# Ollama embeddings (optional — takes priority over OpenAI for --embed)
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_EMODEL=nomic-embed-text
OLLAMA_EDIM=768       # optional dimension validation

# OpenAI embeddings (used if Ollama is not configured)
OPENAI_EMODEL=text-embedding-3-small
OPENAI_EDIM=1536      # optional dimension validation

# Override recognised image extensions (default: .png,.jpg,.jpeg,.tiff,.tif)
ASKEMBLAEX_IMG_EXTS=.png,.jpg,.tif
```

---

## CLI usage

```
askemblaex [--source DIR] [--output DIR]
           [--extract] [--reconcile] [--entities] [--embed]
           [--methods METHOD ...] [--skip-methods METHOD ...]
           [--force-extract] [--force-reconcile] [--force-entities] [--force-embed]
           [--dpi DPI] [--recursive] [-v|-vv|-vvv]
           [--list-methods]
```

### Flags reference

#### Directories

| Flag | Description |
|------|-------------|
| `--source DIR`, `-s DIR` | Source directory containing PDFs or images |
| `--output DIR`, `-o DIR` | Output directory. Defaults to `--source` |

#### Pipeline stages

| Flag | Description | Requires |
|------|-------------|----------|
| `--extract` | Run multi-method OCR extraction | `--source` |
| `--reconcile` | Merge OCR outputs into best-quality text via OpenAI | Extracted pages, `OPENAI_KEY` |
| `--entities` | Extract structured persons/events/claims per page via OpenAI | Reconciled pages, `OPENAI_KEY` |
| `--embed` | Generate vector embeddings from reconciled page text | Reconciled pages, Ollama or OpenAI embeddings |
| `--list-methods` | Print available extraction methods and exit | — |

#### Method control (extraction only)

| Flag | Description |
|------|-------------|
| `--methods METHOD [...]` | Whitelist specific extraction methods (default: all) |
| `--skip-methods METHOD [...]` | Exclude specific extraction methods |

#### Force flags

| Flag | Description |
|------|-------------|
| `--force-extract` | Re-extract even if already marked complete |
| `--force-reconcile` | Re-reconcile even if model matches |
| `--force-entities` | Re-extract entities even if already done with same model |
| `--force-embed` | Re-embed even if already embedded with same model |

#### Other options

| Flag | Default | Description |
|------|---------|-------------|
| `--dpi INT` | `300` | DPI for PDF page rendering |
| `--recursive`, `-r` | off | Search source directory recursively |
| `-v` / `-vv` / `-vvv` | — | Verbosity: INFO / DEBUG / DEBUG + tracebacks |

---

### Common workflows

```bash
# List available extraction methods
askemblaex --list-methods

# Extract all PDFs in a folder (output defaults to source)
askemblaex --source /path/to/pdfs --extract

# Extract to a separate output folder
askemblaex --source /path/to/pdfs --output /path/to/output --extract

# Full pipeline in one command
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --reconcile --entities --embed

# Reconcile previously extracted pages
askemblaex --output /path/to/output --reconcile

# Entity extraction only (requires reconciled text)
askemblaex --output /path/to/output --entities

# Generate embeddings only (requires reconciled text)
askemblaex --output /path/to/output --embed

# Only run specific extraction methods
askemblaex --source /path/to/pdfs --extract \
    --methods azure_computer_vision pymupdf

# Skip a method
askemblaex --source /path/to/pdfs --extract --skip-methods pdfplumber

# Custom DPI for PDF rendering (default: 300)
askemblaex --source /path/to/pdfs --extract --dpi 400

# Force re-extraction (will not touch reconciled or entity data)
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --force-extract

# Force re-reconciliation only
askemblaex --output /path/to/output --reconcile --force-reconcile

# Force re-entity-extraction only
askemblaex --output /path/to/output --entities --force-entities

# Recursive source tree + verbose output
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --reconcile --recursive -vv
```

### Verbosity levels

| Flag | Level | Output |
|------|-------|--------|
| _(none)_ | WARNING | Errors and service failures only |
| `-v` | INFO | Per-file progress |
| `-vv` | DEBUG | Per-page detail |
| `-vvv` | DEBUG | Full tracebacks on errors |

---

## Output structure

Each processed document produces a hash-keyed folder under the output root:

```
output/
  2e7698fc.../
    2e7698fc....metadata._.json       ← document metadata + processing state
    2e7698fc....log                   ← per-document log
    2e7698fc....pymupdf.image.0.0.png ← embedded images extracted by PyMuPDF
    2e7698fc....pdfplumber.table.0.0.csv
    pages/
      2e7698fc....page.0000.json      ← all extraction data for page 0
      2e7698fc....page.0001.json
      ...
      2e7698fc....person.0000.0000.json  ← person mention, page 0, person 0
      2e7698fc....person.0000.0001.json  ← person mention, page 0, person 1
      ...
    images/
      page.0000.png                   ← rendered page images (for Azure CV)
      page.0000.pdf                   ← single-page PDFs (for Azure DocInt)
      ...
    logs/
      askemblaex.log
```

### Metadata file (`<hash>.metadata._.json`)

```json
{
  "_key": "2e7698fc...",
  "source": {
    "filename": "The Freewill Baptist Register.pdf",
    "type": ".pdf",
    "title": "The Freewill Baptist Register",
    "created_utc": "2026-02-24T10:00:00Z",
    "local": true,
    "uris": []
  },
  "raw": {
    "page_count": 42
  },
  "extraction": {
    "complete": true,
    "started_utc": "2026-02-24T10:01:00Z",
    "completed_utc": "2026-02-24T10:05:00Z",
    "steps": {
      "azure_computer_vision": true,
      "azure_docint": true,
      "pymupdf": true,
      "pdfplumber": true,
      "reconciled": true,
      "entities": true,
      "embeddings": true
    }
  }
}
```

### Page file (`<hash>.page.<NNNN>.json`)

```json
{
  "schema_version": 1.0,
  "doc_id": "2e7698fc...",
  "page_num": 0,
  "default": "reconciled",
  "created_at": "...",
  "updated_at": "...",
  "extractions": {
    "azure_computer_vision": {
      "text": "...",
      "text_hash": "...",
      "method": "azure_computer_vision",
      "extracted_at": "..."
    },
    "azure_docint":  { "text": "...", "..." : "..." },
    "pymupdf":       { "text": "...", "..." : "..." },
    "pdfplumber":    { "text": "...", "..." : "..." },
    "reconciled": {
      "text": "...",
      "model": "gpt-4o",
      "provider": "openai",
      "source_methods": ["azure_computer_vision", "azure_docint", "pymupdf", "pdfplumber"],
      "extracted_at": "..."
    },
    "entities": {
      "raw": { "persons": [...], "events": [...], "claims": [...], "places": [...] },
      "model": "gpt-4o",
      "extracted_at": "...",
      "person_count": 3,
      "window_pages": [0, 1, 2],
      "person_files": ["2e7698fc....person.0000.0000.json", "..."]
    },
    "embedding": {
      "values": [0.012, -0.034, "..."],
      "model": "nomic-embed-text",
      "provider": "ollama",
      "dim": 768,
      "created_at": "..."
    }
  }
}
```

### Person mention file (`<hash>.person.<page>.<idx>.json`)

One file is written per person mention found on a page.

```json
{
  "_key": "person_john_smith_2e7698fc_p0_n0",
  "type": { "entity": "person", "source": "mention" },
  "process_version": 2,
  "schema_version": 4,
  "method": "windowed_default",
  "datetime": "...",
  "person_name": "John Smith",
  "source_document": "2e7698fc...",
  "page_num": 0,
  "person_num": 0,
  "page_text": "...",
  "window_text": "=== PAGE 0 (TARGET) ===\n...",
  "embedding_text": "TYPE: Person\nNAME: John Smith\nVITALS: born 1842 County Cork...",
  "embedding_model": "nomic-embed-text",
  "embedding": [0.012, -0.034, "..."],
  "structure": { "persons": [...], "events": [...] }
}
```

---

## Available extraction methods

| Method | Description | Credentials required |
|--------|-------------|----------------------|
| `azure_computer_vision` | Azure Computer Vision Read OCR | `AZURE_VISION_ENDPOINT`, `AZURE_VISION_KEY` |
| `azure_docint` | Azure Document Intelligence layout + key-value pairs | `AZURE_DOCINT_ENDPOINT`, `AZURE_DOCINT_KEY` |
| `pymupdf` | PyMuPDF embedded text layer | None |
| `pdfplumber` | pdfplumber embedded text layer | None |

---

## Module reference

### `askemblaex.main`

CLI entrypoint and pipeline orchestration.

| Symbol | Description |
|--------|-------------|
| `main()` | Parse args, run preflight, execute pipeline stages |
| `build_parser()` | Return configured `argparse.ArgumentParser` |
| `run_extraction(source, output, *, ...)` | Run multi-method OCR extraction |
| `run_reconciliation(output, *, ...)` | Run OpenAI reconciliation |
| `run_entities(output, *, ...)` | Run windowed entity extraction |
| `run_embedding(output, *, ...)` | Generate page-level embeddings |

---

### `askemblaex.extract`

File discovery and per-page text extraction.

| Symbol | Description |
|--------|-------------|
| `discover_files(root, *, recursive)` | Find all PDFs and images under a directory |
| `process_one(src_path, output_root, logger, *, ...)` | Extract a single source file into a hash-keyed output folder |
| `process_all(source_root, output_root, logger)` | Extract all files found under a directory |
| `extract_page_text(path, logger, *, ...)` | Low-level per-page extraction returning `List[PageExtraction]` |
| `PageExtraction` | Dataclass holding `page_index` and per-method text dict |

---

### `askemblaex.reconcile`

OpenAI-powered multi-source reconciliation.

| Symbol | Description |
|--------|-------------|
| `reconcile_folder(folder, *, client, model, ...)` | Reconcile all pages in a hash-keyed document folder |
| `reconcile_page_file(page_file, doc_id, page_num, parent_folder, *, ...)` | Load, reconcile, and write back a single page file |
| `reconcile_page(page_data, *, client, model)` | Reconcile one page's extraction data dict; returns text or `None` |

---

### `askemblaex.window`

Dynamic context-window builder for entity extraction.

| Symbol | Description |
|--------|-------------|
| `build_dynamic_extraction_window(pages, anchor_page, *, context_pages, max_chars)` | Build a `ExtractionWindow` centred on `anchor_page` with surrounding context pages |
| `ExtractionWindow` | Dataclass with `anchor_page`, `pages_included`, `text`, `char_count` |

**How windowing works:** the anchor page is always included in full. Up to `context_pages` (default 2) pages before and after are added alternately until the `max_chars` (default 30 000) budget is exhausted. The anchor page is labelled `=== PAGE N (TARGET) ===` in the text so the model knows which page to extract from.

---

### `askemblaex.entities`

Structured entity extraction and person embedding generation.

| Symbol | Description |
|--------|-------------|
| `entity_extract_folder(folder, *, client, model, embed_provider, ...)` | Extract entities for all pages in a document folder |
| `entity_extract_page_file(page_file, parent_folder, page_num, all_page_texts, *, ...)` | Extract entities for one page, write person files and update page JSON |
| `call_entity_extraction(window_text, *, client, model)` | Call OpenAI and return parsed entity JSON dict |
| `build_person_embeddings(extraction_json)` | Build one embedding-ready text string per person from entity JSON |

**Entity schema** (returned by `call_entity_extraction`):

```json
{
  "persons": [
    {
      "name": "John Smith",
      "alt_names": ["J. Smith"],
      "summary": "Head of household in the 1881 census.",
      "roles": ["head of household"],
      "birth": { "date_text": "1842", "place": "County Cork, Ireland" },
      "death": { "date_text": "1910", "place": "Melbourne, Victoria" },
      "residences": ["Melbourne"],
      "relationships": {
        "parents": ["Thomas Smith", "Mary O'Brien"],
        "spouse": ["Catherine Murphy"],
        "children": ["William Smith"],
        "siblings": []
      },
      "attributes": ["farmer", "Catholic"],
      "events": ["Arrived Melbourne 1865"],
      "evidence_phrases": ["head of household", "born County Cork"]
    }
  ],
  "events": [ { "type": "marriage", "date_text": "1865", "people": ["John Smith", "Catherine Murphy"], "details": [] } ],
  "claims": [],
  "places": ["Melbourne", "County Cork"],
  "notes": []
}
```

---

### `askemblaex.embed`

Vector embedding generation for reconciled page text.

| Symbol | Description |
|--------|-------------|
| `embed_folder(folder, *, provider, force, verbosity)` | Embed all reconciled pages in a document folder |
| `embed_page_file(page_file, doc_id, page_num, parent_folder, *, provider, ...)` | Embed a single page file's reconciled text |
| `generate_embedding(text, provider)` | Generate an embedding vector via Ollama or OpenAI |
| `detect_provider()` | Return `"ollama"`, `"openai"`, or `None` based on environment |

---

### `askemblaex.preflight`

Service credential checks and interactive recovery.

| Symbol | Description |
|--------|-------------|
| `run_preflight(requested_methods, needs_reconcile, needs_entities, needs_embed, *, verbose)` | Run all required preflight checks; returns `PreflightResult` |
| `PreflightResult` | Dataclass with `active_methods`, `openai_available`, `openai_client`, `openai_model`, `embed_provider`, `services` |
| `ServiceStatus` | Dataclass with `name`, `available`, `reason`, `env_vars` |

---

### `askemblaex.pages`

Per-page JSON file I/O utilities.

| Symbol | Description |
|--------|-------------|
| `save_or_merge_page(parent_folder, doc_id, page, data)` | Create or update a page JSON file, merging extraction data |
| `load_page(parent_folder, doc_id, page)` | Load a page JSON file; returns `None` if missing |
| `build_page_schema(doc_id, page)` | Build a fresh page dict with empty extraction slots |
| `get_page_number(filepath)` | Parse zero-based page number from a page filename |
| `page_file_path(out_dir, doc_id, page)` | Return canonical path for a page JSON file |

---

### `askemblaex.metadata`

Document metadata building, reading, and writing.

| Symbol | Description |
|--------|-------------|
| `build_metadata(src_path, *, file_hash)` | Build a fresh metadata dict from a source file |
| `write_metadata(out_dir, file_hash, metadata)` | Write metadata dict to `<out_dir>/<hash>.metadata._.json` |
| `load_metadata(out_dir, file_hash)` | Load metadata dict; returns `None` if missing |
| `merge_metadata(existing, new, *, overwrite)` | Shallow-merge two metadata dicts |

---

### `askemblaex.hash`

Content-based file hashing.

| Symbol | Description |
|--------|-------------|
| `hash_file(path, algo)` | Hash a file by content using chunked reads; returns hex digest |

---

## Python API examples

### Extract a single file

```python
from askemblaex.extract import process_one
from pathlib import Path
import logging

logger = logging.getLogger("myapp")

process_one(
    Path("registers/freewill_baptist.pdf"),
    Path("output/"),
    logger,
    active_methods={"azure_computer_vision", "pymupdf"},
    pdf_dpi=300,
)
```

### Reconcile a document folder

```python
from askemblaex.reconcile import reconcile_folder
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="...")
reconcile_folder(
    Path("output/2e7698fc.../"),
    client=client,
    model="gpt-4o",
)
```

### Extract entities from a reconciled folder

```python
from askemblaex.entities import entity_extract_folder
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="...")
extracted, skipped = entity_extract_folder(
    Path("output/2e7698fc.../"),
    client=client,
    model="gpt-4o",
    embed_provider="ollama",   # or "openai", or None to skip embeddings
)
print(f"{extracted} pages extracted, {skipped} skipped")
```

### Build a context window manually

```python
from askemblaex.window import build_dynamic_extraction_window

pages = {
    0: "Page zero text...",
    1: "Page one text...",
    2: "Page two text...",
    3: "Page three text...",
}

window = build_dynamic_extraction_window(pages, anchor_page=2, context_pages=1)
print(f"Window covers pages {window.pages_included}, {window.char_count} chars")
# Window covers pages [1, 2, 3], 123 chars
```

### Build person embedding text

```python
from askemblaex.entities import build_person_embeddings
import json

structured = {
    "persons": [{
        "name": "John Smith",
        "summary": "Head of household.",
        "birth": {"date_text": "1842", "place": "County Cork"},
        "death": {"date_text": "1910", "place": "Melbourne"},
        "relationships": {"spouse": ["Catherine Murphy"], "children": ["William Smith"]},
        "attributes": ["farmer"],
        "events": ["Arrived Melbourne 1865"],
        "evidence_phrases": ["head of household"],
    }],
    "events": [],
    "claims": [],
}

texts = build_person_embeddings(json.dumps(structured))
print(texts[0])
# TYPE: Person
# NAME: John Smith
# ALT_NAMES: none
# SUMMARY: Head of household.
# VITALS: born 1842 County Cork; died 1910 Melbourne
# ATTRIBUTES: farmer
# RELATIONSHIPS:
# - spouse: Catherine Murphy
# - children: William Smith
# EVENTS:
# - Arrived Melbourne 1865
# EVIDENCE_ANCHORS:
# - head of household
# YEAR_HINT: 1842–1910
```

### Generate an embedding

```python
from askemblaex.embed import generate_embedding

vector = generate_embedding("John Smith, farmer, born 1842 County Cork", provider="ollama")
print(len(vector))  # e.g. 768
```

---

## Publishing to PyPI

```bash
pip install hatch

# Build
hatch build

# Upload to TestPyPI first
hatch publish --repo test

# Upload to PyPI
hatch publish
```

---

## Development

```bash
git clone https://github.com/askembla/askemblaex
cd askemblaex
pip install -e ".[dev]"
pytest
```

---

## License

MIT — see [LICENSE](LICENSE).
