Metadata-Version: 2.4
Name: coresdk-extractor
Version: 1.2.1
Summary: Production-grade document extraction to versioned RAG-ready JSON/JSONL
License: MIT
License-File: LICENSE
Keywords: chunking,document,extraction,jsonl,llm,nlp,pdf,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: charset-normalizer>=3
Requires-Dist: ebooklib>=0.18
Requires-Dist: lxml>=5
Requires-Dist: mistletoe>=1.3
Requires-Dist: openpyxl>=3.1
Requires-Dist: pdfplumber>=0.11
Requires-Dist: pikepdf>=8
Requires-Dist: puremagic>=1.28
Requires-Dist: pydantic>=2.0
Requires-Dist: pymupdf4llm>=0.0.17
Requires-Dist: pymupdf>=1.24
Requires-Dist: python-docx>=1.1
Requires-Dist: python-pptx>=0.6
Requires-Dist: requests>=2.32
Requires-Dist: rich>=13
Requires-Dist: tiktoken>=0.7
Requires-Dist: trafilatura>=1.12
Requires-Dist: typer>=0.12
Provides-Extra: audio
Requires-Dist: faster-whisper>=1.0; extra == 'audio'
Requires-Dist: pyannote-audio>=3.1; extra == 'audio'
Requires-Dist: torch>=2.0; extra == 'audio'
Requires-Dist: whisperx>=3.1; extra == 'audio'
Provides-Extra: azure
Requires-Dist: azure-identity>=1.16; extra == 'azure'
Requires-Dist: azure-storage-blob>=12.19; extra == 'azure'
Provides-Extra: canon-embeddings
Requires-Dist: numpy>=1.24; extra == 'canon-embeddings'
Requires-Dist: sentence-transformers>=2.2; extra == 'canon-embeddings'
Provides-Extra: clickhouse
Requires-Dist: clickhouse-connect>=0.7.0; extra == 'clickhouse'
Provides-Extra: dev
Requires-Dist: dirty-equals>=0.7; extra == 'dev'
Requires-Dist: inline-snapshot>=0.13; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest-regressions>=0.5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: types-beautifulsoup4; extra == 'dev'
Requires-Dist: types-requests; extra == 'dev'
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch>=8.0; extra == 'elasticsearch'
Provides-Extra: email
Provides-Extra: full
Requires-Dist: azure-identity>=1.16; extra == 'full'
Requires-Dist: azure-storage-blob>=12.19; extra == 'full'
Requires-Dist: boto3>=1.34; extra == 'full'
Requires-Dist: clickhouse-connect>=0.7.0; extra == 'full'
Requires-Dist: confluent-kafka>=2.3; extra == 'full'
Requires-Dist: elasticsearch>=8.0; extra == 'full'
Requires-Dist: faster-whisper>=1.0; extra == 'full'
Requires-Dist: gliner2>=0.1.0; extra == 'full'
Requires-Dist: google-cloud-storage>=2.14; extra == 'full'
Requires-Dist: langdetect>=1.0.9; extra == 'full'
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.20; extra == 'full'
Requires-Dist: opentelemetry-sdk>=1.20; extra == 'full'
Requires-Dist: pillow>=10; extra == 'full'
Requires-Dist: psycopg2-binary>=2.9; extra == 'full'
Requires-Dist: pyannote-audio>=3.1; extra == 'full'
Requires-Dist: pymongo>=4.6; extra == 'full'
Requires-Dist: qdrant-client>=1.8; extra == 'full'
Requires-Dist: rapidfuzz>=3.0.0; extra == 'full'
Requires-Dist: surya-ocr>=0.4; extra == 'full'
Requires-Dist: tomli>=2.0; (python_version < '3.11') and extra == 'full'
Requires-Dist: torch>=2.0; extra == 'full'
Requires-Dist: torch>=2.0.0; extra == 'full'
Requires-Dist: weaviate-client>=3.26; extra == 'full'
Requires-Dist: whisperx>=3.1; extra == 'full'
Provides-Extra: gcs
Requires-Dist: google-cloud-storage>=2.14; extra == 'gcs'
Provides-Extra: kafka
Requires-Dist: confluent-kafka>=2.3; extra == 'kafka'
Provides-Extra: lang
Requires-Dist: langdetect>=1.0.9; extra == 'lang'
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.6; extra == 'mongodb'
Provides-Extra: ner
Requires-Dist: gliner2>=0.1.0; extra == 'ner'
Requires-Dist: rapidfuzz>=3.0.0; extra == 'ner'
Requires-Dist: torch>=2.0.0; extra == 'ner'
Provides-Extra: ocr
Requires-Dist: pillow>=10; extra == 'ocr'
Requires-Dist: surya-ocr>=0.4; extra == 'ocr'
Provides-Extra: otel
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.20; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.20; extra == 'otel'
Provides-Extra: postgres
Requires-Dist: psycopg2-binary>=2.9; extra == 'postgres'
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.8; extra == 'qdrant'
Provides-Extra: s3
Requires-Dist: boto3>=1.34; extra == 's3'
Provides-Extra: scientific
Provides-Extra: sftp
Requires-Dist: paramiko>=3.4; extra == 'sftp'
Provides-Extra: sources
Requires-Dist: azure-identity>=1.16; extra == 'sources'
Requires-Dist: azure-storage-blob>=12.19; extra == 'sources'
Requires-Dist: boto3>=1.34; extra == 'sources'
Requires-Dist: google-cloud-storage>=2.14; extra == 'sources'
Provides-Extra: tomli
Requires-Dist: tomli>=2.0; (python_version < '3.11') and extra == 'tomli'
Provides-Extra: weaviate
Requires-Dist: weaviate-client>=3.26; extra == 'weaviate'
Description-Content-Type: text/markdown

# extractor

> Production-grade document extraction CLI + Python library.  
> Converts **any file format** into a versioned, schema-stable, RAG-ready JSONL stream.

[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
[![Schema v1.2.0](https://img.shields.io/badge/schema-v1.2.0-green)]()
[![License: MIT](https://img.shields.io/badge/license-MIT-brightgreen)](LICENSE)

---

## Why extractor?

Most RAG pipelines treat documents as bags of text. **extractor** preserves structure:

- Every chunk carries a **full breadcrumb** (`section_path`) so your retriever knows where it came from
- **Tables are atomic** — never split across chunks; emitted in both Markdown and structured JSON
- **Sections drive chunking** — boundaries follow headings, not page numbers or character counts
- **Content-addressed IDs** — stable across re-runs, safe to use as vector store keys
- **Streaming JSONL** — process terabytes without loading files into memory
- **Versioned protocol** — `schema_version` on every record; breaking changes bump the major version
- **Named entity extraction** — GLiNER2 NER inline on every element and chunk; `entity_types` field for fast vector store payload filtering; local inference via `fastino/gliner2-base-v1`

---

## Installation

```bash
pip install coresdk-extractor                      # core (PDF, DOCX, XLSX, PPTX, HTML, Markdown, EPUB, JSON, XML, CSV, LaTeX)
pip install "coresdk-extractor[audio]"             # + audio transcription (faster-whisper + pyannote)
pip install "coresdk-extractor[ocr]"               # + scanned PDF OCR (surya-ocr)
pip install "coresdk-extractor[ner]"               # + GLiNER2 NER, classification, relations, structured extraction
pip install "coresdk-extractor[lang]"              # + language detection (langdetect)
pip install "coresdk-extractor[otel]"              # + OpenTelemetry tracing

# Source connectors
pip install "coresdk-extractor[s3]"                # S3 / MinIO
pip install "coresdk-extractor[azure]"             # Azure Blob Storage / ADLS Gen2
pip install "coresdk-extractor[gcs]"               # Google Cloud Storage
pip install "coresdk-extractor[sources]"           # all three cloud connectors
# HTTP/HTTPS and IMAP/email connectors are included in the core install

# Database sinks
pip install "coresdk-extractor[clickhouse]"        # ClickHouse
pip install "coresdk-extractor[mongodb]"           # MongoDB
pip install "coresdk-extractor[postgres]"          # PostgreSQL
pip install "coresdk-extractor[elasticsearch]"     # Elasticsearch
pip install "coresdk-extractor[qdrant]"            # Qdrant
pip install "coresdk-extractor[weaviate]"          # Weaviate
pip install "coresdk-extractor[kafka]"             # Kafka (confluent-kafka)
# Webhook sink requires no extra install

pip install "coresdk-extractor[full]"              # everything above
```

**Scientific PDFs** (GROBID): run a GROBID server and set `GROBID_URL=http://localhost:8070`.
Without it, scientific PDFs fall back to pymupdf4llm automatically.

### Verify installation

```bash
extractor info          # lists all supported formats
extractor run README.md # quick smoke test on any local file
```

> **Heavy optional dependencies:** `extractor[audio]` pulls PyTorch (~2 GB). `extractor[ocr]` requires surya-ocr with PyTorch. `extractor[ner]` pulls PyTorch (~2 GB) for local GLiNER2 inference. Install these only when needed.

---

## Quick start

```bash
# Extract a PDF — stream elements to stdout
extractor run paper.pdf

# Extract in RAG-ready chunks mode, write to file
extractor run paper.pdf --mode chunks --out paper.chunks.jsonl

# Extract a whole directory, write each file alongside it
extractor run ./docs/ --mode chunks --out ./out/

# View what was extracted
extractor view paper.chunks.jsonl
extractor view paper.chunks.jsonl --count        # element type breakdown
extractor view paper.chunks.jsonl --types table:simple,code:block

# Inspect a file before extracting
extractor info paper.pdf

# List all parsers and supported formats
extractor info

# Validate output against the schema
extractor validate paper.chunks.jsonl --level invariants

# List all element types
extractor schema
```

---

## Source connectors

extractor can pull documents directly from cloud storage, HTTP, and email — no manual download step needed.

```bash
# S3 bucket or prefix
extractor run s3://my-bucket/docs/ --out ./output/

# MinIO (S3-compatible)
EXTRACTOR_S3_ENDPOINT_URL=http://minio:9000 extractor run s3://my-bucket/docs/ --out ./output/

# Azure Blob Storage
extractor run az://my-container/reports/ --out ./output/

# Azure Data Lake Storage Gen2
extractor run abfs://my-container/data/ --out ./output/

# Google Cloud Storage
extractor run gcs://my-bucket/papers/ --out ./output/

# Single file via HTTPS (no extra install needed)
extractor run https://example.com/report.pdf

# Email attachments via IMAP (no extra install needed)
extractor run imap://inbox --out ./output/

# Filter to PDF files only, download up to 8 files in parallel
extractor run s3://my-bucket/docs/ --source-filter "*.pdf" --source-concurrency 8 --out ./output/
```

Every downloaded file passes through the same quarantine gate as local files before extraction.

### Auth env vars by connector

| Connector | Required env vars | Optional env vars |
|-----------|-------------------|-------------------|
| S3 | `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` (or `AWS_PROFILE`, or IAM role) | `AWS_SESSION_TOKEN`, `EXTRACTOR_S3_ENDPOINT_URL`, `EXTRACTOR_S3_REGION` |
| MinIO | `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `EXTRACTOR_S3_ENDPOINT_URL` | `EXTRACTOR_S3_REGION` |
| Azure Blob / ADLS | `AZURE_STORAGE_CONNECTION_STRING` **or** `AZURE_STORAGE_ACCOUNT` + `AZURE_STORAGE_KEY` | `AZURE_STORAGE_ACCOUNT` alone uses DefaultAzureCredential (managed identity / service principal) |
| GCS | `GOOGLE_APPLICATION_CREDENTIALS` (path to service account JSON) | On GKE/Cloud Run: Workload Identity — no env var needed |
| HTTP/HTTPS | none | `EXTRACTOR_HTTP_HEADERS_JSON` (JSON dict), `EXTRACTOR_HTTP_VERIFY_SSL`, `EXTRACTOR_HTTP_MAX_BYTES` |
| IMAP/email | `EXTRACTOR_IMAP_HOST`, `EXTRACTOR_IMAP_USERNAME`, `EXTRACTOR_IMAP_PASSWORD` | `EXTRACTOR_IMAP_PORT` (default: 993), `EXTRACTOR_IMAP_FOLDER` (default: INBOX), `EXTRACTOR_IMAP_SEARCH` (default: UNSEEN) |

See [docs/sources.md](docs/sources.md) for full connector documentation.

---

## Database sinks

Stream extracted records directly into a database with `--sink`. The sink writes in batches alongside normal JSONL output.

```bash
# Write to Qdrant (chunk payloads only — add embeddings separately)
extractor run ./docs/ --mode chunks --sink qdrant --sink-uri http://localhost:6333

# Write to MongoDB
extractor run ./docs/ --mode chunks --sink mongodb --sink-uri mongodb://localhost:27017 --sink-database mydb --sink-table chunks

# Write to PostgreSQL (DSN from env: EXTRACTOR_PG_DSN)
extractor run ./docs/ --mode chunks --sink postgres

# Write to Elasticsearch
extractor run ./docs/ --mode chunks --sink elasticsearch --sink-uri http://localhost:9200 --sink-table my_index

# Write to ClickHouse
extractor run ./docs/ --mode chunks --sink clickhouse --sink-uri localhost:8123

# Write to Kafka topic
extractor run ./docs/ --mode chunks --sink kafka --sink-uri broker:9092 --sink-table my_topic

# POST batches to a webhook
extractor run ./docs/ --mode chunks --sink webhook --sink-uri https://my-api.example.com/ingest

# Adjust batch size (default: 1000)
extractor run ./docs/ --mode chunks --sink postgres --sink-batch 500
```

| Sink | Install | Connection |
|------|---------|------------|
| `clickhouse` | `coresdk-extractor[clickhouse]` | `--sink-uri host:port` or defaults to `localhost:8123` |
| `mongodb` | `coresdk-extractor[mongodb]` | `--sink-uri mongodb://...` or defaults to `mongodb://localhost:27017` |
| `postgres` / `postgresql` | `coresdk-extractor[postgres]` | `--sink-uri postgresql://user:pass@host/db` or `EXTRACTOR_PG_DSN` |
| `elasticsearch` / `es` | `coresdk-extractor[elasticsearch]` | `--sink-uri http://...` or `EXTRACTOR_ES_URL` |
| `qdrant` | `coresdk-extractor[qdrant]` | `--sink-uri http://...` or `EXTRACTOR_QDRANT_URL` |
| `weaviate` | `coresdk-extractor[weaviate]` | `--sink-uri http://...` or `EXTRACTOR_WEAVIATE_URL` |
| `kafka` | `coresdk-extractor[kafka]` | `--sink-uri broker:9092` or `EXTRACTOR_KAFKA_BROKERS` |
| `webhook` / `http_post` | none (core) | `--sink-uri https://...` or `EXTRACTOR_WEBHOOK_URL` |

See [docs/sinks.md](docs/sinks.md) for schema mapping details, auth env vars, and custom sink plugins.

---

## Config file

Place `.extractor.toml` in your project directory (or pass `--config path/to/extractor.toml`) to set defaults without repeating CLI flags.

```toml
[run]
mode = "chunks"
chunk_size = 512
tokenizer = "cl100k_base"

[ner]
enabled = true
model = "fastino/gliner2-base-v1"

[sink]
type = "qdrant"
uri = "http://localhost:6333"

[source]
concurrency = 8

[quality_gates]
min_chunks = 1
max_extraction_error_rate = 0.05
```

---

## New in v1.2.0

- **GLiNER2 extended capabilities** — classification (`--classify-as`), relation triples (`--relations`), structured field extraction (`--extract-schema`)
- **Source connectors** — pull documents from S3/MinIO, Azure Blob/ADLS, GCS, HTTP/HTTPS, and IMAP email directly via URI
- **Database sinks** — stream records into ClickHouse, MongoDB, PostgreSQL, Elasticsearch, Qdrant, Weaviate, Kafka, or any HTTP webhook
- **Table serialization modes** — `--table-text-mode markdown|nl-rows|nl-columns|hybrid` controls how tables are serialized into chunk text
- **Chunk quality scoring** — `--quality` emits `ChunkQuality` with lexical density, entity density, compression ratio, and heading coverage
- **Language detection** — `metadata.language` populated per element when `extractor[lang]` is installed
- **Quality gates** — configurable pass/fail thresholds in `.extractor.toml` under `[quality_gates]`; gate failures are recorded in the manifest
- **Figure extraction** — `--figures-dir` exports figure assets (PNG/JPEG) alongside JSONL output
- **OpenTelemetry** — `extractor[otel]` emits spans per document; configure via standard OTEL env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`, etc.)
- **Dual-chunk mode** — `--mode dual-chunks` produces both coarse parent chunks and fine child chunks linked by `parent_chunk_id`
- **Parallel workers** — `--workers N` for local directory extraction; cloud sources use `--source-concurrency N`
- **Incremental processing** — `--incremental` skips files unchanged since last run (SHA256 + run-config keyed JSON cache)

---

## Entity extraction (NER)

```bash
# Extract with named entities (requires extractor[ner])
extractor run paper.pdf --mode chunks --entities

# Disable NER
extractor run paper.pdf --mode chunks --no-entities

# Custom entity types
extractor run paper.pdf --entities-types "person,organization,product" --entities-threshold 0.6
```

```python
from extractor import extract

for chunk in extract("paper.pdf", mode="chunks", entities=True):
    print(chunk.entities)           # list of EntityAnnotation
    print(chunk.chunk_metadata.entity_types)  # ["organization", "person"]
```

---

## Python API

```python
from extractor import extract

# Elements mode — fine-grained semantic units
for el in extract("paper.pdf", mode="elements"):
    print(el.element_type, el.section_path, el.text[:120])

# Chunks mode — pre-committed, RAG-ready chunks
for chunk in extract("paper.pdf", mode="chunks", chunk_size=512):
    print(chunk.id, chunk.token_count, chunk.text[:120])

# Filter to only tables and headings
for el in extract("report.docx", include_types=["table:simple", "structural:section_header"]):
    if el.table:
        print(el.table.structured)  # {"headers": [...], "rows": [[...]]}
```

### Error handling

```python
from extractor import extract, QuarantineError, UnsupportedFormatError, ParserError

try:
    for el in extract("untrusted_file.pdf"):
        print(el.element_type, el.text[:80])
except QuarantineError as e:
    print(f"File rejected by security check: {e}")
except UnsupportedFormatError as e:
    print(f"Format not supported. Run `extractor info` for the full list.")
except ParserError as e:
    print(f"Parser failed: {e}")
```

---

## Output schema

Every record is a JSON object. Key fields:

| Field | Type | Description |
|-------|------|-------------|
| `id` | `string` | Content-addressed ID (`el_` + 16 hex chars) |
| `element_type` | `string` | One of 47 canonical types (see below) |
| `text` | `string` | Plain-text content |
| `section_path` | `string[]` | Heading breadcrumb, e.g. `["Introduction", "Methods"]` |
| `section_path_tier` | `int` | Quality: 1=native, 2=font-heuristic, 3=keyword, 4=positional |
| `sequence_index` | `int` | Document order, 0-based |
| `page` | `int\|null` | Source page (1-based) |
| `schema_version` | `string` | `"1.2.0"` |
| `source_filename` | `string` | Source file name |
| `source_sha256` | `string` | SHA-256 of source file |
| `entities` | `EntityAnnotation[]\|absent` | Named entity annotations (absent = NER not run; `[]` = NER ran, nothing found) |
| `table` | `object\|null` | `{markdown, structured, has_header_row, row_count, col_count}` |
| `equation` | `object\|null` | `{latex, plain_text, mathml}` |
| `figure` | `object\|null` | `{caption, image_ref, image_sha256, ocr_text}` |
| `transcript` | `object\|null` | `{speaker, start_time_s, end_time_s, word_timestamps}` |
| `admonition` | `object\|null` | `{kind, title}` |

### Element types

```
structural:  title  section_header  subtitle  divider  page_header  page_footer
text:        narrative  abstract  admonition  pull_quote  footnote  caption  sidebar  transcript_segment
table:       simple  complex  continuation
code:        block  cell  inline
list:        item  item_ordered  item_definition
media:       figure  image  audio  video
scientific:  equation_display  equation_inline  citation  reference_entry  theorem  definition  proof
meta:        document_title  author  date  url  email  page_number  extraction_error
form:        field  label  checkbox
composite:   chunk
```

**Atomic types** (never split across chunks): `table:simple`, `table:complex`, `table:continuation`, `media:figure`, `media:image`, `code:block`, `scientific:equation_display`

---

## JSONL envelope format

```jsonl
{"type":"envelope","extractor_version":"1.2.0","source":{...},"run_config":{...},"created_at":"..."}
{"id":"el_a1b2c3d4e5f6a7b8","element_type":"structural:title","text":"Introduction",...}
{"id":"el_...","element_type":"text:narrative","text":"...",...}
...
{"type":"stream_end","status":"complete","total_elements":42,"schema_version":"1.0.0"}
```

A `manifest.json` companion file is written alongside every `--out` file with full stats.

---

## Supported formats

| Format | Library | Notes |
|--------|---------|-------|
| PDF (digital) | pymupdf4llm | Fast, heading-aware |
| PDF (scientific) | GROBID TEI → pymupdf4llm fallback | Equations, citations, references |
| PDF (scanned) | surya-ocr → pymupdf fallback | Layout detection + OCR |
| DOCX | python-docx | Headings, tables, runs, images |
| XLSX | openpyxl | Sheet-per-section, dual-format tables |
| PPTX | python-pptx | Slide titles + body, speaker notes |
| HTML | trafilatura + lxml | Boilerplate removal, GFM alerts |
| Markdown | mistletoe (GFM) | Headings, tables, alerts, code fences |
| EPUB | ebooklib + BS4 | Spine-order chapter extraction |
| LaTeX | pure-regex parser | Sections, equations, tables, figures, bibliography |
| JSON | stdlib | Key-value pairs as narrative |
| XML | lxml | Title/paragraph heuristics |
| CSV | stdlib csv | Entire file as dual-format table |
| Plain text | heuristic | Heading pattern detection |
| Audio (mp3/wav/m4a/flac) | faster-whisper + pyannote | Diarization, word timestamps |

---

## CLI reference

```
extractor run <target> [options]

  Output
  --mode            elements | chunks | dual-chunks  (default: elements)
  --out             Output file or directory (default: stdout)
  --quiet / -q      Suppress progress output
  --debug           Show full tracebacks on errors
  --include-full-path  Store absolute source path instead of filename

  Chunking
  --chunk-size      Max tokens per chunk (default: 512)
  --overlap         Overlap tokens between chunks (default: 0)
  --tokenizer       tiktoken encoding (default: cl100k_base)
  --context-prefix  Prepend section breadcrumb to each chunk text
  --parent-size     Token budget for coarse chunks in dual-chunks mode (default: 512)
  --child-size      Token budget for fine chunks in dual-chunks mode (default: 128)

  Extraction
  --strategy        fast | accurate | ocr  (default: fast)
  --include-types   Comma-separated element types to emit
  --exclude-types   Comma-separated element types to suppress
  --table-text-mode markdown | nl-rows | nl-columns | hybrid  (default: markdown)
  --figures-dir     Directory to export figure assets (PNG/JPEG)

  NER / GLiNER2
  --entities/--no-entities      Run GLiNER2 NER (default: on when extractor[ner] installed)
  --entities-model              Local GLiNER2 model (default: fastino/gliner2-base-v1)
  --entities-types              Comma-separated NER label list
  --entities-threshold          Min confidence score (default: 0.50)
  --classify-as                 Comma-separated classification labels
  --relations/--no-relations    Extract (subject, predicate, object) triples
  --relation-types              Comma-separated relation predicates
  --extract-schema              JSON file with schema dict for structured field extraction
  --extract-on                  Comma-separated element types for structured extraction
  --canonicalize/--no-canonicalize  Cross-document entity canonicalization
  --registry-path               Path to EntityRegistry JSON file

  Quality
  --quality/--no-quality        Emit ChunkQuality scores on each chunk

  Parallel / incremental
  --workers / -w    Parallel workers for local directory extraction (default: 1)
  --incremental     Skip files unchanged since last run (requires --out)

  Source connectors
  --source-filter       Glob pattern to filter remote files, e.g. "*.pdf"
  --source-concurrency  Max parallel downloads from remote sources (default: 4)
  --source-tmp-dir      Directory for temp files during remote download

  Database sinks
  --sink            clickhouse | mongodb | postgres | elasticsearch | qdrant | weaviate | kafka | webhook
  --sink-uri        Connection URI or host:port
  --sink-database   Database name (default: extractor)
  --sink-table      Table/collection/index/topic name (default: elements)
  --sink-batch      Batch size for database writes (default: 1000)

  Config
  --config          Path to extractor.toml config file

extractor view <jsonl-file>
  --max-text        Max chars per element (default: 200)
  --types / -t      Comma-separated element types to show
  --count / -n      Print element counts by type and exit
  --no-meta         Hide envelope/manifest lines

extractor validate <jsonl-file>
  --level           basic | schema | invariants  (default: schema)

extractor info [file]
extractor schema [element-type] [--json] [--type element|chunk|manifest] [--out file]
extractor cache clear [cache-file] [--older-than N]
```

---

## Environment variables

All environment variables are **optional**. The library works out of the box without any of them — each one unlocks a specific optional capability.

**PDF (scientific)**

| Variable | Default | Description |
|----------|---------|-------------|
| `GROBID_URL` | `http://localhost:8070` | URL of a running [GROBID](https://grobid.readthedocs.io) server. When set and reachable, scientific PDFs are parsed via GROBID TEI (better equation/citation/reference extraction). Without it, scientific PDFs automatically fall back to the standard digital PDF parser — no errors. |

**Audio transcription** — only relevant if you install `extractor[audio]` and pass audio files (MP3/WAV/M4A)

| Variable | Default | Description |
|----------|---------|-------------|
| `HF_TOKEN` | — | Hugging Face API token. Required only for **speaker diarization** ("who said what"). Without it, you still get a full transcript — just without speaker labels. Get a free token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and accept the [pyannote.audio](https://huggingface.co/pyannote/speaker-diarization-3.1) model license. |
| `WHISPER_MODEL` | `base` | Whisper model size controlling accuracy vs. speed. `base` (~150 MB) is fast and good for most uses. Use `large-v2` (~3 GB) for production-quality transcription. Options: `tiny` / `base` / `small` / `medium` / `large-v2`. |
| `WHISPER_DEVICE` | auto | Hardware to run Whisper on. Auto-detected: uses NVIDIA GPU (`cuda`), Apple Silicon (`mps`), or falls back to `cpu`. Set explicitly if auto-detection picks the wrong device. |
| `WHISPER_COMPUTE` | auto | Float precision for faster-whisper. `int8` on CPU (faster, less RAM). `float16` on GPU (fastest). `float32` for maximum accuracy. Auto-set based on device. |

**Entity extraction (NER)** — only relevant if you install `extractor[ner]`

> NER uses local inference only (`fastino/gliner2-base-v1`). No external API key required.

---

## Exit codes

| Code | Meaning |
|------|---------|
| 0 | Success |
| 1 | Partial success (some files failed) |
| 2 | Quarantine failure (file rejected) |
| 3 | Unsupported format |
| 4 | Parser crash |
| 5 | Output write error |
| 6 | Configuration error |

---

## Documentation

| Document | Description |
|----------|-------------|
| [Protocol Specification](docs/SPEC.md) | Full schema and protocol spec v1.2.0 |
| [Source Connectors](docs/sources.md) | S3, Azure, GCS, HTTP, IMAP connectors |
| [Database Sinks](docs/sinks.md) | ClickHouse, MongoDB, Postgres, ES, Qdrant, Weaviate, Kafka, Webhook |
| [Architecture](docs/architecture.md) | System architecture and design decisions |
| [Writing a Parser](docs/writing-a-parser.md) | How to add support for a new format |
| [Contributing](CONTRIBUTING.md) | Dev setup, test workflow, PR guidelines |

---

## License

MIT — see [LICENSE](LICENSE).
