Metadata-Version: 2.4
Name: sift-kg
Version: 0.6.0
Summary: Zero-config document-to-knowledge-graph pipeline
Project-URL: Homepage, https://github.com/civictable/sift-kg
Project-URL: Documentation, https://github.com/civictable/sift-kg#readme
Project-URL: Repository, https://github.com/civictable/sift-kg
Project-URL: Issues, https://github.com/civictable/sift-kg/issues
Author-email: Juan Ceresa <jcere@umich.edu>
License: MIT
License-File: LICENSE
Keywords: document-processing,entity-extraction,knowledge-graph,llm,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: inflect>=7.0.0
Requires-Dist: litellm>=1.0.0
Requires-Dist: networkx>=3.2
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pydantic-settings>=2.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: python-docx>=1.0.0
Requires-Dist: pyvis>=0.3.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.0.0
Requires-Dist: semhash>=0.4.0
Requires-Dist: typer[all]>=0.9.0
Requires-Dist: unidecode>=1.3.0
Provides-Extra: all
Requires-Dist: google-cloud-vision>=3.4.0; extra == 'all'
Requires-Dist: pymupdf>=1.23.0; extra == 'all'
Requires-Dist: scikit-learn>=1.3.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.0.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.9.0; extra == 'dev'
Provides-Extra: embeddings
Requires-Dist: scikit-learn>=1.3.0; extra == 'embeddings'
Requires-Dist: sentence-transformers>=2.0.0; extra == 'embeddings'
Provides-Extra: ocr
Requires-Dist: google-cloud-vision>=3.4.0; extra == 'ocr'
Requires-Dist: pymupdf>=1.23.0; extra == 'ocr'
Description-Content-Type: text/markdown

# sift-kg

**Turn any collection of documents into a knowledge graph.**

No code, no database, no infrastructure — just a CLI and your documents. Define what to extract in YAML (or use the built-in defaults), and get a browsable, exportable knowledge graph. sift-kg handles the rest: entity extraction, duplicate resolution with your approval, and narrative generation that traces connections across your entire collection.

**[Live demos →](https://juanceresa.github.io/sift-kg/)** — graphs generated entirely by sift-kg

```bash
pip install sift-kg

sift init                           # create sift.yaml + .env.example
sift extract ./documents/           # extract entities & relations
sift build                          # build knowledge graph
sift resolve                        # find duplicate entities
sift review                         # approve/reject merges interactively
sift apply-merges                   # apply your decisions
sift narrate                        # generate narrative summary
sift view                           # interactive graph in your browser
sift export graphml                 # export to Gephi, yEd, Cytoscape, etc.
```

## How It Works

```
Documents (PDF, DOCX, text, HTML)
       ↓
  Text Extraction (pdfplumber, local) — or OCR for scanned PDFs (Google Cloud Vision)
       ↓
  Entity & Relation Extraction (LLM)
       ↓
  Knowledge Graph (NetworkX, JSON)
       ↓
  Entity Resolution (LLM proposes → you review)
       ↓
  Narrative Generation (LLM)
       ↓
  Interactive Viewer (browser) / Export (GraphML, GEXF, CSV)
```

Every entity and relation links back to the source document and passage. You control what gets merged. The graph is yours.

## Features

- **Zero-config start** — point at a folder, get a knowledge graph. Or drop a `sift.yaml` in your project for persistent settings
- **Any LLM provider** — OpenAI, Anthropic, Mistral, Ollama (local/private), or any LiteLLM-compatible provider
- **Domain-configurable** — define custom entity types and relation types in YAML
- **Human-in-the-loop** — sift proposes entity merges, you approve or reject in an interactive terminal UI
- **CLI search** — `sift search "SBF"` finds entities by name or alias, with optional relation and description output
- **Interactive viewer** — explore your graph in-browser with focus mode (double-click to isolate neighborhoods), keyboard navigation (arrow keys to step through connections), search, type/community toggles, and degree filtering
- **Export anywhere** — GraphML (yEd, Cytoscape), GEXF (Gephi), CSV, or native JSON for advanced analysis
- **Narrative generation** — investigative-style reports with relationship chains, timelines, and community-grouped entity profiles
- **Source provenance** — every extraction links to the document and passage it came from
- **Multilingual** — extracts from documents in any language, outputs a unified English knowledge graph. Proper names stay as-is, non-Latin scripts are romanized automatically
- **OCR for scanned PDFs** — optional Google Cloud Vision integration for court records, FOIA dumps, and historical archives (`--ocr` flag)
- **Budget controls** — set `--max-cost` to cap LLM spending
- **Runs locally** — your documents stay on your machine

## Use Cases

- **Investigative journalism** — analyze FOIA releases, court filings, and document leaks
- **OSINT research** — map entity networks from public records
- **Academic research** — map how theories, methods, and findings connect across a body of literature
- **Legal review** — extract and connect entities across document collections
- **Genealogy** — trace family relationships across vital records

## Bundled Domains

sift-kg ships with specialized domains you can use out of the box:

```bash
sift domains                              # list available domains
sift extract ./docs/ --domain-name osint  # use a bundled domain
```

Set a domain in `sift.yaml` so you don't need the flag every time:

```yaml
domain: academic
```

Works with bundled names (`academic`, `osint`, `default`) or a path to a custom YAML file.

| Domain | Focus | Key Entity Types | Key Relation Types |
|--------|-------|------------------|--------------------|
| `default` | General document analysis | PERSON, ORGANIZATION, LOCATION, EVENT, DOCUMENT | ASSOCIATED_WITH, MEMBER_OF, LOCATED_IN |
| `osint` | Investigations & FOIA | SHELL_COMPANY, FINANCIAL_ACCOUNT | BENEFICIAL_OWNER_OF, TRANSACTED_WITH, SIGNATORY_OF |
| `academic` | Literature review & topic mapping | CONCEPT, THEORY, METHOD, SYSTEM, FINDING, PHENOMENON, RESEARCHER, PUBLICATION, FIELD, DATASET | SUPPORTS, CONTRADICTS, EXTENDS, IMPLEMENTS, EXPLAINS, PROPOSED_BY, USES_METHOD, APPLIED_TO, INVESTIGATES |

The **academic** domain maps the intellectual landscape of a research area — feed in papers and get a graph of how theories, methods, systems, findings, and concepts connect. Distinguishes abstract ideas (THEORY, METHOD) from concrete artifacts (SYSTEM — e.g. GPT-2, BERT, GLUE). Designed for literature reviews, topic mapping, and understanding where ideas agree, contradict, or build on each other.

The **osint** domain adds entity types for shell companies, financial accounts, and offshore jurisdictions, plus relation types for tracing beneficial ownership and financial flows.

Nothing gets merged without your approval — the LLM proposes, you verify. Every extraction links back to the source document and passage.

See [`examples/ftx/`](examples/ftx/) for a pipeline run on 9 articles about the FTX collapse (431 entities, 1,201 relations) and [`examples/epstein/`](examples/epstein/) for the Giuffre v. Maxwell depositions (190 entities, 387 relations). [**Explore both graphs live**](https://juanceresa.github.io/sift-kg/) — no install, no API key.

## Civic Table

Looking for a hosted platform with forensic legal analysis and analyst verification?

[**Civic Table**](https://github.com/juanceresa/forensic_analysis_platform) is a forensic intelligence platform built on the sift-kg pipeline. It adds a 4-tier verification system where analysts and JDs validate AI-extracted facts before they're treated as evidence, LaTeX dossier generation for legal submissions, and a web interface for sharing results with clients and families. Built for property restitution, investigative journalism, and any context where documentary provenance matters.

sift-kg is the open-source CLI. Civic Table is the full platform — and where the output gets vetted by analysts and JDs before it carries evidentiary weight.

## Installation

Requires Python 3.11+.

```bash
pip install sift-kg
```

For scanned PDF support via Google Cloud Vision OCR (optional):

```bash
pip install sift-kg[ocr]
```

For semantic clustering during entity resolution (optional, ~2GB for PyTorch):

```bash
pip install sift-kg[embeddings]
```

For development:

```bash
git clone https://github.com/juanceresa/sift-kg.git
cd sift-kg
pip install -e ".[dev]"
```

## Quick Start

### 1. Initialize and configure

```bash
sift init                     # creates sift.yaml + .env.example
cp .env.example .env          # copy and add your API key
```

`sift init` generates a `sift.yaml` project config so you don't need flags on every command:

```yaml
# sift.yaml
domain: domain.yaml           # or a bundled name like "osint"
model: openai/gpt-4o-mini
ocr: true                     # for scanned PDFs (requires sift-kg[ocr])
```

Set your API key in `.env`:
```
SIFT_OPENAI_API_KEY=sk-...
```

Or use Anthropic, Mistral, Ollama, or any LiteLLM provider:
```
SIFT_ANTHROPIC_API_KEY=sk-ant-...
SIFT_MISTRAL_API_KEY=...
```

Settings priority: CLI flags > env vars > `.env` > `sift.yaml` > defaults. You can override anything from `sift.yaml` with a flag on any command.

### 2. Extract entities and relations

```bash
sift extract ./my-documents/
sift extract ./my-documents/ --ocr    # for scanned PDFs
```

Reads PDFs, DOCX, text files, and HTML. Extracts entities and relations using your configured LLM. Results saved as JSON in `output/extractions/`.

The `--ocr` flag enables Google Cloud Vision OCR for scanned PDFs (requires `pip install sift-kg[ocr]` and [Google Cloud credentials](https://cloud.google.com/docs/authentication/application-default-credentials)). It autodetects which PDFs need it — text-rich PDFs use pdfplumber as usual, only near-empty pages fall back to OCR. Safe for mixed folders. Without `--ocr`, sift will warn if a PDF appears to be scanned. You can also set `ocr: true` in `sift.yaml` for projects that always need it.

### 3. Build the knowledge graph

```bash
sift build
```

Constructs a NetworkX graph from all extractions. Automatically deduplicates near-identical entity names (plurals, Unicode variants, case differences) before they become graph nodes. Fixes reversed edge directions when the LLM swaps source/target types vs. the domain schema. Flags low-confidence relations for review. Saves to `output/graph_data.json`.

### 4. Resolve duplicate entities

See [Entity Resolution Workflow](#entity-resolution-workflow) below for the full guide — especially important for genealogy, legal, and investigative use cases where accuracy matters.

### 5. Explore and export

**Interactive viewer** — for exploration and investigation:

```bash
sift view                     # → opens output/graph.html in your browser
```

Opens a force-directed graph in your browser with color-coded entity types, semantic edge colors, search, type/community/relation toggles, a degree filter, and a detail sidebar. Smart defaults hide low-signal edges (MENTIONED_IN) and low-degree nodes on first load so you start with a readable graph.

**Focus mode:** Double-click any entity to isolate its neighborhood. Use arrow keys to step through connections one by one — each pair is shown in isolation with labeled edges. Press Enter/Right to shift focus to a neighbor, Backspace/Left to go back along your path, Escape to exit. This is the intended way to explore dense graphs — zoom in on what matters, trace connections, read the evidence.

**CLI search** — query entities directly from the terminal:

```bash
sift search "Sam Bankman"          # search by name
sift search "SBF"                  # search by alias
sift search "Caroline" -r          # show relations
sift search "FTX" -d -t ORGANIZATION  # descriptions + type filter
```

**Static exports** — for analysis tools where you want custom layout, filtering, or styling:

```bash
sift export graphml           # → output/graph.graphml (Gephi, yEd, Cytoscape)
sift export gexf              # → output/graph.gexf (Gephi native)
sift export csv               # → output/csv/entities.csv + relations.csv
sift export json              # → output/graph.json
```

Use GraphML/GEXF when you want to control node sizing, edge weighting, custom color schemes, or apply graph algorithms (centrality, community detection) in dedicated tools.

### 6. Generate narrative

```bash
sift narrate
sift narrate --communities-only   # regenerate community labels only (~$0.01)
```

Produces `output/narrative.md` — an investigative-style report with an overview, key relationship chains between top entities, a timeline (when dates exist in the data), and entity profiles grouped by thematic community (discovered via Louvain community detection). Entity descriptions are written in active voice with specific actions, not role summaries.

## Domain Configuration

sift-kg ships with three bundled domains (see [Bundled Domains](#bundled-domains) above for details).

Use a bundled domain:
```bash
sift extract ./docs/ --domain-name osint
```

Or create your own `domain.yaml`:
```yaml
name: My Domain
entity_types:
  PERSON:
    description: People and individuals
    extraction_hints:
      - Look for full names with titles
  COMPANY:
    description: Business entities
relation_types:
  EMPLOYED_BY:
    description: Employment relationship
    source_types: [PERSON]
    target_types: [COMPANY]
  OWNS:
    description: Ownership relationship
    symmetric: false
    review_required: true
```

```bash
sift extract ./docs/ --domain path/to/domain.yaml
```

## Library API

Use sift-kg from Python — Jupyter notebooks, scripts, web apps:

```python
from sift_kg import load_domain, run_extract, run_build, run_narrate, export_graph
from sift_kg import KnowledgeGraph
from pathlib import Path

# Load domain and run extraction
domain = load_domain()  # or load_domain(bundled_name="osint")
results = run_extract(Path("./docs"), "openai/gpt-4o-mini", domain, Path("./output"))

# Build graph
kg = run_build(Path("./output"), domain)
print(f"{kg.entity_count} entities, {kg.relation_count} relations")

# Export
export_graph(kg, Path("./output/graph.graphml"), "graphml")

# Or run the full pipeline
from sift_kg import run_pipeline
run_pipeline(Path("./docs"), "openai/gpt-4o-mini", domain, Path("./output"))
```

## Project Structure

After running the pipeline, your output directory contains:

```
output/
├── extractions/               # Per-document extraction JSON
│   ├── document1.json
│   └── document2.json
├── graph_data.json            # Knowledge graph (native format)
├── merge_proposals.yaml       # Entity merge proposals (DRAFT/CONFIRMED/REJECTED)
├── relation_review.yaml       # Flagged relations for review
├── narrative.md               # Generated narrative summary
├── entity_descriptions.json   # Entity descriptions (loaded by viewer)
├── communities.json           # Community assignments (shared by narrate + viewer)
├── graph.html                 # Interactive graph visualization
├── graph.graphml              # GraphML export (if exported)
├── graph.gexf                 # GEXF export (if exported)
└── csv/                       # CSV export (if exported)
    ├── entities.csv
    └── relations.csv
```

## Entity Resolution Workflow

When you're building a knowledge graph from family records, legal filings, or any documents where accuracy matters, you want full control over which entities get merged. sift-kg never merges anything without your approval.

The workflow has three layers, each catching different kinds of duplicates:

### Layer 1: Automatic Pre-Dedup (during `sift build`)

Before entities become graph nodes, sift deterministically collapses names that are obviously the same. No LLM involved, no cost, no review needed:

- **Unicode normalization** — "Jose Garcia" and "Jose Garcia" become one node
- **Title stripping** — "Detective Joe Recarey" and "Joe Recarey" merge (strips ~35 common prefixes: Dr., Mr., Judge, Senator, etc.)
- **Singularization** — "Companies" and "Company" merge
- **Fuzzy string matching** — [SemHash](https://github.com/MinishLab/semhash) at 0.95 threshold catches near-identical strings like "MacAulay" vs "Mac Aulay"

This happens automatically every time you run `sift build`. These are the trivial cases — spelling variants that would clutter your graph without adding information.

### Layer 2: LLM Proposes Merges (during `sift resolve`)

The LLM sees batches of entities (all types except DOCUMENT) and identifies ones that likely refer to the same real-world thing. It also detects cross-type duplicates (same name, different entity type) and proposes variant relationships (EXTENDS) when it finds parent/child patterns. Results go to `merge_proposals.yaml` (entity merges) and `relation_review.yaml` (variant relations), all starting as `DRAFT`:

```bash
sift resolve                  # uses domain from sift.yaml
sift resolve --domain osint   # or specify explicitly
```

If you have a domain configured, the LLM uses that context to make better judgments about entity names specific to your field.

This generates proposals like:

```yaml
proposals:
- canonical_id: person:samuel_benjamin_bankman_fried
  canonical_name: Samuel Benjamin Bankman-Fried
  entity_type: PERSON
  status: DRAFT                    # ← you decide
  members:
  - id: person:bankman_fried
    name: Bankman-Fried
    confidence: 0.99
  reason: Same person referenced with full name vs. surname only.

- canonical_id: person:stephen_curry
  canonical_name: Stephen Curry
  entity_type: PERSON
  status: DRAFT                    # ← you decide
  members:
  - id: person:steph_curry
    name: Steph Curry
    confidence: 0.99
  reason: Same basketball player referenced with nickname 'Steph' and full name 'Stephen'.
```

**Nothing is merged yet.** The LLM is proposing, not deciding.

### Layer 3: You Review and Decide

You have two options for reviewing proposals:

**Option A: Interactive terminal review**

```bash
sift review
```

Walks through each `DRAFT` proposal one by one. For each, you see the canonical entity, the proposed merge members, the LLM's confidence and reasoning. You approve, reject, or skip.

High-confidence proposals (>0.85 by default) are auto-approved, and low-confidence relations (<=0.5 by default) are auto-rejected:
```bash
sift review                        # uses defaults: --auto-approve 0.85, --auto-reject 0.5
sift review --auto-approve 0.90    # raise the auto-approve threshold
sift review --auto-reject 0.3      # lower the auto-reject threshold
sift review --auto-approve 1.0     # disable auto-approve, review everything manually
```

**Option B: Edit the YAML directly**

Open `output/merge_proposals.yaml` in any text editor. Change `status: DRAFT` to `CONFIRMED` or `REJECTED`:

```yaml
- canonical_id: person:stephen_curry
  canonical_name: Stephen Curry
  entity_type: PERSON
  status: CONFIRMED                # ← approve this merge
  members:
  - id: person:steph_curry
    name: Steph Curry
    confidence: 0.99
  reason: Same basketball player...

- canonical_id: person:winklevoss_twins
  canonical_name: Winklevoss twins
  entity_type: PERSON
  status: REJECTED                 # ← these are distinct people, don't merge
  members:
  - id: person:cameron_winklevoss
    name: Cameron Winklevoss
    confidence: 0.95
  reason: ...
```

**For high-accuracy use cases** (genealogy, legal review), we recommend editing the YAML directly so you can study each proposal carefully. The file is designed to be human-readable.

### Layer 3b: Relation Review

During `sift build`, relations below the confidence threshold (default 0.7) or of types marked `review_required` in your domain config get flagged in `output/relation_review.yaml`:

```yaml
review_threshold: 0.7
relations:
- source_name: Alice Smith
  target_name: Acme Corp
  relation_type: WORKS_FOR
  confidence: 0.45
  evidence: "Alice mentioned she used to work near the Acme building."
  status: DRAFT                    # ← you decide: CONFIRMED or REJECTED
  flag_reason: Low confidence (0.45 < 0.7)
```

Same workflow: review with `sift review` or edit the YAML, then apply.

### Layer 4: Apply Your Decisions

Once you've reviewed everything:

```bash
sift apply-merges
```

This does three things:
1. **Confirmed entity merges** — member entities are absorbed into the canonical entity. All their relations are rewired. Source documents are combined. The member nodes are removed.
2. **Rejected relations** — removed from the graph entirely.
3. **DRAFT proposals** — left untouched. You can come back to them later.

The graph is saved back to `output/graph_data.json`. You can re-export, narrate, or visualize the cleaned graph.

### Iterating

Entity resolution isn't always one-pass. After merging, new duplicates may become apparent. You can re-run:

```bash
sift resolve                  # find new duplicates in the cleaned graph
sift review                   # review the new proposals
sift apply-merges             # apply again
```

Each run is additive — previous `CONFIRMED`/`REJECTED` decisions in `merge_proposals.yaml` are preserved.

### Recommended Workflow by Use Case

| Use Case | Suggested Approach |
|---|---|
| **Quick exploration** | `sift review --auto-approve 0.85` — approve high-confidence, review the rest |
| **Genealogy / family records** | Edit YAML manually, `--auto-approve 1.0` — review every single merge |
| **Legal / investigative** | `sift resolve --embeddings`, edit YAML manually, use `sift view` to inspect between rounds |
| **Large corpus (1000+ entities)** | `sift resolve --embeddings` for better batching, then interactive review |

## Deduplication Internals

The pre-dedup and LLM batching techniques are inspired by [KGGen](https://github.com/stochastic-sisyphus/KGGen) (NeurIPS 2025) by [@stochastic-sisyphus](https://github.com/stochastic-sisyphus). KGGen uses SemHash for deterministic entity deduplication and embedding-based clustering for grouping entities before LLM comparison. sift-kg adapts these into its human-in-the-loop review workflow.

### Embedding-Based Clustering (optional)

By default, `sift resolve` sorts entities alphabetically and splits them into overlapping batches for LLM comparison. This works well when duplicates have similar spelling — but "Robert Smith" (R) and "Bob Smith" (B) end up in different batches and never get compared.

```bash
pip install sift-kg[embeddings]    # sentence-transformers + scikit-learn (~2GB, pulls PyTorch)
sift resolve --embeddings
```

This replaces alphabetical batching with KMeans clustering on sentence embeddings (all-MiniLM-L6-v2). Semantically similar names cluster together regardless of spelling.

| | Default (alphabetical) | `--embeddings` |
|---|---|---|
| Install size | Included | ~2GB (PyTorch) |
| First-run overhead | None | ~90MB model download |
| Per-run overhead | Sorting only | Encoding (<1s for hundreds of entities) |
| Cross-alphabet duplicates | Missed if in different batches | Caught |
| Small graphs (<100/type) | Same result | Same result |

Falls back to alphabetical batching if dependencies aren't installed or clustering fails.

## License

MIT
