Metadata-Version: 2.4
Name: mcp-fact-finder
Version: 0.1.2
Summary: Semantic fact indexer over markdown knowledge bases — extracts atomic facts, entities, and authority scores, then exposes search via MCP
Project-URL: Homepage, https://github.com/bobmatnyc/mcp-fact-finder
Project-URL: Repository, https://github.com/bobmatnyc/mcp-fact-finder
Project-URL: Bug Tracker, https://github.com/bobmatnyc/mcp-fact-finder/issues
Author-email: Robert Matsuoka <bob@matsuoka.com>
License: Copyright (c) 2024-2026 Robert Matsuoka
        
        All rights reserved.
        
        This software and associated documentation files (the "Software") are proprietary
        and confidential.  Redistribution, use, or modification in source or binary forms
        is permitted for personal and commercial use by the copyright holder only.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
License-File: LICENSE
Keywords: entity-resolution,fact-extraction,knowledge-base,lancedb,llm,local-inference,markdown,mcp,semantic-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.12
Requires-Dist: aioboto3>=13.0
Requires-Dist: dateparser>=1.2
Requires-Dist: httpx>=0.27
Requires-Dist: instructor>=1.0
Requires-Dist: lancedb>=0.20
Requires-Dist: mcp>=1.0
Requires-Dist: nltk>=3.8
Requires-Dist: numpy>=1.26
Requires-Dist: pyarrow>=15.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: rich>=13.0
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: sentencepiece>=0.2.1
Requires-Dist: tqdm>=4.0
Requires-Dist: transformers>=5.3.0
Requires-Dist: watchdog>=4.0
Provides-Extra: dev
Requires-Dist: mypy>=1.9; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: web
Requires-Dist: fastapi>=0.111; extra == 'web'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'web'
Description-Content-Type: text/markdown

# mcp-fact-finder

Semantic fact indexer over markdown knowledge bases.

Crawls markdown documents, extracts atomic facts using **local NLP (free, no API key)**
or an optional LLM enrichment pass, stores facts in LanceDB with authority scoring and
entity relationships, and exposes structured search via the Model Context Protocol.

---

## Why local inference?

> **Pass 1 (rebel-large NLP) is completely free.**

No API keys, no rate limits, no cost.  Runs locally on CPU or Apple Silicon MPS
(Metal Performance Shaders).  A full 7,600-document knowledge base indexes in ~5 hours
at $0.

**Pass 2 (LLM enrichment) is optional.**  It adds canonical fact forms, tense
classification, and better entity linking.  Use a local Ollama instance (free) or
OpenRouter (~$5–9 for 7,600 docs).

> **Recommendation:** Run Pass 1 first.  For most knowledge bases, Pass 1 alone
> is sufficient for useful search results.

---

## Status: v0.0.3

| Component             | Status   |
|-----------------------|----------|
| Data models           | ✅ Complete |
| Markdown crawler      | ✅ Complete |
| Authority scoring     | ✅ Complete |
| rebel-large extractor | ✅ Complete |
| Ollama extractor      | ✅ Complete |
| OpenRouter/Bedrock    | ✅ Complete |
| Embedder (bge-small)  | ✅ Complete |
| LanceDB store         | ✅ Complete |
| MCP server (6 tools)  | ✅ Complete |
| Inconsistency checker | ✅ Complete |
| Indexing pipeline     | ✅ Complete |
| Setup script          | ✅ Complete |

---

## Quick start

### 1. Install

```bash
pip install mcp-fact-finder
# or
uv add mcp-fact-finder
```

Or clone and develop locally:

```bash
git clone https://github.com/bobmatnyc/mcp-fact-finder
cd mcp-fact-finder
uv sync
```

### 2. Index your knowledge base (Pass 1 — free, no API key needed)

```bash
# Index a directory of markdown files
uv run scripts/index.py --path ~/my-docs/

# First 100 docs only (quick smoke test)
uv run scripts/index.py --path ~/my-docs/ --limit 100

# Show current stats
uv run scripts/index.py --path ~/my-docs/ --stats
```

The index is stored in `{path}/.mcp-fact-finder/db` — co-located with your docs,
trivial to delete and rebuild.  Incremental re-runs only process new or changed files
(content-hash based — moving a file does not trigger reindexing).

### 3. Set up Claude Code MCP integration

```bash
# In the project you want to query:
cd ~/my-project
uv run --project ~/Projects/mcp-fact-finder scripts/setup.py
```

Creates `.mcp.json` (Claude Code MCP config) and
`.claude/skills/mcp-fact-finder.md` (explains when/how to use each tool),
then adds `.mcp-fact-finder/` to `.gitignore`.  Restart Claude Code to activate.

### 4. Optional: LLM enrichment (Pass 2)

```bash
# Via local Ollama (free, no API key):
FACT_FINDER_INFERENCE_BACKEND=ollama \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force

# Via OpenRouter (~$5-9 for 7,600 docs):
FACT_FINDER_OPENROUTER_API_KEY=sk-... \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force --workers 150
```

---

## MCP Tools

Six tools are exposed to Claude Code once the index is running:

| Tool | Description |
|------|-------------|
| `search_facts` | Natural language semantic + keyword search, authority-ranked |
| `get_document_facts` | All facts extracted from one source document |
| `compare_sources` | How different docs describe the same topic |
| `get_entity_facts` | All facts about a named entity (person, system, concept) |
| `check_inconsistencies` | Contradictions and conflicts for an entity |
| `get_conflict_report` | Detailed explanation of a specific pair inconsistency |

### Example queries

```
"What was decided about the Pricerator architecture?"
→ search_facts(query="Pricerator architecture", fact_type="decided", min_authority=0.8)

"What do all documents say about latency?"
→ compare_sources(topic="P90 latency")

"Everything known about Thomas Evans"
→ get_entity_facts(entity="Thomas Evans")

"Are there contradictions about the GCP migration?"
→ check_inconsistencies(entity="GCP migration")
```

---

## Authority scoring

Facts are ranked by source document type × recency × LLM confidence:

| Score | Document type |
|-------|--------------|
| 1.0   | Technical specs, RFCs, ADRs, TDDs |
| 0.8   | Product requirements (PRDs) |
| 0.5   | Meeting notes, retros, sprint notes |
| 0.3   | Summaries, digests, inferred content |

Use `min_authority: 0.8` to restrict results to high-confidence sources.

---

## Fact types

| Type | Pattern | Example |
|------|---------|---------|
| `is` | X is / has Y | "Pricerator is the pricing engine" |
| `said` | X stated / believes | "Thomas said the migration was overdue" |
| `happened` | X occurred (past) | "GCP migration completed Q3 2024" |
| `planned` | X will / intends | "Team plans to deprecate the monolith" |
| `decided` | It was decided | "Team decided to use LanceDB" |
| `metric` | X = N (quantified) | "P90 latency is 340ms" |

---

## Architecture

```
Markdown files
    │
    ▼
Crawler (discover_documents)
  • Confluence / YAML frontmatter parsing
  • Authority tier from filename keywords
  • Content-hash document identity (stable across moves)
    │
    ▼
Pass 1: rebel-large NLP  [FREE — local CPU/MPS]
  • Relation extraction (SPO triples)
  • ~20 docs/min on Apple M-series
    │   (optional)
    ▼
Pass 2: LLM enrichment  [Ollama local = free | OpenRouter ~$5–9]
  • Canonical fact forms + tense + fact_type
  • Better entity linking
    │
    ▼
LanceDB  (vector + FTS hybrid search)
  • facts, entities, documents, inconsistencies tables
    │
    ▼
MCP Server  (stdio transport, Claude Code)
  • 6 structured search tools
```

---

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `FACT_FINDER_DB_PATH` | `{corpus}/.mcp-fact-finder/db` | LanceDB path |
| `FACT_FINDER_INFERENCE_BACKEND` | `rebel` | `rebel` \| `ollama` \| `openrouter` \| `bedrock` |
| `FACT_FINDER_OPENROUTER_API_KEY` | — | OpenRouter key (Pass 2) |
| `FACT_FINDER_OPENROUTER_MODEL_ID` | `google/gemini-2.0-flash-lite-001` | Pass 2 model |
| `FACT_FINDER_BEDROCK_REGION` | `us-east-1` | AWS Bedrock region |

Copy `.env.example` to `.env` and fill in as needed.

---

## Development

```bash
uv run pytest          # run tests
uv run ruff check src/ # lint
uv run mypy src/       # type check

make build             # build wheel + sdist
make publish           # bump patch, publish to PyPI, tag, push
make publish-minor     # bump minor version
```

See `docs/architecture.md` for the full design document.

---

## License

Proprietary. See LICENSE.
