Metadata-Version: 2.4
Name: mcp-fact-finder
Version: 0.1.3
Summary: Semantic fact indexer over markdown knowledge bases — extracts atomic facts, entities, and authority scores, then exposes search via MCP
Project-URL: Homepage, https://github.com/bobmatnyc/mcp-fact-finder
Project-URL: Repository, https://github.com/bobmatnyc/mcp-fact-finder
Project-URL: Bug Tracker, https://github.com/bobmatnyc/mcp-fact-finder/issues
Author-email: Robert Matsuoka <bob@matsuoka.com>
License: Copyright (c) 2024-2026 Robert Matsuoka
        
        Elastic License 2.0
        
        URL: https://www.elastic.co/licensing/elastic-license
        
        ## Acceptance
        
        By using the software, you agree to all of the terms and conditions below.
        
        ## Copyright License
        
        The licensor grants you a non-exclusive, royalty-free, worldwide,
        non-sublicensable, non-transferable license to use, copy, distribute, make
        available, and prepare derivative works of the software, in each case subject
        to the limitations and conditions below.
        
        ## Limitations
        
        You may not provide the software to third parties as a hosted or managed
        service, where the service provides users with access to any substantial set of
        the features or functionality of the software.
        
        You may not move, change, disable, or circumvent the license key functionality
        in the software, and you may not remove or obscure any functionality in the
        software that is protected by the license key.
        
        You may not alter, remove, or obscure any licensing, copyright, or other
        notices of the licensor in the software. Any use of the licensor's trademarks
        is subject to applicable law.
        
        ## Patents
        
        The licensor grants you a license, under any patent claims the licensor can
        license, or becomes able to license, to make, have made, use, sell, offer for
        sale, import and have imported the software, in each case subject to the
        limitations and conditions in this license. This license does not cover any
        patent claims that you cause to be infringed by modifications or additions to
        the software. If you or your company make any written claim that the software
        infringes or contributes to infringement of any patent, your patent license for
        the software granted under these terms ends immediately.
        
        ## Notices
        
        You must ensure that anyone who gets a copy of any part of the software from
        you also gets a copy of these terms.
        
        If you modify the software, you must include in any modified copies of the
        software prominent notices stating that you have modified the software.
        
        ## No Other Rights
        
        These terms do not imply any licenses other than those expressly granted in
        these terms.
        
        ## Termination
        
        If you use the software in violation of these terms, such use is not licensed,
        and your licenses will automatically terminate. If the licensor provides you
        with a notice of your violation, and you cease all violation of this license no
        later than 30 days after you receive that notice, your licenses will be
        reinstated retroactively. However, if you violate these terms after such
        reinstatement, any additional violation of these terms will cause your licenses
        to terminate automatically and permanently.
        
        ## No Liability
        
        As far as the law allows, the software comes as is, without any warranty or
        condition, and the licensor will not be liable to you for any damages arising
        out of these terms or the use or nature of the software, under any kind of
        legal claim.
        
        ## Definitions
        
        The "licensor" is the entity offering these terms, and the "software" is the
        software the licensor makes available under these terms, including any portion
        of it.
        
        "you" refers to the individual or entity agreeing to these terms.
        
        "your company" is any legal entity, sole proprietorship, or other kind of
        organization that you work for, plus all organizations that have control over,
        are under the control of, or are under common control with that organization.
        "control" means ownership of substantially all the assets of an entity, or the
        power to direct its management and legal affairs.
        
        "your licenses" are all the licenses granted to you for the software under
        these terms.
        
        "use" includes copying, distributing, making available, or preparing derivative
        works of the software.
License-File: LICENSE
Keywords: entity-resolution,fact-extraction,knowledge-base,lancedb,llm,local-inference,markdown,mcp,semantic-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.12
Requires-Dist: aioboto3>=13.0
Requires-Dist: dateparser>=1.2
Requires-Dist: httpx>=0.27
Requires-Dist: instructor>=1.0
Requires-Dist: lancedb>=0.20
Requires-Dist: mcp>=1.0
Requires-Dist: nltk>=3.8
Requires-Dist: numpy>=1.26
Requires-Dist: pyarrow>=15.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: rich>=13.0
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: sentencepiece>=0.2.1
Requires-Dist: tqdm>=4.0
Requires-Dist: transformers>=5.3.0
Requires-Dist: watchdog>=4.0
Provides-Extra: dev
Requires-Dist: mypy>=1.9; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: web
Requires-Dist: fastapi>=0.111; extra == 'web'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'web'
Description-Content-Type: text/markdown

# mcp-fact-finder

Semantic fact indexer over markdown knowledge bases.

Crawls markdown documents, extracts atomic facts using **local NLP (free, no API key)**
or an optional LLM enrichment pass, stores facts in LanceDB with authority scoring and
entity relationships, and exposes structured search via the Model Context Protocol.

---

## Why local inference?

> **Pass 1 (rebel-large NLP) is completely free.**

No API keys, no rate limits, no cost.  Runs locally on CPU or Apple Silicon MPS
(Metal Performance Shaders).  A full 7,600-document knowledge base indexes in ~5 hours
at $0.

**Pass 2 (LLM enrichment) is optional.**  It adds canonical fact forms, tense
classification, and better entity linking.  Use a local Ollama instance (free) or
OpenRouter (~$5–9 for 7,600 docs).

> **Recommendation:** Run Pass 1 first.  For most knowledge bases, Pass 1 alone
> is sufficient for useful search results.

---

## Status: v0.0.3

| Component             | Status   |
|-----------------------|----------|
| Data models           | ✅ Complete |
| Markdown crawler      | ✅ Complete |
| Authority scoring     | ✅ Complete |
| rebel-large extractor | ✅ Complete |
| Ollama extractor      | ✅ Complete |
| OpenRouter/Bedrock    | ✅ Complete |
| Embedder (bge-small)  | ✅ Complete |
| LanceDB store         | ✅ Complete |
| MCP server (6 tools)  | ✅ Complete |
| Inconsistency checker | ✅ Complete |
| Indexing pipeline     | ✅ Complete |
| Setup script          | ✅ Complete |

---

## Quick start

### 1. Install

```bash
pip install mcp-fact-finder
# or
uv add mcp-fact-finder
```

Or clone and develop locally:

```bash
git clone https://github.com/bobmatnyc/mcp-fact-finder
cd mcp-fact-finder
uv sync
```

### 2. Index your knowledge base (Pass 1 — free, no API key needed)

```bash
# Index a directory of markdown files
uv run scripts/index.py --path ~/my-docs/

# First 100 docs only (quick smoke test)
uv run scripts/index.py --path ~/my-docs/ --limit 100

# Show current stats
uv run scripts/index.py --path ~/my-docs/ --stats
```

The index is stored in `{path}/.mcp-fact-finder/db` — co-located with your docs,
trivial to delete and rebuild.  Incremental re-runs only process new or changed files
(content-hash based — moving a file does not trigger reindexing).

### 3. Set up Claude Code MCP integration

```bash
# In the project you want to query:
cd ~/my-project
uv run --project ~/Projects/mcp-fact-finder scripts/setup.py
```

Creates `.mcp.json` (Claude Code MCP config) and
`.claude/skills/mcp-fact-finder.md` (explains when/how to use each tool),
then adds `.mcp-fact-finder/` to `.gitignore`.  Restart Claude Code to activate.

### 4. Optional: LLM enrichment (Pass 2)

```bash
# Via local Ollama (free, no API key):
FACT_FINDER_INFERENCE_BACKEND=ollama \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force

# Via OpenRouter (~$5-9 for 7,600 docs):
FACT_FINDER_OPENROUTER_API_KEY=sk-... \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force --workers 150
```

---

## MCP Tools

Six tools are exposed to Claude Code once the index is running:

| Tool | Description |
|------|-------------|
| `search_facts` | Natural language semantic + keyword search, authority-ranked |
| `get_document_facts` | All facts extracted from one source document |
| `compare_sources` | How different docs describe the same topic |
| `get_entity_facts` | All facts about a named entity (person, system, concept) |
| `check_inconsistencies` | Contradictions and conflicts for an entity |
| `get_conflict_report` | Detailed explanation of a specific pair inconsistency |

### Example queries

```
"What was decided about the Pricerator architecture?"
→ search_facts(query="Pricerator architecture", fact_type="decided", min_authority=0.8)

"What do all documents say about latency?"
→ compare_sources(topic="P90 latency")

"Everything known about Thomas Evans"
→ get_entity_facts(entity="Thomas Evans")

"Are there contradictions about the GCP migration?"
→ check_inconsistencies(entity="GCP migration")
```

---

## Authority scoring

Facts are ranked by source document type × recency × LLM confidence:

| Score | Document type |
|-------|--------------|
| 1.0   | Technical specs, RFCs, ADRs, TDDs |
| 0.8   | Product requirements (PRDs) |
| 0.5   | Meeting notes, retros, sprint notes |
| 0.3   | Summaries, digests, inferred content |

Use `min_authority: 0.8` to restrict results to high-confidence sources.

---

## Fact types

| Type | Pattern | Example |
|------|---------|---------|
| `is` | X is / has Y | "Pricerator is the pricing engine" |
| `said` | X stated / believes | "Thomas said the migration was overdue" |
| `happened` | X occurred (past) | "GCP migration completed Q3 2024" |
| `planned` | X will / intends | "Team plans to deprecate the monolith" |
| `decided` | It was decided | "Team decided to use LanceDB" |
| `metric` | X = N (quantified) | "P90 latency is 340ms" |

---

## Architecture

```
Markdown files
    │
    ▼
Crawler (discover_documents)
  • Confluence / YAML frontmatter parsing
  • Authority tier from filename keywords
  • Content-hash document identity (stable across moves)
    │
    ▼
Pass 1: rebel-large NLP  [FREE — local CPU/MPS]
  • Relation extraction (SPO triples)
  • ~20 docs/min on Apple M-series
    │   (optional)
    ▼
Pass 2: LLM enrichment  [Ollama local = free | OpenRouter ~$5–9]
  • Canonical fact forms + tense + fact_type
  • Better entity linking
    │
    ▼
LanceDB  (vector + FTS hybrid search)
  • facts, entities, documents, inconsistencies tables
    │
    ▼
MCP Server  (stdio transport, Claude Code)
  • 6 structured search tools
```

---

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `FACT_FINDER_DB_PATH` | `{corpus}/.mcp-fact-finder/db` | LanceDB path |
| `FACT_FINDER_INFERENCE_BACKEND` | `rebel` | `rebel` \| `ollama` \| `openrouter` \| `bedrock` |
| `FACT_FINDER_OPENROUTER_API_KEY` | — | OpenRouter key (Pass 2) |
| `FACT_FINDER_OPENROUTER_MODEL_ID` | `google/gemini-2.0-flash-lite-001` | Pass 2 model |
| `FACT_FINDER_BEDROCK_REGION` | `us-east-1` | AWS Bedrock region |

Copy `.env.example` to `.env` and fill in as needed.

---

## Development

```bash
uv run pytest          # run tests
uv run ruff check src/ # lint
uv run mypy src/       # type check

make build             # build wheel + sdist
make publish           # bump patch, publish to PyPI, tag, push
make publish-minor     # bump minor version
```

See `docs/architecture.md` for the full design document.

---

## License

Proprietary. See LICENSE.
