Metadata-Version: 2.4
Name: watson-lite
Version: 0.1.2
Summary: Extractive QA pipeline — no LLM, no training. BM25 + FAISS + Wikidata + cross-encoder re-ranking + roberta-base-squad2.
Project-URL: Homepage, https://github.com/daedalus/watson-lite
Project-URL: Repository, https://github.com/daedalus/watson-lite
Project-URL: Issues, https://github.com/daedalus/watson-lite/issues
Author-email: Dario Clavijo <clavijodario@gmail.com>
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: bm25s>=0.1.0
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: numpy>=1.26.0
Requires-Dist: requests>=2.31.0
Requires-Dist: scipy>=1.13.0
Requires-Dist: sentence-transformers>=2.7.0
Requires-Dist: spacy>=3.7.0
Requires-Dist: sparqlwrapper>=2.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: wikipedia-api>=0.6.0
Provides-Extra: all
Requires-Dist: hatch; extra == 'all'
Requires-Dist: prospector[with-mypy,with-ruff]; extra == 'all'
Requires-Dist: pytest; extra == 'all'
Requires-Dist: pytest-cov; extra == 'all'
Requires-Dist: ruff; extra == 'all'
Provides-Extra: dev
Requires-Dist: hatch; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: lint
Requires-Dist: prospector[with-mypy,with-ruff]; extra == 'lint'
Provides-Extra: test
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-cov; extra == 'test'
Description-Content-Type: text/markdown

# watson-lite

A Watson-inspired extractive QA system that runs on a laptop.  
**No LLM. No trained weights of your own. No paid APIs.**

[![Python](https://img.shields.io/pypi/pyversions/watson-lite.svg)](https://pypi.org/project/watson-lite/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/master/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

## Install

```bash
pip install watson-lite
python -m spacy download en_core_web_sm
```

## Usage

### CLI

```bash
# Single question
watson-lite "Who designed the Eiffel Tower?"
watson-lite "Who was the 44th president of the United States?"

# Interactive mode
watson-lite

# Toggle optional features (ablation-style)
watson-lite --no-vector-retrieval --no-graph-enrichment "Who designed the Eiffel Tower?"

# Query across multiple online datasets
watson-lite --datasets wikipedia,wikibooks "What is Python?"

# Benchmark/eval run from dataset
watson-lite \
  --benchmark-dataset /path/to/benchmark.json \
  --benchmark-output-json /tmp/watson_benchmark.json \
  --benchmark-output-csv /tmp/watson_benchmark.csv

# Full ablation sweep + regression gate against baseline
watson-lite \
  --benchmark-dataset /path/to/benchmark.json \
  --ablation-sweep \
  --regression-check \
  --max-accuracy-drop 0.02 \
  --max-f1-drop 0.02
```

Benchmark dataset format (`.json` or `.jsonl`):

```json
[
  {
    "question": "Who designed the Eiffel Tower?",
    "answers": ["Gustave Eiffel"],
    "evidence_passages": ["designed by Gustave Eiffel"]
  }
]
```

### Python

```python
from watson_lite import WatsonLite

watson = WatsonLite()
answer = watson.answer("Who designed the Eiffel Tower?")

print(answer.answer)        # "Gustave Eiffel"
print(answer.confidence)    # 0.752
print(answer.source)        # "Eiffel Tower"
```

### KPI evaluation

```python
from watson_lite import WatsonLite
from watson_lite.evaluation import BenchmarkLabel, evaluate_kpis

watson = WatsonLite()
answers = [
    watson.answer("Who designed the Eiffel Tower?", verbose=False),
    watson.answer("What is the capital of France?", verbose=False),
]

labels = [
    BenchmarkLabel(
        answers=["Gustave Eiffel"],
        evidence_passages=["designed by Gustave Eiffel"],
    ),
    BenchmarkLabel(
        answers=["Paris"],
        evidence_passages=["capital of France"],
    ),
]

report = evaluate_kpis(answers, labels, recall_k=10, calibration_bins=10)
print(report.answer_success_rate)
print(report.latency_p95_s)
print(report.confidence_calibration_ece)
```

Each `FinalAnswer` now includes `diagnostics` with stage latencies, cache hit/miss
counters, retrieval/extraction counts, and top retrieved passages for KPI rollups.

### Example output

```
$ watson-lite "Who was the 44th president of the United States?"

  ANSWER:     Barack Hussein Obama
  CONFIDENCE: 43.6%
  SOURCE:     Barack Obama
  URL:        https://en.wikipedia.org/wiki/Barack Obama

  Confidence breakdown:
    extraction_model: 0.592
    span_agreement: 0.2
    graph_corroboration: 0.0
    passage_rank_signal: 1.0

  Time: 44.60s
```

## API

- **`WatsonLite`** — Main orchestrator. `answer(question)` runs the full 6-stage pipeline.
- **`NLPProcessor`** — spaCy-based question classification, NER, decomposition.
- **`DatasetQueryEngine`** — Modular dataset querying and aggregation across pluggable providers.
- **`BM25Retriever`** — BM25 retrieval over aggregated online passages.
- **`VectorRetriever`** — Dense vector retrieval (sentence-transformers + FAISS).
- **`WikidataGraph`** — Structured fact enrichment from Wikidata.
- **`Ranker`** — RRF fusion + cross-encoder re-ranking.
- **`ExtractiveReader`** — Span extraction via roberta-base-squad2.
- **`ConfidenceScorer`** — Multi-signal confidence scoring.
- **`Cache`** — SQLite3 cache for Wikipedia/Wikidata/type-coercion responses with TTL expiry, namespace metrics, and bounded-size pruning.

## Feature inventory

Core (always on):
- NLP parse
- Dataset query engine fetch
- BM25 retrieve
- Span extraction
- Final scoring shell

Optional toggles (default enabled):
- Vector retrieval (`--no-vector-retrieval`)
- Query expansion variants (`--no-query-expansion`)
- Wikidata graph enrichment (`--no-graph-enrichment`)
- Cross-encoder reranking (`--no-cross-encoder-reranking`)
- Question-type bonus (`--no-question-type-bonus`)
- Type-coercion signal (`--no-type-coercion`)

## Development

```bash
git clone https://github.com/daedalus/watson-lite.git
cd watson_lite
pip install -e ".[test]"

# run tests
pytest

# format
ruff format src/ tests/

# lint + type check
prospector --with-tool ruff --with-tool mypy src/

# find unused code
vulture --min-confidence 90 src/
```

## Architecture

```
User Question → NLP (spaCy) → Decomposition → Entity Extraction
  → Parallel Retrieval (BM25 + FAISS) → Graph (Wikidata)
  → RRF Fusion → Cross-Encoder Rerank → Span Extraction → Confidence Score
```

## Models Used (all pretrained, inference only)

| Model | Purpose | Size |
|---|---|---|
| `en_core_web_sm` | spaCy NLP | ~12MB |
| `all-MiniLM-L6-v2` | Passage embeddings | ~90MB |
| `ms-marco-MiniLM-L-6-v2` | Cross-encoder reranking | ~90MB |
| `deepset/roberta-base-squad2` | Extractive span QA | ~480MB |

Total: ~670MB — runs CPU-only.

## Data Sources

- **Wikipedia REST API** — Live article retrieval
- **Wikibooks REST API** — Live educational content retrieval
- **Wikidata REST API** — Structured entity facts (no SPARQL)

## Extending

- **Add a domain corpus**: Plug a new provider into `DatasetQueryEngine`.
- **Add more graph sources**: Wikidata REST API pattern is reusable.
- **Offline mode**: Download Wikipedia dumps and index locally with BM25 + FAISS.
