Metadata-Version: 2.4
Name: lookback-ai
Version: 0.1.0
Summary: Local-first multimodal semantic memory for your machine — searchable text + screenshots, MCP-native, runs on CPU.
Author: Ayush Chaurasia
License: MIT
License-File: LICENSE
Keywords: embeddings,lancedb,local-first,mcp,mobileclip,multimodal,nomic,semantic-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.11
Requires-Dist: fastmcp>=3.3.1
Requires-Dist: huggingface-hub>=0.24
Requires-Dist: lancedb>=0.16
Requires-Dist: numpy>=1.26
Requires-Dist: onnxruntime>=1.18
Requires-Dist: pillow>=10.3
Requires-Dist: pyarrow>=17
Requires-Dist: pydantic>=2.7
Requires-Dist: pypdf>=4.3
Requires-Dist: rich>=13.7
Requires-Dist: tokenizers>=0.20
Requires-Dist: typer>=0.12
Requires-Dist: watchfiles>=0.24
Description-Content-Type: text/markdown

# Lookback

Local-first, multimodal semantic memory for your machine.

Index your files, code, PDFs, browser history, and **screenshots** into a
[LanceDB](https://lancedb.com) store on disk. Query by meaning from the CLI
*or* from any MCP-capable AI tool (Claude Code, Cursor, Continue, ChatGPT
Desktop, Windsurf, Zed). Everything runs on-device — no cloud, no GPU.

## Highlights

- **Multimodal.** Real semantic search over text + screenshots in a single
  index. Cross-modal: search for *"fluffy clouds in the sky"* and you'll get
  back the screenshot, not just text mentioning clouds.
- **Local-first.** Models (Nomic Embed v1.5 + MobileCLIP2-S2) run on CPU
  via ONNX Runtime. Your data and your queries never leave your laptop.
- **MCP-native.** A single `lookback serve` makes the index available as a
  tool to every modern AI assistant. See [`MCP_SETUP.md`](MCP_SETUP.md).
- **Dev-grade DX.** Single `pip install`, sensible defaults, one config
  file, every subcommand documented.

## Status

| Milestone | Scope | State |
|---|---|---|
| M0 | Design + scaffold | ✅ |
| M1 | Lance schema + store | ✅ |
| M2 | Text embedder ABC + mock + Nomic adapter; chunking; markdown extractor; indexer | ✅ |
| M3 | Image embedder mock + screenshot extractor | ✅ |
| M4 | PDF + code extractors | ✅ |
| M5 | CLI: init / index / search / stats / models | ✅ |
| M6 | Model registry, system probe, recommendation, `init` model selection | ✅ |
| M7 | Real Nomic + MobileCLIP weights wired end-to-end, `@needs_models` smoke tests | ✅ |
| M8a | Cross-modal text→image search via MobileCLIP joint text tower; `--modality` flag | ✅ |
| M8b | File watcher (`lookback watch`); MCP server (`lookback serve`); hybrid FTS + vector (`--hybrid`); MCP setup docs | ✅ |

**194 tests, all green** (10 of them gated on real model weights; auto-skip
when absent). Run `uv sync && uv run pytest -q` to verify.

## Quick start

```bash
# Install
pip install lookback-ai    # PyPI distribution; imports as `lookback`
# OR for local development:
uv sync && uv tool install --editable .

# Bootstrap config with system-aware model recommendation
uv run lookback init
# Detected: Darwin · arm64 · Apple Silicon · 16.0 GB RAM · 8 CPU
# Recommended: text=nomic-v1.5  image=mobileclip-s2

# Download weights (~700 MB total — Nomic v1.5 + MobileCLIP2-S2 vision + text + tokenizer)
uv run lookback models download nomic-v1.5 mobileclip-s2

# First-time index pass over directories you care about
uv run lookback index ~/Documents
uv run lookback index ~/Pictures/Screenshots

# Search
uv run lookback search "transformer attention notes"
uv run lookback search "a diagram with red and blue arrows" --modality image
uv run lookback search "IVF_PQ tuning" --hybrid       # FTS + vector RRF fusion

# Keep the index up to date as files change
uv run lookback watch ~/Documents

# Expose to AI tools via MCP
uv run lookback serve                                  # stdio (IDE-friendly)
uv run lookback serve --transport http --port 7777     # HTTP for remote
```

See **[MCP_SETUP.md](MCP_SETUP.md)** for Claude Code / Cursor / Continue /
ChatGPT Desktop / Windsurf / Zed configuration snippets.

## Commands at a glance

| Command | What it does |
|---|---|
| `lookback init` | Detect system, recommend models, write `~/.lookback/config.toml`. Flags: `--text-model`, `--image-model`, `--interactive`. |
| `lookback models list` | Show every registered model with HF repo and disk-size estimate. |
| `lookback models download <name> [<name> …]` | Fetch weights into `models_dir`. |
| `lookback index <path>` | Walk a path, hash + skip-if-unchanged, embed new/changed files, write to Lance. |
| `lookback search <query>` | Semantic search. Flags: `--modality text|image|all`, `--source-kind <kind>`, `--hybrid`, `--limit N`, `--json`. |
| `lookback stats` | Row counts per table. |
| `lookback watch <path>` | Foreground watcher — re-indexes on FS events. |
| `lookback serve` | MCP server. `--transport stdio|http`, `--host`, `--port`. |

## Storage layout

```
~/.lookback/
├── config.toml         # one TOML, hand-editable
├── models/
│   ├── nomic-v1.5/
│   │   ├── onnx/model.onnx
│   │   └── tokenizer.json
│   └── mobileclip-s2/
│       ├── onnx/s2/vision_model.onnx
│       ├── onnx/s2/text_model.onnx
│       └── tokenizer.json
└── data/
    ├── chunks_text.lance       (Nomic 768-d)
    ├── chunks_image.lance      (MobileCLIP 512-d)
    └── files.lance             (file-level state for incremental indexing)
```

## What it indexes by default

Tier 1 (configured in `roots`, on by default):

- **Markdown / plaintext** — `.md`, `.markdown`, `.mdx`, `.txt`, `.log`, `.rst`
- **PDFs** — text-layer extraction via pypdf (OCR for image-only PDFs is M9)
- **Source code** — 40+ languages (Python, TS/JS, Go, Rust, Java, Swift, C/C++, Ruby, …) with language tags as `source_kind`
- **Screenshots** — `.png`, `.jpg`, `.webp`, `.gif`, `.bmp`. Visually searchable via MobileCLIP.

Skipped: hidden directories, `.gitignore`'d paths, `node_modules`/`.venv`/`target`/`build`/`dist`/etc., files larger than `max_file_bytes` (50 MiB default), symlinks (unless `follow_symlinks = true`).

## Hero demos (with real weights)

```bash
$ lookback search "fluffy clouds in the sky" --modality image
Image hits
┃ score ┃ kind       ┃ meta                    ┃
│ 0.779 │ screenshot │ {"filename": "sky.png"} │
│ 0.926 │ screenshot │ {"filename": "dog.png"} │

$ lookback search "transformer attention paper"
Text hits
│ 0.309 │ markdown │ {"section": "Attention is all you need", ...} │
Image hits
│ 0.836 │ screenshot │ {"filename": "diagram.png"} │

$ lookback search "IVF_PQ tuning" --hybrid
Text hits
│ 0.033 │ markdown │ {"section": "IVF_PQ index tuning", ...} │   # exact-keyword boost
```

## Design + architecture

See [DESIGN.md](DESIGN.md) for:
- Lance schema (chunks_text + chunks_image + files) and the perf-guide-driven decisions
- Embedding choices, dim selection, distance metrics
- Per-extractor chunking strategies
- Index types (IVF_PQ vector + bitmap/btree scalar + FTS inverted)
- Session-by-session implementation log

## License

MIT
