Metadata-Version: 2.4
Name: distillr
Version: 0.10.0
Summary: Source-to-intelligence platform: turn YouTube, websites, and arXiv papers into a structured, reusable corpus with per-source insights, cross-source synthesis, and Deep Research reports.
Keywords: research,intelligence,synthesis,arxiv,youtube,rag,knowledge-base,mcp
Author: Nick Seal
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: yt-dlp>=2025.1.0
Requires-Dist: openai>=1.0.0
Requires-Dist: google-genai>=1.50.0
Requires-Dist: typer>=0.9.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: rich>=13.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: python-docx>=1.0.0
Requires-Dist: playwright>=1.52.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: uvicorn[standard]>=0.30.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: markdown>=3.5.0
Requires-Dist: nh3>=0.2.0
Requires-Dist: pypdf>=4.0.0
Requires-Dist: requests>=2.32.0
Requires-Dist: httpx>=0.28.0
Requires-Dist: defusedxml>=0.7.1
Requires-Python: >=3.12
Project-URL: Homepage, https://github.com/blisspixel/distillr
Project-URL: Repository, https://github.com/blisspixel/distillr
Project-URL: Issues, https://github.com/blisspixel/distillr/issues
Description-Content-Type: text/markdown

# Distill

*Installed as [`distillr`](https://pypi.org/project/distillr/) on PyPI; the CLI is `distill`.*

[![CI](https://github.com/blisspixel/distillr/actions/workflows/ci.yml/badge.svg)](https://github.com/blisspixel/distillr/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/distillr.svg)](https://pypi.org/project/distillr/)
[![Python](https://img.shields.io/pypi/pyversions/distillr.svg)](https://pypi.org/project/distillr/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

> Distill helps you get insights on the topics you care about — and keep them current. Point it at a research goal; it finds the papers, videos, sites, and posts worth reading, analyzes each into structured insights with source receipts, and synthesizes across them into a durable plain-Markdown corpus on your disk. You browse it in Obsidian, your agents read it as files or query it over MCP, and it refreshes on a cadence instead of going stale.

```bash
pip install distillr
distill papers "temporal knowledge graph" --topic tkg --limit 20
```

That one command searches arXiv, downloads 20 PDFs, extracts full text, runs structured analysis on each, and writes a cross-paper synthesis. For a 20-paper run like the example below, expect single-digit minutes and under a dollar in model spend on the `grok-4.3` default. Terminal output during the run looks like this (illustrative run; see the labelled sample-output note below):

```
Papers: temporal knowledge graph
Topic: tkg | Selected papers: 20

  [1/20] Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge
         Graphs and Agentic Memory
  [2/20] Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
  ...

  6m 47s  ~$0.58 (391,278 in / 38,117 out)

  time_is_not_a_label_260411544_Paper.md     90.4 KB
  time_is_not_a_label_260411544_Insights.md   8.1 KB
  ...
  tkg_Paper_Synthesis.md  11.8 KB
  tkg_Corpus_Synthesis.md 10.5 KB
```

## Where distill sits

Three kinds of tools orbit this space, and distill is deliberately none of them:

- **Deep Research oracles** (ChatGPT, Gemini, Perplexity) are excellent at one-shot answers — and the work evaporates after each session. No corpus, no receipts you can re-check, nothing that compounds. Distill is the engine under that pattern: every run leaves transcripts, extracted paper text, per-source insights, and cross-source synthesis on disk, refreshable on a cadence.
- **Grounded notebooks** (NotebookLM) keep a persistent corpus, but in a silo: you find and feed the sources by hand, and the corpus exports to Google Docs/Sheets only. Distill *finds* the sources against your goal, and the corpus is plain files you own.
- **LLM-wiki maintainers** (the post-Karpathy wave of agent-curated Markdown vaults) assume you already have the content and tidy it. Distill is the acquisition half they leave out — goal-aware discovery across papers, videos, sites, and X, transcript-grade capture, and provenance on every claim — producing exactly the kind of vault those tools maintain.
- **Academic literature tools** (Elicit, Semantic Scholar, scite, Consensus) are stronger for pure paper search, citation graphs, and systematic review. Distill treats papers as one source type inside a broader corpus that also holds talks, vendor docs, and posts.

The short version: those are **report and search layers**; distill is the **corpus layer underneath repeated research** — capture, per-source insights, cross-source synthesis, refresh, receipts. And plain Markdown is the substrate, not the moat: anyone can write Markdown. The moat is the acquisition-and-maintenance loop that fills it and keeps it current.

That matters when you are doing thesis work, competitive analysis, technical due diligence, or building a startup knowledge base — you can verify the receipts, watch how a topic evolves, query the same folder through MCP from Claude Desktop / Cursor / other agents, and open it in Obsidian, Logseq, VS Code, or plain filesystem search. Nothing is locked in anything.

## What you get

One local `library/` directory of plain Markdown. No database, no cloud lock-in, no proprietary format. Files use globally descriptive names plus YAML frontmatter so knowledge-base tools, Dataview-style plugins, and AI coding assistants can understand them without guessing from generic `insights.md` tabs.

Four source types, same pipeline shape (capture -> analyze -> synthesize -> report):

- **YouTube** — channels, topic searches, videos, Shorts
- **Websites** — vendor sites, research hubs, curated URL sets (browser-first crawl with PDF/embedded-video ingestion)
- **arXiv papers** — phrase-matched search, full-PDF extraction, structured per-paper insights, cross-paper synthesis
- **X (Twitter) posts** — via `distill ingest <tweet-url>`; uses the public syndication embed endpoint (no anti-bot scraping). When a tweet has a native video attachment, the audio is transcribed via local-first Whisper (`faster-whisper` on GPU/CPU, OpenAI Whisper as cloud fallback) with a vocabulary hint derived from the source metadata to keep proper nouns intact.

Plus an MCP server so AI assistants and agent systems can query the library directly.

## Quick start

Distill runs on Linux, macOS, and Windows (Python 3.12+); local models run on consumer GPUs via Ollama or LM Studio.

```bash
pip install distillr
playwright install chromium     # for YouTube search + website capture
distill doctor                  # verify API keys + system health
```

Set two keys in `.env` (copy from `.env.example`):

```bash
XAI_API_KEY=xai-...             # Grok models
GEMINI_API_KEY=AIza...          # Gemini Deep Research (reports + briefings)
```

Or run locally with Ollama (no API keys needed for ingestion):

```bash
ollama pull qwen3.5:27b         # download recommended model for 24GB GPU
echo "DISTILL_PROVIDER=ollama" >> .env
distill doctor                  # verify local setup
```

Then try any of:

```bash
# Goal-aware cross-source discovery (papers + videos + curated sites, reranked against a goal)
distill discover "help an AI become a great music composer" --topic music --preview
distill discover --goal-file private/my-goal.md --topic research --yes
distill discover --goal-file private/agent365-goal.md --topic agent365 --site-seeds private/agent365_sites.json --site-limit 10 --preview

# Get smart on a YouTube topic, fast
distill latest "Microsoft Fabric best practices" --limit 10 --report

# Discover and ingest arXiv papers — expands the query, LLM-reranks candidates,
# picks the top N (use --preview to see the shortlist without ingesting)
distill papers "agent memory systems" --topic memory --limit 20
distill papers "agent memory systems" --topic memory --limit 20 --preview

# Distill a vendor/research site
distill site-batch configs/example_seeds.json --topic example --seed-only
```

The full command reference lives in [`docs/usage.md`](docs/usage.md).

## Mental model

```
library/
  └── topics/<topic>/
       ├── channels/<creator>/videos/<video>/
       │     ├── <video-slug>_Transcript.txt
       │     └── <video-slug>_Insights.md
       ├── sites/<hostname>/pages/<page>/
       │     ├── <page-slug>_Content.md
       │     └── <page-slug>_Insights.md
       ├── papers/<paper>/
       │     ├── <paper-slug>_Paper.md
       │     └── <paper-slug>_Insights.md
       ├── <topic>_Topic_Synthesis.md      # cross-source
       └── <topic>_Corpus_Synthesis.md     # mixed-source view
```

You build a topic library over time. Ingest once, refresh on a cadence, generate a report or briefing when you need one. Older `insights.md`-style libraries are still readable, but new Markdown writes use the stable knowledge-base naming scheme.

See [`docs/outputs.md`](docs/outputs.md) for what every artifact contains.

## Sample output

*The excerpts below are synthetic examples: the file shapes, frontmatter fields, and section structure are exactly what distill writes, but the papers, authors, and numbers are invented for illustration. For a provenance-first tool that distinction matters, so it is stated. A **real, unedited example corpus** (6 papers on claim verification, $0.19 of analysis) ships in [`examples/`](examples/README.md).*

A cross-paper `<topic>_Paper_Synthesis.md` (excerpt):

```markdown
## Strongest Research Signals

- Append-only temporal representations improve long-horizon extrapolation:
  RoMem (arXiv:2604.11544), EST (arXiv:2602.12389v3), and CID-TKG converge on
  persistent or dual-view entity state over destructive overwriting, with
  consistent MRR/Hits@K gains on ICEWS and GDELT.

- Semantic gating scales better than manual relation tagging: RoMem's Semantic
  Speed Gate and EST's energy-barrier gate both learn relational volatility
  from text embeddings rather than schema tags…
```

<details>
<summary>Per-paper <code>&lt;paper-slug&gt;_Insights.md</code> excerpt (click to expand)</summary>

```markdown
---
title: "Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs"
type: "insights"
topic: "tkg"
source: "arxiv"
source_id: "2604.11544v1"
url: "https://arxiv.org/abs/2604.11544v1"
authors: ["Alice Example", "Bob Example"]
tags: ["distill/tkg", "source/arxiv", "cs.AI"]
synthesis_scope: "single-paper"
analyzed_by: grok-4.3
source_mode: full_pdf
---

### Core Contribution
1. Continuous functional rotation θ_r(τ) = s · α_r · τ · ω instead of discrete
   timestamp lookup tables. Zero-shot interpolation of unseen dates.
2. Semantic Speed Gate: MLP that reads only text embedding ϕ(r) and outputs α_r.
   Learns relational volatility from data.
3. Geometric shadowing in complex space: obsolete facts rotated out of phase so
   the correct fact outranks contradictions via the scoring function alone.

### Methods and Evidence
- On ICEWS05-15, RoMem-ChronoR reaches 72.6 MRR (vs vanilla ChronoR 68.4).
- Zero-shot domain transfer to FinTMMBench: 0.728 MRR, 0.673 R@5.
- All baselines use identical answer LLM and judge for fairness.

### Limits and Open Questions
- Computational cost at millions-of-facts scale is motivation but no latency,
  memory, or throughput numbers are reported.
- Gate pretrained only on ICEWS05-15 political events; generalization to
  highly ambiguous relations is not quantified.
```

</details>

For **multi-topic** literature reviews, stakeholder briefings, or agent grounding, `distill research-brief` (Gemini Deep Research, web-augmented) and `distill synthesize` (grok-4.3 single-call, corpus-only) take a user-written context file that shapes the output. See [`docs/usage.md#research-briefings-and-deep-synthesis`](docs/usage.md#research-briefings-and-deep-synthesis).

## Dashboard

```bash
distill                         # terminal home screen
distill serve                   # local web dashboard at http://127.0.0.1:8899
```

The terminal home screen shows tracked topics, channel and topic watches, recent runs, failures, and rolling spend. The web dashboard adds clickable drill-downs to per-topic, per-channel, and per-video views with rendered markdown, plus cost history and watchlist status. Both auto-refresh and read directly from library files — no database.

## MCP server, and agent-discoverable directories

Distillr is built for two parallel agent-integration paths:

**Path 1 — MCP (structured queries).** Claude Desktop / Claude Code config:

```json
{ "mcpServers": { "distill": { "command": "distill-mcp" } } }
```

Distill exposes 21 tools (a deliberately small surface, shrinking toward workflow-shaped tools — the JIT read layer returns ranked `path`/`preview`/`score` tuples with `read_insight` drill-down, never full payloads by default), plus 12 resources and 4 prompts. See [`docs/mcp.md`](docs/mcp.md) for the list.

**Path 2 — file system (the corpus IS the interface).** When a coding agent `cd`s into `library/topics/<your-topic>/`, the directory is plain Markdown with stable filenames and YAML frontmatter, so `grep`, `cat`, `ls`, and `find` are first-class query primitives — no schema to learn, no MCP setup required. Every topic directory (and the library root) ships auto-generated **`CLAUDE.md` and `AGENTS.md`** orientation files with identical content — `CLAUDE.md` for Claude Code, `AGENTS.md` for Codex, Cursor, Gemini CLI, and the 30+ tools on the cross-vendor AGENTS.md standard — so any agent that enters the directory gets oriented. This matches what Anthropic's Agent SDK material recommends for agent design: file system + composable tools as the substrate, with structured APIs layered on top when they help, not as the only entry point.

There's also a canonical **Agent Skill** at [`skills/distill-corpus/SKILL.md`](skills/distill-corpus/SKILL.md) — one vendor-neutral file teaching an agent how to read the corpus and drive the CLI (drop it into `~/.claude/skills/` or `~/.agents/skills/`).

## Cost

On the `grok-4.3` default ($1.25/$2.50 per 1M tokens), bulk video analysis runs ~$0.03/video and a full paper ~$0.03; Gemini Deep Research dominates paid reports (~$2–3/report); `distill synthesize` is ~$0.20–0.40 for a multi-topic corpus pass. grok-4.3 is the cloud floor — xAI retired the cheaper fast tiers (grok-4-1-fast etc.) on 2026-05-15, and those slugs now redirect to grok-4.3 and bill at grok-4.3 rates ([migration guide](docs/migration-grok-4.3.md)). The only cheaper path is running analysis on a **local model** (Ollama/LM Studio) — `distill eval --models grok-4.3,<local-model>` measures the cost × quality tradeoff over frozen fixtures and recommends the cheapest model that clears your quality bar before you switch. Every run logs actual vs estimated cost to `cost_log.jsonl`, and the pre-run estimate self-calibrates against that history; `distill costs` shows it. The estimator's goal is **accuracy**, not safe padding — a padded estimate discourages runs you'd happily pay for, so calibration error is tracked and shrunk over time.

Providers are adapters behind a workload router: grok + gemini are the calibrated cloud defaults, Ollama/LM Studio the local route, and **Anthropic and OpenAI adapters ship in-tree** (wireable, opt-in). Broader backends (AWS Bedrock, Microsoft Foundry) and **plan-quota compute** — routing batch analysis through agent CLIs your existing subscriptions already license (Claude, Codex, Gemini, and others), eval-gated for quality — are committed on the [roadmap](ROADMAP.md#looking-beyond-10).

Full cost model in [`docs/cost.md`](docs/cost.md).

## Reliability and trust boundaries

What's enforced (every release clears the same CI gate): ~1,950 tests at 81% **branch** coverage (floor ratchets up-only toward the 1.0 ≥95% gate), ruff + import-linter dependency-direction contracts + pyright + bandit + pip-audit, pinned dependencies via a committed `uv.lock`, SHA-pinned Actions, and PEP 740 build provenance on every PyPI release. Default tests mock all LLM and network boundaries — contributors never burn API spend; live integration tests are marked and opt-in.

Trust boundaries, stated plainly: everything ingested (transcripts, pages, PDFs, tweets) is treated as **untrusted input** — injection-resistance rules are threaded through first- and second-hop prompts, the dashboard sanitizes rendered HTML, and MCP file access is confined to the library root. Distill never bypasses login walls, captchas, or anti-bot defenses. Known-fragile edge: YouTube extraction depends on yt-dlp, which churns with YouTube's countermeasures — failures degrade with messages, not corrupted corpora. Analysis output is LLM-generated and can err; provenance fields on every artifact exist so you can check receipts, and since 0.10.0 a **write-time verify hook** grounds every numeric claim in a paper or video insight against its source receipt, flagging unsupported numbers in a `_Verify.json` sidecar (`DISTILL_VERIFY=warn|off`; the entailment-checker tier and audit surface are next — [roadmap](ROADMAP.md#0100--verified-corpus-run-time-verify--self-maintaining-audit)). Full posture: [`docs/SECURITY.md`](docs/SECURITY.md) and the [security section of the roadmap](ROADMAP.md#security-posture).

## Docs

- [`docs/usage.md`](docs/usage.md) — full command reference
- [`docs/invariants.md`](docs/invariants.md) — design charter: what distill is, is not, and the rules that don't bend
- [`docs/architecture.md`](docs/architecture.md) — data flow, 4-phase report pipeline, model routing, security hardening
- [`docs/outputs.md`](docs/outputs.md) — what every artifact contains
- [`docs/cost.md`](docs/cost.md) — cost model, examples, guardrails
- [`docs/mcp.md`](docs/mcp.md) — MCP tools, resources, prompts
- [`docs/migration-grok-4.3.md`](docs/migration-grok-4.3.md) — Grok 4.3 migration guide (model retirement May 15, 2026)
- [`docs/briefing-contexts/TEMPLATE.md`](docs/briefing-contexts/TEMPLATE.md) — starting point for `--context-file` prompts
- [`private/README.md`](private/README.md) — where personal/client-specific files go (git-ignored)

## Roadmap and changelog

- [`docs/CHANGELOG.md`](docs/CHANGELOG.md) — what shipped
- [`ROADMAP.md`](ROADMAP.md) — what's next

Feature work is interleaved with recurring **bug-hunt + harden passes** — see the [release rhythm](ROADMAP.md#path-to-10) note in the roadmap.

## Contributing

See [`docs/CONTRIBUTING.md`](docs/CONTRIBUTING.md) for dev setup, quality gates, and scope. Security disclosures go through [`docs/SECURITY.md`](docs/SECURITY.md).

## License

MIT — see [`LICENSE`](LICENSE).
