# maestro-fetch

> Fetch everything, for agents. Universal data acquisition with smart routing.

maestro-fetch is a Python CLI and SDK that fetches data from any URL --
web pages, APIs, PDFs, Excel/CSV, cloud storage, video/audio, binary formats,
and authenticated pages -- and returns agent-friendly markdown or structured data.
It orchestrates multiple browser backends (bb-browser, Cloudflare, Playwright)
behind a unified interface.

## Quick Start

pip install maestro-fetch
mfetch "https://any-url.com"

## CLI Commands

### mfetch <url>
Default fetch command. Smart-routes URL to the correct adapter.
Options: --output (markdown|json|csv|raw), --dir, --no-cache, --timeout, --backend

### mfetch batch <file>
Batch fetch URLs from a text file.
Options: --dir, --concurrency

### mfetch source update|list|info|run
Manage and execute community source adapters from maestro-fetch-sources.
- source update: git pull latest adapters
- source list [--category]: browse available adapters
- source info <name>: show adapter args and examples
- source run <name> [args]: execute adapter

### mfetch session start|click|fill|snapshot|screenshot|eval|end
Interactive Playwright browser sessions for authenticated or complex pages.

### mfetch cache list|clear
SQLite cache management. clear --older-than <duration> for eviction.

### mfetch config init|show
Generate or display ~/.maestro-fetch/config.toml.

## Python SDK

from maestro_fetch import fetch, batch_fetch

result = await fetch("https://example.com/data")
result.content      # markdown text
result.source_type  # "web" | "doc" | "cloud" | "media" | "binary" | ...
result.tables       # list[pd.DataFrame]
result.metadata     # provenance dict
result.raw_path     # Path to cached raw file

results = await batch_fetch(urls, concurrency=10)

# LLM extraction (requires ANTHROPIC_API_KEY or OPENAI_API_KEY)
result = await fetch(url, schema={"field": str}, provider="anthropic")

## Architecture

URL -> Router (smart detection) -> Adapter -> FetchResult

### Router Decision Chain
1. Match community source adapter (@meta pattern)
2. Match built-in adapter:
   - pan.baidu.com -> BaiduPanAdapter
   - dropbox/gdrive/gdocs -> CloudAdapter
   - youtube/vimeo -> MediaAdapter
   - *.pdf/xlsx/csv -> DocAdapter
   - *.zip/tif/nc/parquet -> BinaryAdapter
3. Web fallback chain: crawl4ai -> httpx -> Cloudflare -> bb-browser -> playwright-stealth

### 7 Built-in Adapters
- web: general web content (multi-backend fallback chain)
- doc: PDF, Excel, CSV parsing
- binary: archives, geospatial, data science formats
- cloud: Dropbox, Google Drive/Docs/Sheets
- media: YouTube, Vimeo (yt-dlp + Whisper)
- baidu_pan: Baidu Pan via OAuth + PCS API
- browser: BrowserBackend dispatch

### 3 Pluggable Browser Backends
- bb-browser: real Chrome + login state + 100+ site adapters
- Cloudflare Browser Rendering: anti-bot, zero install, free tier
- Playwright: interactive sessions, screenshots, PDF generation

### Community Source Adapters
Separate repo: maestro-ai-stack/maestro-fetch-sources
Categories: economics, finance, politics, climate, social, academic, government, industrial, internet

## Storage

~/.maestro-fetch/
  config.toml       # TOML configuration
  cache.db          # SQLite cache index (URL -> hash, TTL, metadata)
  cache/            # content-addressed files
  sources/          # git clone of community adapters
  custom/           # user private adapters (override community)
  sessions/         # temporary session state

## Configuration (config.toml)

[cache]
max_size = "2GB"
default_ttl = 86400

[backends]
priority = ["bb-browser", "cloudflare", "playwright"]

[backends.cloudflare]
enabled = false
account_id = ""
api_token = ""

## Installation Extras

pip install maestro-fetch[pdf]       # Docling, openpyxl
pip install maestro-fetch[media]     # yt-dlp, Whisper
pip install maestro-fetch[browser]   # Playwright
pip install maestro-fetch[anthropic] # Claude
pip install maestro-fetch[openai]    # GPT
pip install maestro-fetch[all]       # everything

## Project Info

- Python: 3.11+
- License: MIT
- GitHub: https://github.com/maestro-ai-stack/maestro-fetch
- Community adapters: https://github.com/maestro-ai-stack/maestro-fetch-sources
