Metadata-Version: 2.4
Name: alembic-proxy
Version: 1.61.0
Summary: Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.
License-Expression: MIT
Project-URL: Homepage, https://github.com/InunuNet/Alembic
Project-URL: Source, https://github.com/InunuNet/Alembic
Project-URL: Issues, https://github.com/InunuNet/Alembic/issues
Project-URL: Changelog, https://github.com/InunuNet/Alembic/blob/main/docs/CHANGELOG.md
Keywords: llm,web-scraping,proxy,markdown,token-reduction,ai,agent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1
Requires-Dist: httpx>=0.27
Requires-Dist: readability-lxml>=0.8
Requires-Dist: trafilatura>=1.12
Requires-Dist: lxml>=5.0
Requires-Dist: tiktoken>=0.7
Requires-Dist: aiohttp>=3.9
Requires-Dist: certifi>=2024.2
Requires-Dist: pip-system-certs>=5.0
Requires-Dist: curl-cffi>=0.7
Requires-Dist: youtube-transcript-api>=0.6
Requires-Dist: jmespath>=1.0
Provides-Extra: js
Requires-Dist: playwright>=1.49; extra == "js"
Provides-Extra: saas
Provides-Extra: pdf
Requires-Dist: pypdf>=4.0; extra == "pdf"
Provides-Extra: stealth
Requires-Dist: patchright>=0.1; extra == "stealth"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: pyyaml>=6.0; extra == "dev"
Requires-Dist: pytest-reportlog>=0.4; extra == "dev"
Dynamic: license-file

# Alembic

> Distil any webpage into clean Markdown for LLM pipelines — 84–98% token reduction.

Alembic is a local HTTP proxy and CLI that sits between your agent and the open web. It fetches a URL, strips navigation, ads, scripts, and boilerplate through a multi-stage extraction cascade, and returns clean LLM-ready Markdown — typically at 84–98% token reduction. It also rewrites registry/documentation URLs to their LLM-optimised equivalents, extracts PDF text, and searches the web via Brave Search. Everything runs locally; no API keys required for basic use.

Named for the alchemical distillation apparatus — we turn raw web pages into the pure essence an agent needs.

---

## Install

```bash
pip install alembic-proxy
```

Or from source:

```bash
git clone https://github.com/InunuNet/Alembic.git
cd Alembic
pip install -e .
```

---

## Quick Start

```bash
# Start the proxy (recommended for agent workflows)
alembic serve

# Distil a URL — returns clean Markdown
curl http://localhost:7077/https://example.com

# Search + distil + synthesise
curl "http://localhost:7077/?q=python+async+patterns&fetch=true"

# JSON response with full metadata
curl -H "Accept: application/json" http://localhost:7077/https://example.com

# CLI fetch with token savings report
alembic fetch https://example.com --stats
```

---

## Features

### URL Normalization

Before fetching, Alembic rewrites certain "human-readable" URLs to their LLM-optimised equivalents — documentation hosts and structured APIs instead of noisy HTML pages:

| From | To | Quality gain |
|------|----|-------------|
| `arxiv.org/pdf/{id}` | `arxiv.org/abs/{id}` | PDF → clean abstract (quality 0→85) |
| `github.com/{owner}/{repo}/blob/{branch}/{file}` | `raw.githubusercontent.com/...` | HTML → raw file (quality 35→84) |
| `hex.pm/packages/{name}` | `hexdocs.pm/{name}/` | Version list → docs (quality 30→88) |
| `rubygems.org/gems/{name}` | `rubydoc.info/gems/{name}/` | Version list → docs (quality 35→93) |
| `crates.io/crates/{name}` | `docs.rs/{name}/` | og-description → API docs (quality 85→100) |
| `formulae.brew.sh/formula/{name}` | `formulae.brew.sh/api/formula/{name}.json` | HTML → JSON API (quality 52→85) |
| `formulae.brew.sh/cask/{name}` | `formulae.brew.sh/api/cask/{name}.json` | HTML → JSON API |
| `npmjs.com/package/{name}` | `registry.npmjs.org/{name}/latest` | HTML → JSON API (quality 35→85); supports `@scope/pkg` |
| `opam.ocaml.org/packages/{name}` | `ocaml.org/p/{name}/latest` | HTML → docs (quality 33→88) |
| `mvnrepository.com/artifact/{G}/{A}` | `search.maven.org/artifact/{G}/{A}` | HTML → Maven Central (quality 12→70) |
| `gopkg.in/{package}` | `pkg.go.dev/gopkg.in/{package}` | install page → full API docs (quality 64→100) |
| `swiftpackageindex.com/{owner}/{name}` | `github.com/{owner}/{name}` | SPI page → GitHub llms.txt (quality 55→100) |
| `lib.rs/crates/{name}` | `docs.rs/{name}/` | browser page → full API docs (quality 35→100) |
| `cran.r-project.org/package={name}` | `rdocumentation.org/packages/{name}` | link-heavy → clean R docs (quality 55→100) |
| `clojars.org/{group}/{artifact}` | `clojars.org/api/artifacts/{group}/{artifact}` | HTML → JSON API (quality 69→85) |
| `pypi.org/project/{name}` | `pypi.org/pypi/{name}/json` | HTML → JSON API (quality 81→85; 46 → 6K-160K words) |
| `registry.terraform.io/providers/{ns}/{type}` | `registry.terraform.io/v1/providers/{ns}/{type}` | HTML → JSON API (10 → 43K words) |
| `registry.terraform.io/modules/{ns}/{name}/{provider}` | `registry.terraform.io/v1/modules/{ns}/{name}/{provider}` | HTML → JSON API (10 → 52K words) |

### Extraction Cascade

Every URL goes through a cascade. Alembic stops at the first stage that produces clean content:

| Stage | Strategy | What it handles |
|-------|----------|----------------|
| Pre | **URL normalization** | Registry/doc URL rewrites (see table above) |
| Pre | **arXiv abstract adapter** | `arxiv.org/abs/{id}` → title + authors + abstract via lxml (quality 85) |
| Pre | **PDF extraction** | `application/pdf` → text via pypdf (5MB limit, encrypted/scanned = fallback) |
| 0a | **Sitemap adapter** | XML sitemap / sitemap index → clean URL list |
| 0b | **RSS/Atom adapter** | Feeds → structured Markdown with title + items |
| 0c | **Page-type adapters** | Recipes, forums (Lobste.rs, Reddit, HN, SE), products |
| 0d | **SVG adapter** | `image/svg+xml` → title + desc + text nodes |
| 0e | **Code-file detection** | `text/plain` with `.py/.ts/.go/.rs/.yaml/.json` ext → fenced code block |
| 1 | **`llms.txt` discovery** | Sites with pre-built LLM index (+ URL-targeted excerpt; falls through if < 25 quality or < 50 words) |
| 1.5 | **Hydration extraction** | Next.js `__NEXT_DATA__`, Nuxt 3, Remix — SSR state without Playwright |
| 1.8 | **JSON-LD `articleBody`** | Articles, HowTo, FAQPage, Event, Course embedded in structured data |
| 2 | **Content negotiation** | Servers that return `text/markdown` natively |
| 3 | **Trafilatura** | Production article extractor — handles most pages |
| 4 | **Readability** | Mozilla's DOM scoring — unusual layouts |
| 5 | **FitCleaner** | Heuristic block scoring — dev docs and engineering blogs |
| 6 | **og:description fallback** | When thin extraction < 50 words + og:description ≥ 30 chars |
| 7 | **Fallback** | Basic tag stripping — always succeeds |

`strategy: llms.txt` = best possible result. `strategy: fallback` or `og-description` = yellow flag — JS-heavy SPA or paywall; check `X-Alembic-JS-Hint-Score`.

### Bot Protection Bypass

Alembic ships a pool of 50 correlated synthetic browser personas — each with a consistent OS, browser version, screen, GPU, timezone, and language fingerprint. The `curl_cffi` fetch stage uses TLS impersonation matched to the persona's browser version to defeat Cloudflare Bot Management at the JA3/JA4 layer.

Stage 5 (optional) adds `patchright` (stealth Playwright) + a residential proxy for DataDome and Akamai Bot Manager.

| Site | Protection | Result |
|------|-----------|--------|
| AllRecipes | Cloudflare | Passes via curl_cffi |
| Reuters | Cloudflare | Passes via curl_cffi |
| Leboncoin | DataDome | Blocked (Stage 5 target) |
| Glassdoor | Cloudflare Bot Management | Blocked (Stage 5 target) |

Enable Stage 5:

```bash
ALEMBIC_STEALTH=1 ALEMBIC_PROXY_URL=http://user:pass@proxy:port alembic serve
```

### JSON API Distillation

Pass a JMESPath filter to extract fields from JSON APIs without writing glue code:

```bash
# Filter a JSON API response
curl "http://localhost:7077/https://api.example.com/users" \
  -H "X-Alembic-JQ: data[*].email"
```

Invalid expressions return HTTP 400 with `{"error": "...", "expression": "..."}`.

### Response Headers

Every response carries telemetry headers:

| Header | Value |
|--------|-------|
| `X-Alembic-Strategy` | Extraction strategy: `trafilatura`, `llms.txt`, `llms.txt:excerpt`, `hydration-*`, `rss-feed`, `sitemap`, `svg-text`, `code-file`, `adapter:arxiv`, `pdf-text`, `pdf-unsupported`, `json-passthrough`, `json-jmespath`, `og-description`, `plain-text`, `fallback` |
| `X-Alembic-Page-Type` | `article`, `recipe`, `forum`, `product`, `api`, `youtube`, `unknown` |
| `X-Alembic-Title` | Extracted page title |
| `X-Alembic-Author` | Extracted author (if available) |
| `X-Alembic-Date` | Extracted publish date (if available) |
| `X-Alembic-Language` | Page language as BCP-47 primary subtag (`en`, `fr`, `de`, …). Empty when unknown. |
| `X-Alembic-Word-Count` | Word count of the clean extracted content |
| `X-Alembic-Link-Count` | Number of unique links extracted from the page (available in JSON envelope as `links[{url,text}]`) |
| `X-Alembic-JS-Hint` | `true` if the page shows strong JavaScript-rendering signals |
| `X-Alembic-JS-Hint-Score` | JS hint confidence score 0–10. ≥6 = likely SPA, retry with `?js=true` |
| `X-Alembic-Cached` | `true` / `false` |
| `X-Alembic-Original-Tokens` | Token count before extraction |
| `X-Alembic-Clean-Tokens` | Token count after extraction |
| `X-Alembic-Saved-Pct` | Percentage of raw tokens saved (e.g. `93%`) |
| `X-Alembic-Yield-Pct` | Percentage of raw tokens in clean output. Low yield (< 1%) = likely SPA or paywall — use `X-Alembic-JS-Hint-Score` to decide whether to retry with JS |
| `X-Alembic-Quality-Score` | 0–100 content quality score. 80+ = clean prose/docs; 45–79 = moderate; 0–19 = challenge page or empty |
| `X-Alembic-Blocked` | `true` if a bot-wall interstitial was detected |
| `X-Alembic-Blocked-By` | Blocker name: `cloudflare`, `datadome`, `perimeterx`, `incapsula`, `kasada`, `bot_wall`, `unknown` |
| `X-Alembic-Retry` | `1` if a second persona was tried automatically after first block |
| `X-Alembic-Upstream-Status` | HTTP status from the upstream server (4xx/5xx only) |
| `X-Alembic-Wait-Status` | Playwright wait-for-selector outcome |
| `X-Alembic-Search-Backend` | `brave` / `searxng` |
| `X-Alembic-Search-Count` | Number of search results returned |

---

## Configuration

| Variable | Default | Purpose |
|----------|---------|---------|
| `ALEMBIC_PROXY_URL` | — | Outbound proxy for fetch requests (`http://user:pass@host:port` or `socks5://…`) |
| `ALEMBIC_STEALTH` | `0` | Set to `1` to enable Stage 5 patchright stealth browser |
| `ALEMBIC_RATE_LIMIT_RPM` | `0` | Per-IP rate limit in requests/minute. `0` = disabled. Exceeded requests get HTTP 429 + `Retry-After` header. Health endpoint exempt. |
| `ALEMBIC_BLOCK_RETRY` | `1` | Automatically retry once with a fresh browser persona when a block is detected. Defeats probabilistic ML scoring (Cloudflare BM). Set to `0` to disable. |
| `ALEMBIC_SEARXNG_URL` | — | SearXNG instance URL for web search (e.g. `http://localhost:8080`). Takes priority over Brave when set. |
| `ALEMBIC_SEARCH_BRAVE_API_KEY` | — | Brave Search API key (2,000 queries/month free) |
| `BRAVE_SEARCH_API_KEY` | — | Alias for the Brave API key |
| `ANTHROPIC_API_KEY` | — | Claude Haiku for search synthesis (`?fetch=true`) |
| `FIRECRAWL_API_KEY` | — | Firecrawl SaaS JS rendering |
| `BROWSERLESS_API_TOKEN` | — | Browserless SaaS JS rendering |
| `GITHUB_TOKEN` | — | GitHub personal access token → 5,000 req/hr on `api.github.com` (default: 60/hr unauthenticated) |
| `SEMANTIC_SCHOLAR_API_KEY` | — | Semantic Scholar API key → per-key rate limit on `api.semanticscholar.org` |
| `HF_TOKEN` | — | HuggingFace token → private model access + higher rate limits on `huggingface.co` |

---

## Authentication

When deploying Alembic as a public service, set `ALEMBIC_API_KEY` to require authentication:

```bash
export ALEMBIC_API_KEY=your-secret-key
alembic serve
```

Clients must then include one of:

```bash
curl -H "Authorization: Bearer your-secret-key" http://your-server:7077/https://example.com
curl -H "X-API-Key: your-secret-key" http://your-server:7077/https://example.com
```

The health endpoint (`GET /`) is always accessible without auth so you can check service status.
Without `ALEMBIC_API_KEY`, no auth is required (default — local dev mode).

---

## Docker

```bash
# Pull and run
docker compose up -d

# Or build from source
docker build -t alembic-proxy .
docker run -p 7077:7077 -e ALEMBIC_API_KEY=secret alembic-proxy
```

See [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for Fly.io, Railway, and Render deployment guides.

---

## Health Check

```bash
curl -H "Accept: application/json" http://localhost:7077/

# → {"status": "ok", "version": "1.13.0", "cache": "active"}
```

Plain text health check:

```bash
curl http://localhost:7077/
# → Alembic Proxy v1.13.0
```

---

## CLI Reference

| Command | Purpose |
|---------|---------|
| `alembic <url>` | Fetch and print clean content |
| `alembic fetch <url> --stats` | Fetch with token savings report |
| `alembic batch <urls…>` | Fetch multiple URLs in parallel |
| `alembic search "query"` | Web search via Brave / Google |
| `alembic search "query" --fetch` | Search + distil + synthesise |
| `alembic serve` | Start the HTTP proxy on `localhost:7077` |
| `alembic clear` | Clear the entire cache |
| `alembic clear-url <url>` | Evict a single URL from cache |
| `alembic vacuum` | Remove expired entries, reclaim disk space |
| `alembic lifetime` | Show lifetime token savings stats |

Key flags for `fetch`:

| Flag | Effect |
|------|--------|
| `--format markdown\|json\|text` | Output format (default: markdown) |
| `--stats` | Print token savings report |
| `--no-cache` | Bypass cache, always refetch |
| `--js` | Use Playwright for JS rendering |
| `--auto-js` | Auto-escalate to JS if page is heavily dynamic |
| `--saas firecrawl\|browserless` | Use cloud rendering |
| `-H "Key: Value"` | Forward custom header to target site |
| `--ls key=value` | Inject into browser localStorage |
| `--ss key=value` | Inject into sessionStorage |

---

## Advanced Usage

### Authenticated SPA (localStorage injection)

```bash
curl "http://localhost:7077/https://app.example.com/dashboard?js=true" \
  -H "X-Alembic-LocalStorage: session_token=eyJ..."
```

### JSON API with JMESPath filtering

```bash
curl "http://localhost:7077/https://api.example.com/users" \
  -H "Authorization: Bearer token" \
  -H "X-Alembic-JQ: data[*].email"
```

### Python API

```python
from src.processor import Processor
from src.config import DEFAULT_CONFIG
import asyncio

processor = Processor(DEFAULT_CONFIG)
result = asyncio.run(processor.process("https://example.com", fmt="markdown"))

print(result.content)          # clean Markdown
print(result.strategy)         # trafilatura / llms.txt / hydration / etc.
print(result.original_tokens)  # token count before
print(result.clean_tokens)     # token count after
print(result.page_type)        # article / recipe / forum / product / unknown
print(result.author)           # extracted author (if available)
print(result.publish_date)     # extracted date (if available)
```

---

## Development

```bash
make test          # unit tests (685 passing)
make install-daemon  # install as launchd service (macOS)
```

```bash
# Run tests directly
pytest tests/ -q
# Expected: 685+ passing, 0 failures

# Live integration tests (30 URLs, real proxy)
pytest tests/integration/ -q
```

---

## Documentation

| File | Contents |
|------|---------|
| [`llms.txt`](llms.txt) | AI-to-AI quick reference — comprehensive, machine-readable |
| [`docs/API.md`](docs/API.md) | Complete CLI, proxy, and Python API reference |
| [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) | System design, data flow, module map |
| [`docs/GUIDE.md`](docs/GUIDE.md) | Integration patterns and practical recipes |
| [`docs/CHANGELOG.md`](docs/CHANGELOG.md) | Version history |

---

## License

MIT. See [`LICENSE`](LICENSE) for the full text.
