Metadata-Version: 2.4
Name: subsift
Version: 0.1.0a6
Summary: Subdomain reconnaissance with LLM-powered interestingness scoring for bug bounty hunters and pentesters.
Project-URL: Homepage, https://github.com/Ataraxia-ia-labs/Subsift
Project-URL: Repository, https://github.com/Ataraxia-ia-labs/Subsift
Project-URL: Issues, https://github.com/Ataraxia-ia-labs/Subsift/issues
Author: KaiserCode
License: AGPL-3.0-or-later
License-File: LICENSE
Keywords: bug-bounty,llm,pentest,recon,security,subdomain
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Requires-Python: >=3.11
Requires-Dist: aiosqlite>=0.20
Requires-Dist: alembic>=1.14
Requires-Dist: anthropic>=0.39
Requires-Dist: fastapi>=0.115
Requires-Dist: greenlet>=3.0
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic-settings>=2.6
Requires-Dist: pydantic>=2.9
Requires-Dist: python-multipart>=0.0.12
Requires-Dist: rich>=13.9
Requires-Dist: sqlalchemy[asyncio]>=2.0
Requires-Dist: sqlmodel>=0.0.22
Requires-Dist: structlog>=24.4
Requires-Dist: tenacity>=9.0
Requires-Dist: typer>=0.13
Requires-Dist: tzdata>=2024.2; platform_system == 'Windows'
Requires-Dist: uvicorn[standard]>=0.32
Provides-Extra: screenshots
Requires-Dist: pillow>=11.0; extra == 'screenshots'
Requires-Dist: playwright>=1.49; extra == 'screenshots'
Provides-Extra: storage-s3
Requires-Dist: boto3>=1.35; extra == 'storage-s3'
Description-Content-Type: text/markdown

<div align="center">

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-dark-bg.png">
  <img src="docs/assets/logo-512.png" alt="SubSift" width="200">
</picture>

# SubSift

**Subdomain reconnaissance that actually ranks what matters.**

`subfinder` gives you 5 000 subdomains. SubSift gives you the 20 that probably have a vulnerability — and tells you _why_.

[![CI](https://github.com/Ataraxia-ia-labs/Subsift/actions/workflows/ci.yml/badge.svg)](https://github.com/Ataraxia-ia-labs/Subsift/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/subsift.svg)](https://pypi.org/project/subsift/)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
[![Type checked: mypy strict](https://img.shields.io/badge/type%20checked-mypy%20strict-blue.svg)](https://mypy.readthedocs.io/)
[![Tests: 239 passing](https://img.shields.io/badge/tests-239%20passing-brightgreen.svg)](#)

</div>

---

## Why

The standard recon pipeline (subfinder → httpx → eyeball every line) doesn't scale. A modern enterprise has 5–10 k subdomains and you can't manually triage that. SubSift bolts an **interestingness model** onto the pipeline: every subdomain gets a 0–100 score with one-sentence reasoning, and the UI/CLI ranks them so the suspicious ones surface first.

The scoring rubric — admin / dev / staging / vpn names, auth-boundary status codes (401/403), outdated tech, exposed cloud storage — is in [`src/subsift/llm/prompts.py`](src/subsift/llm/prompts.py). Tune it for your engagement.

### Real-world result — `tesla.com`, 2 m 11 s

A single scan against `tesla.com` from the private Fly.io deploy (subfinder + crt.sh, httpx probe, OpenAI `gpt-5-mini` scorer):

```
Subdomains discovered  726
Probed (live HTTP)     258
Scored                 347
High-score (≥70)       220
```

The model surfaced VPN endpoints, authentication services, password-reset (SSPR) backends, financial gateways, MFA origins, and production vehicle-file storage. The first six rows of the ranked output:

| Score | FQDN | Status | Reasoning |
| --- | --- | --- | --- |
| **98** | `origin-finplat-prd.tesla.com` | — | Production financial platform origin — extremely sensitive backend |
| **95** | `apacvpn.tesla.com` | — | Regional VPN endpoint, almost certainly auth / remote access |
| **95** | `auth-global-stage.tesla.com` | 403 | Global staging auth service behind an Access-Denied boundary |
| **95** | `auth.prd.usw.vn.cloud.tesla.com` | 200 | Production auth service exposing the login flow (Envoy / hCaptcha) |
| **95** | `sspr.tesla.com` | 403 | Self-service password reset behind 403 — top-tier account-takeover risk |
| **95** | `vehicle-files.prd.usw2.vn.cloud.tesla.com` | 403 | Production vehicle-file storage behind auth |

The rest of the long tail (marketing, CDN edges, redirects) sits comfortably below 50 — exactly where you want it during triage.

## The pipeline at a glance

```
                    ┌──────────────────────────────────────┐
   subsift scan ──▶ │ ScanOrchestrator                     │
   POST /scans  ──▶ │   1. enumerate (subfinder + crt.sh)  │
   POST /ui/scans   │   2. dedupe + scope-filter (RFC1035) │
                    │   3. upsert subdomains (+ junction)  │
                    │   4. probe (httpx PD: code/tech/ip)  │
                    │   5. score 0-100 (Ollama / Claude)   │
                    │   6. persist Probe + ScoreResult     │
                    └──────────────────────────────────────┘
                          │
   ┌──────────────────────┼──────────────────────────────────┐
   ▼                      ▼                                  ▼
 CLI tables            HTML UI at /ui                  JSON API at /scans
 (Rich, ranked)        (HTMX, ranked, filterable)      (REST, paginated)
                                                          │
                                                          └─▶ exports:
                                                              .json .csv .txt .md
```

| Tool | Output | Ranking |
| --- | --- | --- |
| `subfinder` | raw subdomain list | none |
| `amass` | subdomains + DNS data | none |
| `httpx` | live hosts + tech | by status code |
| **SubSift** | **subdomains + probes + LLM scores + diffs over time** | **by interestingness, with reasoning + history** |

### Enumeration sources

Seven passive sources run concurrently behind an asyncio `Semaphore`. Each is a `Protocol` impl — adding more is a one-file change (no schema migration: the scan records its sources in a single `sources_used` column).

| Source | Kind | Default in registry | Notes |
| --- | --- | --- | --- |
| `subfinder` | ProjectDiscovery binary | yes | broad passive recon, fast |
| `crtsh` | Certificate Transparency logs | yes | finds names from TLS certs only |
| `wayback` | Internet Archive CDX API | yes | historical URLs → hostnames |
| `otx` | AlienVault OTX passive DNS | yes | optional API key boosts rate limits |
| `amass` | OWASP binary, `-passive` mode | yes | slower but very thorough |
| `anubis` | jldc.me Anubis DB | yes | free JSON API, no key |
| `hackertarget` | HackerTarget hostsearch | yes | free, rate-limited (fails soft) |

Use `subsift scan example.com -s crtsh -s wayback` to run a subset. Sources whose binaries aren't installed (`amass`, `subfinder`) fail soft and the scan continues with the rest.

## Quickstart

### One-line scan

```bash
subsift scan example.com
```

The first time you run this it'll enumerate (crt.sh), probe live hosts (httpx), then ask the LLM to score each subdomain. Output:

```
 scan_id       1
 domain        example.com
 duration      18.42s
 total unique  87
 inserted      87
 updated       0
 probes        62 persisted

Per-source results
┌──────────┬────────┬───────┬────────┐
│ Source   │ Status │ Count │ Time   │
├──────────┼────────┼───────┼────────┤
│ crtsh    │ ok     │   142 │ 4.10s  │
│ subfinder│ ok     │    71 │ 6.85s  │
└──────────┴────────┴───────┴────────┘

LLM scoring
┌──────────┬───────────────────┬──────┬───────────┬────────┐
│ Provider │ Model             │ Stat │ Persisted │ Time   │
├──────────┼───────────────────┼──────┼───────────┼────────┤
│ ollama   │ llama3.2:3b       │ ok   │       62  │ 4.92s  │
└──────────┴───────────────────┴──────┴───────────┴────────┘
```

Then `subsift scores 1` to see the ranked table (highest first):

```
Score  FQDN                            Reasoning
  92   admin.staging.example.com       admin keyword + 401 auth boundary
  88   jenkins.example.com             exposed Jenkins UI, default branding
  74   gitlab-internal.example.com     internal name leaked publicly
  ...
  12   www.example.com                 marketing site behind CDN
```

### Web UI

```bash
subsift serve --reload
# open http://localhost:8000/ui
```

Three pages: home (recent scans + form), scan detail (ranked table with live filter + export buttons + polling badge), diff view (added/removed/score-changed buckets).

### Diff against last week's scan

```bash
subsift diff --domain example.com
# or explicitly:
subsift diff 1 2 --threshold 20
```

Shows what appeared, what disappeared, and which scores moved significantly between two scans of the same domain.

### Alert when a high-score subdomain appears

Wire a webhook so SubSift pings you (Slack, Discord, PagerDuty, your own endpoint) the moment a new finding with `score ≥ 80` lands:

```bash
subsift alerts add "admin-watch" "https://hooks.slack.com/..." \
    --domain example.com --min-score 80 --trigger added
subsift alerts test 1   # synthetic payload, audited in alert_deliveries
```

Then cron a nightly scan:

```cron
0 3 * * *  subsift scan example.com --no-score
```

Every scan that has a previous scan to diff against evaluates every active rule against the diff and POSTs a JSON payload to webhooks whose threshold matched. Failures are isolated — one broken endpoint never affects other rules or the scan itself, and every attempt (sent / failed / skipped) gets a row in `alert_deliveries` for audit.

## Install

### Requirements

- **Python 3.11+** with [uv](https://docs.astral.sh/uv/) for dependency management
- **ProjectDiscovery binaries** (`subfinder`, `httpx`, `dnsx`) — install with Go, see [docs/CONFIGURATION.md](docs/CONFIGURATION.md)
- **LLM** — choose one:
  - [Ollama](https://ollama.com) running locally with any 3B+ instruct model (default, free)
  - [Anthropic API key](https://console.anthropic.com) — opt-in via `SUBSIFT_LLM_PROVIDER=claude`

### From PyPI

```bash
pip install subsift          # or: uv tool install subsift / pipx install subsift
subsift init-db              # create the local SQLite schema
subsift scan example.com
```

Optional extras: `pip install "subsift[screenshots]"` (Playwright capture +
thumbnails) and `pip install "subsift[storage-s3]"` (S3-compatible blob
storage). You still need the ProjectDiscovery binaries on PATH and an LLM
(Ollama running locally, or an API key) — see Requirements above.

### From source (for development)

```bash
git clone https://github.com/Ataraxia-ia-labs/Subsift.git
cd subsift
cp .env.example .env
uv sync
uv run alembic upgrade head   # migration-managed schema (vs. init-db)
uv run subsift --help
```

### Docker (when WSL2 / Docker Desktop is available)

```bash
cp .env.example .env
docker compose up --build -d
docker compose exec ollama ollama pull llama3.2:3b
curl http://localhost:8000/health
```

The `docker-compose.yml` ships an Ollama service alongside the app so a fresh clone works without external dependencies.

## Configuration

Everything is driven by environment variables prefixed `SUBSIFT_`. Copy `.env.example` to `.env` and edit. Full reference in [docs/CONFIGURATION.md](docs/CONFIGURATION.md).

Key knobs:

| Variable | Default | What it does |
| --- | --- | --- |
| `SUBSIFT_LLM_PROVIDER` | `ollama` | `ollama` (local) or `claude` (API) |
| `SUBSIFT_OLLAMA_MODEL` | `llama3.1:8b` | Any chat-completion model your Ollama has |
| `SUBSIFT_ANTHROPIC_API_KEY` | — | Required when provider = `claude` |
| `SUBSIFT_TOOL_RUNNER` | `native` | `native` (binaries on PATH) or `docker` (image per tool) |
| `SUBSIFT_HTTPX_BIN` | `httpx` | Absolute path needed on Windows — see [docs/CONFIGURATION.md](docs/CONFIGURATION.md#windows-quirk-httpx-binary-name-clash) |

## Deploy (private, on Fly.io)

SubSift ships a complete production-deploy story: HTTPBasic-gated app on
Fly.io (São Paulo region), OpenAI (`gpt-5-mini`) as the LLM by default
(Claude and Ollama are one secret swap away), persistent SQLite volume,
and idle machines auto-stopped to keep the bill at ~$0 for personal use.

```bash
fly auth login
fly apps create subsift
fly volumes create subsift_data --region gru --size 1
fly secrets set \
    SUBSIFT_AUTH_PASSWORD="$(openssl rand -base64 24)" \
    SUBSIFT_OPENAI_API_KEY="sk-..."
fly deploy
```

Full step-by-step — smoke tests, log tailing, password rotation, volume
resizing, the Tailscale-to-local-Ollama variant — in
[`docs/DEPLOY.md`](docs/DEPLOY.md). The committed `fly.toml` already
wires the release-command Alembic migration, the volume mount at
`/app/data`, the `/health` check, and auto-stop on idle.

## Documentation

- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — module layout, request lifecycle, data model.
- **[docs/CLI.md](docs/CLI.md)** — every command, every flag, examples.
- **[docs/API.md](docs/API.md)** — REST endpoint reference with curl examples.
- **[docs/CONFIGURATION.md](docs/CONFIGURATION.md)** — env var reference + install troubleshooting.
- **[docs/DEPLOY.md](docs/DEPLOY.md)** — Fly.io deploy guide, secrets, cost expectations.
- **[CHANGELOG.md](CHANGELOG.md)** — release notes.
- **[CONTRIBUTING.md](CONTRIBUTING.md)** — dev setup, style, PR workflow.
- **[CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)** — Contributor Covenant 2.1.
- **[SECURITY.md](SECURITY.md)** — vulnerability disclosure + responsible-use.
- **[DISCLAIMER.md](DISCLAIMER.md)** — ethical-use terms + legal warning.

## Roadmap

| Phase | Status |
| --- | --- |
| 1 — Scaffolding (Python 3.11, FastAPI, SQLModel, uv) | :white_check_mark: |
| 2 — Enumeration + persistence (subfinder, crt.sh, SQLite, Alembic) | :white_check_mark: |
| 3 — Probing + enrichment (httpx PD) | :white_check_mark: |
| 4 — LLM scoring (Ollama + Claude via tool-use) | :white_check_mark: |
| 5 — Web UI (Jinja2 + HTMX + Alpine + compiled Tailwind) | :white_check_mark: |
| 6 — Exports (JSON / CSV / TXT / Markdown) | :white_check_mark: |
| 7 — Historical diffs with junction table | :white_check_mark: |
| 8 — Docs + v0.1.0-alpha release | :white_check_mark: |
| 9 — Webhook alerts on new high-scored findings | :white_check_mark: |
| 10 — Wayback + Amass + AlienVault OTX enumerators | :white_check_mark: |
| 11a — Screenshot capture per probe (Playwright, local storage) | :white_check_mark: |
| 11b — Storage abstraction (S3-compatible) + thumbnails | :white_check_mark: |
| 12 — HTTPBasic auth + Fly.io deploy (gru, persistent volume, auto-stop) | :white_check_mark: |

## Architecture notes

- **`Enumerator` Protocol** + a registry — adding a new source is one file (see [`src/subsift/core/enumerators/crtsh.py`](src/subsift/core/enumerators/crtsh.py) for the smallest example).
- **`Prober` and `LLMClient` Protocols** for the same reason — swap httpx for naabu, swap Ollama for OpenAI, no orchestrator changes.
- **`ToolRunner` abstraction** so binaries can run native or via Docker without the wrappers caring.
- **Repository pattern** so the CLI / API / UI never construct SQL — testable with an in-memory engine.
- **Junction table `scan_subdomains`** so diffs are set operations, not heuristics over `first_seen` boundaries.

## Quality gates

Every push to `main` runs:

- `pre-commit run --all-files` — ruff (lint + format), mypy `--strict`, detect-secrets, file hygiene.
- `pytest --cov` on Python 3.11 **and** 3.12.
- `uvx pip-audit --strict` over exported runtime deps — fails on any known CVE.
- `docker build --target runtime` followed by a `/health` smoke-test inside the container.

Local: `make check` (POSIX) or `scripts\tasks.ps1 check` (Windows) reproduces lint + types + tests in one shot.

## Legal

SubSift is for **authorised security testing only** — bug bounty programs, your own assets, contracted pentests, CTFs. Unauthorised scanning of third-party infrastructure may violate the Computer Fraud and Abuse Act (US), the Computer Misuse Act (UK), and equivalent legislation elsewhere. You are responsible for your use. Full terms in [DISCLAIMER.md](DISCLAIMER.md).

## License

**AGPL-3.0-or-later** © 2026 KaiserCode. See [LICENSE](LICENSE).

SubSift is copyleft: if you run a modified version as a network service, the
AGPL requires you to offer that modified source to its users. This keeps the
free/core tier open while leaving room for a separately-licensed Pro tier.
