Metadata-Version: 2.4
Name: domvault
Version: 1.0.0
Summary: DomVault extraction runtime with session identity, residential proxy orchestration, challenge auto-solving, and warm session caching.
Author: DomVault Maintainers
License-Expression: MIT
Keywords: web-extraction,browser-automation,anti-bot,playwright,scraping
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: apify-fingerprint-datapoints==0.12.0
Requires-Dist: browserforge==1.2.4
Requires-Dist: camoufox[geoip]==0.4.11
Requires-Dist: curl_cffi==0.15.0
Requires-Dist: lxml==6.1.0
Requires-Dist: msgspec==0.20.0
Requires-Dist: patchright==1.58.2
Requires-Dist: Pillow==12.1.1
Requires-Dist: protego==0.6.0
Requires-Dist: psutil==7.2.2
Requires-Dist: pydantic==2.12.5
Requires-Dist: playwright==1.58.0
Requires-Dist: python-dotenv==1.0.1
Requires-Dist: requests==2.33.0
Requires-Dist: scrapling==0.4.7
Requires-Dist: tenacity==9.1.4
Provides-Extra: dev
Requires-Dist: build==1.4.3; extra == "dev"
Requires-Dist: lxml-stubs==0.5.1; extra == "dev"
Requires-Dist: mypy==1.20.1; extra == "dev"
Requires-Dist: pytest==9.0.2; extra == "dev"
Requires-Dist: pytest-cov==7.1.0; extra == "dev"
Requires-Dist: ruff==0.15.10; extra == "dev"
Requires-Dist: types-psutil==7.2.2.20260408; extra == "dev"
Requires-Dist: types-requests==2.33.0.20260408; extra == "dev"

﻿# DomVault 3.0

DomVault is a production-grade extraction runtime for protected, modern websites.
It captures page evidence, preserves session identity, routes through residential
proxy leases, solves supported challenge families in-place, and emits a structured
manifest for downstream reconstruction workflows.

This package is the extraction boundary only. It does not ship discovery, ranking,
vault curation, or code-generation systems. Its job is to get into the page cleanly,
extract high-fidelity artifacts, and explain exactly how it did so.

## Unified Core Advantage

DomVault ships as a **Unified Core** runtime: one install with browser-capable anti-bot extraction ready out of the box.

What this means:
- No browser extras split for Playwright, Camoufox, Patchright, or Scrapling.
- Protected-target posture is available immediately after install.
- Escalation between HTTP and browser transports is policy-gated and fully observable.

Competitive position:
- DomVault intentionally accepts a heavier runtime footprint to maximize first-run success on protected targets.
- Tuning happens through runtime policy knobs, not through post-install dependency surgery.

## Architecture

DomVault 3.0 is packaged as a standard `src/domvault` Python library with deterministic runtime contracts.

Key architectural points:
- **Canonical Layout**: `src/domvault` is the packaging and import root.
- **Runtime Discipline**: Browser backends start only when selected by routing or explicit fallback policy.
- **Dependency Hygiene**: Runtime dependencies stay in core packaging; tooling stays in dev channels.
- **Python 3.12+ Compatibility**: Metadata and typing are aligned for modern Python versions.

## DomVault Features

### Session Identity

Each run is bound to a typed `SessionIdentity` that carries:

- transport profile
- browser profile
- locale and timezone
- viewport
- cookie jar
- solver history
- proxy lease affinity

DomVault keeps browser and HTTP transport aligned to the same logical identity unless
policy explicitly rotates it.

### Residential Proxy Orchestrator

Protected-mode traffic is routed through a leased proxy identity with:

- sticky session semantics
- leak-control policy enforcement
- DNS-via-proxy preference
- WebRTC blocking for browser paths
- lease scoring, challenge penalties, and ejection thresholds

### Challenge Auto-Solving

DomVault detects and routes supported challenge families with explicit policy:

- Cloudflare Turnstile: solve in place when configured
- DataDome: solve in place and reapply session cookies
- Kasada: rotate identity instead of pretending a token flow is safe

Solver providers are environment-driven and provider-ordered. The runtime records
which provider was attempted, which one succeeded, and when identity rotation was
required instead.

### Warm Session Caching

Successful sessions are stored domain-by-domain with:

- browser cookies
- local storage values
- proxy affinity hints
- prior challenge outcomes
- preferred backend/provider history

When a repeat domain is captured, DomVault attempts to restore a warm identity before
starting cold. Poisoned or stale sessions are excluded automatically.

### Explainable Extraction Provenance

Every capture emits provenance describing:

- transport backend used
- session identity id
- transport profile id
- proxy lease behavior
- challenge routing decisions
- solver provider results
- fallback and degradation reasons

The output is intended to be auditable, not magical.

## Installation

Install the runtime package:

```bash
pip install .
```

For development:

```bash
pip install -e .
```

## Runtime Tuning Knobs

Unified Core keeps browser capabilities installed by default. You can tune runtime cost and behavior with policy controls:

- `DOMVAULT_ENABLE_CAMOUFOX_FALLBACK`: enable or disable Camoufox fallback attempts.
- `DOMVAULT_ENABLE_PATCHRIGHT_FALLBACK`: enable or disable Patchright compatibility fallback.
- `DOMVAULT_CAMOUFOX_TIMEOUT_MS` and `DOMVAULT_PATCHRIGHT_TIMEOUT_MS`: bound browser escalation cost.
- `DOMVAULT_BACKEND_CAMOUFOX_HEADLESS` and `DOMVAULT_BACKEND_CAMOUFOX_VIRTUAL_DISPLAY`: control execution mode.
- `DOMVAULT_TRANSPORT_PROXY_PROTECTED_MODE` and related proxy settings: harden protected-target routing.

## Public API

DomVault exposes a stable top-level API. Consumers should not need to import from
internal modules for normal usage.

```python
from domvault import CaptureResult, SessionIdentity, extract

result: CaptureResult = extract(
    "https://example.com",
    selector="main",
    output_dir="_scraped_raw/example",
)

print(result.capture_status)
print(result.manifest_path)
print(result.target_profile)
```

Async usage:

```python
from domvault import extract_async

result = await extract_async(
    "https://example.com",
    selector="main",
    output_dir="_scraped_raw/example-async",
)
```

Primary public exports:

- `extract`
- `extract_async`
- `CaptureResult`
- `DomVaultManifest`
- `SessionIdentity`
- `SessionStore`
- `WarmSessionRecord`
- `RuntimeConfig`
- `ProxyOrchestrator`

## Output Model

The runtime writes a manifest and artifact bundle under the selected output directory.
Typical outputs include:

- `manifest.json`
- `structured-extraction.json`
- page HTML
- DOM snapshots
- computed styles
- hydration state
- shadow DOM coverage
- frame tree coverage
- anti-bot signals
- animation and token mapping artifacts

The `CaptureResult` returned by the API gives you the high-value runtime summary while
the manifest preserves the deeper artifact references.

## Environment Configuration

DomVault is configured through environment variables. Important groups include:

### Identity

- `DOMVAULT_IDENTITY_STORAGE_ROOT`
- `DOMVAULT_IDENTITY_DEFAULT_LOCALE`
- `DOMVAULT_IDENTITY_DEFAULT_TIMEZONE`
- `DOMVAULT_IDENTITY_DEFAULT_ACCEPT_LANGUAGE`

### Transport And Proxy

- `DOMVAULT_TRANSPORT_HTTP_IMPERSONATION`
- `DOMVAULT_TRANSPORT_PROXY_PROVIDER`
- `DOMVAULT_TRANSPORT_PROXY_URL`
- `DOMVAULT_TRANSPORT_PROXY_COUNTRY`
- `DOMVAULT_TRANSPORT_PROXY_REQUIRE_LEASE`
- `DOMVAULT_TRANSPORT_PROXY_BLOCK_WEBRTC`

### Challenge Solvers

- `DOMVAULT_CHALLENGE_SOLVER_PROVIDER_ORDER`
- `DOMVAULT_CAPSOLVER_API_KEY`
- `DOMVAULT_2CAPTCHA_API_KEY`
- `DOMVAULT_CHALLENGE_SOLVER_TURNSTILE_ENABLED`
- `DOMVAULT_CHALLENGE_SOLVER_DATADOME_ENABLED`
- `DOMVAULT_CHALLENGE_SOLVER_KASADA_ENABLED`

### Warm Session Store

- `DOMVAULT_SESSION_STORE_ROOT`
- `DOMVAULT_SESSION_STORE_ENABLED`
- `DOMVAULT_SESSION_STORE_MAX_AGE_HOURS`

## Operational Notes

- Protected captures are expected to run with a real proxy strategy.
- Solver credentials must be injected through environment variables.
- Warm cache reuse is domain-scoped and identity-scoped.
- Unresolved or poisoned sessions are not silently reused.
- The package is strict-typed and validated with `mypy`, `ruff`, and pytest.

## Crawl4AI Worker

The isolated Crawl4AI worker is intentionally not bundled into the main runtime
environment because of dependency constraints around `lxml`. If you need the offline
worker, install `requirements-crawl4ai.txt` into a separate virtual environment and
set `DOMVAULT_CRAWL4AI_PYTHON` to that interpreter path.

## CLI

The package also exposes a CLI entrypoint:

```bash
domvault clone https://example.com --selector main --output _scraped_raw/example
```

The CLI is a thin wrapper around the same extraction pipeline used by the Python API.

## Release Standard

DomVault 3.0 is packaged with:

- a typed public API
- exact dependency pins
- manifest-first extraction outputs
- identity-aware challenge handling
- warm-session persistence for repeat domains

This package is meant for deterministic, explainable extraction under real production
pressure, not just best-effort scraping.

