Metadata-Version: 2.4
Name: webweavex
Version: 2.0.0
Summary: Deterministic universal runtime extraction infrastructure with replay-safe graphs and Kaalka persistence
Author: Piyush Mishra
License-Expression: Apache-2.0
Project-URL: Homepage, https://pypi.org/project/webweavex/
Project-URL: Repository, https://github.com/ni-sh-a-char/WebWeaveX
Project-URL: Documentation, https://github.com/ni-sh-a-char/WebWeaveX#readme
Project-URL: Issues, https://github.com/ni-sh-a-char/WebWeaveX/issues
Keywords: extraction,web-scraping,semantic-analysis,runtime-cognition,deterministic,browser-automation
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
License-File: AUTHORS
Requires-Dist: requests>=2.28.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=5.2.0
Requires-Dist: markdownify>=0.12.0
Requires-Dist: pypdf>=4.2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Provides-Extra: browser
Requires-Dist: playwright>=1.40.0; extra == "browser"
Provides-Extra: native
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Requires-Dist: Pillow>=10.0.0; extra == "ocr"
Provides-Extra: distributed
Provides-Extra: connectors
Provides-Extra: parsers
Requires-Dist: tree-sitter-languages>=1.10.0; extra == "parsers"
Provides-Extra: ingestion
Requires-Dist: python-docx>=1.0.0; extra == "ingestion"
Requires-Dist: pytesseract>=0.3.10; extra == "ingestion"
Requires-Dist: Pillow>=10.0.0; extra == "ingestion"
Provides-Extra: llm
Requires-Dist: groq>=0.9.0; extra == "llm"
Provides-Extra: full
Requires-Dist: playwright>=1.40.0; extra == "full"
Requires-Dist: tree-sitter-languages>=1.10.0; extra == "full"
Requires-Dist: python-docx>=1.0.0; extra == "full"
Requires-Dist: pytesseract>=0.3.10; extra == "full"
Requires-Dist: Pillow>=10.0.0; extra == "full"
Requires-Dist: groq>=0.9.0; extra == "full"
Dynamic: license-file

<p align="center">
  <br/>
  <img src="https://img.shields.io/badge/WebWeaveX-v2.0.0-0f172a?style=for-the-badge&logo=python&logoColor=white" alt="WebWeaveX v2.0.0"/>
  <br/><br/>
  <strong>Deterministic runtime extraction and replay-safe operational cognition infrastructure</strong>
  <br/><br/>
</p>

<p align="center">
  <a href="https://pypi.org/project/webweavex/"><img src="https://img.shields.io/pypi/v/webweavex?style=flat-square&logo=pypi&logoColor=white" alt="PyPI version"/></a>
  <img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=flat-square&logo=python&logoColor=white" alt="Python 3.10+"/>
  <a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-2EA44F?style=flat-square" alt="Apache 2.0"/></a>
  <img src="https://img.shields.io/badge/tests-760%2B%20passing-22c55e?style=flat-square" alt="Tests passing"/>
  <img src="https://img.shields.io/badge/coverage-90%25%2B%20scoped-6366f1?style=flat-square" alt="Coverage 90%+"/>
  <img src="https://img.shields.io/badge/build-passing-22c55e?style=flat-square" alt="Build passing"/>
  <img src="https://img.shields.io/badge/deterministic%20runtime-0ea5e9?style=flat-square" alt="Deterministic runtime"/>
  <img src="https://img.shields.io/badge/replay--safe-14b8a6?style=flat-square" alt="Replay-safe"/>
  <img src="https://img.shields.io/badge/Kaalka-verified-7c3aed?style=flat-square" alt="Kaalka verified"/>
  <img src="https://img.shields.io/badge/production%20ready-15803d?style=flat-square" alt="Production ready"/>
  <img src="https://img.shields.io/badge/OSS-infrastructure-64748b?style=flat-square" alt="Open Source"/>
</p>

<p align="center">
  <a href="https://buymeacoffee.com/piyushmishra00"><img src="https://img.shields.io/badge/Buy%20Me%20a%20Coffee-Support%20WebWeaveX-FFDD00?style=for-the-badge&logo=buy-me-a-coffee&logoColor=black" alt="Buy Me a Coffee"/></a>
</p>

<p align="center">
  <br/>
</p>

---

## Contents

- [What is WebWeaveX?](#what-is-webweavex)
- [What WebWeaveX is NOT](#what-webweavex-is-not)
- [Why existing systems fail](#why-existing-systems-fail)
- [Core capabilities](#core-capabilities)
- [Authenticated runtime continuation](#authenticated-runtime-continuation)
- [Architecture](#architecture)
- [Canonical pipeline](#canonical-pipeline)
- [Quick start](#quick-start)
- [Code examples](#real-code-examples)
- [Determinism](#determinism)
- [Validation](#real-validation)
- [Security](#security-model)
- [Architecture guarantees](#architecture-guarantees)
- [Contributing](#contributing)

---

## What is WebWeaveX?

**WebWeaveX** is deterministic **runtime extraction and operational cognition infrastructure**. It captures how software actually runs—browser DOM, authenticated sessions, Electron state, native UI, workflows, and connector surfaces—and compiles that into **replay-safe runtime graphs** with **Kaalka-encrypted persistence**.

### Why it exists

Modern systems are **authenticated**, **stateful**, **runtime-driven**, **SPA-based**, **Electron-based**, **synchronized**, and **operationally dynamic**. Operators need continuity across runs, not another HTML snapshot.

Traditional extraction fails because it is:

| Failure mode | Consequence |
|--------------|-------------|
| HTML-only parsing | Misses hydration, storage, IPC, native UI |
| Stateless requests | Loses session and workflow continuity |
| No authenticated persistence | Re-login and drift between runs |
| No replay contract | Cannot prove equivalence after rebuild |
| No reconstruction | Cannot rebuild operational topology from IR |
| Weak SPA/Electron support | Unstable IDs, routes, and storage break diffs |

WebWeaveX exists to deliver **deterministic runtime extraction** and **replay-safe operational reconstruction** through one **canonical pipeline**.

---

## What WebWeaveX is NOT

WebWeaveX is **not**:

| Category | Clarification |
|----------|----------------|
| **Auth bypass tooling** | Does not defeat MFA, CAPTCHA, or login controls |
| **Malware or exploit infrastructure** | Not designed for unauthorized access |
| **Credential theft tooling** | Does not harvest secrets you do not already hold |
| **CAPTCHA bypass software** | No circumvention of bot defenses |
| **Browser exploitation tooling** | Not a vulnerability framework |
| **AGI or “autonomous hacking”** | No probabilistic agent that “figures out” sites |
| **Hacking infrastructure** | No unauthorized intrusion features |
| **An LLM wrapper** | Core path is deterministic; optional plugins fail safe |
| **A chatbot** | Infrastructure library, not conversational AI |

WebWeaveX only operates on **authorized authenticated runtimes** and data **you explicitly provide**.

---

## Why existing systems fail

| System | Strength | Limitation for operational runtime |
|--------|----------|-----------------------------------|
| **BeautifulSoup** | Fast static HTML parse | No live session, storage, or runtime graph |
| **Selenium** | Browser automation | No unified IR, Kaalka fabric, or replay equivalence layer |
| **Playwright** | Reliable browser control | Automation driver—not extraction + memory + reconstruction |
| **Puppeteer** | Chromium scripting | Same gap: no federated sync or deterministic checkpoints |
| **Traditional crawlers** | Scale on public pages | Stateless; poor on authenticated SPAs |
| **Generic AI agents** | Flexible tasks | Probabilistic; weak replay and audit guarantees |

Common gaps WebWeaveX addresses:

- Lack of **runtime continuity** across processes
- Lack of **replay** and fingerprint equivalence
- Lack of **authenticated persistence** (encrypted, deterministic)
- Lack of **reconstruction** from structured IR
- Lack of **synchronization** between browser, semantic, workflow, and memory layers

---

## Core capabilities

| Capability | Description |
|------------|-------------|
| **Browser runtime extraction** | Bounded Playwright capture, network/session envelopes |
| **SPA stabilization** | DOM and route stabilization for framework noise |
| **Electron extraction** | Routes, IPC, storage metadata, deterministic Electron hash |
| **Native runtime cognition** | Desktop, terminal, VM, remote (graceful OS fallbacks) |
| **Terminal runtime** | Shell-oriented cognition fixtures |
| **Distributed extraction** | Autonomous workers + Kaalka checkpoints |
| **Runtime causality** | Event chains and propagation in extraction fabrics |
| **Semantic cognition** | Entities, ontology, semantic graphs |
| **Workflow runtime** | Plans, objectives, workflow memory |
| **Synchronization runtime** | Multi-source runtime alignment |
| **Reconstruction engine** | Replay-safe rebuild from IR |
| **Federated memory** | Deterministic merge and stable hashes |
| **Execution sandbox** | Allowlisted actions only |
| **Runtime replay** | `validate_replay_equivalence()` |
| **Runtime graph** | Normalized universal runtime graph |
| **Deterministic fingerprints** | Global and pipeline hashes |
| **Authenticated runtime continuation** | Encrypted session reload |
| **Kaalka deterministic encryption** | Stable ciphertext; cross-language vectors |
| **Connector runtime fabric** | Database, API, container, K8s, telemetry (bounded) |

---

## Authenticated runtime continuation

Modern applications authenticate with **cookies**, **localStorage**, **sessionStorage**, **tokens**, **runtime identity**, and **cross-navigation continuity**. Electron adds **IndexedDB metadata**, **IPC**, and **route state**. Multi-tab products add **synchronization state** across surfaces.

WebWeaveX supports:

- **Encrypted authenticated session persistence** (`save_encrypted_session`, session paths on `extract_web`)
- **Runtime continuation** across extractions when you supply the same Kaalka key and session file
- **Deterministic replay-safe reconstruction** of operational graphs from IR

Persistence uses **Kaalka deterministic encryption** (`algorithm: kaalka`)—not plaintext JSON checkpoints on disk.

| Stored surface | Mechanism |
|----------------|-----------|
| Cookies / headers | Encrypted session store |
| Browser snapshot | Session + identity engines |
| Electron storage | Native/Electron cognition (bounded) |
| Workflow / sync state | Kaalka checkpoint engines |

**WebWeaveX does not:** bypass auth, defeat MFA, bypass security controls, or access systems without authorization.

**WebWeaveX only operates on authorized authenticated runtimes explicitly provided by the user.**

```python
from webweavex import extract_web

result = extract_web(
    "https://app.example.com/dashboard",
    authenticated=True,
    session_path="./session.kaalka",
    encryption_key="your-kaalka-master-key",
)
```

---

## Architecture

```
                              ┌──────────────────┐
                              │      Input       │
                              │  UniversalInput  │
                              └────────┬─────────┘
                                       │
                                       ▼
                              ┌──────────────────┐
                              │ Canonical Pipeline│
                              │ run_canonical_    │
                              │   pipeline()      │
                              └────────┬─────────┘
                                       │
                                       ▼
                              ┌──────────────────┐
                              │ Runtime Cognition │
                              │ web·native·repo   │
                              └────────┬─────────┘
                                       │
           ┌───────────────────────────┼───────────────────────────┐
           ▼                           ▼                           ▼
    ┌─────────────┐            ┌─────────────┐            ┌─────────────┐
    │  Semantic   │            │  Causality  │            │  Workflow   │
    │   Layer     │            │   Layer     │            │  Runtime    │
    └──────┬──────┘            └──────┬──────┘            └──────┬──────┘
           │                          │                          │
           └──────────────────────────┼──────────────────────────┘
                                      ▼
                             ┌─────────────────┐
                             │ Synchronization │
                             │    Runtime      │
                             └────────┬────────┘
                                      ▼
                             ┌─────────────────┐
                             │ Federated Memory│
                             └────────┬────────┘
                                      ▼
                             ┌─────────────────┐
                             │ Execution Fabric│
                             └────────┬────────┘
                                      ▼
                             ┌─────────────────┐
                             │ Reconstruction  │
                             │    Engine       │
                             └────────┬────────┘
                                      ▼
                             ┌─────────────────┐
                             │ Universal Runtime│
                             │     Graph        │
                             └─────────────────┘
```

Source: [`core/kernel/runtime_pipeline.py`](core/kernel/runtime_pipeline.py)

---

## Canonical pipeline

Single production execution path—no shadow orchestrators.

```python
from webweavex import UniversalInput, run_canonical_pipeline

result = run_canonical_pipeline(
    UniversalInput(source="https://example.com", source_type="web"),
)

print(result["pipeline_hash"])
print(len(result["unified_runtime_graph"].get("nodes", [])))
```

| Property | Detail |
|----------|--------|
| Single execution path | `run_canonical_pipeline()` only |
| Deterministic normalization | `RuntimeGraphContract.normalize()` |
| Replay-safe runtime | Fingerprint at pipeline boundary |
| Canonical IR generation | Per-kind extraction → kernel phases |

---

## Quick start

```bash
pip install webweavex
pip install "webweavex[browser]"
pip install "webweavex[full]"
```

```bash
python -c "import webweavex; print(webweavex.__version__)"
# 2.0.0
```

---

## Real code examples

<details>
<summary><strong>Browser, auth, replay, semantic, reconstruction, distributed, native</strong></summary>

### Browser extraction

```python
from webweavex import extract_web, compute_global_runtime_fingerprint

out = extract_web("https://example.com")
print(out.get("bounded"), compute_global_runtime_fingerprint(out))
```

### Authenticated runtime persistence

```python
from webweavex import save_encrypted_session, extract_web

save_encrypted_session(
    "./session.kaalka",
    {"cookies": [], "headers": {}, "auth_tokens": []},
    "your-kaalka-master-key",
)

out = extract_web(
    "https://app.example.com",
    authenticated=True,
    session_path="./session.kaalka",
    encryption_key="your-kaalka-master-key",
)
```

Runnable: [`examples/authenticated_extraction.py`](examples/authenticated_extraction.py)

### Replay equivalence

```python
from webweavex import validate_replay_equivalence

assert validate_replay_equivalence(original, replayed)["equivalent"]
```

### Semantic runtime

```python
out = extract_web("https://example.com", semantic_runtime=True)
```

### Reconstruction

```python
from webweavex import run_reconstruction_runtime

rebuilt = run_reconstruction_runtime(
    sources={"extraction": prior},
    runtime_type="browser",
)
```

### Distributed extraction

```python
from webweavex import run_autonomous_extraction

out = run_autonomous_extraction(
    tasks=[{"task_id": "t1", "url": "https://example.com", "priority": 0}],
)
```

### Native extraction

```python
from webweavex import extract_native

out = extract_native(runtime="desktop", application="notepad")
```

</details>

---

## Determinism

| Mechanism | Role |
|-----------|------|
| `compute_global_runtime_fingerprint()` | Cross-run runtime digest |
| `validate_replay_equivalence()` | Graph + fingerprint + topology checks |
| `compute_stable_dom_hash()` | DOM meaning stable under attribute noise |
| SPA stabilizer | Framework route/state freeze |
| `stable_memory_hash()` | Ordered federated memory merge |
| Kaalka `encrypt_value` | Identical plaintext + key → identical ciphertext |

**Python ↔ JS consistency:** reference vectors in `validation/kaalka_cross_language/` validate hash and encrypt stability across runtimes.

**Limitation:** two live fetches of a dynamic SPA may differ; identical captured bytes → identical stabilized hashes.

---

## Reconstruction engine

WebWeaveX reconstructs **operational structure** from runtime IR:

- Runtime topology and unified graphs
- Workflow and application memory views
- Browser/application state envelopes
- Semantic operational graphs

| Property | Meaning |
|----------|---------|
| Runtime reconstruction | IR → bounded runtime view |
| Operational graph rebuilding | Normalized nodes/edges |
| Replay-safe reconstruction | Tested equivalence paths |
| Deterministic recreation | Sorted, canonical structures |

This is **not** full machine cloning or sci-fi simulation—it is **auditable operational recreation** for engineering workflows.

---

## Real validation

<details>
<summary><strong>Validation commands and CI gates</strong></summary>

| Metric | Value |
|--------|--------|
| Tests | **760+ passing** (`pytest -q`) |
| Scoped coverage | **≥ 90%** (production packages in `pyproject.toml`) |
| Wheel | `webweavex-2.0.0-py3-none-any.whl` |
| Replay | `validate_replay_equivalence` suite |
| Determinism | Kaalka cross-language + fingerprint tests |
| Playwright | Browser extraction paths (optional extra) |
| Native | Orchestrator + platform fallbacks |
| Distributed | Autonomous extraction tests |

```bash
pytest -q
python -m build
python validation/final_production_master.py
```

</details>

---

## Security model

| Control | Implementation |
|---------|----------------|
| Allowlisted execution | `core/execution/` sandbox |
| No arbitrary eval/exec | Forbidden in production paths |
| Sandboxed runtime | Bounded simulate/rollback |
| Deterministic persistence | Kaalka-only checkpoints |
| Encrypted memory/session | `encrypt_value`, session wrappers |
| Replay-safe recovery | Deterministic reload envelopes |

See [SECURITY.md](SECURITY.md). Report issues responsibly.

---

## Architecture guarantees

| Guarantee | How |
|-----------|-----|
| **Deterministic outputs** | Canonical ordering, stable hashes |
| **Replay-safe persistence** | Kaalka + equivalence validation |
| **Bounded execution** | Explicit `bounded: True` contracts |
| **Graceful degradation** | Playwright/native/connectors fail soft |
| **Canonical normalization** | Graph and DOM contracts |
| **Stable graph generation** | `build_runtime_graph` + normalize |
| **Cross-language consistency** | Kaalka reference vectors |

Contract document: [WEBWEAVEX_v2_ARCHITECTURE_LOCK_REPORT.md](WEBWEAVEX_v2_ARCHITECTURE_LOCK_REPORT.md)

---

## Repository structure

```
WebWeaveX/
├── core/           # Runtime infrastructure (kernel, browser, memory, sync, …)
├── webweavex/      # Public Python package
├── tests/          # 760+ tests
├── docs/           # Architecture, API, security, Kaalka, replay, validation
├── examples/       # Runnable scripts
├── validation/     # Production and real-world validators
└── .github/        # CI, templates, code of conduct, funding
```

| Package | Role |
|---------|------|
| `core/kernel/` | Canonical pipeline, `RuntimeKernel` |
| `core/browser/` | Web extraction, DOM/SPA stabilization |
| `core/crypto/` | Kaalka engines |
| `core/memory/` | Federated memory fabric |
| `core/synchronization/` | Sync runtime |
| `core/reconstruction/` | Reconstruction orchestrator |
| `core/replay/` | Replay equivalence |
| `webweavex/` | Stable public API |

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) and [.github/CODE_OF_CONDUCT.md](.github/CODE_OF_CONDUCT.md).

| Rule | Requirement |
|------|-------------|
| Determinism | No `random` / `uuid4` in runtime paths |
| Replay safety | Preserve graph normalization semantics |
| Canonical pipeline | No parallel mega-orchestrators |
| Persistence | Kaalka for new checkpoints |
| Tests | `pytest -q` must pass; coverage gate ≥ 90% scoped |

---

## Roadmap

See [ROADMAP.md](ROADMAP.md).

**v2.1 focus:**

- Deeper native bindings (UIA, AX, AT-SPI)
- Distributed runtime infrastructure hardening
- Stronger SPA normalization
- Real connector runtimes (live Postgres, Redis, K8s validation)
- Native OS integrations behind optional extras

---

## License

Apache 2.0 — see [LICENSE](LICENSE).

---

## Final positioning

**WebWeaveX is an attempt to build deterministic runtime cognition infrastructure for the authenticated operational web**—where extraction means encrypted continuity, structured graphs, replay equivalence, and reconstruction, not disposable HTML dumps.

If this work helps your team, consider supporting it:

[![Buy Me a Coffee](https://img.shields.io/badge/Buy%20Me%20a%20Coffee-piyushmishra00-FFDD00?style=for-the-badge&logo=buy-me-a-coffee&logoColor=black)](https://buymeacoffee.com/piyushmishra00)

---

<p align="center">
  <sub>Documentation · <a href="docs/README.md">docs/</a> · <a href="examples/README.md">examples/</a> · <a href="WEBWEAVEX_v2_RELEASE_REPORT.md">release report</a></sub>
</p>
