Metadata-Version: 2.4
Name: site-mapper-agents
Version: 0.1.0
Summary: LLM-driven self-healing API discovery for undocumented SaaS portals via CDP
Project-URL: Homepage, https://github.com/axumquant/site-mapper-agents
Project-URL: Repository, https://github.com/axumquant/site-mapper-agents
Project-URL: Issues, https://github.com/axumquant/site-mapper-agents/issues
Author: axumquant
License-Expression: MIT
License-File: LICENSE
Keywords: api-discovery,cdp,llm,pydantic-ai,scraping,self-healing,site-mapping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic-ai>=0.0.10
Requires-Dist: pydantic>=2.5
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Description-Content-Type: text/markdown

# site-mapper-agents

**LLM-once API discovery + self-healing extraction for any browser-accessible portal.**

Burst-record CDP network traffic from a portal you have a browser session on,
hand it to a three-agent team, get back a typed schema + signatures you can
extract from forever — with auto-repair when the portal's API shape drifts.

[![PyPI](https://img.shields.io/pypi/v/site-mapper-agents.svg)](https://pypi.org/project/site-mapper-agents/)
![Python](https://img.shields.io/badge/python-3.11%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)

---

## The problem

Every SaaS portal has a different API. Writing extractors for each is a
treadmill — and the schemas change without warning, so your extractors
silently break.

Pre-built connectors only cover the top 20 platforms. For everything else
(internal CRMs, niche-vertical tools, undocumented partner portals) you
either pay someone to reverse-engineer the API, or you give up and scrape
the DOM.

This library is the third option.

## What this solves

1. **Onboarding**: you point the system at a portal you have a real browser
   session on. It records a burst of CDP network traffic while you click
   around, then asks an LLM **once** to classify which endpoints carry the
   data you want and to map response JSON keys to your fields. Output is a
   typed `SiteSchema` and a list of `NetworkSignature` patterns.
2. **Extraction**: from that point forward, every CDP event is matched
   against the saved signatures with pure Pydantic validation —
   sub-millisecond, no LLM calls, no cost.
3. **Self-healing**: when the portal changes its response shape, an
   `ExtractionFailed` event fires. The Healer compares the old key map
   against the new response, fixes what it can deterministically, and
   asks the LLM to semantically match the rest. Confident patches
   auto-apply. Borderline patches surface for human review.

## The three agents

```
                   ┌──────────────────────────────────────────────┐
                   │  Browser session → CDP forwarder → events    │
                   └──────────────────────┬───────────────────────┘
                                          │
   ┌──────────────────────┐               │              ┌───────────────────────┐
   │     Architect        │ ◀── once ──── │ ──── live ──▶│     Eavesdropper      │
   │  (LLM classifies     │               │              │  (Pydantic only,      │
   │   endpoints, builds  │               │              │   sub-ms hot path)    │
   │   SiteSchema +       │               │              │                       │
   │   signatures)        │               │              │   emits ExtractionResult
   └──────────────────────┘               │              │   or ExtractionFailed │
              │                           │              └───────────┬───────────┘
              ▼                           │                          │
   ╔══════════════════════╗               │              ┌───────────▼───────────┐
   ║   MappedSite +       ║◀──── heals ───┼──────────────│       Healer          │
   ║   NetworkSignatures  ║               │              │  (LLM re-maps stale   │
   ╚══════════════════════╝               │              │   keys, auto-applies  │
                                          │              │   confident patches)  │
                                          │              └───────────────────────┘
```

- **Architect** — runs once. Expensive. Produces the schema.
- **Eavesdropper** — runs on every event. Free. Pure validation.
- **Healer** — runs only on failures. Costs nothing when nothing breaks.

## Install

```bash
pip install site-mapper-agents
```

For the runnable examples you'll also want a pydantic-ai provider:

```bash
pip install 'pydantic-ai[anthropic]'   # or [openai], [ollama], ...
```

## Quickstart

```python
import asyncio
from pydantic_ai.models.test import TestModel

from site_mapper_agents import (
    Architect,
    CDPNetworkEvent,
    Eavesdropper,
    TargetField,
    UserIntent,
)

# 1. Tell the system what you want to extract.
intent = UserIntent(
    description="Customer account details",
    target_fields=[
        TargetField(name="account_id", description="Account UUID"),
        TargetField(name="email", description="Primary contact email"),
    ],
)

# 2. Construct the Architect. Replace TestModel with a real provider.
architect = Architect(model=TestModel())  # or AnthropicModel("claude-sonnet-4-5")

# 3. Feed it a burst of CDP traffic (your forwarder produced these).
architect.record_traffic(CDPNetworkEvent(
    request_id="r1",
    url="https://crm.example.com/api/v2/accounts/42",
    method="GET",
    body={"data": {"client": {"id": "acct_42", "email": "ada@example.com"}}},
))

# 4. Ask the Architect to propose a schema.
async def onboard():
    proposal = await architect.propose(
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    site = architect.build_mapped_site(
        proposal=proposal,
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    return site

site = asyncio.run(onboard())

# 5. From now on, every live CDP event runs through the Eavesdropper.
eaves = Eavesdropper()
result, event = eaves.ingest(
    CDPNetworkEvent(
        request_id="r2",
        url="https://crm.example.com/api/v2/accounts/99",
        method="GET",
        body={"data": {"client": {"id": "acct_99", "email": "g@example.com"}}},
    ),
    sites=[site],
)
print(result.data_payload if result else "no match")
```

## API reference

### `Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)`

The onboarding agent. LLM-once.

| Parameter        | Type                        | Notes                                         |
| ---------------- | --------------------------- | --------------------------------------------- |
| `model`          | `pydantic_ai.Model \| None` | Any pydantic-ai model. `None` → heuristic.    |
| `vocabulary`     | `list[EndpointType] \| None`| Caller-supplied classifications. See below.   |
| `policy`         | `OnboardingPolicy`          | Sample-count thresholds.                      |
| `model_settings` | `ModelSettings \| None`     | max_tokens, temperature, etc.                 |

**Methods:**

- `record_traffic(event)` — buffer a CDP event during onboarding.
- `record_click()` — mark that the user clicked something.
- `has_enough_samples()` → `bool` — policy check.
- `detect_endpoints()` → `list[DetectedEndpoint]` — deterministic
  pre-processing.
- `await propose(*, target_url, user_intent, llm_classify=None)` →
  `ArchitectProposal` — the main entry point.
- `build_mapped_site(*, proposal, target_url, user_intent)` →
  `MappedSite` — promote an approved proposal to an active site.
- `emit_event(site, *, success=True, reason="")` → `SiteMapped |
  OnboardingFailed`.
- `reset()` — clear buffers for the next onboarding session.

### `Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)`

The runtime agent. No LLM. Pure Pydantic validation.

**Methods:**

- `ingest(event, sites)` → `(ExtractionResult | None, ExtractionSucceeded | ExtractionFailed | None)`.

### `Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)`

The self-healing agent.

**Methods:**

- `await diagnose(*, site, failed_event, new_response_body=None, llm_semantic_match=None)` →
  `HealerPatch`.
- `apply_patch(site, patch)` → `(bool, SchemaHealed | HealingFailed | SiteDegraded)`.

### Models

| Class               | Purpose                                                                 |
| ------------------- | ----------------------------------------------------------------------- |
| `CDPNetworkEvent`   | One captured network response. Library input.                           |
| `TargetField`       | One data point the caller wants extracted.                              |
| `UserIntent`        | A bundle of target fields with a human description.                     |
| `EndpointType`      | One entry in the Architect's classification vocabulary.                 |
| `DetectedEndpoint`  | Pre-LLM view of a unique endpoint.                                      |
| `NetworkSignature`  | URL pattern + JSON-key map. Saved per site.                             |
| `SiteSchema`        | The extraction contract for one intent.                                 |
| `ArchitectProposal` | Architect's structured output before user confirms.                     |
| `HealerPatch`       | Healer's structured output for one repair attempt.                      |
| `MappedSite`        | Aggregate root — schemas + signatures + status.                         |
| `ExtractionResult`  | Eavesdropper's output for one matched event.                            |

### Domain events

`SiteMapped`, `OnboardingFailed`, `ExtractionSucceeded`,
`ExtractionFailed`, `SchemaHealed`, `HealingFailed`, `SiteDegraded`.

All extend `AutomationEvent` (frozen Pydantic model).

## Endpoint vocabularies

The Architect's LLM prompt embeds a list of `EndpointType` definitions
that tell the model "you may only classify endpoints into one of these
categories". The default vocabulary covers generic CRUD shapes:

| name              | what it means                                                  |
| ----------------- | -------------------------------------------------------------- |
| `list_records`    | Paginated list of records (grid/table views).                  |
| `detail_view`     | One record's full detail (after click-through).                |
| `search`          | Filtered records based on user query.                          |
| `create_record`   | POST/PUT that creates a new record.                            |
| `update_record`   | PATCH/PUT that mutates an existing record.                     |
| `delete_record`   | DELETE.                                                        |
| `reference_data`  | Lookup / enum / config data.                                   |
| `metrics`         | Dashboard counts/aggregates.                                   |
| `unknown`         | Fallback when nothing fits.                                    |

You'll usually want to extend this with site-specific categories:

```python
from site_mapper_agents import (
    Architect,
    default_vocabulary,
    define_endpoint_type,
    merge_vocabularies,
)

vocab = merge_vocabularies(
    default_vocabulary(),
    [
        define_endpoint_type(
            name="invoice_pdf_download",
            description="Streaming download of a generated invoice PDF",
            expected_fields=["invoice_id", "pdf_url"],
        ),
        define_endpoint_type(
            name="webhook_subscription",
            description="Webhook registration endpoint that returns the subscription id",
            expected_fields=["subscription_id", "target_url", "events"],
        ),
    ],
)

architect = Architect(model=my_model, vocabulary=vocab)
```

## LLM providers

The library binds to any provider pydantic-ai supports — just pass a
`Model` instance (or its name) to the agent constructor:

```python
# Anthropic
from pydantic_ai.models.anthropic import AnthropicModel
architect = Architect(model=AnthropicModel("claude-sonnet-4-5"))

# OpenAI
from pydantic_ai.models.openai import OpenAIModel
architect = Architect(model=OpenAIModel("gpt-4o"))

# Ollama (or any OpenAI-compatible local server)
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
architect = Architect(model=OpenAIModel(
    "llama3.1:8b",
    provider=OpenAIProvider(base_url="http://localhost:11434/v1"),
))

# Deterministic stub for tests
from pydantic_ai.models.test import TestModel
architect = Architect(model=TestModel())
```

## CDP burst format

`CDPNetworkEvent` is the only input shape the library cares about:

```python
CDPNetworkEvent(
    request_id="<unique-id>",
    url="https://...",
    method="GET",
    status_code=200,
    headers={"content-type": "application/json"},
    body={"data": {"...": "..."}},   # parsed JSON
    frame_origin=None,                # set for iframe traffic
    target_id=None,                   # CDP target id, for multi-frame disambiguation
    timestamp=1715760000.0,
)
```

The library does not capture CDP traffic itself. Use a sibling tool —
e.g. **[axumquant/cdp-network-interceptor](https://github.com/axumquant/cdp-network-interceptor)**
— or your own Chrome extension / Puppeteer / Playwright session that
emits this shape.

## Self-healing flow

When does the Healer fire?

1. The Eavesdropper validates an incoming event and detects missing
   fields against a registered signature.
2. It emits `ExtractionFailed` and returns it from `ingest()`.
3. Your orchestrator passes the failed event (plus the raw response
   body) to `Healer.diagnose()`.
4. The Healer runs **structural** matching first (same key still exists?
   then we just need a path tweak). If everything resolves
   structurally, no LLM call happens.
5. Otherwise the Healer calls its pydantic-ai Agent with the old key
   map + new available keys + unresolved field names.
6. The returned `HealerPatch` has an aggregate confidence:
   - `≥ auto_approve_above` (default 0.90) → `apply_patch()` succeeds,
     emits `SchemaHealed`, signature is replaced in-place.
   - `[min_semantic_confidence, require_human_review_below)` (default
     0.70–0.75) → `apply_patch()` returns `HealingFailed` with reason
     `requires human review`. Surface this to the user.
   - `< min_semantic_confidence` → site is marked DEGRADED, retried
     up to `max_attempts` times, then marked BROKEN.
7. Persistence is the caller's job — the library mutates the
   `MappedSite` aggregate in memory but doesn't write it anywhere.

## Use cases

- **Salesforce custom-object extraction** — Salesforce's API surface is
  huge and per-tenant. Onboard once against the tenant you have a
  session on, extract from then on.
- **HubSpot scraping** — undocumented internal endpoints powering the UI.
- **Internal CRM discovery** — your customer is on some no-name CRM you've
  never seen. Onboarding takes minutes.
- **Pre-acquisition portal audits** — point it at a target's admin
  portal, get back a structured map of their data surface.
- **Partner integrations** with companies who refuse to ship an API.

## Pitfalls

- **The Architect costs money** — it's an LLM call with a non-trivial
  prompt + context. Budget for one call per site you map. The
  Eavesdropper is free; the Healer only fires when something breaks.
- **Schema drift is real** — sites change shapes monthly. Wire the
  Healer or you'll be debugging in production.
- **Auth-protected endpoints** — the library never authenticates for
  you. You drive a real browser session; the CDP forwarder captures
  authenticated traffic. The library only sees the resulting bodies.
- **Rate limits** — your scraping cadence is your problem. Polite
  pacing is on you.
- **Iframe traffic** — the library handles `frame_origin` matching
  correctly, but your CDP forwarder MUST populate it. Without
  `frame_origin`, iframe responses match parent-frame signatures, which
  produces garbage extractions.
- **The vocabulary matters** — generic CRUD works for most sites, but
  niche portals benefit a lot from a custom vocabulary that names the
  domain entities (e.g. `invoice_line_items` vs generic `list_records`).

## License

MIT — see [LICENSE](LICENSE).
