Metadata-Version: 2.4
Name: guestlist-tools
Version: 0.2.0
Summary: Check whether AI agents can crawl a website. Filter URLs by agent-friendliness before sending them to your computer-use agent.
Project-URL: Homepage, https://guestlist.tools
Author-email: Felix Müller <felixmueller0205@gmail.com>
License: MIT
License-File: LICENSE
Keywords: agents,ai,bot-detection,browser-automation,computer-use,crawling,cua,scraping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: httpx<1.0,>=0.25
Description-Content-Type: text/markdown

# guestlist

**Filter URLs by whether AI agents can crawl them — before sending them to your computer-use agent.**

Most computer-use agents are still routinely blocked by Cloudflare, Akamai, DataDome, and friends. When your agent does a Google search and gets 10 results, the top 3 might all silently fail. `guestlist` lets you ask, in one call, *which of these will actually let an agent in?*

```python
from guestlist import check

results = check([
    "https://example.com",
    "https://instagram.com",
    "https://en.wikipedia.org/wiki/Web_scraping",
])

for r in results:
    print(r.url, r.tier, r.success_rate)
```

```
https://example.com           green   0.98
https://instagram.com         red     0.04
https://en.wikipedia.org/...  green   0.99
```

## Install

```bash
pip install guestlist-tools
```

## Quickstart

Set your API key (get one at [guestlist.tools](https://guestlist.tools)):

```bash
export GUESTLIST_API_KEY=gst_...
```

Then:

```python
from guestlist import check, Tier

results = check(["https://example.com", "https://instagram.com"])

allowed = [r.url for r in results if r.tier in (Tier.GREEN, Tier.YELLOW)]
```

`check()` accepts up to **500 URLs per call** and auto-batches them into requests of 100. Passing more than 500 raises `ValueError`. A bare string is treated as a single URL. An empty list returns an empty list with no HTTP request.

## What you get back

```python
@dataclass(frozen=True)
class Result:
    url: str                    # echoed from the request
    domain: str                 # registrable domain the API matched against
    tier: Tier                  # green | yellow | orange | red | unknown
    success_rate: float | None  # successes / samples over the last 90 days
    n_samples: int              # how many probes back that rate
    confidence: float           # 0.0 to 1.0, scales with sample count
    blocker_detected: Blocker | None
    last_tested_at: datetime | None
```

### Tier bands

| Tier      | What it means                              |
| --------- | ------------------------------------------ |
| `green`   | Agents work reliably.                      |
| `yellow`  | Agents usually work; expect some friction. |
| `orange`  | Agents are often blocked.                  |
| `red`     | Agents are almost always blocked.          |
| `unknown` | Not enough data yet.                       |

### About `unknown`

The dataset doesn't cover every domain on the web yet. You'll see `tier="unknown"` for sites we haven't probed enough times to be confident. Treat it as "no signal" rather than "safe to crawl." Coverage is expanding.

## How URLs are matched

Each URL is resolved to its **registrable domain** (e.g. `https://api.x.com/users/123` → `x.com`), and the verdict is for that domain's *apex page*. The path, query, fragment, and subdomain are not currently part of the lookup. The `Result.domain` field shows what we matched against.

For example: `instagram.com` is apex-blocked (hard login wall) even though deep public paths like `instagram.com/user/p/<post_id>` may load fine. Per-path tiering is on the v2 roadmap.

Different effective TLDs are distinguished: `bbc.co.uk` is not the same as `bbc.com`.

## Configuration

| Setting   | Constructor arg | Env var               | Default                       |
| --------- | --------------- | --------------------- | ----------------------------- |
| API key   | `api_key=`      | `GUESTLIST_API_KEY`   | *(required)*                  |
| Base URL  | `base_url=`     | `GUESTLIST_API_BASE`  | `https://api.guestlist.tools` |
| Timeout   | `timeout=`      | —                     | `30.0` seconds                |

For finer control, instantiate `Guestlist` directly:

```python
from guestlist import Guestlist

with Guestlist(api_key="gst_...", timeout=10.0) as gl:
    results = gl.check([url1, url2])
```

## Errors

All errors inherit from `GuestlistError`:

| Class                  | When                                                          |
| ---------------------- | ------------------------------------------------------------- |
| `ConfigError`          | Missing API key or malformed base URL.                        |
| `AuthenticationError`  | API returned 401 (bad key).                                   |
| `RateLimitError`       | API returned 429, retry exhausted. `.retry_after` may be set. |
| `APIError`             | Other 4xx/5xx after any retries. `.status_code`, `.detail`.   |
| `NetworkError`         | Connection or timeout error after retry.                      |

Retry policy:

- `429`: respects `Retry-After`, one retry.
- `5xx`: exponential backoff (250 ms, 1 s), two retries.
- Network / timeout: one retry.
- `4xx` other than 429: no retry.

## Status

`guestlist` is a small public API on top of an ongoing data-collection effort. The library surface is stable; the underlying dataset is growing. If you hit a domain you'd like covered, open an issue.

## License

MIT.
