Metadata-Version: 2.4
Name: ghostcrawl
Version: 2.2.1
Summary: Official Python SDK for the Ghostcrawl local orchestration API.
Author: Ghostcrawl
License: Proprietary — see LICENSE
Project-URL: Homepage, https://github.com/ghostcrawl/ghostcrawl
Keywords: ghostcrawl,scraping,browser,automation,agent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.28.1
Requires-Dist: pydantic>=2.13
Requires-Dist: typer>=0.12
Requires-Dist: mcp>=1.27.1
Requires-Dist: microsoft-kiota-abstractions>=1.7.0
Requires-Dist: microsoft-kiota-http>=1.10.0
Requires-Dist: microsoft-kiota-serialization-json>=1.7.0
Requires-Dist: microsoft-kiota-serialization-text>=1.0.0
Requires-Dist: microsoft-kiota-serialization-form>=1.0.0
Requires-Dist: microsoft-kiota-serialization-multipart>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: respx>=0.23.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Dynamic: license-file

# ghostcrawl — Python SDK

The official Python client for the [GhostCrawl](https://ghostcrawl.io) API. Collect web data at scale — scrape, crawl, search, extract structured data, manage browser sessions, and automate the full data-collection pipeline.

## Install

```bash
pip install ghostcrawl
```

Requires Python 3.10+. Runtime dependencies: `httpx>=0.28.1`.

## Quickstart

```python
from ghostcrawl import GhostcrawlClient

# Reads GHOSTCRAWL_API_KEY from environment, or pass token= explicitly
client = GhostcrawlClient(token="gck_live_YOUR_KEY")

# Scrape a URL
result = client.scrape(url="https://example.com", format="markdown")
print(result["content"])

# Start a crawl
run = client.crawl_runs.start(url="https://example.com", max_depth=2, max_pages=50)
print(run["run_id"])

# Web search
results = client.search(query="latest AI research", engine="google", limit=10)
for r in results["results"]:
    print(r["title"], r["url"])
```

## Authentication

```python
import os
from ghostcrawl import GhostcrawlClient

# Option 1: pass token directly
client = GhostcrawlClient(token="gck_live_YOUR_KEY")

# Option 2: set environment variable (recommended for production)
os.environ["GHOSTCRAWL_API_KEY"] = "gck_live_YOUR_KEY"
client = GhostcrawlClient()
```

Every request sends `Authorization: Bearer <token>`. This is the only auth scheme the API accepts.

## Extract structured data

```python
from ghostcrawl import GhostcrawlClient

client = GhostcrawlClient(token="gck_live_YOUR_KEY")

# Define a schema and extract matching data
data = client.extract(
    url="https://example.com/product",
    schema={
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "price": {"type": "number"},
            "description": {"type": "string"},
        },
    },
)
print(data["name"], data["price"])
```

## Browser sessions

```python
from ghostcrawl import GhostcrawlClient

client = GhostcrawlClient(token="gck_live_YOUR_KEY")

# Create a session
session = client.sessions.create(profile_name="my-profile")
session_id = session["session_id"]

# Extend and release
client.sessions.extend(session_id, duration_seconds=600)
client.sessions.release(session_id)
```

## Error handling

```python
from ghostcrawl import GhostcrawlClient, AuthenticationError, RateLimitError, APIError

client = GhostcrawlClient(token="gck_live_YOUR_KEY")

try:
    result = client.scrape(url="https://example.com")
except AuthenticationError:
    print("Invalid API key — check your token")
except RateLimitError:
    print("Rate limit reached — retry after a short delay")
except APIError as e:
    print(f"Server error: {e.status_code}")
```

## Context manager

```python
from ghostcrawl import GhostcrawlClient

with GhostcrawlClient(token="gck_live_YOUR_KEY") as client:
    result = client.scrape(url="https://example.com")
    print(result)
# HTTP connection is closed automatically
```

## All resources

| Resource | Client attribute | Key operations |
|----------|-----------------|----------------|
| Scraping | `client.scrape(url=…)` | Render and return page content |
| Web search | `client.search(query=…)` | Search Google, Bing, DuckDuckGo |
| Data extraction | `client.extract(url=…, schema=…)` | Structured JSON from any page |
| Deep crawl | `client.crawl(url=…)` | Crawl a site depth-first |
| URL map | `client.map(url=…)` | Discover all reachable URLs |
| Crawl runs | `client.crawl_runs` | start, list, get, cancel |
| Sessions | `client.sessions` | create, extend, release |
| Profiles | `client.profiles` | list, get, create, update, delete |
| Webhooks | `client.webhooks` | list, get, create, delete, rotate-secret |
| Schedules | `client.schedules` | list, get, create, delete |
| Datasets | `client.datasets` | list, get, create, delete, append rows |
| Recordings | `client.recordings` | list, get, delete |
| Key-Value Store | `client.kv` | get, set, delete |
| Account | `client.me()` | Get account info and usage |

## LangChain integration

```bash
pip install ghostcrawl-langchain
```

```python
from ghostcrawl_langchain import GhostcrawlScrape, GhostcrawlSearch

scrape_tool = GhostcrawlScrape()
search_tool = GhostcrawlSearch()
```

## Self-hosted

```python
client = GhostcrawlClient(
    token="gck_live_YOUR_KEY",
    base_url="http://localhost:8080",  # your self-hosted instance
)
```

## License

Proprietary — GhostCrawl Software License. See [LICENSE](LICENSE).
