Metadata-Version: 2.4
Name: geonode-scraper-tools-core
Version: 0.3.1
Summary: Shared runtime and schemas for Geonode Scraper framework tools
Author: Geonode Team
License-Expression: MIT
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: geonode-scraper-sdk>=0.3.0
Requires-Dist: pydantic>=2.11
Requires-Dist: typing-extensions>=4.7.1
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.12.11; extra == "dev"

# Geonode Scraper Tools Core

Shared runtime, schemas, and operation registry for Geonode Scraper tool
integrations.

Most users should install one of the framework packages instead:

- `geonode-scraper-langchain`
- `geonode-scraper-crewai`

Install the core package directly only if you are building your own wrapper layer
on top of the shared service.

## Installation

```sh
pip install geonode-scraper-tools-core
```

## Public API

- `ScraperToolSettings`
- `ScraperToolService`
- `OperationSpec`
- `OPERATIONS`
- `get_operations()`

## Configuration

```python
from geonode_scraper_tools_core import ScraperToolSettings, ScraperToolService

settings = ScraperToolSettings(
    host="https://api.example.com",
    api_key="your-api-key",
)

service = ScraperToolService(settings)
```

## Exposed Operations

The shared service normalizes SDK responses into JSON-friendly dictionaries and
exposes the following 17 operations:

**Extraction**
- `extract` — extract content from a single URL (sync or async)
- `get_job_result` — fetch the current state or result of an async extraction job
- `wait_for_job` — poll an async extraction job until it reaches a terminal state
- `list_jobs` — list previously submitted extraction jobs with optional filters

**Batch**
- `create_batch` — submit a list of URLs for asynchronous batch extraction
- `get_batch_status` — poll the current status and partial results of a batch job
- `wait_for_batch` — poll a batch job until it reaches a terminal state
- `list_batch_jobs` — list previously submitted batch jobs with optional filters

**Crawl**
- `create_crawl` — start a crawl job from a seed URL
- `get_crawl_status` — poll the current status and results of a crawl job
- `wait_for_crawl` — poll a crawl job until it reaches a terminal state
- `list_crawl_jobs` — list previously submitted crawl jobs with optional filters

**Map**
- `map_urls` — discover all URLs under a base URL via sitemap and HTML link extraction
- `list_map_jobs` — list previously submitted map jobs with optional filters
- `get_map_job` — fetch the status and discovered URLs for a single map job

**Statistics & Health**
- `get_statistics` — retrieve aggregated extraction statistics
- `health_check` — check the scraper service health and version

## Selecting a Subset of Operations

Pass an `operations` list to `get_operations()` or to any framework wrapper to
expose only the operations you need.

```python
from geonode_scraper_tools_core import get_operations

ops = get_operations(["extract", "map_urls", "create_batch", "wait_for_batch"])
```
