Metadata-Version: 2.3
Name: divparser
Version: 0.1.0
Summary: Python SDK for DivParser API - Web scraping and HTML parsing with AI-powered extraction
Author: solomon344
Author-email: solomon344 <willamssolomon672@gmail.com>
Requires-Dist: requests>=2.28.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# DivParser Python SDK

A Python SDK for [DivParser](https://www.divparser.com/) - AI-powered web scraping and HTML parsing.

## Features

- **Web Scraping**: Extract structured data from web pages
- **HTML Parsing**: Parse raw HTML content directly
- **Async Job Handling**: Non-blocking job submission with status polling
- **Pagination Support**: Scrape multiple URLs in a single batch
- **Simple API**: Pythonic interface to the DivParser REST API

## Installation

```bash
pip install divparser
```

Or if using `uv`:

```bash
uv pip install divparser
```

## Quick Start

### Setup

```python
from divparser import DivParser

# Initialize the client with your API key
client = DivParser(api_key="your_api_key_here")
```

Get your API key from [DivParser Console](https://divparser.com/dashboard/settings).

### Scraping a Web Page

```python
# Scrape a single page and wait for results
result = client.scrape_and_parse(
    url="https://example.com/products",
    schema="Extract product name, price, and rating from each item"
)

# Access the extracted data
for item in result["results"][0]["data"]:
    print(item)
```

### Parsing HTML Content

```python
# Parse HTML content directly
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"

result = client.parse_and_wait(
    html=html_content,
    schema="Extract all headings and paragraphs"
)

# Get the parsed data
data = result["results"][0]["data"]
print(data)
```

### Paginated Scraping

```python
# Scrape multiple URLs
urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3"
]

result = client.scrape_paginated(
    urls=urls,
    schema="Extract product name and price",
    wait=True
)

# Combine results from all pages
from divparser.utils import flatten_results
all_items = flatten_results(result["results"])
```

## API Reference

### Scraping

#### `scrape(url, schema, name=None, page_type="LISTING", wait=False, timeout=300)`

Create a scrape job for a single URL.

**Parameters:**
- `url` (str): Target page URL
- `schema` (str): Extraction instructions (plain English or Nestlang)
- `name` (str, optional): Friendly label for this scrape
- `page_type` (str): "LISTING" (default) or "DETAIL"
- `wait` (bool): Wait for completion before returning
- `timeout` (int): Max seconds to wait (only if wait=True)

**Returns:** Dictionary with `scrapeId`, `jobId`, and optionally `results`

#### `scrape_paginated(urls, schema, name=None, page_type="LISTING", wait=False, timeout=300)`

Create a scrape job for multiple URLs.

**Parameters:**
- `urls` (List[str]): Array of URLs to scrape
- `schema` (str): Extraction instructions
- `name` (str, optional): Friendly label
- `page_type` (str): "LISTING" or "DETAIL"
- `wait` (bool): Wait for completion
- `timeout` (int): Max seconds to wait

**Returns:** Dictionary with `scrapeId`, `jobId`, and optionally `results`

#### `list_scrapes(limit=20, cursor=None)`

List all scrapes for the authenticated user.

**Returns:** Dictionary with list of scrapes and pagination info

#### `get_scrape(scrape_id)`

Retrieve a scrape and its results by ID.

**Parameters:**
- `scrape_id` (str): The scrapeId from creation

**Returns:** Dictionary with scrape details and results

### Parsing

#### `parse(html, schema, name=None, wait=False, timeout=300)`

Submit raw HTML for structured extraction.

**Parameters:**
- `html` (str): Full HTML content to parse
- `schema` (str): Extraction instructions
- `name` (str, optional): Friendly label
- `wait` (bool): Wait for completion
- `timeout` (int): Max seconds to wait

**Returns:** Dictionary with `scrapeId`, `jobId`, and optionally `results`

#### `get_parse(parse_id)`

Retrieve results for a completed parse job.

**Parameters:**
- `parse_id` (str): The scrapeId from parse creation

**Returns:** Dictionary with parse details and results

### Utilities

#### `check_status(job_id)`

Poll the status of a job.

**Parameters:**
- `job_id` (str): The jobId returned from creation

**Returns:** Dictionary with `completed` (bool) and `state` (str)

#### `wait_for_completion(job_id, timeout=300, poll_interval=1.0)`

Wait for a job to complete.

**Parameters:**
- `job_id` (str): The jobId to poll
- `timeout` (int): Max seconds to wait
- `poll_interval` (float): Seconds between polls

**Returns:** Status dictionary when completed

**Raises:** `TimeoutError` if job doesn't complete

## Utility Functions

The `divparser.utils` module provides helper functions for working with results:

```python
from divparser.utils import (
    extract_data_from_results,
    flatten_results,
    filter_results_by_status,
    get_results_by_url,
    get_result_stats
)

# Flatten nested results
all_items = flatten_results(results)

# Get statistics
stats = get_result_stats(results)
print(f"Success rate: {stats['success_rate']:.1f}%")

# Group by URL
by_url = get_results_by_url(results)
```

## Examples

### Example 1: Extract Job Listings

```python
from divparser import DivParser

client = DivParser(api_key="your_api_key")

result = client.scrape_and_parse(
    url="https://example-jobs.com/listings",
    schema="""
    Extract the following for each job:
    - job title
    - company name
    - location
    - salary range (if available)
    """,
    name="Job Listings Scrape"
)

for job in result["results"][0]["data"]:
    print(f"{job['title']} at {job['company']} in {job['location']}")
```

### Example 2: Parse Product Information from HTML

```python
html_content = """
<html>
<body>
    <div class="product">
        <h2>Widget Pro</h2>
        <p class="price">$49.99</p>
        <p class="rating">4.5 stars</p>
    </div>
    <div class="product">
        <h2>Widget Lite</h2>
        <p class="price">$19.99</p>
        <p class="rating">4.2 stars</p>
    </div>
</body>
</html>
"""

result = client.parse_and_wait(
    html=html_content,
    schema="Extract product name, price, and rating"
)

for product in result["results"][0]["data"]:
    print(f"{product['name']}: {product['price']} ({product['rating']})")
```

### Example 3: Batch Scraping Multiple Pages

```python
from divparser.utils import flatten_results

pages = [f"https://example.com/products?page={i}" for i in range(1, 4)]

result = client.scrape_paginated(
    urls=pages,
    schema="Extract product ID, name, and price"
)

# Get all products from all pages
all_products = flatten_results(result["results"])
print(f"Total products: {len(all_products)}")
```

## Error Handling

```python
from divparser import DivParser
import requests

client = DivParser(api_key="your_api_key")

try:
    result = client.scrape_and_parse(
        url="https://example.com",
        schema="Extract content"
    )
except requests.exceptions.HTTPError as e:
    print(f"API Error: {e}")
except TimeoutError as e:
    print(f"Job timed out: {e}")
```

## Best Practices

1. **Use Descriptive Schemas**: Clear instructions in your schema lead to better extraction
2. **Set Appropriate Timeouts**: Complex extractions may need longer timeouts
3. **Batch Operations**: Use `scrape_paginated` for multiple URLs instead of individual requests
4. **Handle Errors**: Always catch exceptions for production code
5. **Reuse Clients**: Create one client instance and reuse it

## API Documentation

For more detailed information, visit [DivParser API Reference](https://www.divparser.com/docs?p=API+Reference).

## License

MIT

## Support

For issues, questions, or feature requests, visit [DivParser Support](https://www.divparser.com/).
