Metadata-Version: 2.2
Name: spiderforce4ai
Version: 1.1
Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
Home-page: https://petertam.pro
Author: Piotr Tamulewicz
Author-email: Piotr Tamulewicz <pt@petertam.pro>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: asyncio>=3.4.3
Requires-Dist: rich>=10.0.0
Requires-Dist: aiofiles>=0.8.0
Requires-Dist: httpx>=0.24.0
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# SpiderForce4AI Python Wrapper

A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.

## Quick Start (Minimal Setup)

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig

# Initialize with your service URL
spider = SpiderForce4AI("http://localhost:3004")

# Create default config
config = CrawlConfig()

# Crawl a single URL
result = spider.crawl_url("https://example.com", config)
```

## Installation

```bash
pip install spiderforce4ai
```

## Crawling Methods

### 1. Single URL

```python
# Basic usage
result = spider.crawl_url("https://example.com", config)

# Async version
async def crawl():
    result = await spider.crawl_url_async("https://example.com", config)
```

### 2. Multiple URLs

```python
urls = [
    "https://example.com/page1",
    "https://example.com/page2"
]

# Client-side parallel (using multiprocessing)
results = spider.crawl_urls_parallel(urls, config)

# Server-side parallel (single request)
results = spider.crawl_urls_server_parallel(urls, config)

# Async version
async def crawl():
    results = await spider.crawl_urls_async(urls, config)
```

### 3. Sitemap Crawling

```python
# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)

# Async version
async def crawl():
    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
```

## Configuration Options

All configuration options are optional with sensible defaults:

```python
from pathlib import Path

config = CrawlConfig(
    # Content Selection (all optional)
    target_selector="article",              # Specific element to extract
    remove_selectors=[                      # Elements to remove
        ".ads", 
        "#popup",
        ".navigation",
        ".footer"
    ],
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing Settings
    max_concurrent_requests=1,              # For client-side parallel processing
    request_delay=0.5,                     # Delay between requests (seconds)
    timeout=30,                            # Request timeout (seconds)
    
    # Output Settings
    output_dir=Path("spiderforce_reports"),  # Default directory for files
    webhook_url="https://your-webhook.com",  # Real-time notifications
    webhook_timeout=10,                      # Webhook timeout
    webhook_headers={                        # Optional custom headers for webhook
        "Authorization": "Bearer your-token",
        "X-Custom-Header": "value"
    },
    webhook_payload_template='''{           # Optional custom webhook payload template
        "crawled_url": "{url}",
        "content": "{markdown}",
        "crawl_status": "{status}",
        "crawl_error": "{error}",
        "crawl_time": "{timestamp}",
        "custom_field": "your-value"
    }''',
    save_reports=False,                      # Whether to save crawl reports (default: False)
    report_file=Path("crawl_report.json")    # Report location (used only if save_reports=True)
)
```

## Real-World Examples

### 1. Basic Blog Crawling

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path

spider = SpiderForce4AI("http://localhost:3004")
config = CrawlConfig(
    target_selector="article.post-content",
    output_dir=Path("blog_content")
)

result = spider.crawl_url("https://example.com/blog-post", config)
```

### 2. Parallel Website Crawling

```python
config = CrawlConfig(
    remove_selectors=[
        ".navigation",
        ".footer",
        ".ads",
        "#cookie-notice"
    ],
    max_concurrent_requests=5,
    output_dir=Path("website_content"),
    webhook_url="https://your-webhook.com/endpoint"
)

# Using server-side parallel processing
results = spider.crawl_urls_server_parallel([
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
], config)
```

### 3. Full Sitemap Processing

```python
config = CrawlConfig(
    target_selector="main",
    remove_selectors=[".sidebar", ".comments"],
    output_dir=Path("site_content"),
    report_file=Path("crawl_report.json")
)

results = spider.crawl_sitemap_server_parallel(
    "https://example.com/sitemap.xml",
    config
)
```

## Output Structure

### 1. Directory Layout
```
spiderforce_reports/           # Default output directory
├── example-com-page1.md      # Converted markdown files
├── example-com-page2.md
└── crawl_report.json         # Crawl report
```

### 2. Markdown Files
Each file is named using a slugified version of the URL:
```markdown
# Page Title

Content converted to clean markdown...
```

### 3. Crawl Report
```json
{
  "timestamp": "2025-02-15T10:30:00.123456",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads", "#popup"]
  },
  "results": {
    "successful": [
      {
        "url": "https://example.com/page1",
        "status": "success",
        "markdown": "# Page Title\n\nContent...",
        "timestamp": "2025-02-15T10:30:00.123456"
      }
    ],
    "failed": [
      {
        "url": "https://example.com/page2",
        "status": "failed",
        "error": "HTTP 404: Not Found",
        "timestamp": "2025-02-15T10:30:01.123456"
      }
    ]
  },
  "summary": {
    "total": 2,
    "successful": 1,
    "failed": 1
  }
}
```

### 4. Webhook Notifications
If configured, real-time updates are sent for each processed URL:
```json
{
  "url": "https://example.com/page1",
  "status": "success",
  "markdown": "# Page Title\n\nContent...",
  "timestamp": "2025-02-15T10:30:00.123456",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads", "#popup"]
  }
}
```

## Error Handling

The package handles various types of errors gracefully:
- Network errors
- Timeout errors
- Invalid URLs
- Missing content
- Service errors

All errors are:
1. Logged in the console
2. Included in the JSON report
3. Sent via webhook (if configured)
4. Available in the results list

## Requirements

- Python 3.11 or later
- Running SpiderForce4AI service
- Internet connection

## Performance Considerations

1. Server-side Parallel Processing
   - Best for most cases
   - Single HTTP request for multiple URLs
   - Less network overhead
   - Use: `crawl_urls_server_parallel()` or `crawl_sitemap_server_parallel()`

2. Client-side Parallel Processing
   - Good for special cases requiring local control
   - Uses Python multiprocessing
   - More network overhead
   - Use: `crawl_urls_parallel()` or `crawl_sitemap_parallel()`

3. Async Processing
   - Best for integration with async applications
   - Good for real-time processing
   - Use: `crawl_url_async()`, `crawl_urls_async()`, or `crawl_sitemap_async()`

## License

MIT License

## Credits

Created by [Peter Tam](https://petertam.pro)
