Metadata-Version: 2.2
Name: spiderforce4ai
Version: 0.1.3
Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
Home-page: https://petertam.pro
Author: Piotr Tamulewicz
Author-email: Piotr Tamulewicz <pt@petertam.pro>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: asyncio>=3.4.3
Requires-Dist: rich>=10.0.0
Requires-Dist: aiofiles>=0.8.0
Requires-Dist: httpx>=0.24.0
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# SpiderForce4AI Python Wrapper (Jina ai reader, fFrecrawl alternative)

A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.

## Features

- 🔄 Simple synchronous and asynchronous APIs
- 📁 Automatic Markdown file saving with URL-based filenames
- 📊 Real-time progress tracking in console
- 🪝 Webhook support for real-time notifications
- 📝 Detailed crawl reports in JSON format
- ⚡ Concurrent crawling with rate limiting
- 🔍 Support for sitemap.xml crawling
- 🛡️ Comprehensive error handling

## Installation

```bash
pip install spiderforce4ai
```

## Quick Start

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig

# Initialize the client
spider = SpiderForce4AI("http://localhost:3004")

# Use default configuration
config = CrawlConfig()

# Crawl a single URL
result = spider.crawl_url("https://example.com", config)

# Crawl multiple URLs
urls = [
    "https://example.com/page1",
    "https://example.com/page2"
]
results = spider.crawl_urls(urls, config)

# Crawl from sitemap
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
```

## Configuration

The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:

```python
config = CrawlConfig(
    # Content Selection (all optional)
    target_selector="article",              # Specific element to target
    remove_selectors=[".ads", "#popup"],   # Elements to remove
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing Settings
    max_concurrent_requests=1,              # Default: 1
    request_delay=0.5,                     # Delay between requests in seconds
    timeout=30,                            # Request timeout in seconds
    
    # Output Settings
    output_dir="spiderforce_reports",      # Default output directory
    webhook_url="https://your-webhook.com", # Optional webhook endpoint
    webhook_timeout=10,                     # Webhook timeout in seconds
    report_file=None                        # Optional custom report location
)
```

### Default Directory Structure

```
./
└── spiderforce_reports/
    ├── example-com-page1.md
    ├── example-com-page2.md
    └── crawl_report.json
```

## Webhook Notifications

If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:

```json
{
  "url": "https://example.com/page1",
  "status": "success",
  "markdown": "# Page Title\n\nContent...",
  "timestamp": "2025-02-15T10:30:00.123456",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads", "#popup"],
    "remove_selectors_regex": ["modal-\\d+"]
  }
}
```

## Crawl Report

A comprehensive JSON report is automatically generated in the output directory:

```json
{
  "timestamp": "2025-02-15T10:30:00.123456",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads", "#popup"],
    "remove_selectors_regex": ["modal-\\d+"]
  },
  "results": {
    "successful": [
      {
        "url": "https://example.com/page1",
        "status": "success",
        "markdown": "# Page Title\n\nContent...",
        "timestamp": "2025-02-15T10:30:00.123456"
      }
    ],
    "failed": [
      {
        "url": "https://example.com/page2",
        "status": "failed",
        "error": "HTTP 404: Not Found",
        "timestamp": "2025-02-15T10:30:01.123456"
      }
    ]
  },
  "summary": {
    "total": 2,
    "successful": 1,
    "failed": 1
  }
}
```

## Async Usage

```python
import asyncio
from spiderforce4ai import SpiderForce4AI, CrawlConfig

async def main():
    config = CrawlConfig()
    spider = SpiderForce4AI("http://localhost:3004")
    
    async with spider:
        results = await spider.crawl_urls_async(
            ["https://example.com/page1", "https://example.com/page2"],
            config
        )
    
    return results

if __name__ == "__main__":
    results = asyncio.run(main())
```

## Error Handling

The crawler is designed to be resilient:
- Continues processing even if some URLs fail
- Records all errors in the crawl report
- Sends error notifications via webhook if configured
- Provides clear error messages in console output

## Progress Tracking

The crawler provides real-time progress tracking in the console:

```
🔄 Crawling URLs... [####################] 100% 
✓ Successful: 95
✗ Failed: 5
📊 Report saved to: ./spiderforce_reports/crawl_report.json
```

## Usage with AI Agents

The package is designed to be easily integrated with AI agents and chat systems:

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig

def fetch_content_for_ai(urls):
    spider = SpiderForce4AI("http://localhost:3004")
    config = CrawlConfig()
    
    # Crawl content
    results = spider.crawl_urls(urls, config)
    
    # Return successful results
    return {
        result.url: result.markdown 
        for result in results 
        if result.status == "success"
    }

# Use with AI agent
urls = ["https://example.com/article1", "https://example.com/article2"]
content = fetch_content_for_ai(urls)
```

## Requirements

- Python 3.11 or later
- Docker (for running SpiderForce4AI service)

## License

MIT License

## Credits

Created by [Peter Tam](https://petertam.pro)
