Metadata-Version: 2.1
Name: webrover
Version: 0.1.5
Summary: Generate high-quality datasets from web content for AI training
Home-page: https://github.com/Area-25/webrover
Author: Area-25
Author-email: jasonquist.ssh@gmail.com
Keywords: web-scraping,dataset-generation,machine-learning,ai-training,deep-learning
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: aiohttp==3.11.8
Requires-Dist: beautifulsoup4==4.12.3
Requires-Dist: googlesearch-python==1.2.5
Requires-Dist: pyyaml==6.0.2
Requires-Dist: setuptools==75.6.0
Requires-Dist: tqdm==4.67.1

# WebRover 🚀

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**WebRover is a powerful Python library for generating high-quality datasets from web content, designed specifically for training Large Language Models and AI applications.**

---

## 🌟 Features

- **Smart Web Scraping**: Automatically find and scrape relevant content based on topics
- **Multiple Input Formats**: Support for JSON, YAML, TXT, and Markdown topic files
- **Async Processing**: Fast, concurrent scraping with built-in rate limiting
- **Quality Control**: Built-in content validation and cleaning
- **LLM-Ready Output**: Structured JSONL format perfect for model training
- **Error Handling**: Robust error tracking and recovery mechanisms

## 🚀 Quick Start

### Installation
```bash
pip install webrover
```

### Basic Usage
```python
from webrover import WebRover

# Initialize WebRover
rover = WebRover()

# Scrape content from topics
rover.scrape_topics(
    topics=["artificial intelligence", "machine learning"],
    num_websites=100
)

# Save the dataset
rover.save_dataset("my_dataset.jsonl")
```

### Using Topic Files
```python
# From JSON file
rover.scrape_topics(
    topics="topics.json",
    num_websites=100
)

# From Markdown list
rover.scrape_topics(
    topics="topics.md",
    num_websites=100
)
```

## 📖 Documentation

### Supported Topic File Formats

#### JSON
```json
{
    "topics": [
        "AI basics",
        "machine learning",
        "deep learning"
    ]
}
```

#### YAML
```yaml
topics:
  - AI basics
  - machine learning
  - deep learning
```

#### Markdown
```markdown
- AI basics
- machine learning
- deep learning
```

### Output Structure
```python
{
    'url': 'https://example.com/article',
    'title': 'Article Title',
    'content': 'Article content...',
    'metadata': {
        'length': 1234,
        'has_title': true,
        'domain': 'example.com'
    }
}
```

## 🛠️ Advanced Usage

```python
# Initialize with custom output directory
rover = WebRover(output_dir="my_datasets")

# Get scraping statistics
stats = rover.get_stats()
print(f"Success rate: {stats['success_rate']*100:.1f}%")

# Access dataset programmatically
dataset = rover.get_dataset()
```

## 📊 Output Files

- `final_dataset/dataset.jsonl`: Main dataset in JSONL format
- `websites_master.json`: List of all discovered URLs
- `websites_completed.json`: Successfully scraped URLs
- `websites_errors.json`: Failed attempts with error details

## 🔄 Error Handling

WebRover automatically handles common issues:
- Rate limiting
- Network timeouts
- Invalid URLs
- Blocked requests
- Malformed content

## 🚧 Limitations

- Respects robots.txt and site rate limits
- Some sites may block automated access
- Large datasets require more processing time
- Google search may throttle excessive requests

## 🗺️ Roadmap

See [FUTURE.md](FUTURE.md) for planned features and improvements.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 📜 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

Built with ❤️ by Area-25. Special thanks to all contributors.

---

**WebRover: Build better datasets, train better models.** 🚀
