Metadata-Version: 2.4
Name: philiprehberger-web-scraper
Version: 0.1.9
Summary: Lightweight web scraper with rate limiting and CSS selectors
Project-URL: Homepage, https://github.com/philiprehberger/py-web-scraper#readme
Project-URL: Repository, https://github.com/philiprehberger/py-web-scraper
Project-URL: Issues, https://github.com/philiprehberger/py-web-scraper/issues
Project-URL: Changelog, https://github.com/philiprehberger/py-web-scraper/blob/main/CHANGELOG.md
Author: Philip Rehberger
License-Expression: MIT
License-File: LICENSE
Keywords: crawler,extraction,html,scraper,web
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: requests>=2.28
Description-Content-Type: text/markdown

# philiprehberger-web-scraper

[![Tests](https://github.com/philiprehberger/py-web-scraper/actions/workflows/publish.yml/badge.svg)](https://github.com/philiprehberger/py-web-scraper/actions/workflows/publish.yml)
[![PyPI version](https://img.shields.io/pypi/v/philiprehberger-web-scraper.svg)](https://pypi.org/project/philiprehberger-web-scraper/)
[![Last updated](https://img.shields.io/github/last-commit/philiprehberger/py-web-scraper)](https://github.com/philiprehberger/py-web-scraper/commits/main)

Lightweight web scraper with rate limiting and CSS selectors.

## Installation

```bash
pip install philiprehberger-web-scraper
```

## Usage

```python
from philiprehberger_web_scraper import Scraper

scraper = Scraper(rate_limit=2.0, retry_attempts=3)

# Fetch a single page
page = scraper.get("https://example.com")
titles = page.select_all("h2.title")
link = page.select_one("a.next")
all_links = page.links()

# Extract data
for el in page.select_all(".product"):
    print(el.select_one(".name").text)
    print(el.select_one("a").attr("href"))

# Crawl multiple pages
for page in scraper.crawl("https://example.com/blog", max_pages=20):
    for article in page.select_all("article"):
        print(article.select_one("h2").text)

# Export
Scraper.export_csv(data, "output.csv")
Scraper.export_json(data, "output.json")
```

## Features

- Built-in rate limiting (token bucket)
- Retry with backoff on 429/5xx errors
- CSS selector API wrapping BeautifulSoup
- Crawl mode with same-domain filtering
- Link and image extraction
- CSV and JSON export helpers

## Options

```python
Scraper(
    rate_limit=2.0,        # max requests/second
    retry_attempts=3,      # retries on failure
    retry_delay=1.0,       # base delay between retries
    timeout=30.0,          # request timeout
    headers={...},         # custom headers
)
```

## API

| Function / Class | Description |
|------------------|-------------|
| `Scraper(rate_limit, retry_attempts, retry_delay, timeout, headers)` | Web scraper with rate limiting, retry, and CSS selector extraction |
| `Page` | A fetched web page with `select_one()`, `select_all()`, `links()`, `images()`, and `title`/`text` properties |
| `Element` | Wrapper around a parsed element with `text`, `html`, `attr()`, `select_one()`, `select_all()` |

## Development

```bash
pip install -e .
python -m pytest tests/ -v
```

## Support

If you find this project useful:

⭐ [Star the repo](https://github.com/philiprehberger/py-web-scraper)

🐛 [Report issues](https://github.com/philiprehberger/py-web-scraper/issues?q=is%3Aissue+is%3Aopen+label%3Abug)

💡 [Suggest features](https://github.com/philiprehberger/py-web-scraper/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement)

❤️ [Sponsor development](https://github.com/sponsors/philiprehberger)

🌐 [All Open Source Projects](https://philiprehberger.com/open-source-packages)

💻 [GitHub Profile](https://github.com/philiprehberger)

🔗 [LinkedIn Profile](https://www.linkedin.com/in/philiprehberger)

## License

[MIT](LICENSE)
