Metadata-Version: 2.4
Name: scraper4ai
Version: 1.0.0
Summary: This is a scraper for LLM.
Author-email: Keisuke Miyamoto <aichiboyhighschool@gmain.com>
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: aiofiles
Requires-Dist: aiohappyeyeballs
Requires-Dist: aiohttp
Requires-Dist: aiosignal
Requires-Dist: aiosqlite
Requires-Dist: alphashape
Requires-Dist: annotated-types
Requires-Dist: anyio
Requires-Dist: attrs
Requires-Dist: beautifulsoup4
Requires-Dist: Brotli
Requires-Dist: certifi
Requires-Dist: cffi
Requires-Dist: chardet
Requires-Dist: charset-normalizer
Requires-Dist: click
Requires-Dist: click-log
Requires-Dist: cloudscraper
Requires-Dist: colorama
Requires-Dist: Crawl4AI
Requires-Dist: cryptography
Requires-Dist: cssselect
Requires-Dist: curl_cffi
Requires-Dist: Deprecated
Requires-Dist: distro
Requires-Dist: fake-http-header
Requires-Dist: fake-useragent
Requires-Dist: filelock
Requires-Dist: frozenlist
Requires-Dist: fsspec
Requires-Dist: greenlet
Requires-Dist: h11
Requires-Dist: h2
Requires-Dist: hf-xet
Requires-Dist: hpack
Requires-Dist: httpcore
Requires-Dist: httpx
Requires-Dist: huggingface-hub
Requires-Dist: humanize
Requires-Dist: hyperframe
Requires-Dist: idna
Requires-Dist: importlib_metadata
Requires-Dist: Jinja2
Requires-Dist: jiter
Requires-Dist: joblib
Requires-Dist: jsonschema
Requires-Dist: jsonschema-specifications
Requires-Dist: lark
Requires-Dist: litellm
Requires-Dist: lxml
Requires-Dist: markdown-it-py
Requires-Dist: markdownify
Requires-Dist: MarkupSafe
Requires-Dist: mdurl
Requires-Dist: mpmath
Requires-Dist: multidict
Requires-Dist: networkx
Requires-Dist: nltk
Requires-Dist: numpy
Requires-Dist: openai
Requires-Dist: outcome
Requires-Dist: packaging
Requires-Dist: pillow
Requires-Dist: playwright
Requires-Dist: propcache
Requires-Dist: psutil
Requires-Dist: pycparser
Requires-Dist: pydantic
Requires-Dist: pydantic_core
Requires-Dist: pyee
Requires-Dist: Pygments
Requires-Dist: pyOpenSSL
Requires-Dist: pyparsing
Requires-Dist: pyperclip
Requires-Dist: PySocks
Requires-Dist: python-dotenv
Requires-Dist: PyYAML
Requires-Dist: rank-bm25
Requires-Dist: referencing
Requires-Dist: regex
Requires-Dist: requests
Requires-Dist: requests-toolbelt
Requires-Dist: rich
Requires-Dist: rpds-py
Requires-Dist: rtree
Requires-Dist: safetensors
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: selenium
Requires-Dist: sentence-transformers
Requires-Dist: setuptools
Requires-Dist: shapely
Requires-Dist: six
Requires-Dist: sniffio
Requires-Dist: snowballstemmer
Requires-Dist: sortedcontainers
Requires-Dist: soupsieve
Requires-Dist: sympy
Requires-Dist: tf-playwright-stealth
Requires-Dist: threadpoolctl
Requires-Dist: tiktoken
Requires-Dist: tokenizers
Requires-Dist: torch
Requires-Dist: tqdm
Requires-Dist: transformers
Requires-Dist: trimesh
Requires-Dist: trio
Requires-Dist: trio-websocket
Requires-Dist: typing-inspection
Requires-Dist: typing_extensions
Requires-Dist: undetected-chromedriver
Requires-Dist: urllib3
Requires-Dist: websocket-client
Requires-Dist: websockets
Requires-Dist: wrapt
Requires-Dist: wsproto
Requires-Dist: xxhash
Requires-Dist: yarl
Requires-Dist: zipp

# scraper4ai

`scraper4ai` is a powerful and easy-to-use Python library for web scraping, specifically designed to prepare web content for AI and Large Language Model (LLM) applications. It fetches web pages, cleans the HTML, and converts the main content into clean, structured Markdown. It also extracts valuable data like links, images, and videos. The library is built with asynchronous support from the ground up, allowing for efficient scraping of multiple URLs concurrently.

## Features

*   **AI-Ready Content**: Converts messy HTML into clean Markdown, perfect for LLM processing.
*   **Asynchronous Support**: Scrape multiple URLs concurrently with `invoke_all` for high performance.
*   **Rich Data Extraction**: Extracts not just the main content, but also hyperlinks, images, and video sources.
*   **JA3/TLS Fingerprint Spoofing**: Uses `curl_cffi` to impersonate real browser profiles (like Chrome 136), helping to bypass many anti-bot measures.
*   **Optimized Performance**: Session reuse and connection pooling for improved efficiency and reduced overhead.
*   **Customizable Cleaning**: Easily specify which HTML tags or CSS selectors to remove before Markdown conversion.
*   **Resource Management**: Automatic session handling with proper cleanup methods.
*   **Simple API**: Get started in just a few lines of code with an intuitive API.

## Installation

```bash
pip install scraper4ai
```

## Usage

### Basic Usage

Here's a simple example of how to scrape a single URL and get the clean Markdown content.

```python
from scraper4ai import WebScraper

# Initialize the scraper
scraper = WebScraper()

# Scrape a single URL
url = "https://example.com"
result = scraper.invoke(url)

if result.status_code == 200:
    print(result.markdown)
else:
    print(f"Failed to scrape {url}. Status code: {result.status_code}")
```

### Batch Scraping

Use `invoke_all` to efficiently process a list of URLs concurrently.

```python
from scraper4ai import WebScraper

# Initialize the scraper
scraper = WebScraper()

urls = ["https://www.python.org/", "https://github.com/"]

# Scrape all URLs concurrently
results = scraper.invoke_all(urls)

for result in results:
    if result.status_code == 200:
        print(f"--- Content from {result.url} ---")
        print(result.markdown)
        print("-" * 20)
    else:
        print(f"Failed to scrape {result.url}. Status code: {result.status_code}")
```

### Customizing HTML Cleaning

You can easily remove unwanted HTML tags or elements matching CSS selectors before the content is converted to Markdown.

```python
from scraper4ai import WebScraper

scraper = WebScraper()

# Add custom rules to remove navigation and footer elements
scraper.ignore_these_tags_in_markdown(["nav", "footer"])
# Add custom rule to remove any element with class="cookie-banner"
scraper.ignore_these_css_in_markdown([".cookie-banner"])

# These rules will be applied to all subsequent .invoke() or .invoke_all() calls
result = scraper.invoke("https://example.com")
print(result.markdown)

# Don't forget to close the session when done to free resources
scraper.close()
```

## The `ScrapedResult` Object

The `invoke()` and `invoke_all()` methods return `ScrapedResult` objects (or a list of them). This object contains all the data you've scraped.

```python
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class LinkData:
    url: str
    text: Optional[str] = None

@dataclass
class ImageData:
    url: str
    alt_text: Optional[str] = None

@dataclass
class VideoData:
    url: str
    title: Optional[str] = None

@dataclass
class ScrapedResult:
    url: str
    status_code: int
    raw_html: Optional[str]
    markdown: Optional[str]
    links: Optional[List[LinkData]] = field(default_factory=list)
    image_links: Optional[List[ImageData]] = field(default_factory=list)
    video_links: Optional[List[VideoData]] = field(default_factory=list)
```

*   `url` (str): The original URL that was scraped.
*   `status_code` (int): The HTTP status code of the response. On failure, this will be `-1` or the actual error code.
*   `raw_html` (Optional[str]): The original, unmodified HTML content of the page. `None` on failure.
*   `markdown` (Optional[str]): The cleaned, converted Markdown content. `None` on failure.
*   `links` (Optional[List[LinkData]]): A list of all hyperlinks found on the page. `None` on failure.
*   `image_links` (Optional[List[ImageData]]): A list of all images found on the page. `None` on failure.
*   `video_links` (Optional[List[VideoData]]): A list of all videos found on the page. `None` on failure.

## Advanced Features

### Browser Impersonation

The library uses the latest Chrome 136 browser fingerprints for maximum compatibility and anti-bot detection avoidance. The impersonation automatically adapts for mobile devices when needed.

### Retry Logic

Intelligent retry mechanism with exponential backoff to handle temporary network issues gracefully without overwhelming servers.

## Error Handling

If the scraper fails to fetch a URL after several retries, it **will not raise an exception**. Instead, it returns a `ScrapedResult` object where:
*   `status_code` is set to `-1` (or the actual HTTP error status code if one was received).
*   `raw_html`, `markdown`, and the link lists are set to `None`.

This design allows you to handle failures gracefully without crashing, especially during batch processing.

## Performance Tips

- **Session Reuse**: The WebScraper automatically reuses HTTP sessions for better performance when making multiple sequential requests.
- **Batch Processing**: Use `invoke_all()` for concurrent processing of multiple URLs with optimized connection pooling.
- **Resource Cleanup**: Call `scraper.close()` when finished to properly release session resources.
- **Connection Limits**: The async session limits concurrent connections to prevent overwhelming target servers.

```python
from scraper4ai import WebScraper

# Create scraper instance
scraper = WebScraper()

# Process multiple URLs efficiently
results = scraper.invoke_all([
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
])

# Clean up resources
scraper.close()
```
