Metadata-Version: 2.4
Name: fsm-crawl
Version: 0.2.0
Summary: A parallel web crawler with consent management using Playwright
License: MIT
Keywords: crawler,playwright,web-scraping,consent
Author: Henry Schwerdtner
Author-email: henry.schwerdtner@web.de
Requires-Python: >=3.11,<3.14
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: dev
Requires-Dist: adblockparser (>=0.7,<0.8)
Requires-Dist: d3graph (>=2.6.1,<3.0.0)
Requires-Dist: graphviz (>=0.21,<0.22)
Requires-Dist: ipysigma (>=0.24.6,<0.25.0)
Requires-Dist: matplotlib (>=3.10.8,<4.0.0)
Requires-Dist: networkx (>=3.6.1,<4.0.0)
Requires-Dist: pandas (>=2.3.3,<3.0.0)
Requires-Dist: playwright (>=1.57.0,<2.0.0)
Requires-Dist: pytest (>=9.0.2,<10.0.0) ; extra == "dev"
Requires-Dist: pytest-mock (>=3.15.1,<4.0.0) ; extra == "dev"
Requires-Dist: pyyaml (>=6.0.3,<7.0.0)
Requires-Dist: rich (>=14.2.0,<15.0.0)
Requires-Dist: scikit-learn (>=1.8.0,<2.0.0)
Requires-Dist: seaborn (>=0.13.2,<0.14.0)
Requires-Dist: tensorflow (>=2.20.0,<3.0.0)
Requires-Dist: tldextract (>=5.3.1,<6.0.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Requires-Dist: transitions (>=0.9.3,<0.10.0)
Project-URL: Repository, https://github.com/yourusername/fsm-crawl
Description-Content-Type: text/markdown

# FSM-Crawl

A high-performance parallel web crawler built with Playwright and Python. Features automatic cookie consent management and configurable crawling strategies.

## Features

- **Parallel Tab Crawling**: Open up to 10 tabs simultaneously for faster crawling
- **Automatic Cookie Consent**: Intelligently accepts cookies across multiple sites
- **Multiple Crawling Strategies**: 
  - Normal crawl (no consent)
  - Normal crawl with cookies
  - Explorative crawl (probability-based navigation)
- **Request/Response Logging**: Detailed CSV logs of all network activity
- **Distributed Crawling**: Support for sharded crawls across multiple machines
- **Headless & Headed Modes**: Run with or without browser UI

## Installation

### Option 1: PyPI (Recommended)
```bash
pip install fsm-crawl
```

### Option 2: From Git
```bash
pip install git+https://github.com/yourusername/fsm-crawl.git
```

### Option 3: Development Install
```bash
git clone https://github.com/yourusername/fsm-crawl.git
cd fsm-crawl
pip install -e .
```

## Quick Start

### Basic Usage

```bash
# Run default normal crawl with 10 parallel tabs on first 1000 URLs
fsm-crawl

# Run with cookies enabled
fsm-crawl --experiment normal_with_cookies

# Run explorative crawl strategy
fsm-crawl --experiment explorative
```

### Configuration

```bash
# Specify input URL file
fsm-crawl --path urls.csv

# Set output prefix for logs
fsm-crawl --prefix my_crawl

# Custom number of parallel tabs (1-20 recommended)
fsm-crawl --num-tabs 5

# Run in headless mode (no browser window)
fsm-crawl --headless

# Distributed crawling with shards
fsm-crawl --shard-index 0 --shard-count 4  # First shard of 4
fsm-crawl --shard-index 1 --shard-count 4  # Second shard of 4
```

## CLI Commands

```
usage: fsm-crawl [-h] [--shard-index SHARD_INDEX] [--shard-count SHARD_COUNT]
                  [-e {normal,normal_with_cookies,explorative}]
                  [-p PATH] [--prefix PREFIX] [--path2 PATH2]
                  [--prefix2 PREFIX2] [--headless] [--engine {playwright}]
                  [--num-tabs NUM_TABS]

Run FSM web crawler experiments

optional arguments:
  -h, --help            show this help message and exit
  --shard-index SHARD_INDEX
                        Shard ID (for distributed runs)
  --shard-count SHARD_COUNT
                        Total number of shards (for distributed runs)
  -e {normal,normal_with_cookies,explorative}, --experiment {normal,normal_with_cookies,explorative}
                        Which experiment to run
  -p PATH, --path PATH  Path to input CSV for URL manager
  --prefix PREFIX       Filename prefix for output logs
  --path2 PATH2         Path to second CSV for two-run mode
  --prefix2 PREFIX2     Filename prefix for second run
  --headless            Run browser in headless mode
  --engine {playwright}
                        Browser engine to use
  --num-tabs NUM_TABS   Number of parallel tabs (default: 10)
```

## Input Format

The input CSV file should have URLs, one per line or in standard CSV format:

```csv
https://example.com
https://another-site.com
https://third-site.org
...
```

## Output

The crawler generates two CSV files in the `crawl_logs/` directory:

- **request_*.csv**: All HTTP requests made during crawling
- **response_*.csv**: All HTTP responses received

Each row contains:
- Timestamp
- Request method, URL, headers
- Response status, headers, cookies
- Classification labels for blocking

## Examples

### Crawl with custom tabs and output prefix
```bash
fsm-crawl --experiment normal_with_cookies --path my_urls.csv --prefix my_crawl --num-tabs 8
```

### Distributed crawling across 4 machines
```bash
# Machine 1
fsm-crawl --shard-index 0 --shard-count 4 --prefix distributed_crawl

# Machine 2
fsm-crawl --shard-index 1 --shard-count 4 --prefix distributed_crawl

# Machine 3
fsm-crawl --shard-index 2 --shard-count 4 --prefix distributed_crawl

# Machine 4
fsm-crawl --shard-index 3 --shard-count 4 --prefix distributed_crawl
```

### Headless mode with explorative strategy
```bash
fsm-crawl --experiment explorative --headless --num-tabs 10
```

## Development

### Install dev dependencies
```bash
pip install -e ".[dev]"
```

### Run tests
```bash
pytest
```

### Run with Poetry
```bash
poetry run fsm-crawl --help
```

## Configuration Files

The crawler uses a `blocking/consent-manager.yaml` file to define CMP (Consent Management Platform) detection and cookie acceptance rules.

## Architecture

- **PlaywrightEngine**: Manages browser tabs and page interactions
- **BrowserManager**: Coordinates parallel crawling across tabs
- **ConsentManager**: Handles cookie acceptance automation
- **RequestResponseLoggingPipeline**: Logs all network activity to CSV

## Requirements

- Python 3.11+
- Chromium browser (installed automatically via Playwright)
- 2GB RAM minimum (recommended 4GB+ for 10 tabs)

## License

MIT

## Support

For issues, please open a GitHub issue or contact the maintainers.

## Citation

If you use FSM-Crawl in your research, please cite:

```bibtex
@software{fsm-crawl,
  title={FSM-Crawl: A Parallel Web Crawler with Consent Management},
  author={Schwerdtner, Henry},
  year={2026},
  url={https://github.com/yourusername/fsm-crawl}
}
```

