Metadata-Version: 2.2
Name: scrapy-mongo
Version: 1.1.0
Summary: MongoDB plugins for Scrapy
Author: Fabien Vauchelles
Project-URL: Homepage, https://github.com/scrapoxy/scrapy-mongo
Project-URL: Issues, https://github.com/scrapoxy/scrapy-mongo/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: scrapy
Requires-Dist: pymongo

# MongoDB plugins for Scrapy

## Installation

```shell
pip install scrapy-mongo
```


## Pipeline

This pipeline stores scraped items into a MongoDB collection.

Each item must have a unique `id` field to avoid duplicates. 
This field is automatically mapped to MongoDB’s `_id` field.

Each item must include a `collection` field that specifies the name of the target MongoDB collection.

Items are upserted in batches of `100` by default. 
The batch size can be adjusted using the `PIPELINE_MONGO_BATCH_SIZE` setting.

To enable the pipeline, include the following lines in `settings.py`:

```python
ITEM_PIPELINES = {
    'scrapy_mongo.MongoPipeline': 300,
}
PIPELINE_MONGO_URL = "mongodb://localhost:27017"
PIPELINE_MONGO_DATABASE = "mycollection"
```

**Note:** Update `PIPELINE_MONGO_URL` and `PIPELINE_MONGO_DATABASE` 
with the appropriate values for the specific environment.


## Cache

The cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times.
It leverages Scrapy’s fingerprinting mechanism to identify responses.

It uses [Scrapy's fingerprint](https://docs.scrapy.org/en/2.10/topics/request-response.html#request-fingerprints) 
mechanism to identify the responses.

To enable caching, include the following lines in `settings.py`:

```python
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'
HTTPCACHE_MONGO_URL = "mongodb://localhost:27017"
HTTPCACHE_MONGO_DATABASE = "scraping"
HTTPCACHE_EXPIRATION_SECS = 604800  # Default is 1 week
```

**Note:** Update `HTTPCACHE_MONGO_URL` and `HTTPCACHE_MONGO_DATABASE` 
with the appropriate values for the specific environment.

The default expiration time is set to **1 week** (604800 seconds). 
This value can be modified via `HTTPCACHE_EXPIRATION_SECS`.


## Cache policy

An advanced cache policy mechanism with whitelist support is available. 
This feature allows for the definition of specific HTTP response codes to be cached,
using both explicit **lists and regular expressions**.

To enable the cache policy, add the following lines to `settings.py`:

```python
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'
HTTPCACHE_ACCEPT_HTTP_CODES = [302]
HTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\d\d'
```

This configuration will accept all `2XX HTTP` codes and `302` redirects.


## Error

The error component stores error logs in a MongoDB collection.
It catches error from the Downloader pipeline and the Spider pipeline.

To enable error logging, include the following lines in `settings.py`:

```python
DOWNLOADER_MIDDLEWARES = {
    'scrapy_mongo.TraceErrorDownloaderMiddleware': 1000,
}

SPIDER_MIDDLEWARES = {
    'scrapy_mongo.TraceErrorSpiderMiddleware': 1000,
}

ERROR_MONGO_URL = "mongodb://localhost:27017"
ERROR_MONGO_DATABASE = 'scraping'
ERROR_MONGO_COLLECTION = 'errors'
```

**Note:** Update `ERROR_MONGO_URL`, `ERROR_MONGO_DATABASE` and `ERROR_MONGO_COLLECTION` 
with the appropriate values for the specific environment.

It is possible to use the same MongoDB connection for both the pipeline and cache
by replacing `PIPELINE_MONGO_URL`, `HTTPCACHE_MONGO_URL` and `ERROR_MONGO_URL` with a unified `MONGO_URL` setting.


## Build for publish

Install dependencies:

```shell
pip install build twine
```

Build the package:

```shell
python -m build --outdir dist
```

And publish to PyPi:

```shell
python -m twine upload dist/*
```
