Metadata-Version: 2.3
Name: modis-crawler-utils
Version: 0.3.55
Summary: Scrapy utils for Modis crawlers projects.
License: BSD
Author: Varlamov
Author-email: varlamov@ispras.ru
Requires-Python: >=3.9,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Scrapy
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: Pillow (>=7.1.2)
Requires-Dist: certifi
Requires-Dist: dateparser (>=1.2.0)
Requires-Dist: ephemeral-port-reserve (>=1.1.1)
Requires-Dist: itemadapter (>=0.2.0)
Requires-Dist: kafka-python-ng (>=2.2.3,<3.0.0)
Requires-Dist: opensearch-py (>=2.7.1,<3.0.0)
Requires-Dist: pika (>=1.3.2,<2.0.0)
Requires-Dist: pydantic (>=2.4.2,<3.0.0)
Requires-Dist: pymongo (>=3.10.1)
Requires-Dist: pytest (>=8.3.4,<9.0.0)
Requires-Dist: pytest-cases (>=3.8.6,<4.0.0)
Requires-Dist: pytest-check (>=2.5.0,<3.0.0)
Requires-Dist: pytest-dependency (>=0.6.0,<0.7.0)
Requires-Dist: python-logstash (>=0.4.6)
Requires-Dist: requests (>=2.23.0)
Requires-Dist: scrapy (>=2.6.0)
Requires-Dist: scrapy-puppeteer-client (>=0.0.6)
Requires-Dist: scrapy-splash (>=0.8.0)
Requires-Dist: sentry-sdk (>=2.13.0,<3.0.0)
Requires-Dist: twisted
Requires-Dist: uuid6 (>=2025.0.1,<2026.0.0)
Project-URL: Homepage, https://gitlab.at.ispras.ru/crawlers/crawler-utils
Project-URL: Repository, https://gitlab.at.ispras.ru/crawlers/crawler-utils
Description-Content-Type: text/markdown

# crawler-utils

Scrapy utils for Modis crawlers projects.

## MongoDB

Some utils connected with mongodb. 

MongoDBPipeline - pipeline for saving items in mongodb. 

Params:
* MONGODB_SERVER - address of mongodb database.
* MONGODB_PORT - port of mongodb database.
* MONGODB_DB - database where to save data.
* MONGODB_USERNAME - username for authentication in MONGODB_DB database.
* MONGODB_PWD - password for authentication.
* DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is `test`).
* MONGODB_COLLECTION_KEY - key of item which identifies items collection name (`MONGO_COLLECTION`)
 where to save item (default value is `collection`).
* MONGODB_UNIQUE_KEY - key of item which identifies item

## Kafka

Some utils connected with kafka. 

KafkaPipeline - pipeline for pushing items into kafka.

Pipeline outputs data into stream with name `{RESOURCE_TAG}.{DATA_TYPE}`.
Where `RESOURCE_TAG` is tag of resource from which data is crawled and `DATA_TYPE` is type of 
data crawled: `data`, `post`, `comment`, `like`, `user`, `friend`, `share`, `member`, `news`, 
`community`.

 Params:
* KAFKA_ADDRESS - address of kafka broker.
* KAFKA_KEY - key of item which is put into kafka record key.
* KAFKA_RESOURCE_TAG_KEY - key of item which identifies item `RESOURCE_TAG` (default value is `platform`)
* KAFKA_DEFAULT_RESOURCE_TAG - default `RESOURCE_TAG` for crawled items without `KAFKA_RESOURCE_TAG_KEY` (default value is `crawler`)
* KAFKA_DATA_TYPE_KEY - key of item from which identifies item `DATA_TYPE` (default value is `type`).
* KAFKA_DEFAULT_DATA_TYPE - default `DATA_TYPE` for crawled items without `KAFKA_DATA_TYPE_KEY` (default value is `data`).
* KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example `gzip`.

## OpenSearch

OpenSearchRequestsDownloaderMiddleware transforms request-response pair into an item,
and then sends it to the OpenSearch.

Settings:
```
`OPENSEARCH_REQUESTS_SETTINGS` - dict specifying OpenSearch client connections:
    "hosts": Optional[str | list[str]] = "localhost:9200" - hosts with opensearch endpoint,
    "timeout": Optional[int] = 60 - timeout of connections,
    "http_auth": Optional[tuple[str, str]] = None - HTTP authentication if needed,
    "port": Optional[int] = 443 - access port if not specified in hosts,
    "use_ssl": Optional[bool] = True - usage of SSL,
    "verify_certs": Optional[bool] = False - verifying certificates,
    "ssl_show_warn": Optional[bool] = False - show SSL warnings,
    "ca_certs": Optional[str] = None - CA certificate path,
    "client_key": Optional[str] = None - client key path,
    "client_cert": Optional[str] = None - client certificate path,
    "buffer_length": Optional[int] = 500 - number of items in OpenSearchStorage's buffer.

`OPENSEARCH_REQUESTS_INDEX`: Optional[str] = "scrapy-job-requests" - index in OpenSearch.
```

See an example in examples/opensearch.

## CaptchaDetection

Captcha detection middleware for scrapy crawlers.
It gets the HTML code from the response (if present), sends it to the captcha detection web-server
and logs the result.

If you don't want to check exact response if it has captcha, provide meta-key `dont_check_captcha`
with `True` value.

The middleware must be set up with higher precedence (lower number) than RetryMiddleware:
```python
DOWNLOADER_MIDDLEWARES = {
    "crawler_utils.CaptchaDetectionDownloaderMiddleware": 549,  # By default, RetryMiddleware has 550
}
```

Middleware settings:
* ENABLE_CAPTCHA_DETECTOR: bool = True. Whether to enable captcha detection.
* CAPTCHA_SERVICE_URL: str. For an example: http://127.0.0.1:8000

## Sentry logging

You may want to log exceptions during crawling to your Sentry.
Use the `crawler_utils.sentry_logging.SentryLoggingExtension` for this.
Note that sentry_sdk wants to be loaded as earlier as possible.
To satisfy this condition make the extension with negative order:
```python
EXTENSIONS = {
    # Load SentryLogging extension before other extensions.
    "crawler_utils.sentry_logging.SentryLoggingExtension": -1,
}
```

Settings:

SENTRY_DSN: str - Sentry's DSN, where to send events.\
SENTRY_SAMPLE_RATE: float = 1.0 - sample rate for error events. Must be in range from 0.0 to 1.0.\
SENTRY_TRACES_SAMPLE_RATE: float = 1.0 - the percentage chance a given transaction will be sent to Sentry.\
SENTRY_ATTACH_STACKTRACE: bool = False - whether to attach stacktrace for error events.\
SENTRY_MAX_BREADCRUMBS: int = 10 - max breadcrumbs to capture with Sentry.

For an example, check `examples/sentry_logging`.

