Metadata-Version: 2.4
Name: bhfutils
Version: 0.2.99
Summary: Utilities that are used by any spider of Behoof project
Home-page: https://behoof.app/
Author: Teplygin Vladimir
Author-email: vvteplygin@gmail.com
License: MIT
Keywords: behoof,scrapy-cluster,utilities
Description-Content-Type: text/x-rst
Requires-Dist: python-json-logger==0.1.8
Requires-Dist: redis>=4.0.2
Requires-Dist: kazoo>=2.8.0
Requires-Dist: mock>=4.0.3
Requires-Dist: playwright>=1.17.2
Requires-Dist: testfixtures>=6.18.3
Requires-Dist: ujson>=4.3.0
Requires-Dist: future>=0.18.2
Provides-Extra: test
Requires-Dist: mock>=2.0.0; extra == "test"
Requires-Dist: testfixtures>=4.13.5; extra == "test"
Provides-Extra: all
Requires-Dist: python-json-logger==0.1.8; extra == "all"
Requires-Dist: redis>=4.0.2; extra == "all"
Requires-Dist: kazoo>=2.8.0; extra == "all"
Requires-Dist: mock>=4.0.3; extra == "all"
Requires-Dist: playwright>=1.17.2; extra == "all"
Requires-Dist: testfixtures>=6.18.3; extra == "all"
Requires-Dist: ujson>=4.3.0; extra == "all"
Requires-Dist: future>=0.18.2; extra == "all"
Requires-Dist: mock>=2.0.0; extra == "all"
Requires-Dist: testfixtures>=4.13.5; extra == "all"
Provides-Extra: docs
Requires-Dist: sphinx; extra == "docs"
Requires-Dist: mock>=2.0.0; extra == "docs"
Requires-Dist: testfixtures>=4.13.5; extra == "docs"
Provides-Extra: lint
Requires-Dist: pep8; extra == "lint"
Requires-Dist: pyflakes; extra == "lint"
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: summary

******************************
Behoof Scrapy Cluster Template
******************************

Overview
--------

The ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.

Requirements
------------

- Unix based machine (Linux or OS X)
- Python 2.7 or 3.6

Installation
------------

Inside a virtualenv, run ``pip install -U bhfutils``.  This will install the latest version of the Behoof Scrapy Cluster Spider utilities.  After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation
-------------

Full documentation for the ``bhfutils`` package does not exist

custom_cookies.py
==================

The ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py
========================

The ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py
===============================

The ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded 

redis_dupefilter.py
===================

The ``redis_dupefilter`` module is redis-based request duplication filter

redis_global_page_per_domain_filter.py
======================================

The ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.
