scrapy_redis package¶
Submodules¶
scrapy_redis.connection module¶
- scrapy_redis.connection.from_settings(settings)¶
Returns a redis client instance from given Scrapy settings object.
This function uses
get_client
to instantiate the client and usesdefaults.REDIS_PARAMS
global as defaults values for the parameters. You can override them using theREDIS_PARAMS
setting.- Parameters:
settings (Settings) – A scrapy settings object. See the supported settings below.
REDIS_URL (str, optional) – Server connection URL.
REDIS_HOST (str, optional) – Server host.
REDIS_PORT (str, optional) – Server port.
REDIS_DB (int, optional) – Server database
REDIS_ENCODING (str, optional) – Data encoding.
REDIS_PARAMS (dict, optional) – Additional client parameters.
Only (Python 3)
----------------
REDIS_DECODE_RESPONSES (bool, optional) – Sets the decode_responses kwarg in Redis cls ctor
- Returns:
Redis client instance.
- Return type:
server
- scrapy_redis.connection.get_redis(**kwargs)[source]¶
Returns a redis client instance.
- Parameters:
redis_cls (class, optional) – Defaults to
redis.StrictRedis
.url (str, optional) – If given,
redis_cls.from_url
is used to instantiate the class.**kwargs – Extra parameters to be passed to the
redis_cls
class.
- Returns:
Redis client instance.
- Return type:
server
- scrapy_redis.connection.get_redis_from_settings(settings)[source]¶
Returns a redis client instance from given Scrapy settings object.
This function uses
get_client
to instantiate the client and usesdefaults.REDIS_PARAMS
global as defaults values for the parameters. You can override them using theREDIS_PARAMS
setting.- Parameters:
settings (Settings) – A scrapy settings object. See the supported settings below.
REDIS_URL (str, optional) – Server connection URL.
REDIS_HOST (str, optional) – Server host.
REDIS_PORT (str, optional) – Server port.
REDIS_DB (int, optional) – Server database
REDIS_ENCODING (str, optional) – Data encoding.
REDIS_PARAMS (dict, optional) – Additional client parameters.
Only (Python 3)
----------------
REDIS_DECODE_RESPONSES (bool, optional) – Sets the decode_responses kwarg in Redis cls ctor
- Returns:
Redis client instance.
- Return type:
server
scrapy_redis.dupefilter module¶
- class scrapy_redis.dupefilter.RFPDupeFilter(server, key, debug=False)[source]¶
Bases:
BaseDupeFilter
Redis-based request duplicates filter.
This class can also be used with default Scrapy’s scheduler.
- close(reason='')[source]¶
Delete data on close. Called by Scrapy’s scheduler.
- Parameters:
reason (str, optional)
- classmethod from_crawler(crawler)[source]¶
Returns instance from crawler.
- Parameters:
crawler (scrapy.crawler.Crawler)
- Returns:
Instance of RFPDupeFilter.
- Return type:
- classmethod from_settings(settings)[source]¶
Returns an instance from given settings.
This uses by default the key
dupefilter:<timestamp>
. When using thescrapy_redis.scheduler.Scheduler
class, this method is not used as it needs to pass the spider name in the key.- Parameters:
settings (scrapy.settings.Settings)
- Returns:
A RFPDupeFilter instance.
- Return type:
- log(request, spider)[source]¶
Logs given request.
- Parameters:
request (scrapy.http.Request)
spider (scrapy.spiders.Spider)
- logger = <Logger scrapy_redis.dupefilter (WARNING)>¶
scrapy_redis.pipelines module¶
- class scrapy_redis.pipelines.RedisPipeline(server, key='%(spider)s:items', serialize_func=<bound method JSONEncoder.encode of <scrapy.utils.serialize.ScrapyJSONEncoder object>>)[source]¶
Bases:
object
Pushes serialized item into a redis list/queue
Settings¶
- REDIS_ITEMS_KEYstr
Redis key where to store items.
- REDIS_ITEMS_SERIALIZERstr
Object path to serializer function.
scrapy_redis.queue module¶
- class scrapy_redis.queue.Base(server, spider, key, serializer=None)[source]¶
Bases:
object
Per-spider base queue class
- class scrapy_redis.queue.FifoQueue(server, spider, key, serializer=None)[source]¶
Bases:
Base
Per-spider FIFO queue
- class scrapy_redis.queue.LifoQueue(server, spider, key, serializer=None)[source]¶
Bases:
Base
Per-spider LIFO queue.
- class scrapy_redis.queue.PriorityQueue(server, spider, key, serializer=None)[source]¶
Bases:
Base
Per-spider priority queue abstraction using redis’ sorted set
- scrapy_redis.queue.SpiderPriorityQueue¶
alias of
PriorityQueue
scrapy_redis.scheduler module¶
- class scrapy_redis.scheduler.Scheduler(server, persist=False, flush_on_start=False, queue_key='%(spider)s:requests', queue_cls='scrapy_redis.queue.PriorityQueue', dupefilter=None, dupefilter_key='%(spider)s:dupefilter', dupefilter_cls='scrapy_redis.dupefilter.RFPDupeFilter', idle_before_close=0, serializer=None)[source]¶
Bases:
object
Redis-based scheduler
Settings¶
- SCHEDULER_PERSISTbool (default: False)
Whether to persist or clear redis queue.
- SCHEDULER_FLUSH_ON_STARTbool (default: False)
Whether to flush redis queue on start.
- SCHEDULER_IDLE_BEFORE_CLOSEint (default: 0)
How many seconds to wait before closing if no message is received.
- SCHEDULER_QUEUE_KEYstr
Scheduler redis key.
- SCHEDULER_QUEUE_CLASSstr
Scheduler queue class.
- SCHEDULER_DUPEFILTER_KEYstr
Scheduler dupefilter redis key.
- SCHEDULER_DUPEFILTER_CLASSstr
Scheduler dupefilter class.
- SCHEDULER_SERIALIZERstr
Scheduler serializer.
scrapy_redis.spiders module¶
- class scrapy_redis.spiders.RedisCrawlSpider(*args: Any, **kwargs: Any)[source]¶
Bases:
RedisMixin
,CrawlSpider
Spider that reads urls from redis queue when idle.
- redis_key¶
Redis key where to fetch start URLs from..
- Type:
str (default: REDIS_START_URLS_KEY)
- redis_batch_size¶
Number of messages to fetch from redis on each attempt.
- Type:
int (default: CONCURRENT_REQUESTS)
- redis_encoding¶
Encoding to use when decoding messages from redis queue.
- Type:
str (default: REDIS_ENCODING)
- Settings¶
- --------
- REDIS_START_URLS_KEY¶
Default Redis key where to fetch start URLs from..
- Type:
str (default: “<spider.name>:start_urls”)
- REDIS_START_URLS_BATCH_SIZE¶
Default number of messages to fetch from redis on each attempt.
- Type:
int (deprecated by CONCURRENT_REQUESTS)
- REDIS_START_URLS_AS_SET¶
Use SET operations to retrieve messages from the redis queue.
- Type:
bool (default: True)
- REDIS_ENCODING¶
Default encoding to use when decoding messages from redis queue.
- Type:
str (default: “utf-8”)
- class scrapy_redis.spiders.RedisMixin[source]¶
Bases:
object
Mixin class to implement reading urls from a redis queue.
- make_request_from_data(data)[source]¶
Returns a Request instance for data coming from Redis.
Overriding this function to support the json requested data that contains url ,`meta` and other optional parameters. meta is a nested json which contains sub-data.
Along with: After accessing the data, sending the FormRequest with url, meta and addition formdata, method
For example:
{ "url": "https://example.com", "meta": { "job-id":"123xsd", "start-date":"dd/mm/yy", }, "url_cookie_key":"fertxsas", "method":"POST", }
If url is empty, return []. So you should verify the url in the data. If method is empty, the request object will set method to ‘GET’, optional. If meta is empty, the request object will set meta to an empty dictionary, optional.
This json supported data can be accessed from ‘scrapy.spider’ through response. ‘request.url’, ‘request.meta’, ‘request.cookies’, ‘request.method’
- Parameters:
data (bytes) – Message from redis.
- max_idle_time = None¶
- redis_batch_size = None¶
- redis_encoding = None¶
- redis_key = None¶
- server = None¶
- setup_redis(crawler=None)[source]¶
Setup redis connection and idle signal.
This should be called after the spider has set its crawler object.
- spider_idle()[source]¶
Schedules a request if available, otherwise waits. or close spider when waiting seconds > MAX_IDLE_TIME_BEFORE_CLOSE. MAX_IDLE_TIME_BEFORE_CLOSE will not affect SCHEDULER_IDLE_BEFORE_CLOSE.
- spider_idle_start_time = 1720301839¶
- class scrapy_redis.spiders.RedisSpider(*args: Any, **kwargs: Any)[source]¶
Bases:
RedisMixin
,Spider
Spider that reads urls from redis queue when idle.
- redis_key¶
Redis key where to fetch start URLs from..
- Type:
str (default: REDIS_START_URLS_KEY)
- redis_batch_size¶
Number of messages to fetch from redis on each attempt.
- Type:
int (default: CONCURRENT_REQUESTS)
- redis_encoding¶
Encoding to use when decoding messages from redis queue.
- Type:
str (default: REDIS_ENCODING)
- Settings¶
- --------
- REDIS_START_URLS_KEY¶
Default Redis key where to fetch start URLs from..
- Type:
str (default: “<spider.name>:start_urls”)
- REDIS_START_URLS_BATCH_SIZE¶
Default number of messages to fetch from redis on each attempt.
- Type:
int (deprecated by CONCURRENT_REQUESTS)
- REDIS_START_URLS_AS_SET¶
Use SET operations to retrieve messages from the redis queue. If False, the messages are retrieve using the LPOP command.
- Type:
bool (default: False)
- REDIS_ENCODING¶
Default encoding to use when decoding messages from redis queue.
- Type:
str (default: “utf-8”)