scrapy_redis package

Submodules

scrapy_redis.connection module

scrapy_redis.connection.from_settings(settings)

Returns a redis client instance from given Scrapy settings object.

This function uses get_client to instantiate the client and uses defaults.REDIS_PARAMS global as defaults values for the parameters. You can override them using the REDIS_PARAMS setting.

Parameters:
  • settings (Settings) – A scrapy settings object. See the supported settings below.

  • REDIS_URL (str, optional) – Server connection URL.

  • REDIS_HOST (str, optional) – Server host.

  • REDIS_PORT (str, optional) – Server port.

  • REDIS_DB (int, optional) – Server database

  • REDIS_ENCODING (str, optional) – Data encoding.

  • REDIS_PARAMS (dict, optional) – Additional client parameters.

  • Only (Python 3)

  • ----------------

  • REDIS_DECODE_RESPONSES (bool, optional) – Sets the decode_responses kwarg in Redis cls ctor

Returns:

Redis client instance.

Return type:

server

scrapy_redis.connection.get_redis(**kwargs)[source]

Returns a redis client instance.

Parameters:
  • redis_cls (class, optional) – Defaults to redis.StrictRedis.

  • url (str, optional) – If given, redis_cls.from_url is used to instantiate the class.

  • **kwargs – Extra parameters to be passed to the redis_cls class.

Returns:

Redis client instance.

Return type:

server

scrapy_redis.connection.get_redis_from_settings(settings)[source]

Returns a redis client instance from given Scrapy settings object.

This function uses get_client to instantiate the client and uses defaults.REDIS_PARAMS global as defaults values for the parameters. You can override them using the REDIS_PARAMS setting.

Parameters:
  • settings (Settings) – A scrapy settings object. See the supported settings below.

  • REDIS_URL (str, optional) – Server connection URL.

  • REDIS_HOST (str, optional) – Server host.

  • REDIS_PORT (str, optional) – Server port.

  • REDIS_DB (int, optional) – Server database

  • REDIS_ENCODING (str, optional) – Data encoding.

  • REDIS_PARAMS (dict, optional) – Additional client parameters.

  • Only (Python 3)

  • ----------------

  • REDIS_DECODE_RESPONSES (bool, optional) – Sets the decode_responses kwarg in Redis cls ctor

Returns:

Redis client instance.

Return type:

server

scrapy_redis.dupefilter module

class scrapy_redis.dupefilter.RFPDupeFilter(server, key, debug=False)[source]

Bases: BaseDupeFilter

Redis-based request duplicates filter.

This class can also be used with default Scrapy’s scheduler.

clear()[source]

Clears fingerprints data.

close(reason='')[source]

Delete data on close. Called by Scrapy’s scheduler.

Parameters:

reason (str, optional)

classmethod from_crawler(crawler)[source]

Returns instance from crawler.

Parameters:

crawler (scrapy.crawler.Crawler)

Returns:

Instance of RFPDupeFilter.

Return type:

RFPDupeFilter

classmethod from_settings(settings)[source]

Returns an instance from given settings.

This uses by default the key dupefilter:<timestamp>. When using the scrapy_redis.scheduler.Scheduler class, this method is not used as it needs to pass the spider name in the key.

Parameters:

settings (scrapy.settings.Settings)

Returns:

A RFPDupeFilter instance.

Return type:

RFPDupeFilter

classmethod from_spider(spider)[source]
log(request, spider)[source]

Logs given request.

Parameters:
  • request (scrapy.http.Request)

  • spider (scrapy.spiders.Spider)

logger = <Logger scrapy_redis.dupefilter (WARNING)>
request_fingerprint(request)[source]

Returns a fingerprint for a given request.

Parameters:

request (scrapy.http.Request)

Return type:

str

request_seen(request)[source]

Returns True if request was already seen.

Parameters:

request (scrapy.http.Request)

Return type:

bool

scrapy_redis.pipelines module

class scrapy_redis.pipelines.RedisPipeline(server, key='%(spider)s:items', serialize_func=<bound method JSONEncoder.encode of <scrapy.utils.serialize.ScrapyJSONEncoder object>>)[source]

Bases: object

Pushes serialized item into a redis list/queue

Settings

REDIS_ITEMS_KEYstr

Redis key where to store items.

REDIS_ITEMS_SERIALIZERstr

Object path to serializer function.

classmethod from_crawler(crawler)[source]
classmethod from_settings(settings)[source]
item_key(item, spider)[source]

Returns redis key based on given spider.

Override this function to use a different key depending on the item and/or spider.

process_item(item, spider)[source]

scrapy_redis.queue module

class scrapy_redis.queue.Base(server, spider, key, serializer=None)[source]

Bases: object

Per-spider base queue class

clear()[source]

Clear queue/stack

pop(timeout=0)[source]

Pop a request

push(request)[source]

Push a request

class scrapy_redis.queue.FifoQueue(server, spider, key, serializer=None)[source]

Bases: Base

Per-spider FIFO queue

pop(timeout=0)[source]

Pop a request

push(request)[source]

Push a request

class scrapy_redis.queue.LifoQueue(server, spider, key, serializer=None)[source]

Bases: Base

Per-spider LIFO queue.

pop(timeout=0)[source]

Pop a request

push(request)[source]

Push a request

class scrapy_redis.queue.PriorityQueue(server, spider, key, serializer=None)[source]

Bases: Base

Per-spider priority queue abstraction using redis’ sorted set

pop(timeout=0)[source]

Pop a request timeout not support in this queue class

push(request)[source]

Push a request

scrapy_redis.queue.SpiderPriorityQueue

alias of PriorityQueue

scrapy_redis.queue.SpiderQueue

alias of FifoQueue

scrapy_redis.queue.SpiderStack

alias of LifoQueue

scrapy_redis.scheduler module

class scrapy_redis.scheduler.Scheduler(server, persist=False, flush_on_start=False, queue_key='%(spider)s:requests', queue_cls='scrapy_redis.queue.PriorityQueue', dupefilter=None, dupefilter_key='%(spider)s:dupefilter', dupefilter_cls='scrapy_redis.dupefilter.RFPDupeFilter', idle_before_close=0, serializer=None)[source]

Bases: object

Redis-based scheduler

Settings

SCHEDULER_PERSISTbool (default: False)

Whether to persist or clear redis queue.

SCHEDULER_FLUSH_ON_STARTbool (default: False)

Whether to flush redis queue on start.

SCHEDULER_IDLE_BEFORE_CLOSEint (default: 0)

How many seconds to wait before closing if no message is received.

SCHEDULER_QUEUE_KEYstr

Scheduler redis key.

SCHEDULER_QUEUE_CLASSstr

Scheduler queue class.

SCHEDULER_DUPEFILTER_KEYstr

Scheduler dupefilter redis key.

SCHEDULER_DUPEFILTER_CLASSstr

Scheduler dupefilter class.

SCHEDULER_SERIALIZERstr

Scheduler serializer.

close(reason)[source]
enqueue_request(request)[source]
flush()[source]
classmethod from_crawler(crawler)[source]
classmethod from_settings(settings)[source]
has_pending_requests()[source]
next_request()[source]
open(spider)[source]

scrapy_redis.spiders module

class scrapy_redis.spiders.RedisCrawlSpider(*args: Any, **kwargs: Any)[source]

Bases: RedisMixin, CrawlSpider

Spider that reads urls from redis queue when idle.

redis_key

Redis key where to fetch start URLs from..

Type:

str (default: REDIS_START_URLS_KEY)

redis_batch_size

Number of messages to fetch from redis on each attempt.

Type:

int (default: CONCURRENT_REQUESTS)

redis_encoding

Encoding to use when decoding messages from redis queue.

Type:

str (default: REDIS_ENCODING)

Settings
--------
REDIS_START_URLS_KEY

Default Redis key where to fetch start URLs from..

Type:

str (default: “<spider.name>:start_urls”)

REDIS_START_URLS_BATCH_SIZE

Default number of messages to fetch from redis on each attempt.

Type:

int (deprecated by CONCURRENT_REQUESTS)

REDIS_START_URLS_AS_SET

Use SET operations to retrieve messages from the redis queue.

Type:

bool (default: True)

REDIS_ENCODING

Default encoding to use when decoding messages from redis queue.

Type:

str (default: “utf-8”)

classmethod from_crawler(crawler, *args, **kwargs)[source]
class scrapy_redis.spiders.RedisMixin[source]

Bases: object

Mixin class to implement reading urls from a redis queue.

make_request_from_data(data)[source]

Returns a Request instance for data coming from Redis.

Overriding this function to support the json requested data that contains url ,`meta` and other optional parameters. meta is a nested json which contains sub-data.

Along with: After accessing the data, sending the FormRequest with url, meta and addition formdata, method

For example:

{
    "url": "https://example.com",
    "meta": {
        "job-id":"123xsd",
        "start-date":"dd/mm/yy",
    },
    "url_cookie_key":"fertxsas",
    "method":"POST",
}

If url is empty, return []. So you should verify the url in the data. If method is empty, the request object will set method to ‘GET’, optional. If meta is empty, the request object will set meta to an empty dictionary, optional.

This json supported data can be accessed from ‘scrapy.spider’ through response. ‘request.url’, ‘request.meta’, ‘request.cookies’, ‘request.method’

Parameters:

data (bytes) – Message from redis.

max_idle_time = None
next_requests()[source]

Returns a request to be scheduled or none.

pop_list_queue(redis_key, batch_size)[source]
pop_priority_queue(redis_key, batch_size)[source]
redis_batch_size = None
redis_encoding = None
redis_key = None
schedule_next_requests()[source]

Schedules a request if available

server = None
setup_redis(crawler=None)[source]

Setup redis connection and idle signal.

This should be called after the spider has set its crawler object.

spider_idle()[source]

Schedules a request if available, otherwise waits. or close spider when waiting seconds > MAX_IDLE_TIME_BEFORE_CLOSE. MAX_IDLE_TIME_BEFORE_CLOSE will not affect SCHEDULER_IDLE_BEFORE_CLOSE.

spider_idle_start_time = 1720301839
start_requests()[source]

Returns a batch of start requests from redis.

class scrapy_redis.spiders.RedisSpider(*args: Any, **kwargs: Any)[source]

Bases: RedisMixin, Spider

Spider that reads urls from redis queue when idle.

redis_key

Redis key where to fetch start URLs from..

Type:

str (default: REDIS_START_URLS_KEY)

redis_batch_size

Number of messages to fetch from redis on each attempt.

Type:

int (default: CONCURRENT_REQUESTS)

redis_encoding

Encoding to use when decoding messages from redis queue.

Type:

str (default: REDIS_ENCODING)

Settings
--------
REDIS_START_URLS_KEY

Default Redis key where to fetch start URLs from..

Type:

str (default: “<spider.name>:start_urls”)

REDIS_START_URLS_BATCH_SIZE

Default number of messages to fetch from redis on each attempt.

Type:

int (deprecated by CONCURRENT_REQUESTS)

REDIS_START_URLS_AS_SET

Use SET operations to retrieve messages from the redis queue. If False, the messages are retrieve using the LPOP command.

Type:

bool (default: False)

REDIS_ENCODING

Default encoding to use when decoding messages from redis queue.

Type:

str (default: “utf-8”)

classmethod from_crawler(crawler, *args, **kwargs)[source]

Module contents