Design

Sneakpeek has 6 core components:

  • Scrapers storage - stores list of scrapers and its metadata.

  • Jobs queue - populated by the scheduler or user and is consumed by the workers

  • Lease storage - stores lease (global lock) for scheduler, to make sure there’s only 1 active scheduler at all times.

  • Scheduler - schedules scrapers defined in the scrapers storage

  • Worker - consumes jobs queue and executes scrapers logic

  • API - provides JsonRPC API for interacting with the system

All of the components are run by the SneakpeekServer.

Scrapers Storage

Storage must implement this abstract class sneakpeek.lib.storage.base.ScrapersStorage. Following methods are mandatory to implement:

Following methods are optional to implement:

Currently there 2 storage implementations:

  • InMemoryScrapersStorage - in-memory storage. Should either be used in development environment or if the list of scrapers is static and wouldn’t be changed.

  • RedisScrapersStorage - redis storage.

Jobs queue

Jobs queue must implement this abstract class sneakpeek.lib.storage.base.ScraperJobsStorage. Following methods must be implemented:

Currently there 2 storage implementations:

  • InMemoryScraperJobsStorage - in-memory storage. Should only be used in development environment.

  • RedisScraperJobsStorage - redis storage.

Lease storage

Lease storage is used by scheduler to ensure that at any point of time there’s no more than 1 active scheduler instance which can enqueue scraper jobs. This disallows concurrent execution of the scraper.

Lease storage must implement this abstract class sneakpeek.lib.storage.base.LeaseStorage. Following methods must be implemented:

Currently there 2 storage implementations:

  • InMemoryLeaseStorage - in-memory storage. Should only be used in development environment.

  • RedisLeaseStorage - redis storage.

Scheduler

Scheduler is responsible for:

  • scheduling scrapers based on their configuration.

  • finding scraper jobs that haven’t sent a heartbeat for a while and mark them as dead

  • cleaning up jobs queue from old historical scraper jobs

  • exporting metrics on number of pending jobs in the queue

As for now there’s only one implementation Scheduler that uses APScheduler.

Worker

Worker constantly tries to dequeue a job and executes dequeued jobs. As for now there’s only one implementation Worker.

API

Sneakpeek implements:

  • JsonRPC to programmatically interact with the system, it exposes following methods (available at /api/v1/jsonrpc): * CRUD methods to add, modify and delete scrapers * Get list of scraper’s jobs * Enqueue scraper jobs

  • UI that allows you to interact with the system

  • Swagger documentation (available at /api)

  • Copy of this documentation (available at /docs)