Design
Table of contents
Sneakpeek has 6 core components:
Scrapers storage - stores list of scrapers and its metadata.
Jobs queue - populated by the scheduler or user and is consumed by the workers
Lease storage - stores lease (global lock) for scheduler, to make sure there’s only 1 active scheduler at all times.
Scheduler - schedules scrapers defined in the scrapers storage
Worker - consumes jobs queue and executes scrapers logic
API - provides JsonRPC API for interacting with the system
All of the components are run by the SneakpeekServer
.
Scrapers Storage
Storage must implement this abstract class sneakpeek.lib.storage.base.ScrapersStorage
.
Following methods are mandatory to implement:
get_scrapers
- get list of all scrapersget_scraper
- get scraper by IDis_read_only
- whether the storage allows modifications of the scrapers list and its metadata
Following methods are optional to implement:
create_scraper
- create a new scraperdelete_scraper
- delete scraper by IDupdate_scraper
- update existing scrapermaybe_get_scraper
- get scraper by ID if it existssearch_scrapers
- search scrapers using given filters
Currently there 2 storage implementations:
InMemoryScrapersStorage
- in-memory storage. Should either be used in development environment or if the list of scrapers is static and wouldn’t be changed.RedisScrapersStorage
- redis storage.
Jobs queue
Jobs queue must implement this abstract class sneakpeek.lib.storage.base.ScraperJobsStorage
.
Following methods must be implemented:
get_scraper_jobs
- get scraper jobs by scraper IDadd_scraper_job
- add new scraper jobupdate_scraper_job
- update existing scraper jobget_scraper_job
- get existing scraper job by scraper ID and scraper job IDdequeue_scraper_job
- dequeue scraper job from queue with given prioritydelete_old_scraper_jobs
- delete old historical scraper jobsget_queue_len
- get number of pending scraper jobs in the queue with given priority
Currently there 2 storage implementations:
InMemoryScraperJobsStorage
- in-memory storage. Should only be used in development environment.RedisScraperJobsStorage
- redis storage.
Lease storage
Lease storage is used by scheduler to ensure that at any point of time there’s no more than 1 active scheduler instance which can enqueue scraper jobs. This disallows concurrent execution of the scraper.
Lease storage must implement this abstract class sneakpeek.lib.storage.base.LeaseStorage
.
Following methods must be implemented:
maybe_acquire_lease
- try to acquire lease (or global lock)release_lease
- release acquired lease
Currently there 2 storage implementations:
InMemoryLeaseStorage
- in-memory storage. Should only be used in development environment.RedisLeaseStorage
- redis storage.
Scheduler
Scheduler is responsible for:
scheduling scrapers based on their configuration.
finding scraper jobs that haven’t sent a heartbeat for a while and mark them as dead
cleaning up jobs queue from old historical scraper jobs
exporting metrics on number of pending jobs in the queue
As for now there’s only one implementation Scheduler
that uses APScheduler.
Worker
Worker constantly tries to dequeue a job and executes dequeued jobs.
As for now there’s only one implementation Worker
.
API
Sneakpeek implements:
JsonRPC to programmatically interact with the system, it exposes following methods (available at
/api/v1/jsonrpc
): * CRUD methods to add, modify and delete scrapers * Get list of scraper’s jobs * Enqueue scraper jobsUI that allows you to interact with the system
Swagger documentation (available at
/api
)Copy of this documentation (available at
/docs
)