Extraction utils
Contents
Extraction utils#
The utilies for extraction are defined cmoncrawl.processor.extraction
.
It provides helper function for both filtering and extraction.
Filtering#
must_exist_filter`: filter out the ulrs that don’t contain css selector
must_not_exist_filter: filter out the ulrs that contain css selector
Extraction#
- – check_required: Creates a function that checks if all the required fileds
are present in the extracted data
- – chain_transform: Creates a function that chains multiple transformation function,
if any return None, the chain is broken and None is returned. Especially usefull with soup select etc…
- – extract_transform: Creates a function that extracts the data from the soup
tag using the css selector and transforms it using your transformation functions.