Command line Extract
Contents
Command line Extract#
The extract mode of the `cmon`
command line tool servers to extract your download files.
The following arguments are needed in this order:
Positional arguments#
config_path - Path to config file containing extraction rules.
output_path - Path to output directory.
files - Files to extract data from.
{record,html} - Extraction mode:
record: Extract data from jsonl (domain record) files.
html: Extract data from HTML files.
To create a config file, see Extractor config file.
Both modes yield the same output format, which is a `.jsonl`
file containing the extracted data,
one per line. For each file a new directory is created in the output directory, named after the
file.
The files created by the download mode, can be directly used with appropriate mode
in the extraction. If you have an html file, you can use the html mode to extract it.
If you have a domain records, which you got some other way (AWS Athena), please refer to Domain Record JSONL format,
which describes how to create `.jsonl`
files from your domain records, which you can then
use with the record mode.
Optional arguments#
- --max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of extractions per file output.
- --max_directory_size MAX_DIRECTORY_SIZE
Max number of extraction files per directory.
- --n_proc N_PROC
Number of processes to use for extraction. The paralelization is on file level, thus for single file it’s useless to use more than one process.
Record arguments#
- --max_retry MAX_RETRY
Max number of WARC download attempts.
- --sleep_step SLEEP_STEP
Number of additional seconds to add to the sleep time between each failed download attempt.
Html arguments#
- --date DATE
Date of extraction of HTML files in ISO format (e.g., 2021-01-01). The default is today.
- --url URL
URL from which the HTML files were downloaded. By default, it will try to infer from the file content.
Examples#
# Take the domain records downloaded using the first command and extracts them using your extractors
cmon extract config.json extracted_output dr_output/*/*.jsonl record --max_retry 100 --sleep_step 10
# Take the htmls downloaded using the second command and extracts them using your extractors
cmon extract config.json extracted_output html_output/*/*.html html --date 2021-01-01 --url https://www.example.com
When you are going to build the extractors, you gonna appreaciate that you can specify what the url of the html file is and what the date of the extraction is. This is because those information are used during the extractor routing.