Command Line Download
Contents
Command Line Download#
The download mode of the `cmon`
command line tool servers to query and download from CommonCrawl indexes.
The following arguments are needed in this order:
Positional arguments#
url - URL to query.
output - Path to output directory.
{record,html} - Download mode:
record: Download record files from Common Crawl.
html: Download HTML files from Common Crawl.
In html mode, the output directory will contain .html files, one
for each found URL. In record mode, the output directory will contain
`.jsonl`
files, each containing multiple domain records in JSON format.
Options#
- --limit LIMIT
Max number of URLs to download.
- --since SINCE
Start date in ISO format (e.g., 2020-01-01).
- --to TO
End date in ISO format (e.g., 2020-01-01).
- --cc_server CC_SERVER
Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index).
- --max_retry MAX_RETRY
Max number of retries for a request. Increase this number when requests are failing.
- --sleep_step SLEEP_STEP
Number of additional seconds to add to the sleep time between each failed download attempt. Increase this number if the server tells you to slow down.
- --match_type MATCH_TYPE
One of exact, prefix, host, domain Match type for the URL. Refer to cdx-api for more information.
- --max_directory_size MAX_DIRECTORY_SIZE
Max number of files per directory.
- --filter_non_200
Filter out non-200 status code.
Record mode options#
- --max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of domain records per file output
Examples#
# Download first 1000 domain records for example.com
cmon download --match_type=domain --limit=1000 example.com dr_output record
# Download first 100 htmls for example.com
cmon download --match_type=domain --limit=100 example.com html_output html