Scraping Web Pages
There are several ways to scrape web pages. The wget(1)
tool is a quick and dirty way but it does not record much metadata. Archival standard copies of web sites is possible by using a tool such as Heretrix from the Internet Archive or Browsertrix. These tools make good archives but are not super helpful for producing browsable copies. For that, the warc2zim tool is helpful. It produces .zim
files that can be read by the Kiwix software for offline reading of web pages.
Using zimit
A convenient way to archive web pages, produce WARC files and .zim
files is using the Zimit tool which bundles both Browsertrix
and warc2zim
in a Docker image. Whilst we have opinions about the Docker strategy and the software development patterns that produced it, in this case it is an easy way to get going.
The steps are:
- Install docker in whatever way your operating system wants you to. Debian or Ubuntu systems might do
apt install docker.io
- Obtain the
zimit
image:docker pull ghcr.io/openzim/zimit
Now we assume you are working in a particular directory, say, /home/name/scraping
that we will call $SCRAPE
First run the scrape. We will use the https://maps.org/ web site as an example.
docker run \
-v ${SCRAPE}:/output \
ghcr.io/openzim/zimit zimit \
-w 12 \
--seeds https://maps.org \
--name maps.org-20250310 \
--title MAPS \
--description "Multidisciplinary Association for Psychedelic Studies" \
--scopeExcludeRx '.*add-to-cart=[0-9]*' \
--keep
This needs some explanation.
-v ${SCRAPE}:/output
says to bind what Docker thinks of as the output directory to the working directory.-w 12
means to run 12 scraping threads concurrently. On our machine, this is the number of CPU cores.--seeds https://maps.org/
is the web site to scrape. It is possible to have multiple web sites, comma separated--name maps.org-20250310
is the filename for the output.zim
file--title
and--description
go in the.zim
file metadata--scopeExcludeRx
is a regular expression to exclude certain URLs. Necessary in this case so that the shopping cart section of the web site does not create an infinitely recursive scrape--keep
causeszimit
to keep intermediate files. In particular, it keeps the WARC files which we also want.
Doing this archived the web site but failed at the very end. The reason is yet undiagnosed but we suspect it to have to do with zimit
's management of concurrency. No matter, the WARC files are saved in a temporary directory that starts with .tmp
followed by some random characters, in this case .tmptp8i9y5f
We can work around this by looking in the temporary directory for the WARC files, and running warc2zim
:
ls .tmptp8i9y5f/collections/crawl-20250310121334268/archive/*.warc.gz | sed s@^@/output/@ > /tmp/scrape.$$
docker run \
-v ${SCRAPE}:/output \
ghcr.io/openzim/zimit warc2zim \
--name maps.org-20250310 \
--title MAPS \
--description "Multidisciplinary Association for Psychedelic Studies" \
--zim-file /output/maps.org-20250310.zim \
`cat /tmp/scrape.$$`
rm /tmp/scrape.$$
Now we can assemble the archive, ready for uploading,
mkdir archive
mv maps.org-20250310.zim archive
mv .tmptp8i9y5f/collections/crawl-20250310121334268/archive/* archive
mv .tmptp8i9y5f/collections/crawl-20250310121334268/crawls/* archive
mv .tmptp8i9y5f/collections/crawl-20250310121334268/pages/* archive
mv .tmptp8i9y5f/collections/crawl-20250310121334268/warc-cdx/* archive