Metadata-Version: 2.4
Name: wayback-machine-archiver
Version: 1.9.2
Summary: A Python script to submit web pages to the Wayback Machine for archiving.
Author-email: Alexander Gude <alex.public.account@gmail.com>
License: # MIT License (MIT)
        
        Copyright © 2018--2025 Alexander Gude
        
        Permission is hereby granted, free of charge, to any person obtaining
        a copy of this software and associated documentation files (the
        "Software"), to deal in the Software without restriction, including
        without limitation the rights to use, copy, modify, merge, publish,
        distribute, sublicense, and/or sell copies of the Software, and to
        permit persons to whom the Software is furnished to do so, subject to
        the following conditions:
        
        The above copyright notice and this permission notice shall be
        included in all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
        MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
        IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
        CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
        TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
        SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
Project-URL: Homepage, https://github.com/agude/wayback-machine-archiver
Keywords: Internet Archive,Wayback Machine
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: urllib3
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: requests-mock; extra == "dev"
Requires-Dist: bump-my-version; extra == "dev"
Dynamic: license-file

# Wayback Machine Archiver

Wayback Machine Archiver (Archiver for short) is a commandline utility writen
in Python to backup Github Pages using the [Internet Archive][ia].

[ia]: https://archive.org/

## Installation

The best way to install Archiver is with `pip`:

```bash
pip install wayback-machine-archiver
```

This will give you access to the script simply by calling:

```bash
archiver --help
```

You can also clone this repository:

```bash
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
python ./wayback_machine_archiver/archiver.py --help
```

If you clone the repository, Archiver can be installed as a local application
using the `setup.py` script:

```bash
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
./setup.py install
```

Which, like using `pip`, will give you access to the script by calling
`archiver`.

Archiver requires [the `requests` library][requests] by Kenneth Reitz.
Archiver supports Python 2.7, and Python 3.4+.

[requests]: https://github.com/kennethreitz/requests

## Usage

The simplest way to schedule a backup is by specifying the URL of a web page,
like so:

```bash
archiver https://alexgude.com
```

This will submit the main page of my blog, [alexgude.com][ag], to the Wayback
Machine for archiving.

[ag]: https://alexgude.com

You can also archive all the URLs specified in a [`sitemap.xml`][sitemap] as
follows:

[sitemap]: https://en.wikipedia.org/wiki/Sitemaps

```bash
archiver --sitemaps https://alexgude.com/sitemap.xml
```

This will backup every page listed in the sitemap of my website, [alexgude.com][ag].

You can also pass a sitemap.xml file (requires the `file://` prefix) to the archiver:

```bash
archiver --sitemaps file://sitemap.xml
```

You can backup multiple pages by specifying multiple URLs or sitemaps:

```bash
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml https://alexgude.com/sitemaps.xml
```

You can also backup multiple URLs by writing them to a file (for example,
`urls.txt`), one URL per line, and passing that file to archiver:

```bash
archiver --file urls.txt
```

Sitemaps often exclude themselves, so you can request that the sitemap itself
be backed up using the flag `--archive-sitemap-also`:

```bash
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
```

## Help

For a full list of commandline flags, Archiver has a built-in help displayed
with `archiver --help`:

```
usage: archiver [-h] [--version] [--file FILE]
                [--sitemaps SITEMAPS [SITEMAPS ...]]
                [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                [--log-to-file LOG_FILE] [--archive-sitemap-also]
                [--jobs JOBS] [--rate-limit-wait RATE_LIMIT_IN_SEC]
                [urls [urls ...]]

A script to backup a web pages with Internet Archive

positional arguments:
  urls                  the URLs of the pages to archive

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --file FILE           path to a file containing urls to save (one url per
                        line)
  --sitemaps SITEMAPS [SITEMAPS ...]
                        one or more URIs to sitemaps listing pages to archive;
                        local paths must be prefixed with 'file://'
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        set the logging level, defaults to WARNING
  --log-to-file LOG_FILE
                        redirect logs to a file
  --archive-sitemap-also
                        also submit the URL of the sitemap to be archived
  --jobs JOBS, -j JOBS  run this many concurrent URL submissions, defaults to
                        1
  --rate-limit-wait RATE_LIMIT_IN_SEC
                        number of seconds to wait between page requests to
                        avoid flooding the archive site, defaults to 5; also
                        used as the backoff factor for retries
```

## Setting Up a `Sitemap.xml` for Github Pages

It is easy to automatically generate a sitemap for a Github Pages Jekyll site.
Simply use [jekyll/jekyll-sitemap][jsm].

Setup instructions can be found on the above site; they require changing just
a single line of your site's `_config.yml`.

[jsm]: https://github.com/jekyll/jekyll-sitemap
