Metadata-Version: 2.4
Name: wayback-machine-archiver
Version: 2.0.0
Summary: A Python script to submit web pages to the Wayback Machine for archiving.
Author-email: Alexander Gude <alex.public.account@gmail.com>
License: # MIT License (MIT)
        
        Copyright © 2018--2025 Alexander Gude
        
        Permission is hereby granted, free of charge, to any person obtaining
        a copy of this software and associated documentation files (the
        "Software"), to deal in the Software without restriction, including
        without limitation the rights to use, copy, modify, merge, publish,
        distribute, sublicense, and/or sell copies of the Software, and to
        permit persons to whom the Software is furnished to do so, subject to
        the following conditions:
        
        The above copyright notice and this permission notice shall be
        included in all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
        MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
        IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
        CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
        TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
        SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
Project-URL: Homepage, https://github.com/agude/wayback-machine-archiver
Keywords: Internet Archive,Wayback Machine
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-dotenv
Requires-Dist: requests
Requires-Dist: urllib3
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: requests-mock; extra == "dev"
Requires-Dist: bump-my-version; extra == "dev"
Dynamic: license-file

# Wayback Machine Archiver

Wayback Machine Archiver (Archiver for short) is a command-line utility
written in Python to back up web pages using the [Internet Archive][ia].

[ia]: https://archive.org/

## Installation

The best way to install Archiver is with `pip`:

```bash
pip install wayback-machine-archiver
```

This will give you access to the script simply by calling:

```bash
archiver --help
```

You can also install it directly from a local clone of this repository:

```bash
git clone https://github.com/agude/wayback-machine-archiver.git
cd wayback-machine-archiver
pip install .
```

All dependencies are handled automatically. Archiver supports Python 3.8+.

## Usage

The archiver is simple to use from the command line. The examples below work
regardless of which execution mode you are using.

### Command-Line Examples

**Archive a single page:**
```bash
archiver https://alexgude.com
```

**Archive all pages from a sitemap:**
```bash
archiver --sitemaps https://alexgude.com/sitemap.xml
```

**Archive from a local sitemap file:**
(Note the `file://` prefix is required)
```bash
archiver --sitemaps file://sitemap.xml
```

**Archive from a text file of URLs:**
(The file should contain one URL per line)
```bash
archiver --file urls.txt
```

**Combine multiple sources:**
```bash
archiver https://radiokeysmusic.com --sitemaps https://charles.uno/sitemap.xml
```

**Archive the sitemap URL itself:**
```bash
archiver --sitemaps https://alexgude.com/sitemaps.xml --archive-sitemap-also
```

### Execution Modes

The script runs in one of two modes, which it selects automatically based on
whether it finds Internet Archive credentials.

#### Authenticated Mode (Recommended)

This is the preferred mode. The script uses the Internet Archive's **Save Page
Now 2 (SPN2)** API to submit a capture job, wait for it to complete, and
confirm the final success or failure.

**To enable this mode:**

1.  Get your S3-style API keys from your Internet Archive account settings:
    [https://archive.org/account/s3.php](https://archive.org/account/s3.php)

2.  Create a `.env` file in the directory where you run the `archiver`
    command. Add your keys to it:
    ```
    INTERNET_ARCHIVE_ACCESS_KEY="YOUR_ACCESS_KEY_HERE"
    INTERNET_ARCHIVE_SECRET_KEY="YOUR_SECRET_KEY_HERE"
    ```

The script will automatically detect this file (or the equivalent environment
variables) and use the authenticated API.

#### Unauthenticated Mode

If no credentials are found, the script falls back to the public,
unauthenticated API. This is a "fire-and-forget" method that submits the
capture request but does not wait to confirm if it was successful.

## Help

For a full list of command-line flags, Archiver has built-in help displayed
with `archiver --help`:

```
usage: archiver [-h] [--version] [--file FILE]
                [--sitemaps SITEMAPS [SITEMAPS ...]]
                [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                [--log-to-file LOG_FILE]
                [--archive-sitemap-also]
                [--rate-limit-wait RATE_LIMIT_IN_SEC]
                [--random-order]
                [urls ...]

A script to backup a web pages with Internet Archive

positional arguments:
  urls                  the URLs of the pages to archive

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --file FILE           path to a file containing urls to save (one url per
                        line)
  --sitemaps SITEMAPS [SITEMAPS ...]
                        one or more URIs to sitemaps listing pages to
                        archive; local paths must be prefixed with 'file://'
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        set the logging level, defaults to WARNING
  --log-to-file LOG_FILE
                        redirect logs to a file
  --archive-sitemap-also
                        also submit the URL of the sitemap to be archived
  --rate-limit-wait RATE_LIMIT_IN_SEC
                        number of seconds to wait between page requests to
                        avoid flooding the archive site, defaults to 5; also
                        used as the backoff factor for retries
  --random-order        randomize the order of pages before archiving
```

## Setting Up a `Sitemap.xml` for Github Pages

It is easy to automatically generate a sitemap for a Github Pages Jekyll site.
Simply use [jekyll/jekyll-sitemap][jsm].

Setup instructions can be found on the above site; they require changing just
a single line of your site's `_config.yml`.

[jsm]: https://github.com/jekyll/jekyll-sitemap
