Metadata-Version: 2.4
Name: internet-archive-extractor
Version: 0.0.1
Summary: Tool for extracting archived web sites from the Internet Archive saving as WARC files.
Author-email: Victor Harbo Johnston <vijo@cas.au.dk>
License: MIT
Project-URL: Homepage, https://github.com/WEB-CHILD/InternetArchiveExtractor
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: certifi
Requires-Dist: charset-normalizer
Requires-Dist: idna
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: pysqlite3
Requires-Dist: python-dateutil
Requires-Dist: python-magic
Requires-Dist: pytz
Requires-Dist: pywaybackup
Requires-Dist: requests
Requires-Dist: six
Requires-Dist: tqdm
Requires-Dist: tzdata
Requires-Dist: urllib3
Requires-Dist: warcio

# The project currently only supports converting output from Bitdruids WaybackUp program to be converted to WARC files.

## WaybackupToWarc

This project is designed to read a CSV file containing URLs, remove any instances of port 80 from those URLs, and generate WARC-GZ files based on the cleaned data.
The CSV file can be constructed by using the following tool: [Python Wayback Machine Downloader](https://github.com/bitdruid/python-wayback-machine-downloader)

## Project Structure

```
WaybackupToWarc
├── src
│   ├── main.py        # Main script for processing CSV and generating WARC files
│   └── utils.py       # Utility functions for reading CSV and modifying URLs
├── requirements.txt    # List of dependencies for the project
└── README.md           # Documentation for the project
```

## Requirements

To run this project, you need to install the following dependencies:

- `warcio`: For creating WARC files.
- `pandas`: For handling CSV data.

You can install the required packages using pip:

```
pip install -r requirements.txt
```

## Usage

1. Place your CSV file in the appropriate directory.
2. Update the `src/main.py` file to specify the path to your CSV file.
3. Run the main script:

```
python src/main.py
```

This will process the CSV file, remove port 80 from the URLs, and generate the corresponding WARC-GZ files.

## Contributing

Feel free to submit issues or pull requests if you have suggestions or improvements for the project.
