Metadata-Version: 2.1
Name: hibp-downloader
Version: 0.4.8
Summary: Efficiently download HIBP new pwned password data by hash-prefix for a local-copy
Keywords: hibp-downloader,hibp,haveibeenpwned,haveibeenpwned-downloader,sha1,ntlm
Author-Email: Nicholas de Jong <ndejong@threatpatrols.com>
License: BSD-3-Clause
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Project-URL: Documentation, https://threatpatrols.github.io/hibp-downloader
Project-URL: Homepage, https://github.com/threatpatrols/hibp-downloader
Project-URL: Repository, https://github.com/threatpatrols/hibp-downloader
Project-URL: Bug Tracker, https://github.com/threatpatrols/hibp-downloader/issues
Requires-Python: <4,>=3.10
Requires-Dist: httpx[http2]>=0.21
Requires-Dist: httpcore>=0.14
Requires-Dist: aiofiles>=0.8
Requires-Dist: typer>=0.9.0
Requires-Dist: shellingham>=1.3.0
Description-Content-Type: text/markdown

# hibp-downloader

[![pypi](https://img.shields.io/pypi/v/hibp-downloader.svg)](https://pypi.python.org/pypi/hibp-downloader/)
[![python](https://img.shields.io/pypi/pyversions/hibp-downloader.svg)](https://github.com/threatpatrols/hibp-downloader/)
[![build tests](https://github.com/threatpatrols/hibp-downloader/actions/workflows/build-tests.yml/badge.svg)](https://github.com/threatpatrols/hibp-downloader/actions/workflows/build-tests.yml)
[![license](https://img.shields.io/github/license/threatpatrols/hibp-downloader.svg)](https://github.com/threatpatrols/hibp-downloader)

This is a CLI tool to efficiently download a local copy of the pwned password hash data from the very awesome
[HIBP](https://haveibeenpwned.com/Passwords) pwned passwords [api-endpoint](https://api.pwnedpasswords.com) using all the good bits;
multiprocessing, async-processes, local-caching, content-etags and http2-connection pooling to probably make things 
as fast as is Pythonly possible.

## Features
 - **Direct password lookups** via the `query` command — check passwords against the *compressed* data store with no database or decompression step needed. Fast enough to use behind a web service.
 - Download and store acquired data in gzip compressed format to save on storage and speed up queries.
 - Download the full dataset in under 45 mins (generally CPU bound).
 - Easily resume interrupted `download` operations into a `--data-path` without re-clobbering api-source.
 - Only download hash-prefix content blocks when the source content has changed (via content ETAG values); making it
   easy to periodically sync-up when needed.
 - Ability to generate a single text file with in-order pwned password hash values, similar to [PwnedPasswordsDownloader](https://github.com/HaveIBeenPwned/PwnedPasswordsDownloader) from
   the awesome HIBP team.
 - Per prefix file metadata in JSON format for easy data reuse by other tooling if required.
 - Standalone validation command to verify the local copy dataset, clean up corrupted or incomplete files, and remove orphaned metadata files.

## Install
```commandline
pipx install hibp-downloader
```

## Usage (download)
![screenshot-help.png](https://raw.githubusercontent.com/threatpatrols/hibp-downloader/main/docs/docs/assets/img/screenshot-help.png)

## Performance
Sample download activity log; host with 32 cores on 500Mbit/s connection. 
```text
...
2024-05-16T10:18:01-0400 | INFO | hibp-downloader | prefix=f80c7 source=[lc:13616 et:3 rc:1002358 ro:25 xx:1] processed=[17836.6MB ~414462H/s] api=[918req/s 17597.4MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f81af source=[lc:13616 et:3 rc:1002558 ro:25 xx:1] processed=[17840.1MB ~414454H/s] api=[918req/s 17600.9MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f826f source=[lc:13616 et:3 rc:1002758 ro:25 xx:1] processed=[17843.6MB ~414454H/s] api=[918req/s 17604.4MB] runtime=36.4min
2024-05-16T10:18:03-0400 | INFO | hibp-downloader | prefix=f833f source=[lc:13616 et:3 rc:1002958 ro:25 xx:1] processed=[17847.1MB ~414450H/s] api=[918req/s 17607.9MB] runtime=36.4min
```

 - 918x requests per second to `api.pwnedpasswords.com`
 - Log sources are shorthand:
     - `lc`: local-cache - request-responses handled locally without hitting the network. 
     - `et`: ETag match - request-responses that confirmed our local data was up-to-date and did not require a new download.
     - `rc`: remote-cache - request-responses that were downloaded to local, but came from the remote-server cache.
     - `ro`: remote-origin - request-responses that were downloaded to local, and the download needed to be fetched from remote origin source.
     - `xx`: unknown/failed - request-responses that failed (and successfully retried).
 - ~17GB downloaded in ~36 minutes (full dataset)
 - Approx ~414k hash values received per second
 - Processing in this example appears to be CPU bound, measured traffic around ~160 Mbit/s.

## Usage (query)
Query passwords directly against the compressed data store — no decompression, no database
import required.  This is the recommended approach for any password-checking lookup.
![screenshot-help.png](https://raw.githubusercontent.com/threatpatrols/hibp-downloader/main/docs/docs/assets/img/screenshot-query-help.png)

## Usage (generate)
Generate a single decompressed text file from the data store.  If you are generating this to
import into a database for lookups, consider using the `query` command directly instead —
it is faster to set up, far easier to maintain, and uses a fraction of the storage.
```commandline
hibp-downloader --data-path /path/to/data generate --filename pwned-hashes.txt --hash-type sha1
```

## Usage (validate)
Validate local pwned password files and automatically clean up corrupted data or orphaned metadata files:
```commandline
hibp-downloader --data-path /path/to/data validate --hash-type sha1
```

## Project

 - Docs - [threatpatrols.github.io/hibp-downloader](https://threatpatrols.github.io/hibp-downloader)
 - PyPI - [pypi.org/project/hibp-downloader/](https://pypi.org/project/hibp-downloader/)
 - Github - [github.com/threatpatrols/hibp-downloader](https://github.com/threatpatrols/hibp-downloader)

