Metadata-Version: 2.4
Name: pdflinkcheck
Version: 1.1.38
Summary: A purpose-built PDF link analysis and reporting tool with GUI and CLI.
Author-email: George Clayton Bennett <george.bennett@memphistn.gov>
Project-URL: Homepage, https://github.com/city-of-memphis-wastewater/pdflinkcheck
Project-URL: Repository, https://github.com/city-of-memphis-wastewater/pdflinkcheck
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Other Audience
Classifier: Topic :: File Formats
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Environment :: Console
Classifier: Environment :: MacOS X
Classifier: Environment :: Win32 (MS Windows)
Classifier: Typing :: Typed
Classifier: Development Status :: 4 - Beta
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyhabitat>=1.0.53
Requires-Dist: pymupdf>=1.26.6
Requires-Dist: rich>=14.2.0
Requires-Dist: typer>=0.20.0
Provides-Extra: dev
Requires-Dist: ruff>=0.1.13; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Dynamic: license-file

# pdflinkcheck

A purpose-built tool for comprehensive analysis of hyperlinks and link remnants within PDF documents, primarily using the PyMuPDF library. Use the CLI or the GUI.

-----

![Screenshot of the pdflinkcheck GUI](https://raw.githubusercontent.com/City-of-Memphis-Wastewater/pdflinkcheck/main/assets/pdflinkcheck_gui_v1.1.32.png)

-----

## 📥 Access and Installation

The recommended way to use `pdflinkcheck` is to either install the CLI with `pipx` or to download the appropriate latest binary for your system from [Releases](https://github.com/City-of-Memphis-Wastewater/pdflinkcheck/releases/).

### 🚀 Recommended Access (Binary Files)

For the most user-typical experience, download the single-file binary matching your OS.

| **File Type** | **Primary Use Case** | **Recommended Launch Method** |
| :--- | :--- | :--- |
| **Executable (.exe, .elf, .pyz)** | **GUI (Double-Click)** | Double-click the file (use the accompanying `.bat` file on Windows). |
| **PYZ (Python Zip App)** | **CLI (Terminal)** | Run using your system's `python` command: `python pdflinkcheck-VERSION.pyz analyze ...` |

### Installation via pipx

For an isolated environment where you can access `pdflinkcheck` from any terminal:

```bash
# Ensure you have pipx installed first (if not, run: pip install pipx)
pipx install pdflinkcheck
```

-----

## 💻 Graphical User Interface (GUI)

The tool can be run as simple cross-platform graphical interface (Tkinter).

### Launching the GUI

There are three ways to launch the GUI interface:

1.  **Implicit Launch:** Run the main command with no arguments, subcommands, or flags (`pdflinkcheck`).
2.  **Explicit Command:** Use the dedicated GUI subcommand (`pdflinkcheck gui`).
3.  **Binary Double-Click:**
      * **Windows:** Double-click the `pdflinkcheck-VERSION-gui.bat` file.
      * **macOS/Linux:** Double-click the downloaded `.pyz` or `.elf` file.

### Planned GUI Updates

We are actively working on the following enhancements:

  * **Report Export:** Functionality to export the full analysis report to a plain text file.
  * **License Visibility:** A dedicated "License Info" button within the GUI to display the terms of the AGPLv3+ license.

-----

## 🚀 CLI Usage

The core functionality is accessed via the `analyze` command. All commands include the built-in `--help` flag for quick reference.

### Available Commands

|**Command**|**Description**|
|---|---|
|`pdflinkcheck analyze`|Analyzes a PDF file for links and remnants.|
|`pdflinkcheck gui`|Explicitly launch the Graphical User Interface.|
|`pdflinkcheck license`|**Displays the full AGPLv3+ license text in the terminal.**|

### `analyze` Command Options

|**Option**|**Description**|**Default**|
|---|---|---|
|`<PDF_PATH>`|**Required.** The path to the PDF file to analyze.|N/A|
|`--check-remnants / --no-check-remnants`|Toggle scanning the text layer for unlinked URLs/Emails.|`--check-remnants`|
|`--max-links INTEGER`|Maximum number of links/remnants to display in the detailed report sections. Use `0` to show all.|`0` (Show All)|
|`--export-format FORMAT`|Format for the exported report. If specified, the report is saved to a file named after the PDF. Currently supported: `JSON`.|`JSON`|
|`--help`|Show command help and exit.|N/A|

### `gui` Command Options

| **Option**             | **Description**                                                                                               | **Default**    |
| ---------------------- | ------------------------------------------------------------------------------------------------------------- | -------------- |
| `--auto-close INTEGER` | **(For testing/automation only).** Delay in milliseconds after which the GUI window will automatically close. | `0` (Disabled) |
#### Example Runs



```bash 
# Analyze a document, show all links/remnants, and save the report as JSON
pdflinkcheck analyze "TE Maxson WWTF O&M Manual.pdf" --export-format JSON

# Analyze a document but skip the time-consuming remnant check
pdflinkcheck analyze "another_doc.pdf" --no-check-remnants 

# Analyze a document but keep the print block short, showing only the first 10 links for each type
pdflinkcheck analyze "TE Maxson WWTF O&M Manual.pdf" --max-links 10

# Show the GUI for only a moment, like in a build check
pdflinkcheck gui --auto-close 3000 
```


-----

### 📦 Library Access (Advanced)

For developers importing `pdflinkcheck` into other Python projects, the core analysis functions are exposed directly in the root namespace:

|**Function**|**Description**|
|---|---|
|`run_analysis()`|**(Primary function)** Performs the full analysis, prints to console, and handles file export.|
|`extract_links()`|Low-level function to retrieve all explicit links (URIs, GoTo, etc.) from a PDF path.|
|`extract_toc()`|Low-level function to extract the PDF's internal Table of Contents (bookmarks/outline).|

Python

```
from pdflinkcheck.analyze import run_analysis, extract_links, extract_toc
```

-----

### ✨ Features

  * **Active Link Extraction:** Identifies and categorizes all programmed links (External URIs, Internal GoTo/Destinations, Remote Jumps).
  * **Anchor Text Retrieval:** Extracts the visible text corresponding to each link's bounding box.
  * **Remnant Detection:** Scans the document's text layer for unlinked URIs and email addresses that should potentially be converted into active links.
  * **Structural TOC:** Extracts the PDF's internal Table of Contents (bookmarks/outline).

-----

### 📜 License Implications (AGPLv3+)

**pdflinkcheck is licensed under the GNU Affero General Public License version 3 or later (AGPLv3+).**

This license has significant implications for **distribution and network use**, particularly for organizations:

  * **Source Code Provision:** If you distribute this tool (modified or unmodified) to anyone, you **must** provide the full source code under the same license.
  * **Network Interaction (Affero Clause):** If you modify this tool and make the modified version available to users over a computer network (e.g., as a web service or backend), you **must** also offer the source code to those network users.

> **Before deploying or modifying this tool for organizational use, especially for internal web services or distribution, please ensure compliance with the AGPLv3+ terms.**

-----

### ⚠️ Compatibility Notes

  * **Platform Compatibility:** This tool relies on the `PyMuPDF` library. All testing has failed to run in a **Termux (Android)** environment due to underlying C/C++ library compilation issues with PyMuPDF. It is recommended for use on standard Linux, macOS, or Windows operating systems.
  * **Document Compatibility:** While `pdflinkcheck` uses the robust PyMuPDF library, not all PDF files can be processed successfully. This tool is designed primarily for digitally generated (vector-based) PDFs.
    Processing may fail or yield incomplete results for:
      * **Scanned PDFs** (images of text) that lack an accessible text layer.
      * **Encrypted or Password-Protected** documents.
      * **Malformed or non-standard** PDF files.

-----

### Run from Source (Developers)

```bash
git clone http://github.com/city-of-memphis-wastewater/pdflinkcheck.git
cd pdflinkcheck
uv sync
uv run python src/pdflinkcheck/cli.py --help
```
