Metadata-Version: 2.4
Name: page-gap-scanner
Version: 0.1.2
Summary: CLI tool to compare two URLs and generate a topic & internal linking gap report as CSV.
Author-email: Amal Alexander <amalalex95@gmail.com>
License: MIT
Keywords: seo,internal links,content gaps,python,cli
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Environment :: Console
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: typer>=0.12.0
Dynamic: license-file

# page-gap-scanner

`page-gap-scanner` is a small, focused SEO CLI tool that compares **two URLs** from the same site and generates a **topic & internal-link gap report as CSV**.

The idea is simple:

- You have two pages that might overlap in intent.
- One of them should be the **hero/winner** page.
- The other should probably **support & link to it**.
- This tool compares both pages, extracts topics/headings, and shows you what the **supporter page is missing** — along with suggested anchors and a ready-made CSV you can hand over to content / devs.

Designed for:
- SEOs who want fast, opinionated insights
- Internal linking & topical cluster work
- Pre-work before consolidation / canonical decisions

---

## Installation

Once uploaded to PyPI, you’ll be able to install it via:

```bash
pip install page-gap-scanner
```

For local development (from source):

```bash
git clone <your-repo-url>
cd page-gap-scanner
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .
```

---

## CLI Usage

Basic usage (Mode 2 – two URLs):

```bash
page-gap-scanner scan https://example.com/page-a https://example.com/page-b --output gaps.csv
```

Arguments:

- `url1` – first URL (candidate winner/supporter)
- `url2` – second URL
- `--output` – path for the CSV file (default: `gaps.csv`)

Example:

```bash
page-gap-scanner scan \
  https://example.com/credit-card-guide \
  https://example.com/credit-card-fees \
  --output cc_gaps.csv
```

This will:

1. Fetch both URLs.
2. Extract:
   - Page title
   - H1–H3 headings
   - Basic topic/phrase candidates from visible text.
3. Decide which URL is the **winner** (more depth & structure).
4. Find topics the winner has that the supporter **does not**.
5. Generate a CSV suggesting how the supporter should link to the winner.

---

## Output CSV

Each row represents **one missing topic** that the supporter page could cover and link for.

Columns:

- `missing_topic` – topic/phrase found on winner page but not on supporter page.
- `winner_page` – URL that should receive internal links / authority.
- `supporter_page` – URL that should add the link.
- `recommended_change` – human-readable suggestion.
- `suggested_anchor` – example anchor text.
- `relevance_score` – rough 1–100 score (higher = more important topic).

Sample:

```csv
missing_topic,winner_page,supporter_page,recommended_change,suggested_anchor,relevance_score
"international transaction charges","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Add a short section about 'international transaction charges' on the supporter page and link to the winner page.","learn more about international transaction charges",84
"annual fee waiver","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Mention 'annual fee waiver' and link to the full guide for details.","full guide on annual fee waivers",78
```

---

## How the “winner” page is chosen

Right now, the logic is intentionally simple and transparent:

- Fetch HTML for both URLs.
- Extract visible text and headings.
- Compute a basic **content score** per page:
  - more words → higher score
  - more H1/H2/H3 headings → higher score

The page with the higher score is treated as the **winner**.  
The other becomes the **supporter**.

> In other words: the deeper, better-structured page should typically be your hero page.

You can change this logic later (e.g., integrate crawl data, link counts, or external metrics).

---

## Topic extraction (lightweight)

To avoid heavy NLP dependencies, `page-gap-scanner` uses a lightweight approach:

- Collects:
  - `<title>`
  - `<h1>`, `<h2>`, `<h3>`
  - Some visible text snippets
- Splits text into word phrases.
- Filters out:
  - very short tokens
  - common stopwords
- Normalises to lowercase and de-duplicates.

This keeps the tool:

- Fast
- Easy to install
- Safe to run in simple environments or CI

---

## Example console output

When you run the command, you’ll see something like:

```text
Scanning:
  Winner candidate A: https://example.com/page-a
  Winner candidate B: https://example.com/page-b

Winner selected: https://example.com/page-a
Supporter:       https://example.com/page-b

Found 17 missing topics on supporter page.
CSV written to: gaps.csv
```

---

## Project structure

```text
page-gap-scanner/
  pyproject.toml
  README.md
  LICENSE
  page_gap_scanner/
    __init__.py
    cli.py
    compare.py
    fetch.py
    extract.py
    export.py
    utils.py
```

Key modules:

- `cli.py` – defines the Typer-based CLI (`page-gap-scanner`).
- `fetch.py` – fetches HTML safely.
- `extract.py` – extracts headings & topics.
- `compare.py` – core gap logic, winner/supporter decision.
- `export.py` – writes the CSV file.
- `utils.py` – small helpers.

---

## Development & contribution

1. Clone the repository.
2. Create and activate a virtual environment.
3. Install dependencies in editable mode:

```bash
pip install -e ".[dev]"
```

4. Run the CLI locally:

```bash
page-gap-scanner scan https://example.com/a https://example.com/b
```

---

## Author

**Name:** Amal Alexander  
**Email:** <amalalex95@gmail.com>

Feel free to fork, tweak, and adapt this tool into your own SEO workflow.

---

## License

This project is licensed under the **MIT License**. See the [`LICENSE`](LICENSE) file for details.
