Metadata-Version: 2.4
Name: pagepull
Version: 0.2.0
Summary: Extract and transform HTML page content with composable CLI tools
License: MIT
Author: Neil Johnson
Author-email: neil@cadent.com
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: beautifulsoup4 (>=4.12,<5.0)
Requires-Dist: markdownify (>=1.2.2,<2.0.0)
Requires-Dist: requests (>=2.31,<3.0)
Description-Content-Type: text/markdown

# pagepull

Extract structured data from HTML pages via the command line.

pagepull wraps [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) behind a simple CLI, turning common DOM extraction tasks into one-liners. Think of it as `jq` for HTML.

## Install

```bash
pip install pagepull
```

Or with pipx for isolated install:

```bash
pipx install pagepull
```

## Quick Start

```bash
# Extract the content div from a WordPress page
pagepull div entry-content https://example.com/about

# Same thing, as markdown
pagepull div entry-content --markdown https://example.com/about

# List images and check for missing alt text
pagepull images --alt https://example.com/about

# Pull meta tags
pagepull meta --title --description https://example.com/about

# Use any CSS selector
pagepull select "nav.primary a" page.html
```

## Input

pagepull accepts three input types:

```bash
# Local file
pagepull div content page.html

# URL (fetched automatically)
pagepull div content https://example.com/page

# stdin
curl -s https://example.com | pagepull div content
```

## Commands

### `div` — Extract a div by class or id

```bash
pagepull div entry-content page.html
pagepull div sidebar --by id page.html
pagepull div entry-content --strip script,style --markdown page.html
```

### `images` — List images with metadata

```bash
pagepull images page.html
pagepull images --alt --dimensions page.html
pagepull images --json page.html
```

Flags `--alt` to show alt text (missing alt flagged as `[MISSING]`) and `--dimensions` for width/height.

### `meta` — Extract meta tags

```bash
pagepull meta page.html                         # all meta tags
pagepull meta --title --description page.html   # specific tags
pagepull meta --og page.html                    # Open Graph tags
```

### `links` — List all links

```bash
pagepull links page.html
pagepull links --external-only page.html
pagepull links --csv page.html
```

### `headings` — Heading hierarchy

```bash
pagepull headings page.html
```

```
h1: Welcome to Our Site
  h2: About Us
  h2: Services
    h3: Web Design
```

### `text` — Visible text only

```bash
pagepull text page.html
pagepull text --selector "div.entry-content" page.html
```

### `select` — Raw CSS selector

```bash
pagepull select "nav a" page.html
pagepull select "img[alt='']" --json page.html
pagepull select "h2 + p" --text page.html
```

### `strip` — Remove elements

```bash
pagepull strip script noscript style page.html
```

### `table` — Extract HTML tables

```bash
pagepull table --csv page.html
pagepull table --index 0 --json page.html
```

## Global Flags

| Flag | Description |
|------|-------------|
| `--selector <css>` | Scope any command to a CSS selector first |
| `--json` | Structured JSON output |
| `--csv` | CSV output (where applicable) |
| `--markdown` | Convert HTML to markdown |
| `--quiet` | Suppress headers and labels |

## Scoping with `--selector`

Any command can be scoped to a portion of the page:

```bash
# Images only within the article
pagepull images --alt --selector "article" page.html

# Links only in the footer
pagepull links --selector "footer" page.html

# Text from a specific section
pagepull text --selector "div.entry-content" page.html
```

## Pairing with sitewalker

pagepull handles one page. [sitewalker](https://github.com/cadentdev/sitewalker) crawls sites. Together they cover site-wide extraction:

```bash
# Audit alt text across an entire site
sitewalker -p https://example.com | xargs -I{} pagepull images --alt --json {}

# Extract every page title
sitewalker -p https://example.com | xargs -I{} pagepull meta --title {}

# Pull article content as markdown
sitewalker -p https://example.com | xargs -I{} pagepull div content --markdown {}
```

## Development

```bash
git clone git@github.com:cadentdev/pagepull.git
cd pagepull
poetry install
poetry run pytest
```

## Requirements

- Python 3.11+
- Dependencies: beautifulsoup4, requests, markdownify

## License

MIT

