Metadata-Version: 2.4
Name: page-segmenter
Version: 0.1.1
Summary: Logical segmentation of web pages using visual and structural DOM heuristics.
Author-email: Gagandeep Singh <gagan@innerkore.com>
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: playwright>=1.40.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: sphinx; extra == "dev"

# Page Segmenter

A highly robust, structure-aware tool that parses HTML pages into non-overlapping logical segments (e.g., header, navigation, sidebar, main content, footer, cards). It uses dynamic visual and structural DOM heuristics rather than content-density metrics.

## Features
- **Logical Structure Parsing**: Determines segments using developer intent via DOM structure, ARIA landmarks, and semantic HTML.
- **Dynamic Threshold Configuration**: Adjusts structural thresholds based on `page_type` (e.g., `commerce`, `content`, `marketing`) to handle varied website architectures without splintering or collapsing components.
- **Playwright-Backed Evaluation**: Executes visual heuristics directly in the browser context to account for exact layout constraints (widths, heights, visibilities).

## Installation

You can install the package directly from PyPI (once published):

```bash
pip install page-segmenter
```

For development, clone the repository and install it using `make`:

```bash
git clone https://github.com/innerkorehq/page_segmenter.git
cd page_segmenter
make install
# or for dev dependencies
make install-dev
```

## Usage

### Command-Line Interface (CLI)

You can segment a live URL directly from the terminal. Use the optional `--type` argument to apply type-specific heuristics.

```bash
python main.py "https://example.com" --type "marketing"
```

You can also segment a local HTML file:

```bash
python main.py --html ./path/to/page.html --type "doc_page"
```

### Visual Tester

To visually debug and inspect the detected segments inside a browser window:

```bash
python visual_tester.py "https://example.com" --type "product_list"
```

### Python API

You can use the segmenter programmatically in your asynchronous Python applications:

```python
import asyncio
import json
from page_segmenter import find_segments, find_segments_from_html

async def main():
    # Segment a live URL
    url = "https://docs.python.org/3/"
    segments = await find_segments(url, page_type="doc_page")
    print(json.dumps(segments, indent=2))

    # Segment from raw HTML
    html_content = "<html>...</html>"
    segments = await find_segments_from_html(html_content, base_url="https://example.com", page_type="commerce")

if __name__ == "__main__":
    asyncio.run(main())
```

## How It Works

The segmenter processes the DOM in a series of logical phases:
1. **Pruning**: Discards invisible nodes, tracking noise (like `script` or `modal`), and microscopic elements.
2. **Decision Logic**: Recursively traverses the DOM evaluating ARIA landmarks, semantic tags, parent identity scores (padding, borders, shadows), raw text density, structural similarity (card grids), and orphaned child checks.
3. **Adaptive Thresholds**: Changes internal variables (like `MIN_SUBTREE_NODES` or `MIN_HEIGHT`) dynamically based on the passed `page_type` family (`commerce`, `content`, `marketing`, etc.).

Read the full algorithm details in [algo.md](algo.md).

## Development

A `Makefile` is included to streamline development tasks:
- `make install`: Install the project.
- `make install-dev`: Install with development dependencies.
- `make build`: Build the distribution packages (`sdist` and `wheel`).
- `make publish`: Build and publish the package to PyPI using twine.
- `make clean`: Clean up build artifacts and cache directories.
- `make lint`: Run basic syntax checks.
- `make docs`: Build Sphinx documentation.
- `make test-run`: Run a quick smoke test.
