Metadata-Version: 2.4
Name: page-segmenter
Version: 0.1.0
Summary: Logical segmentation of web pages using visual and structural DOM heuristics.
Author-email: DeepMind Agent <agent@example.com>
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: playwright>=1.40.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: sphinx; extra == "dev"

# Logical Segmentation Algorithm

## Goal

Cover all user-visible content on a page with non-overlapping logical segments — header, nav, sidebar, main content, footer, card grid, breadcrumb, etc. — each identified by a unique CSS selector and bounding box. Every visible region belongs to exactly one segment.

---

## Core Principle

The page already has logical segments. The algorithm is a structural parsing problem, not a content-finding problem. It reads developer intent from the DOM rather than inferring it from content density.

A node is either a **segment leaf** (owns its subtree) or a **layout container** (transparent — descend into its children). A node never appears in both roles.

---

## Dynamic Threshold Configuration

The segmenter adapts its pruning and structural heuristics based on an optionally provided `page_type` string (e.g., `product_list`, `doc_page`, `homepage`). Page types are mapped to logical "families" which adjust the baseline configuration:

- **Baseline/Default**: `MIN_WIDTH`=80, `MIN_HEIGHT`=20, `MIN_SUBTREE_DEPTH`=3, `MIN_SUBTREE_NODES`=10, `COMPONENT_SCORE_THRESHOLD`=4
- **Commerce** (Product Lists): Relaxes node/depth checks (`MIN_SUBTREE_DEPTH`=2, `MIN_SUBTREE_NODES`=5) to ensure small product cards aren't skipped.
- **Content/Docs** (Blogs, FAQ): Lowers `COMPONENT_SCORE_THRESHOLD`=3, as text-heavy pages often lack hard visual boundaries (borders/shadows) but still have distinct sections. Relaxes node/depth checks similarly.
- **Marketing** (Homepages): Increases `MIN_SUBTREE_NODES`=15 to prevent cohesive, visually rich hero sections from splintering into tiny fragmented pieces.

---

## Phase 1 — Pruning

Remove nodes that are never user-visible before any analysis:

- Non-visible tags: `script`, `style`, `noscript`, `meta`, `svg`, `iframe`, `template`
- Hidden elements: `display:none`, `visibility:hidden`, `opacity:0`, zero bounding box
- Elements smaller than `MIN_WIDTH`×`MIN_HEIGHT` (true UI atoms — tiny badges, invisible spacers)
- Transient noise overlays: `modal`, `popup`, `overlay`, `toast`, `tooltip`, `dropdown`, `cookie`, `consent`

**What is NOT pruned (changed from earlier design):**

- `nav`, `header`, `footer` at any depth — these are user-visible page zones
- Elements with ARIA roles `navigation`, `banner`, `contentinfo`, `complementary`, `main`, `search` — these are all user-visible landmark zones. ARIA landmark roles are treated the same as semantic HTML5 tags.
- `<nav>` tag at top-level or any depth — previously blanket-pruned, now treated as `SEMANTIC_SEGMENT_TAGS` (always a leaf segment)

---

## Phase 2 — Decision Logic (per node, recursive)

At each surviving node, run these checks in order. Stop at the first match.

### 2a. Leaf tags
`pre`, `code`, `table`, `ul`, `ol`, `figure`, `blockquote`, `video`, `audio`, `canvas`, `picture` — always a segment leaf, never descend. These are atomic content units.

### 2b. ARIA landmark roles
If the element has an ARIA landmark role (`navigation`, `banner`, `contentinfo`, `complementary`, `main`, `search`, `form`, `region`), it is always declared a segment leaf. ARIA landmarks are developer-encoded zone boundaries equivalent to semantic HTML5 tags.

### 2c. Semantic segment tags
`section`, `article`, `aside`, `form`, `nav`, `header`, `footer` — always a segment leaf. The developer used a semantic tag to explicitly mark a component boundary — trust it unconditionally.

`main` — trust only when not a full-width layout wrapper (width < 95% viewport) OR when a hard visual signal is present (border, shadow, radius, background isolation).

### 2d. Parent identity check
Score the node on visual signals. Requires at least one **hard signal** (background-isolation, border, box-shadow, border-radius) plus total score ≥ 4 to fire. If visible siblings with content exist, fall through instead — the parent owns this node and its siblings together.

**Scoring:**
| Signal | Points |
|---|---|
| Background differs from parent (non-transparent) | +1 |
| Has border | +2 |
| Has box-shadow | +2 |
| Has border-radius > 0 | +1 |
| Padding ≥ 16px on any side | +1 |
| Spatial gap ≥ 16px above (from previous sibling) | +1 |
| Compositional completeness (≥2 of: text, media, interactive) | +2 |

**Guards that override scoring:**
- Full-width (≥95% viewport) with no hard isolation → score zeroed
- Height < `MIN_HEIGHT` → score capped at 2 (true UI atoms only — nav bars are typically 35-60px and are not capped)
- Child diversity > 3 (more than 3 independently meaningful or semantic children) → `background-isolation` is demoted and removed from score. A canvas wrapper has many diverse children; a real component does not.

### 2e. Raw text node check
If the node has direct non-whitespace text node children, it is a content-mixed node that cannot be safely split. Declare as leaf immediately.

### 2f. Structural similarity check
If ≥75% of direct children share the same deep fingerprint (tag + children tags + grandchildren tags), and there are ≥3 children, this is a repeating pattern (card grid, product list). The parent owns the repetition — declare as leaf.

### 2g. Meaningful children
A child is **meaningful** if:
- It is a semantic tag (`SEMANTIC_SEGMENT_TAGS`) OR
- It is a leaf tag (`LEAF_TAGS`) OR
- It has subtree depth ≥ `MIN_SUBTREE_DEPTH` AND node count ≥ `MIN_SUBTREE_NODES`

If no meaningful children exist, the node is a composed unit (heading + shallow elements) — declare as leaf.

### 2h. Coupled-sibling check
If exactly 2–3 meaningful children exist and none individually score ≥ 4 on identity, they are functionally coupled (image column + text column, slider + thumbnails). The parent is the component — declare as leaf. Single-child nodes always descend.

### 2i. Orphan check
Before descending, verify no non-meaningful sibling would be left stranded. A non-meaningful sibling is **skipped** (not treated as an orphan) when any of the following are true:

- It is hidden or < 10×10px
- Its tag is in `PRUNE_TAGS` (script, style, etc.)
- Its tag is in `LEAF_TAGS` (`ul`, `ol`, `table`, etc.) — independently processable as a leaf segment
- Its tag is in `SEMANTIC_SEGMENT_TAGS` — will be processed independently
- Its height is **< 15% of the largest meaningful child's height** — it is a section title, label, or heading that naturally belongs to the surrounding content and will be silently absorbed

A non-meaningful sibling triggers an orphan stop **only when**:
- It is visible (≥10×10px)
- Its height is **≥ 15%** of the tallest meaningful child (substantial enough to be genuinely stranded)
- It contains visible content (text nodes, `img[src]`, visible headings/links)

If an orphan is detected, the parent declares itself the leaf — no descent.

**Coupling ratio rationale:** The 15% threshold distinguishes two cases:
- A 76px heading next to a 640px product card (ratio = 12%) → heading, absorb silently
- A 200px promo block next to a 640px card (ratio = 31%) → potential orphan, check content

**Why LEAF_TAGS are excluded from orphan detection:** A `<ul class="breadcrumb">` is always an independently processable leaf segment. Treating it as an orphan would cause the entire page wrapper to become one giant segment.

### 2j. Container descent
Recurse into each meaningful child, each semantic-tag child, **and each leaf-tag child**. Non-meaningful, non-semantic, non-leaf siblings are silently absorbed into the parent's segment. If descent yields no results, fall back to declaring the current node a leaf.

---

## Role Inference

Role is determined in this priority order:
1. **ARIA role**: `role="navigation"` → `nav`, `role="banner"` → `header`, `role="contentinfo"` → `footer`, `role="complementary"` / `role="search"` → `sidebar`, `role="main"` → `main`
2. **Semantic HTML tag**: `<nav>` → `nav`, `<header>` → `header`, `<footer>` → `footer`, `<aside>` → `sidebar`, `<main>` → `main`, `<article>` → `article`, `<form>` → `form`
3. **Class/id vocabulary scan**: keywords like `hero`, `sidebar`, `card`, `grid`, `cta`, `pricing`, `faq` matched against class + id names
4. **Fallback**: `section`

---

## Selector Generation

Each segment gets a unique CSS selector built by walking up the DOM to the nearest `#id` anchor, then constructing a `>` descendant path downward. Each step prefers `#id` → `tag.class1.class2` (verified unique within parent scope) → `tag:nth-of-type(n)`. Final selector is verified unique against the full document. This grounds selectors to meaningful anchors rather than document-relative positional paths.

---

## Output

Each segment is a flat dict:

```json
{
  "selector": "#product_detail > div.container > div.grid",
  "role": "grid",
  "depth": 4,
  "boundingBox": { "x": 50, "y": 300, "width": 1300, "height": 2599 },
  "identityScore": 4,
  "identitySignals": ["border", "compositional-completeness"],
  "children": []
}
```

`children` is always empty in the current flat output model. The tree structure is implicit in `depth` and `selector` ancestry.

---

## Page Load Strategy

For SSR and SPA frameworks (Next.js, React, Vue, etc.) that never reach a true network-idle state (due to keep-alive polling, websockets, or continuous background fetches), a two-phase wait is used:
1. Wait for `domcontentloaded` (reliable, fires after initial HTML parse)
2. Optionally wait up to 5 seconds for `networkidle` — if it times out, proceed with existing DOM

This prevents the 30-second timeout that `wait_until="networkidle"` causes on React/Next.js apps.

---

## Key Design Decisions

**Structural parsing, not content scoring.** The algorithm reads developer-encoded boundaries (semantic tags, visual containment, structural repetition) rather than scoring text density or link ratios. This recovers intentional component structure rather than finding "the most content-rich zone."

**Coverage is the primary goal.** Every user-visible region should belong to exactly one segment. Navigation bars, sidebars, breadcrumbs, and footers are all user-visible and are included. Only true technical noise (hidden elements, script/style tags, transient overlays) is pruned.

**ARIA landmarks and semantic tags are authoritative.** `nav`, `section`, `article`, `aside`, `header`, `footer`, `form`, and any element with an ARIA landmark role (`navigation`, `banner`, `contentinfo`, etc.) are declared segments unconditionally. Developer intent encoded in HTML/ARIA is more reliable than any heuristic.

**Hard signal requirement for identity check.** Soft signals (padding, spatial gap, mixed content types) fire on layout wrappers as readily as on real components. The identity check only fires when at least one hard visual signal (border, shadow, radius, background) is present. This prevents plain wrapper divs from being declared components.

**Child diversity cancels background isolation.** A node with more than 3 independently meaningful children is a layout container. Background isolation on such nodes is a canvas reset, not component identity — the signal is demoted.

**LEAF_TAGS are never orphans.** `<ul>`, `<ol>`, `<table>`, `<figure>`, etc. are always independently processable as leaf segments. They are excluded from orphan detection and included in the descent loop. Previously, a shallow `<ul class="breadcrumb">` next to a large content div would trigger an orphan stop, collapsing the entire page into one segment.

**Orphan check uses a small-sibling exclusion, not a large-sibling exception.** A non-meaningful sibling with content only blocks descent when its height is **≥ 15%** of the tallest meaningful child. Siblings smaller than 15% are section titles, headings, or label blocks that will be silently absorbed — they are not stranded. Previously the guard was inverted (only skipping siblings > 120%), causing small heading divs to trigger orphan stops and collapse entire product-listing pages into one segment.

**MIN_HEIGHT is 30px, not 60px.** Navigation bars, breadcrumbs, and tab bars are commonly 30–55px tall and are user-visible. The original 60px threshold was incorrectly filtering these out at the prune step. The identity-score atom guard (cap at 2) is also lowered to 30px so only true sub-pixel elements are capped.
