Metadata-Version: 2.4
Name: html-shrinker
Version: 0.1.0
Summary: Simple Python helper library that can significantly reduce LLM input tokens by removing unnecessary page code
Project-URL: Homepage, https://github.com/damadjan/html-shrinker
Project-URL: Repository, https://github.com/damadjan/html-shrinker
Project-URL: Issues, https://github.com/damadjan/html-shrinker/issues
Author-email: Todor Donev <me@todordonev.com>
License: MIT
License-File: LICENSE
Keywords: ai-scraping,data-extraction,html,html-cleaning,html-preprocessing,llm,token-optimization,token-reduction
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.10
Requires-Dist: lxml>=6.0.2
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# html-shrinker

Simple Python helper library that can significantly reduce LLM input tokens by removing unnecessary page code (configurable).

AI scraping usually involves sending the whole page code to an LLM + instructions + output format. 99.9% of the time the information needed is somewhere in the body tag of the page, thus we can safely remove the whole head tag which contains a ton of styles and scripts and metadata that are not needed. This alone reduces the tokens/costs significantly. Further optimizations can be made, like removing specific html tags, attributes or even the innertext.

## What it does

- Removes noisy tags/attributes or keeps only a whitelist
- Strips inner text
- Removes comments
- Flattens repeated single-child `div > div` wrappers, even if they are nested many levels deep
- Collapses whitespace between tags

## Quick install

```bash
pip install html-shrinker
```

## Quick start

```python
from html_shrinker import HTMLShrinker
from html_shrinker.defaults import tags

raw_html = """
<html>
  <head><script>ignore me</script></head>
  <body>
    <div><div><p>Hello world</p></div></div>
    <script>alert("x")</script>
  </body>
</html>
"""

shrinker = HTMLShrinker(
    tags=list(tags),
)
result = shrinker.shrink(raw_html)
print(result)
```

## API

```python
from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(
    tag_mode="remove",
    tags=["script", "style", "head"],
    attribute_mode="remove",
    attributes=["class", "id", "style"],
    strip_innertext=False,
    remove_comments=True,
    flatten_single_child_divs=True,
    collapse_between_tags=True,
)

output = shrinker.shrink("<html>...</html>")
```

Default presets are available from:

```python
from html_shrinker.defaults import tags, arguments
```

Invalid HTML input raises `InvalidHTMLInputError`:

```python
from html_shrinker import HTMLShrinker, InvalidHTMLInputError

try:
    HTMLShrinker().shrink("<div>fragment</div>")
except InvalidHTMLInputError as exc:
    print(exc)
```

## Configuration

`HTMLShrinker(...)` constructor parameters:

- `tag_mode`: `"remove"` or `"keep"` (default: `"remove"`)
- `tags`: `list[str]`
  - If `tag_mode="remove"`: these tags are removed.
  - If `tag_mode="keep"`: only these tags are kept.
- `attribute_mode`: `"remove"` or `"keep"` (default: `"remove"`)
- `attributes`: `list[str]`
  - If `attribute_mode="remove"`: these attributes are removed.
  - If `attribute_mode="keep"`: only these attributes are kept.
- `strip_innertext`: `bool` (default: `False`)
  - If `True`: removes text nodes.
  - Example: `<p>secret</p>` becomes `<p></p>`.
- `remove_comments`: `bool` (default: `True`)
  - If `True`: removes HTML comments such as `<!-- comment -->`.
- `flatten_single_child_divs`: `bool` (default: `True`)
  - If `True`: flattens nested `div > div` wrappers when a `div` contains only one child `div`.
  - This is applied repeatedly, so a deep chain like `<div><div><div><p>...</p></div></div></div>` becomes `<div><p>...</p></div>`.
  - These large div chains appear very often when shrinking aggresively.
- `collapse_between_tags`: `bool` (default: `True`)
  - If `True`: removes whitespace between tags, so `>   <` becomes `><`.

Notes:

- Default `tags` is empty (`[]`), so no tags are removed by default.
- Default `attributes` is empty (`[]`), so no attributes are removed by default.
- If `tag_mode="keep"`, `tags` must be non-empty.
- If `attribute_mode="keep"`, `attributes` must be non-empty.

## Usage patterns

### 1) Remove mode (default)

```python
from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(
    tag_mode="remove",
    tags=["script", "style", "head"],
    attribute_mode="remove",
    attributes=["class", "id", "style"],
)
clean = shrinker.shrink(raw_html)
```

### 2) Keep mode

```python
from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(
    tag_mode="keep",
    tags=["main", "article", "h1", "h2", "p", "ul", "ol", "li", "a"],
    attribute_mode="keep",
    attributes=["href"],
)
clean = shrinker.shrink(raw_html)
```

### 3) Strip inner text

```python
from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(strip_innertext=True)
clean = shrinker.shrink("<html><body><p>secret text</p></body></html>")
# <html><body><p></p></body></html>
```

## License

MIT