Metadata-Version: 2.4
Name: pulpie
Version: 0.0.2
Summary: Fast content extraction from HTML using encoder models.
Author-email: Feyn <team@usefeyn.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/feyninc/pulpie
Project-URL: Documentation, https://github.com/feyninc/pulpie
Project-URL: Repository, https://github.com/feyninc/pulpie
Keywords: html,content-extraction,web,nlp,encoder,transformer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1
Requires-Dist: transformers<5.0,>=4.45
Requires-Dist: lxml>=5.0
Requires-Dist: selectolax>=0.3
Requires-Dist: beautifulsoup4>=4.12
Provides-Extra: markdown
Requires-Dist: html2text>=2024.2; extra == "markdown"
Provides-Extra: all
Requires-Dist: html2text>=2024.2; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0; extra == "dev"
Requires-Dist: html2text>=2024.2; extra == "dev"
Requires-Dist: ruff>=0.9; extra == "dev"
Requires-Dist: ty>=0.0.20; extra == "dev"
Dynamic: license-file

<div align="center">

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/feyninc/pulpie/main/assets/banner-dark.png">
  <img alt="pulpie" src="https://raw.githubusercontent.com/feyninc/pulpie/main/assets/banner-light.png" width="460">
</picture>

[![PyPI version](https://img.shields.io/pypi/v/pulpie.svg)](https://pypi.org/project/pulpie/)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://pypi.org/project/pulpie/)
[![License](https://img.shields.io/github/license/feyninc/pulpie.svg)](https://github.com/feyninc/pulpie/blob/main/LICENSE)
[![Downloads](https://static.pepy.tech/badge/pulpie)](https://pepy.tech/project/pulpie)
[![Blog](https://img.shields.io/badge/blog-read%20the%20writeup-E34C26.svg)](https://usefeyn.com/blog/pulpie-pareto-optimal-models-for-cleaning-the-web/)
[![GitHub stars](https://img.shields.io/github/stars/feyninc/pulpie.svg)](https://github.com/feyninc/pulpie/stargazers)

_Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost._

[Install](#installation) •
[Usage](#usage) •
[Models](#models) •
[How it works](#how-it-works) •
[Benchmarks](#benchmarks) •
[Blog](https://usefeyn.com/blog/pulpie-pareto-optimal-models-for-cleaning-the-web/)

</div>

Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU.

- **Fast.** An encoder labels every block in one forward pass (13.7 pages/sec on an L4).
- **Accurate.** Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench.
- **Small.** The recommended model is 210M parameters and fits on any GPU.
- **Cheap.** Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder.
- **Simple.** Run `pip install pulpie`, then `Extractor().extract(html)`.
- **Batched.** An overlapped CPU and GPU pipeline scales across multiple GPUs.

## Installation

```bash
pip install pulpie
```

For Markdown output, install the `markdown` extra:

```bash
pip install "pulpie[markdown]"
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv pip install "pulpie[markdown]"
```

## Usage

### Basic

```python
from pulpie import Extractor

extractor = Extractor()                # defaults to pulpie-orange-small (210M)
result = extractor.extract(html)

print(result.markdown)                 # clean Markdown
print(result.html)                     # clean HTML
print(result.n_main, result.n_other)   # blocks kept vs dropped
```

The model downloads from Hugging Face on first use.

### Choosing a model

```python
extractor = Extractor(model="orange-large")   # "orange-small" (default), "orange-base", "orange-large"
extractor = Extractor(model="path/to/model")  # or a custom checkpoint
extractor = Extractor(device="cpu")           # force CPU
```

### Batch processing

For bulk extraction, `Pipeline` overlaps CPU preprocessing with GPU inference and self-balances across one or more GPUs:

```python
from pulpie import Pipeline, PageInput

pipeline = Pipeline(model="orange-small")
results = pipeline.extract_batch(
    [PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)
```

## Models

All three models are built on [EuroBERT](https://arxiv.org/abs/2503.05500), share a tokenizer, and use the same `<|sep|>` block-marker architecture. Large is the teacher; Base and Small are distilled from it.

| Model | Hugging Face | Params | ROUGE-5 F1 | Notes |
|-------|--------------|--------|------------|-------|
| **Orange Small** | [`feyninc/pulpie-orange-small`](https://huggingface.co/feyninc/pulpie-orange-small) | 210M | 0.862 | **Recommended**, best size-to-quality ratio |
| Orange Base | [`feyninc/pulpie-orange-base`](https://huggingface.co/feyninc/pulpie-orange-base) | 610M | 0.863 | Distilled from Large |
| Orange Large | [`feyninc/pulpie-orange-large`](https://huggingface.co/feyninc/pulpie-orange-large) | 2.1B | 0.873 | Teacher (highest quality) |

`orange-small` is the default. Despite being a third the size of Dripper (the leading extractor), it matches its quality (0.862 vs 0.864) while running 20x faster.

## How it works

Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages:

1. **Simplify.** Remove scripts, styles, and formatting noise; tag each content block with a unique ID.
2. **Chunk.** Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk).
3. **Classify.** A single encoder forward pass labels every block as content or boilerplate.
4. **Reconstruct.** Return the kept blocks as HTML, or convert them to Markdown.

A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4).

## Benchmarks

Quality on the English subset of [WebMainBench](https://github.com/opendatalab/WebMainBench) (6,647 pages), ROUGE-5 F1:

| Method | Params | ROUGE-5 F1 | Empty pages |
|--------|--------|------------|-------------|
| **Pulpie Orange Large** | 2.1B | **0.873** | 21 |
| Dripper | 0.6B | 0.864 | 135 |
| **Pulpie Orange Base** | 610M | 0.863 | 36 |
| **Pulpie Orange Small** | 210M | 0.862 | 45 |
| magic-html | - | 0.700 | 384 |
| Trafilatura | - | 0.619 | 16 |

Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):

| | Pulpie Orange Small | Dripper |
|--|--------------------|---------|
| Throughput (L4) | **13.7 pages/sec** | 0.68 pages/sec |
| Cost / 1B pages (L4) | **~$7,900** | ~$159,000 |

Pulpie Orange Small matches Dripper's quality at **20x the throughput** and **20x lower cost** on an L4. See [BENCHMARKS.md](BENCHMARKS.md) for the full comparison, per-difficulty breakdown, and reproduction command.

## Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their `simplify_html` preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.

## Citation

If you use Pulpie in your research, please cite:

```bibtex
@note{pulpie2026,
  title  = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
  author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
  year   = {2026},
  venue  = {Feyn Field Notes}
}
```

---

<div align="center">
Built by <a href="https://usefeyn.com">Feyn</a>.
</div>
