Metadata-Version: 2.4
Name: pulpie
Version: 0.0.1
Summary: Fast content extraction from HTML using encoder models.
Author-email: Chonkie AI <team@chonkie.ai>
License: Apache-2.0
Project-URL: Homepage, https://github.com/chonkie-inc/pulpie
Project-URL: Documentation, https://github.com/chonkie-inc/pulpie
Project-URL: Repository, https://github.com/chonkie-inc/pulpie
Keywords: html,content-extraction,web,nlp,encoder,transformer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.1
Requires-Dist: transformers<5.0,>=4.45
Requires-Dist: lxml>=5.0
Requires-Dist: selectolax>=0.3
Requires-Dist: beautifulsoup4>=4.12
Provides-Extra: markdown
Requires-Dist: html2text>=2024.2; extra == "markdown"
Provides-Extra: all
Requires-Dist: html2text>=2024.2; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0; extra == "dev"
Requires-Dist: html2text>=2024.2; extra == "dev"
Requires-Dist: ruff>=0.9; extra == "dev"
Requires-Dist: ty>=0.0.20; extra == "dev"

# Pulpie

Fast content extraction from HTML using encoder models. 16x faster than autoregressive approaches at the same quality.

## Install

```bash
pip install pulpie
```

For markdown output:
```bash
pip install pulpie[markdown]
```

## Usage

```python
from pulpie import Extractor

extractor = Extractor()  # downloads pulpie-orange-small (210M) on first use

result = extractor.extract(html)
print(result.markdown)   # clean markdown
print(result.html)       # clean HTML
print(result.n_main)     # number of content blocks
print(result.n_other)    # number of boilerplate blocks
```

## Models

| Model | Size | ROUGE-5 | Speed (L4) |
|-------|------|---------|------------|
| `orange-small` | 210M | 0.864 | 15 pps |
| `orange-base` | 610M | 0.849 | ~6 pps |
| `orange-large` | 2.1B | 0.862 | ~2 pps |

`orange-small` is the default and recommended model — it matches the 2.1B teacher at 1/10th the size.

```python
# Use a specific model
extractor = Extractor(model="orange-large")

# Use a custom model path
extractor = Extractor(model="path/to/your/model")

# Force CPU
extractor = Extractor(device="cpu")
```

## How it works

Pulpie classifies each HTML block as "main content" or "boilerplate" using a bidirectional encoder. The pipeline:

1. **Simplify** — Strip scripts, styles, normalize HTML (via MinerU-HTML)
2. **Chunk** — Pack blocks into sequences separated by `<|sep|>` tokens
3. **Classify** — Single encoder forward pass classifies all blocks simultaneously
4. **Reconstruct** — Extract content blocks, convert to markdown

## Performance

On 500 real Common Crawl pages (NVIDIA L4 GPU):

- **15.1 pages/sec** (single GPU, 210M model)
- **$6,500** to clean 1 billion pages
- **16.4x faster** than Dripper (autoregressive) on the same hardware
- **433 MB** VRAM — fits on any GPU

## License

Apache 2.0
