Metadata-Version: 2.4
Name: magic-eraser
Version: 0.1.0
Summary: Rule-based web ad/clutter eraser, learned from a crowd-labeled dataset of page elements.
Author: alvations
License: MIT
Project-URL: Homepage, https://github.com/alvations/magic-eraser
Project-URL: Dataset, https://github.com/alvations/magic-eraser/tree/main/data
Keywords: adblock,ad-blocking,content-blocker,web,dataset,ads
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: train
Requires-Dist: datasets; extra == "train"
Requires-Dist: scikit-learn; extra == "train"
Dynamic: license-file

# 🪄 magic-eraser

> People who don't know what to sell, sell advertisements.

Rule-based web **ad/clutter eraser**, learned from a continuously-growing,
crowd-labeled dataset of real page elements. No LLM needed at inference time —
the rules are distilled from labels that an LLM (or a human) produced once.

```bash
pip install magic-eraser
```

```python
from magic_eraser import is_ad, css, detect_ads, AdEraser

is_ad({"cls": "ad-slot leaderboard", "eid": "div-gpt-ad-1", "w": 728, "h": 90})
# True

css("www.washingtonpost.com")
# '[class*="ad-slot"],...{display:none !important;height:0 !important;...}'

eraser = AdEraser("example.com")
eraser.detect([{"id": 0, "cls": "advert", "w": 300, "h": 250, "iframe": True},
               {"id": 1, "cls": "article-body", "w": 680, "h": 1200}])
# [0]
```

## How it works

1. A browser (e.g. **Melon**) collects candidate page elements and, on first
   visit to a site, asks an LLM which are ads.
2. Each verdict is appended to `data/ad_dataset.jsonl` as **labeled training
   data** and pushed here.
3. `scripts/build_rules.py` re-derives high-precision class/id token rules +
   per-domain CSS selectors into `magic_eraser/rules.json`.
4. `magic-eraser` then blocks ads with **zero LLM calls** — and gets better
   every time the dataset grows.

## The dataset

`data/ad_dataset.jsonl` — one JSON object per labeled page element:

| field | meaning |
|---|---|
| `host`, `url`, `ts` | where/when it was seen |
| `tag` | element tag (DIV, IFRAME, …) |
| `cls`, `eid` | class string, element id |
| `w`, `h` | rendered size (px) |
| `iframe` | is it an iframe |
| `txt` | short visible-text snippet |
| `is_ad` | **label** — ad/clutter (true) or content (false) |

Load it with HuggingFace `datasets`:

```python
from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files="https://raw.githubusercontent.com/alvations/magic-eraser/main/data/ad_dataset.jsonl",
    split="train",
)
ds[0]  # {'host': ..., 'cls': ..., 'is_ad': True, ...}
```

## Train a model to replace the rules

```bash
pip install "magic-eraser[train]"
python scripts/build_rules.py     # regenerate rule-based detector from data
```

The labeled dataset is designed to train a small local classifier
(features → `is_ad`) that can replace both the rules and the LLM entirely.

## License

MIT.
