Metadata-Version: 2.4
Name: selectolax-tree
Version: 0.1.2
Summary: YAML-driven HTML extractor powered by selectolax.
Project-URL: Repository, https://github.com/aimscrape/selectolax-tree
Project-URL: Issues, https://github.com/aimscrape/selectolax-tree/issues
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: selectolax>=0.3.0
Requires-Dist: PyYAML>=6.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"

# selectolax-tree

Parse HTML into structured data using a YAML spec.

Powered by AimScrape.

中文文档：`README-ZH.md`

Versioned docs:
- `docs/0.1/USAGE.md` (applies to `0.1.x`)
- Changelog: `CHANGELOG.md`

## Dependencies

This tool is built on top of:
- `selectolax` (HTML parsing + CSS selectors)
- `PyYAML` (YAML spec parsing)

They are declared as install dependencies, so `pip` will install them automatically.

## Install (recommended: venv)

```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e ".[dev]"
```

## Install via pip

### From GitHub

```bash
python -m pip install "git+https://github.com/aimscrape/selectolax-tree.git"
```

### From PyPI

```bash
python -m pip install selectolax-tree
```

## YAML Spec (minimal)

```yaml
fields:
  title:
    css: "h1"
    text: true

  link_hrefs:
    css: "a"
    list: true
    attr: "href"

  items:
    css: ".item"
    list: true
    fields:
      name: { css: ".name", text: true }
      url:  { css: "a", attr: "href" }
```

## Python usage

```python
from selectolax_tree import extract_from_yaml

data = extract_from_yaml(html, yaml_spec_str)
```

## CLI

```bash
selectolax-tree --spec spec.yml --html-file page.html
```

## Examples

Runnable scenario examples live in `example/`:

```bash
selectolax-tree --spec example/article/spec.yml --html-file example/article/page.html
selectolax-tree --spec example/product_list/spec.yml --html-file example/product_list/page.html
selectolax-tree --spec example/profiles/spec.yml --html-file example/profiles/page.html
```

## Releasing (maintainers)

See `RELEASING.md` for PyPI publishing and GitHub Actions release automation.
