Metadata-Version: 2.4
Name: snakyscraper
Version: 1.1.0
Summary: A lightweight, Pythonic web scraping toolkit built on BeautifulSoup and Requests.
Author-email: Rio Agung Purnomo <hi@ioodev.my.id>
License: MIT
Project-URL: Homepage, https://github.com/ioodev/snakyscraper
Project-URL: Repository, https://github.com/ioodev/snakyscraper
Project-URL: Issues, https://github.com/ioodev/snakyscraper/issues
Project-URL: Changelog, https://github.com/ioodev/snakyscraper/blob/main/CHANGELOG.md
Keywords: scraping,web-scraping,beautifulsoup,html-parser,metadata,open-graph,seo
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Typing :: Typed
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25
Requires-Dist: beautifulsoup4>=4.10
Provides-Extra: lxml
Requires-Dist: lxml>=4.9; extra == "lxml"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: responses>=0.23; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Dynamic: license-file

# 🐍 SnakyScraper

**SnakyScraper** is a lightweight, Pythonic web scraping toolkit built on top of [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [Requests](https://docs.python-requests.org/). It gives you a clean interface for pulling structured HTML and metadata out of any web page — titles, Open Graph tags, headings, links, images, and arbitrary DOM selectors — with predictable, JSON-friendly return values.

> Fast. Accurate. Snake-style scraping. 🐍🎯

**Bahasa:** [English](#-features) | [Indonesia](#-bahasa-indonesia)

---

## 📋 Table of Contents

- [Features](#-features)
- [Installation](#-installation)
- [Quick Start](#️-quick-start)
- [Handling Errors](#-handling-errors)
- [API Reference](#-api-reference)
- [Custom DOM Filtering](#-custom-dom-filtering)
- [Project Structure](#-project-structure)
- [Development](#-development)
- [Changelog](#-changelog)
- [Bahasa Indonesia](#-bahasa-indonesia)

---

## 🚀 Features

- ✅ Extract metadata: title, description, keywords, author, charset, canonical URL, and more
- ✅ Built-in support for Open Graph, Twitter Card, and CSRF tags
- ✅ Extract HTML structures: `h1`–`h6`, `p`, `ul`, `ol`, images, links
- ✅ Powerful `filter()` method with tag, class, and ID-based selectors
- ✅ Parse raw HTML directly with `html=` — no network call required
- ✅ Proper error handling: inspect `.error` / `.status_code` or opt into exceptions with `raise_on_error=True`
- ✅ Custom headers, timeout, and `requests.Session` reuse for real-world scraping
- ✅ Fully type-hinted, ships with `py.typed` (PEP 561) for IDE & mypy support
- ✅ Zero bare `except:` blocks, no silent 404/500 pass-throughs
- ✅ Powered by BeautifulSoup4 and Requests — no heavyweight dependencies

---

## 📦 Installation

```bash
pip install snakyscraper
```

Optional extras:

```bash
# Faster HTML parsing via lxml
pip install "snakyscraper[lxml]"

# Development tools (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"
```

> Requires Python 3.8 or later.

---

## 🛠️ Quick Start

```python
from snakyscraper import SnakyScraper

scraper = SnakyScraper("https://example.com")

if scraper.ok():
    print(scraper.title())          # "Example Domain"
    print(scraper.description())    # meta description, or None
    print(scraper.h1())             # ["Example Domain"]
    print(scraper.open_graph())     # {"og:title": ..., "og:image": ..., ...}
else:
    print("Failed:", scraper.error)
```

### Parsing HTML you already have (no network call)

Useful for tests, cached pages, or HTML obtained from a headless browser:

```python
scraper = SnakyScraper(html="<title>Hello</title><h1>Hi there</h1>")
scraper.title()  # "Hello"
scraper.h1()     # ["Hi there"]
```

### Custom headers, timeout, and session reuse

```python
import requests
from snakyscraper import SnakyScraper

session = requests.Session()
scraper = SnakyScraper(
    "https://example.com",
    timeout=15,
    headers={"Accept-Language": "id-ID,en;q=0.8"},
    session=session,  # reuse connections/cookies across multiple scrapes
)
```

---

## ⚠️ Handling Errors

By default, SnakyScraper **never raises** — failures (invalid URL, network error, HTTP 4xx/5xx, parse failure) are captured instead of thrown, so a single bad URL in a batch job won't crash the whole run.

```python
scraper = SnakyScraper("https://example.com/this-page-does-not-exist")

scraper.ok()           # False
scraper.status_code    # 404
scraper.error          # HTTPStatusError("'...' returned HTTP 404.")
```

If you'd rather fail fast (e.g. while developing), pass `raise_on_error=True`:

```python
from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError

try:
    scraper = SnakyScraper("https://example.com/missing", raise_on_error=True)
except HTTPStatusError as e:
    print("Server returned an error status:", e.status_code)
except InvalidURLError:
    print("That URL is malformed.")
except FetchError as e:
    print("Network problem:", e)
```

**Exception hierarchy:**

```
SnakyScraperError
├── InvalidURLError    # bad/missing/non-http(s) URL
├── FetchError          # network failure (timeout, DNS, connection refused, ...)
│   └── HTTPStatusError # non-2xx response (has .status_code)
└── ParseError          # HTML could not be parsed
```

---

## 📖 API Reference

### 🔹 Status

| Method | Returns | Description |
|---|---|---|
| `ok()` | `bool` | `True` if the page was fetched and parsed successfully |
| `.error` | `Exception \| None` | The exception captured during construction, if any |
| `.status_code` | `int \| None` | HTTP status code of the response, if a request was made |

### 🔹 Page Metadata

| Method | Returns |
|---|---|
| `title()` | `str \| None` |
| `description()` | `str \| None` |
| `keywords()` | `list[str] \| None` |
| `keyword_string()` | `str \| None` |
| `charset()` | `str \| None` — reads both `<meta charset>` and legacy `http-equiv` forms |
| `canonical()` | `str \| None` |
| `content_type()` | `str \| None` |
| `author()` | `str \| None` |
| `csrf_token()` | `str \| None` — checks meta tag, then hidden input |
| `image()` | `str \| None` — shortcut for `og:image` |
| `viewport()` | `list[str] \| None` |
| `viewport_string()` | `str \| None` |

### 🔹 Open Graph & Twitter Card

```python
scraper.open_graph()              # dict of common og:* properties
scraper.open_graph("og:title")    # a single property

scraper.twitter_card()                  # dict of common twitter:* properties
scraper.twitter_card("twitter:title")   # a single property
```

### 🔹 Headings & Text

```python
scraper.h1()  # list[str]
scraper.h2()
scraper.h3()
scraper.h4()
scraper.h5()
scraper.h6()
scraper.p()
```

### 🔹 Lists

```python
scraper.ul()  # flattened text of every <li> in every <ul>
scraper.ol()  # flattened text of every <li> in every <ol>
```

### 🔹 Images

```python
scraper.images()         # ["/img/1.jpg", "/img/2.jpg", ...]
scraper.image_details()  # [{"url": ..., "alt_text": ..., "title": ...}, ...]
```

### 🔹 Links

```python
scraper.links()         # list of href strings (anchors with no href are skipped)
scraper.link_details()  # list of dicts: url, protocol, text, title, target, rel, is_nofollow, ...
```

---

## 🔍 Custom DOM Filtering

Use `filter()` to target specific elements and optionally pull nested content out of them.

#### ▸ Single element

```python
scraper.filter(
    element="div",
    attributes={"id": "main"},
    multiple=False,
    extract=[".title", "#description", "p"],
)
```

#### ▸ Multiple elements

```python
scraper.filter(
    element="div",
    attributes={"class": "card"},
    multiple=True,
    extract=["h3", ".subtitle", "#meta"],
)
```

> `extract` selectors: a tag name (`"h3"`), a class (`.title` → key `class__title`),
> or an ID (`#meta` → key `id__meta`).

#### ▸ Clean text instead of raw HTML

```python
scraper.filter(
    element="p",
    attributes={"class": "dark-text"},
    multiple=True,
    return_html=False,
)
```

---

## 🗂 Project Structure

```
snakyscraper/
├── snakyscraper/
│   ├── __init__.py       # public API surface (SnakyScraper, exceptions, __version__)
│   ├── core.py           # SnakyScraper implementation
│   ├── exceptions.py     # SnakyScraperError and subclasses
│   ├── _version.py       # single source of truth for the version string
│   └── py.typed          # PEP 561 marker
├── tests/
│   ├── conftest.py       # shared fixtures (sample HTML pages)
│   ├── test_metadata.py  # title, og, twitter, charset, csrf, ...
│   ├── test_content.py   # headings, lists, images, links
│   ├── test_filter.py    # filter() DOM queries
│   └── test_fetching.py  # URL validation, HTTP mocking, error handling
├── examples/
│   └── basic_usage.py
├── pyproject.toml        # build system + project metadata + tool config
├── LICENSE
└── README.md
```

This split keeps the public API (`__init__.py`) thin, the implementation (`core.py`) self-contained, and error types (`exceptions.py`) reusable without importing the whole scraping engine — making the codebase easier to navigate and extend.

---

## 🧑‍💻 Development

```bash
git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"

# Run the test suite (mocked HTTP, no real network calls)
pytest

# With coverage
pytest --cov=snakyscraper --cov-report=term-missing

# Type-check
mypy snakyscraper/

# Build distributable wheel/sdist
python -m build
```

### Contributing

Found a bug or want to request a feature? Open an [issue](https://github.com/ioodev/snakyscraper/issues) or submit a pull request.

---

## 📝 Changelog

See [CHANGELOG.md](./CHANGELOG.md) for the full version history. Highlights for **v1.1.0**:

- Restructured into a proper multi-module package (`core`, `exceptions`, `_version`)
- Fixed: HTTP error pages (404/500) no longer silently treated as successful
- Fixed: `charset()` now reads legacy `http-equiv` charset declarations
- Fixed: `link_details()` no longer breaks on anchors without `href`
- Fixed: `title()` now returns a clean `str` instead of a `NavigableString`
- Added: `html=` kwarg to parse raw HTML with no network call
- Added: typed exception hierarchy (`InvalidURLError`, `FetchError`, `HTTPStatusError`, `ParseError`)
- Added: `.error`, `.status_code`, `.ok()`, `raise_on_error=`, custom `headers=`/`session=`
- Added: full type hints + `py.typed`, full pytest suite (63 tests, 92% coverage)
- Renamed: project ownership moved from `riodevnet` to `ioodev`

---

## 📄 License

MIT License © 2025–2026 — [ioodev](https://ioodev.my.id)

---

## 🔗 Related Projects

- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/)
- [Requests](https://docs.python-requests.org/)
- [lxml](https://lxml.de/)
- **NodeScraper** ([`@ioodev/nodescraper`](https://www.npmjs.com/package/@ioodev/nodescraper)) — the Node.js sibling of this library
- **ElephScraper** — the PHP sibling of this library

---

## 💡 Why SnakyScraper?

> Think of it as your Pythonic sniper — targeting HTML content with precision and elegance.

---

## 🇮🇩 Bahasa Indonesia

**SnakyScraper** adalah toolkit web scraping yang ringan dan Pythonic, dibangun di atas [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) dan [Requests](https://docs.python-requests.org/). Library ini menyediakan antarmuka yang bersih untuk mengambil HTML terstruktur dan metadata dari halaman web mana pun — judul, tag Open Graph, heading, link, gambar, hingga selector DOM khusus — dengan nilai kembalian yang konsisten dan ramah JSON.

### 🚀 Fitur

- Ekstraksi metadata: title, description, keywords, author, charset, canonical URL, dan lainnya
- Dukungan bawaan untuk Open Graph, Twitter Card, dan tag CSRF
- Ekstraksi struktur HTML: `h1`–`h6`, `p`, `ul`, `ol`, gambar, link
- Metode `filter()` yang fleksibel dengan selector tag, class, dan ID
- Bisa parsing HTML langsung lewat `html=` — tanpa perlu request ke jaringan
- Penanganan error yang jelas: cek `.error` / `.status_code`, atau aktifkan exception dengan `raise_on_error=True`
- Dukungan custom headers, timeout, dan reuse `requests.Session` untuk kebutuhan scraping nyata
- Type hints lengkap, sudah menyertakan `py.typed` (PEP 561) untuk dukungan IDE & mypy
- Tidak ada lagi blok `except:` kosong, tidak ada lagi halaman 404/500 yang lolos begitu saja

### 📦 Instalasi

```bash
pip install snakyscraper
```

Ekstra opsional:

```bash
# Parsing HTML lebih cepat dengan lxml
pip install "snakyscraper[lxml]"

# Tools development (pytest, mypy, build, twine)
pip install "snakyscraper[dev]"
```

> Membutuhkan Python 3.8 atau lebih baru.

### 🛠️ Penggunaan Dasar

```python
from snakyscraper import SnakyScraper

scraper = SnakyScraper("https://example.com")

if scraper.ok():
    print(scraper.title())
    print(scraper.description())
    print(scraper.h1())
    print(scraper.open_graph())
else:
    print("Gagal:", scraper.error)
```

### Parsing HTML yang sudah dimiliki (tanpa request jaringan)

```python
scraper = SnakyScraper(html="<title>Halo</title><h1>Selamat datang</h1>")
scraper.title()  # "Halo"
scraper.h1()     # ["Selamat datang"]
```

### ⚠️ Penanganan Error

Secara default, SnakyScraper **tidak pernah melempar exception** — kegagalan (URL tidak valid, error jaringan, HTTP 4xx/5xx, gagal parsing) ditangkap secara internal, sehingga satu URL bermasalah di dalam batch job tidak akan menghentikan seluruh proses.

```python
scraper = SnakyScraper("https://example.com/halaman-tidak-ada")

scraper.ok()           # False
scraper.status_code    # 404
scraper.error          # HTTPStatusError("'...' returned HTTP 404.")
```

Jika ingin error langsung dilempar sebagai exception (misalnya saat development), gunakan `raise_on_error=True`:

```python
from snakyscraper import SnakyScraper, HTTPStatusError, InvalidURLError, FetchError

try:
    scraper = SnakyScraper("https://example.com/halaman-tidak-ada", raise_on_error=True)
except HTTPStatusError as e:
    print("Server mengembalikan status error:", e.status_code)
except InvalidURLError:
    print("URL tidak valid.")
except FetchError as e:
    print("Masalah jaringan:", e)
```

### 🔍 Filtering DOM Khusus

```python
scraper.filter(
    element="div",
    attributes={"class": "card"},
    multiple=True,
    extract=["h3", ".subtitle", "#meta"],
)
```

> Selector `extract`: nama tag (`"h3"`), class (`.title` → key `class__title`),
> atau ID (`#meta` → key `id__meta`).

### 🧑‍💻 Development

```bash
git clone https://github.com/ioodev/snakyscraper.git
cd snakyscraper
pip install -e ".[dev]"

pytest                  # jalankan test suite
mypy snakyscraper/      # type-check
python -m build         # build wheel/sdist
```

### 📄 Lisensi

MIT License © 2025–2026 — [ioodev](https://ioodev.my.id)
