Metadata-Version: 2.3
Name: typed-soup
Version: 0.1.5
Summary: A type-safe wrapper around BeautifulSoup and related HTML parsing utilities
License: MIT
Keywords: beautifulsoup,html,parsing,type-safe,scrapy,type-hints,static-typing,mypy,pyright,web-scraping,html-parsing,xml-parsing
Author: Robert Shecter
Author-email: robert@public.law
Requires-Python: >=3.10
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Typing :: Typed
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Dist: beautifulsoup4
Requires-Dist: scrapy
Requires-Dist: toolz
Project-URL: Documentation, https://github.com/public-law/typed-soup#readme
Project-URL: Homepage, https://github.com/public-law/typed-soup
Project-URL: Issues, https://github.com/public-law/typed-soup/issues
Project-URL: Repository, https://github.com/public-law/typed-soup.git
Description-Content-Type: text/markdown

# typed-soup

A type-safe wrapper around BeautifulSoup and utilities for parsing HTML/XML with robust return types and error handling. Extracted from [Open-Gov Crawlers](https://github.com/public-law/open-gov-crawlers).

## Motivation

This is an example [from production code](https://github.com/public-law/open-gov-crawlers/blob/d7ad31081a88cec0e48bd51e06a4d0cc6039abec/public_law/parsers/gbr/cpr_glossary.py#L128-L143).

### Before

<p align="center">
  <img src="https://raw.githubusercontent.com/public-law/typed-soup/master/before.jpg" width="75%" alt="Before">
</p>

Here are the first five errors. There are 16 in total.

```
  error: Type of "rows" is partially unknown
    Type of "rows" is "list[PageElement | Tag | NavigableString] | Unknown" (reportUnknownVariableType)
  error: Type of "find_all" is partially unknown
    Type of "find_all" is "Unknown | ((name: str | bytes | Pattern[str] | bool | ((Tag) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((Tag) -> bool)] | ElementFilter | None = None, attrs: Dict[str, str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)]] = {}, recursive: bool = True, string: str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)] | None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)]) -> ResultSet[PageElement | Tag | NavigableString])" (reportUnknownMemberType)
  error: Cannot access attribute "find_all" for class "PageElement"
    Attribute "find_all" is unknown (reportAttributeAccessIssue)
  error: Cannot access attribute "find_all" for class "NavigableString"
    Attribute "find_all" is unknown (reportAttributeAccessIssue)
  error: Type of "row" is partially unknown
    Type of "row" is "PageElement | Tag | NavigableString | Unknown" (reportUnknownVariableType)
```

### After

Switching out `BeautifulSoup` for `TypedSoup` provides type knowledge to the checker and IDE:

<p align="center">
  <img src="https://raw.githubusercontent.com/public-law/typed-soup/refs/heads/master/after.jpg" width="75%" alt="After">
</p>

## Installation

```bash
pip install typed-soup
```

## Quick Start

```python
from typed_soup import TypedSoup
from bs4 import BeautifulSoup

# Create a type-safe soup object
soup = TypedSoup(BeautifulSoup("<div>Hello <span>World</span></div>", "html.parser"))

# Find elements with type safety
element = soup.find("span")
if element:
    print(element.get_text())  # Type-safe: IDE knows this returns str
```


## Usage

If you're using Scrapy, you can use the `from_response` function to create a `TypedSoup` object from a Scrapy response:

```python
from typed_soup import from_response
from scrapy.http.response.html import HtmlResponse

# Assume 'response' is an HtmlResponse object
soup = from_response(response)

# Find an element
element = soup.find("div", class_="example")
if element:
    print(element.get_text())

# Find all elements
elements = soup("p")
for elem in elements:
    print(elem.get_text())
```

Or, without Scrapy, you can explicity wrap a `BeautifulSoup` object in `TypedSoup`:

```python
from typed_soup import TypedSoup
from bs4 import BeautifulSoup

soup = TypedSoup(BeautifulSoup(html_content, "html.parser"))
```


## Supported Functions

I'm adding functions as I need them. If you have a request, please open an issue.
 These are the ones that I needed for [a dozen spiders](https://github.com/public-law/open-gov-crawlers):

- `find`
- `find_all`
- `__call__` (implicit find_all, e.g. `soup("p")` - standard BeautifulSoup API)
- `get_text`
- `children`
- `tag_name`
- `parent`
- `next_sibling`
- `get_content_after_element`
- `string`

And then these help create a `TypedSoup` object:

- `from_response`
- `TypedSoup`

## Type Safety Benefits

- All methods return properly typed results
- No more `None` surprises - optional values are properly typed and described in the function signatures
- IDE autocomplete support for all methods
- Static type checking support with mypy/pyright
- Runtime type validation for BeautifulSoup results

## License

This project is licensed under the MIT License.

