Metadata-Version: 2.1
Name: lst
Version: 0.5.0
Summary: Declarative Scraping Library
Home-page: https://github.com/andriystr/LST
Author: Andriy Stremeluk
Author-email: astremeluk@gmail.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: bs4
Requires-Dist: monapy
Requires-Dist: requests
Requires-Dist: soupsieve

LST
===

Declarative web scraping library built on top of `monapy`.

Describe scraping as a chain of data transformations — not control flow.

Installation
------------

```sh
pip install lst
```

Quick Example
-------------

```python
from lst import Fetch, Scan

# Define the scraper chain
parser = (
    Fetch()
    << Scan('.pagination a')  # Feedback loop: found links are sent back to Fetch
    >> Scan('.article a')     # Extract article links
    >> Fetch()                # Fetch each article page
    >> Scan('.title').text()  # Extract title text
)

for title in parser('https://example.com'):
    print(title)
```

Configuration and Argument Priority
-----------------------------------

`lst` uses a flexible configuration system that passes through the entire chain via \*\*kwargs.
- **Global Arguments**: Passed when you call the parser (e.g., `parser(url, timeout=10)`). These flow through all steps.
- **Step Arguments**: Passed directly to a step's constructor (e.g., `Fetch(timeout=5)`).
- **Priority**: Step-specific arguments have **higher priority**. A constructor argument overrides a global argument if both are provided.

Core Components
---------------

#### Fetch
Performs HTTP requests and handles link extraction.
- **Inputs**: Accepts a URL string or a `bs4.Tag` (it automatically extracts `href` from `<a>` tags).
- **Arguments**: Supports all standard `requests.request` arguments (headers, proxies, params, etc.), except for `url` which is provided by the chain.
- **User Agent**: The `user_agent` parameter can be a string or a callable. If it's a function, it can optionally take the current `url` as an argument or take no arguments.
- **Outputs**: Yields `requests.Response` objects.

##### Error Handling (`on_error`)

You can handle request failures using `on_error` (in the constructor) or `fetch_on_error` (as a global argument). The handler receives `(exception, request, session)`.

There are only two ways to handle an error:
- **Recover**: Return a `requests.Response` instance. You can use the provided `request` and `session` to retry or return a fallback.
- **Abort**: Raise an exception to stop execution.

#### Scan

Parses HTML content using `BeautifulSoup`.
- **Inputs**: Accepts `requests.Response`, `bs4.Tag`, `str`, or `bytes`.
- **Outputs**: By default, it yields `bs4.Tag` objects.
- **Selectors**: Supports CSS selectors or custom filter functions.

Transformations and Types
-------------------------

Scan supports one terminal transformation. Applying a transformation changes what is passed to the next step:
- No transformation: Produces bs4.Tag instances.
- `.text()`: Produces a string (inner text of the element). Supports `separator` and `strip` arguments.
- `.attr(name)`: Produces the value of the specified attribute.

_Note: Once a transformation is applied, the chain passes strings/values. You cannot follow up with another `Scan` that expects HTML tags._

Operators
---------

- `>>` **(Forward Bind)**: Passes produced values to the next step.
- `<<` **(Feedback Bind)**: Sends values back to a previous step, enabling recursion and pagination.

Under the Hood: The Iterative Model
-----------------------------------

The library's design is based on the `monapy` execution model, which dictates how data moves through the chain:

- **Iterator-Based Execution**: Everything in `lst` works on iterators. Each step is a generator that receives a value and yields new values.

- **Lazy Evaluation**: The chain is "lazy." No requests are made or HTML parsed until you actually iterate over the parser (e.g., in a `for` loop).

- **Memory Efficiency**: Because it uses generators, `lst` processes items one by one. This allows you to scrape vast amounts of data without high memory consumption.

- **Continuous Flow**: In a feedback loop (`<<`), execution continues automatically until no more new values are produced by any step in the cycle.

For more details on the underlying principles, refer to the `monapy` documentation.

License
-------

MIT
