Metadata-Version: 2.4
Name: scraplet-dsl
Version: 0.1.0
Summary: Safe, data-driven URL adapter engine for structured content extraction
License: Apache-2.0
License-File: LICENSE
Keywords: adapter-engine,content-extraction,crawler,etl,html-extraction,scraping
Author: Andrey Baksalyar
Author-email: andreybaksalyar@gmail.com
Requires-Python: >=3.11,<3.14
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: cssselect (>=1.2.0,<2.0.0)
Requires-Dist: dateparser (>=1.2.0,<2.0.0)
Requires-Dist: httpx (>=0.27.0,<0.28.0)
Requires-Dist: lxml (>=5.2.1,<6.0.0)
Project-URL: homepage, https://github.com/Baksalyar/Scraplet-DSL
Project-URL: repository, https://github.com/Baksalyar/Scraplet-DSL
Description-Content-Type: text/markdown

# Scraplet DSL

Scraplet DSL is a safe, data-driven adapter engine for resolving URLs into structured outputs without executing arbitrary code.

It is designed as a reusable package that can be embedded in other applications (news readers, crawlers, ETL tools, feed resolvers).

![Scraplet DSL](https://raw.githubusercontent.com/Baksalyar/Scraplet-DSL/main/scraplet_dsl_512.png)

## Features

- URL trigger matching by domain and path regex
- Deterministic step pipeline with strict schema validation
- HTML extraction with CSS selectors (`lxml` + `cssselect`)
- Regex extraction and replacement operations
- JSON extraction with dot-path and array wildcard support
- Date helper steps (`datetime`, `parse_date`)
- Pluggable HTTP fetcher and HTML parser interfaces for testability
- Lazy HTML parser construction: `ScriptEngine()` does not import `lxml`/`cssselect` until a `select` step actually runs

## Step Types

- `fetch`: GET URL into variables (`save_body_as`, optional `save_status_as`, optional `save_final_url_as`, optional `retry`, optional `retry_backoff`, optional `timeout`)
- `select`: CSS select text/html/attribute from HTML; `mode` selects `first` (default) or `all` matches
- `regex`: search/findall over variable content, with optional numeric or named capture-group selection and optional `flags` (`i`/`m`/`s`)
- `replace`: regex replace in an input variable, with optional `flags` (`i`/`m`/`s`)
- `assign`: assign literal or templated value
- `assert`: fail if variable is missing/empty
- `set_url`: rewrite `input_url` (and derived `input_host`/`input_path`/`input_query`) for subsequent steps
- `datetime`: write current datetime with custom format
- `parse_date`: parse human dates into normalized format, with optional `month_names` for locale-specific month names (bypasses `dateparser`)
- `json`: parse JSON and extract by path (`items[0].name`, `items[*].name`); scalar values (`int`/`float`/`bool`) are preserved, and invalid array access on non-lists resolves to `""`
- `output`: produce final output map

### URL rewriting

`set_url` and `replace` (when `output = "input_url"`) keep the derived URL parts
in sync. After rewriting `input_url`, the variables `input_host`, `input_path`,
and `input_query` are re-derived from the new URL, so later steps and `output`
templates see consistent values.

`set_url` is URL-aware:

- A value containing `${input_url}` is treated as path manipulation: the
  prefix/suffix around the reference is appended to the base path, the base
  query string is preserved, and the fragment is dropped.
  `${input_url}/rss` against `https://example.com/news?c=1` becomes
  `https://example.com/news/rss?c=1`.
  `archive${input_url}/rss` becomes
  `https://example.com/archive/news/rss?c=1`.
- A value with a scheme (e.g. `https://other.example/feed`) is treated as an
  absolute override.
- A relative value without `${input_url}` (e.g. `/feed`) is joined onto the
  current `input_url` via `urljoin`, with the base's query and fragment
  stripped.

### Regex safety

Adapter regexes are compiled with a conservative safety guard.

- `regex.pattern`, `replace.pattern`, and `url_triggers.path_patterns` reject
  nested repeated subpatterns such as `^(a+)+$`, which are a common source of
  catastrophic backtracking in Python's `re` engine. Possessive quantifiers
  (`a++`) and atomic groups (`(?>...)`) suppress backtracking and are accepted.
- The guard is enforced during adapter load for bundle-defined adapters, and on
  first execution for step instances created directly in Python.
- This is a heuristic safe-subset check, not a complete ReDoS analyser: it
  targets the nested-quantifier shape and does not catch every backtracking
  bomb (e.g. ambiguous alternation like `(a|a)+`). Keep adapter regexes simple
  and treat third-party-sourced adapters as untrusted input unless you review
  them.

### JSON extraction typing

`json` step results preserve the underlying JSON scalar types. A path like
`items[*].id` returns a list of `int`/`float`/`bool` values as they appear in
the source, not their stringified form. Nested arrays are preserved as nested
lists.

When a path applies `[index]` or `[*]` to a non-list value, the step returns an
empty string rather than serializing the object at that path.

### Datetime step timezone

The `datetime` step emits timestamps from `datetime.now(datetime.UTC)`, i.e.
a tz-aware UTC value. This avoids the cross-timezone correctness footgun that
came from the previous naive `datetime.now()` (which silently used the host's
local time). To force a specific offset in the formatted string, use the
`%z`/`%Z` directives — e.g. `format = "%Y-%m-%dT%H:%M:%S%z"` produces a
trailing `+0000` for UTC.

## Installation

Scraplet DSL supports Python 3.11, 3.12, and 3.13. The 3.11 floor is
intentional: the loader uses the stdlib `tomllib` module. CI runs format,
lint, tests, build, and installed-wheel smoke checks on all supported Python
versions.

### From PyPI

```bash
pip install scraplet-dsl
```

This installs the library with the default `httpx`-based HTTP fetcher and the
`lxml`-based HTML parser. Both are runtime dependencies, so a fresh
`pip install` is enough to start resolving URLs.

### Local editable install (for development)

```bash
git clone <your-fork-or-mirror-url>
cd Scraplet-DSL
pip install -e .
```

If you use Poetry, `poetry install` works the same way.

## For Library Users

Once installed, the minimal end-to-end flow is:

```python
from scraplet_dsl import ScriptEngine, load_adapter_bundle
from scraplet_dsl.engine import select_adapter

bundle = load_adapter_bundle(...)  # see "Adapter Bundle Example" below
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
    raise RuntimeError("No adapter matched the URL")

result = ScriptEngine().resolve(adapter, url)
print(result.output)
```

The bundle is a plain Python dict (or a TOML file loaded the same way); its
schema is described in [Adapter Schema](#adapter-schema-bundle-mode).

## Adapter Bundle Example

A runnable end-to-end example, defining one adapter in a Python dict bundle
and resolving a URL through it:

```python
from scraplet_dsl import load_adapter_bundle, ScriptEngine
from scraplet_dsl.engine import select_adapter

bundle = load_adapter_bundle(
    {
        "schema_version": 1,
        "adapters": [
            {
                "name": "example_article",
                "priority": 10,
                "url_triggers": {"domains": ["example.com"], "path_patterns": [r"^/news/"]},
                "steps": [
                    {"type": "fetch", "url_var": "input_url", "save_body_as": "html", "retry": 2, "retry_backoff": 0.5, "timeout": 20},
                    {"type": "select", "html_var": "html", "selector": "h1", "output": "title"},
                    {"type": "output", "output": {"title": "${title}"}},
                ],
            }
        ],
    }
)

url = "https://example.com/news/123"
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
    raise RuntimeError("No adapter matched the URL")

result = ScriptEngine().resolve(adapter, url)
print(result.output)
```

## Adapter Schema (bundle mode)

Top-level keys:

- `schema_version`: must be integer `1` (`true`/`false` are rejected)
- `adapters`: list of adapter definitions

Adapter keys:

- `name`: unique adapter name (duplicates are rejected at load time)
- `priority`: integer priority; lower value wins when multiple adapters match
- `url_triggers.domains`: non-empty list of domains
- `url_triggers.path_patterns`: optional list of valid regex path filters (compiled during load and rejected if they use blocked nested-repeat forms)
- `steps`: ordered list of step tables
- `headers`: optional table of HTTP headers attached to every `fetch` the adapter issues (validated as a `str -> str` map; values override the engine-level fetcher headers)

Selected numeric validation rules:

- `fetch.retry`: integer `>= 0`
- `fetch.retry_backoff`: number `>= 0` (defaults to `0.5`; retries sleep `retry_backoff * attempt_number` seconds)
- `fetch.timeout`: number `> 0`; when omitted, the active fetcher's default is used (`HttpxFetcher` defaults to `20.0` seconds). This caps a single HTTP request and is separate from the adapter-wide resolution deadline (see [Resolution Deadline](#resolution-deadline)).
- `replace.count`: integer `>= 0`

## Error Model

- `ScrapletError`: base class for all errors below
- `ScriptValidationError`: invalid schema or step declaration
- `ExecutionError`: runtime step failure
- `MissingDependencyError`: optional dependency missing at runtime

Runtime errors include adapter and step context through `ScriptEngine.resolve`.

## Resolution Deadline

`ScriptEngine.resolve` accepts an optional `timeout=` keyword argument that
caps the total wall-clock budget of a single resolve call:

```python
result = engine.resolve(adapter, url, timeout=10.0)
```

When `timeout` is set, an internal monotonic deadline is propagated through
`ExecutionContext`:

- The engine checks the deadline before each step, so a runaway adapter aborts
  in bounded time with an `ExecutionError` referencing the step that was
  skipped.
- `FetchStep` checks the deadline before each fetch attempt and refuses to
  start another attempt if no budget is left for its retry backoff. The
  retry-backoff sleep is also capped to the remaining budget, so a slow or
  wedged fetch cannot exhaust the budget while sleeping between retries.
- A step that does not respect the deadline (e.g. a misbehaving custom step)
  is still bounded: subsequent steps and retries will see the deadline
  already exceeded and abort.

`timeout` must be `> 0` when given. The default (`None`) preserves the prior
behavior — no deadline is propagated and `ExecutionContext.deadline` stays
unset.

## Network Hardening (`HttpxFetcher`)

The default `HttpxFetcher` enforces a small security policy on every request:

- **Scheme allowlist.** Only `http://` and `https://` URLs are accepted. Unsupported
  schemes (`ftp://`, `file://`, ...) are rejected before the client opens a
  socket. The same check is applied to every redirect `Location`.
- **Domain allowlist.** When `allow_domains=...` is configured, the host of the
  initial URL **and** every redirect target is validated against the allowlist
  *before* the request is issued. Disallowed hosts are never contacted.
- **Private-network blocking.** Private, loopback, link-local, multicast,
  reserved, and unspecified addresses are allowed by default for backward
  compatibility. Set `block_private_networks=True` to reject IP literals and
  hostnames that resolve to those address ranges before any request is issued.
- **Redirect cap.** Up to `max_redirects` (default `5`) redirects are followed.
  Beyond that, the fetch fails with `httpx.HTTPError("too many redirects")`.
- **Streaming size cap.** Requests are issued in streaming mode, and response
  bodies are read in chunks via `iter_bytes()`. The fetch is aborted as soon as
  more than `max_bytes` (default `2_000_000`) bytes are accumulated, so
  oversized responses cannot be fully buffered into memory.
- **UTF-8 only.** Bodies that fail UTF-8 decoding are rejected rather than
  silently producing mojibake.
- **HTTP status codes are data, not errors.** A non-2xx response (404, 410,
  500, ...) is returned as a normal `FetchResult` carrying its status, body,
  and headers. Only transport-level and policy failures (unsupported scheme,
  disallowed host, redirect cap, oversized body, invalid UTF-8, timeout/connect
  error) raise. Use `fetch.save_status_as` plus an `assert` or output template
  if an adapter needs to branch on the status.

These guarantees apply to the final response (after redirects) as well as every
intermediate hop.

Important: without `allow_domains=...`, `HttpxFetcher` allows any valid
`http://` or `https://` host. If adapters or input URLs can come from untrusted
third parties, configure `allow_domains` and consider enabling
`block_private_networks=True`:

```python
from scraplet_dsl.http import HttpxFetcher

fetcher = HttpxFetcher(
    allow_domains=("example.com",),
    block_private_networks=True,
)
```

Private-network blocking resolves hostnames before each request and redirect.
This is a useful SSRF guard, but it is not a complete network sandbox; deploy
network-level egress controls for high-risk untrusted adapter execution.

If a `fetch` step omits `timeout`, the active fetcher's default is used. The
default `HttpxFetcher` uses `20.0` seconds per request. This caps an individual
HTTP request and is distinct from the adapter-wide resolution deadline passed
to `ScriptEngine.resolve(timeout=...)` (see [Resolution Deadline](#resolution-deadline)).

## Development

```bash
poetry install
poetry run pytest
```

## License

Scraplet DSL is distributed under the Apache License 2.0. See `LICENSE` for details.

External contributions are accepted under the contribution terms in `CONTRIBUTING.md`.

## Compatibility Policy

- `scraplet-dsl` is pre-1.0 and may take breaking package/API changes in minor
  releases when that helps the project move faster
- Adapter-schema breaking changes must be deliberate and must bump
  `schema_version`
- User-visible changes and breakages should be recorded in `CHANGELOG.md`

## Project Layout

- `src/scraplet_dsl/engine.py`: adapter matching and execution
- `src/scraplet_dsl/loader.py`: schema parsing and validation
- `src/scraplet_dsl/steps.py`: step implementations
- `src/scraplet_dsl/http.py`: fetcher protocol and default `httpx` fetcher
- `src/scraplet_dsl/html.py`: parser protocol and `lxml` implementation
- `src/scraplet_dsl/variables.py`: variable store and template helpers
- `src/scraplet_dsl/types.py`: shared types (`Value`, `Variables`, `ResolutionResult`)
- `src/scraplet_dsl/errors.py`: error hierarchy
- `src/scraplet_dsl/regex_utils.py`: regex flag parsing and the nested-repeat safety guard

