Metadata-Version: 2.4
Name: sitemap-verify
Version: 0.1.0
Summary: Validate sitemap XML files and inspect their discovered URLs.
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: aiometer>=1.0.0
Requires-Dist: feedparser>=6.0.12
Requires-Dist: httpx>=0.28.1
Requires-Dist: isodate>=0.7.2
Requires-Dist: lxml>=6.0.2
Requires-Dist: protego>=0.6.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: rfc3986>=2.0.0
Requires-Dist: xmlschema>=4.3.1
Description-Content-Type: text/markdown

# sitemap-verify

A Python 3.10+ tool/library to validate sitemap protocol compliance and check discovered URL reachability.

中文文档：`README.zh-CN.md`

## Features

- Async library API: `validate_target(...)`
- CLI command: `sitemap-verify check <target>`
- SQLite-backed runtime persistence for long-running validations
- Resume interrupted validations with `--resume-from <sqlite-file>`
- Supports sitemap inputs: XML `urlset`, `sitemapindex`, text sitemap, RSS, Atom
- Uses XSD validation (`xmlschema`) plus protocol semantic validation
- Recursively traverses sitemap indexes with depth/count safeguards
- URL reachability checks with SEO-oriented severity:
  - `2xx` => pass
  - `3xx` / `429` => `warn`
  - `4xx` / `5xx` / network errors => `error`
- Unified `error` / `warn` diagnostics report with JSON output support

## Requirements

- Python 3.10+
- `uv` for environment and dependency management

## Quick Start

```bash
uv sync --dev
uv run sitemap-verify check path/to/sitemap.xml
```

Install from PyPI:

```bash
pip install sitemap-verify
```

Validate a remote sitemap URL:

```bash
uv run sitemap-verify check https://example.com/sitemap.xml --mode url --format json
```

Validate a domain (discover sitemap from `robots.txt`, fallback `/sitemap.xml`):

```bash
uv run sitemap-verify check example.com --mode domain
```

Enable runtime logs, progress, and write output to a file:

```bash
uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --probe-method get \
  --format json \
  --output reports/result.json \
  --log-file logs/run.log \
  --verbose \
  --show-progress
```

Persist validation state to SQLite and resume after interruption:

```bash
uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --store reports/example-run.sqlite3

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --resume-from reports/example-run.sqlite3
```

If `--store` is not provided, the CLI creates a timestamped SQLite file under `reports/`.
During resume, sitemap files are parsed again, but URL reachability checks are skipped when a
cached result already exists in the SQLite store.

Reachability probe modes:

- `--probe-method get` (default): always use GET (recommended for sites that block or mis-handle HEAD)
- `--probe-method head`: HEAD only
- `--probe-method auto`: HEAD first, fallback to GET when HEAD returns 4xx/5xx (except 429) or 405/501

## Library Usage

```python
import asyncio

from sitemap_verify import validate_target


async def main() -> None:
    report = await validate_target(
        "https://example.com/sitemap.xml",
        mode="url",
        recursive=True,
        check_reachability=True,
        store_path="reports/example-run.sqlite3",
    )
    print(report.model_dump())


asyncio.run(main())
```

Optional persistence arguments:

- `store_path`: write validation state to a specific SQLite file
- `resume_from`: reopen an interrupted SQLite file and reuse existing URL reachability results

When `store_path` is omitted, `validate_target(...)` creates a timestamped SQLite file under
`reports/`.

## Development

Run the test suite:

```bash
uv run pytest
```

Run lint checks:

```bash
uv run ruff check .
```

## Project Structure

- `src/sitemap_verify/`: application package and CLI entrypoint
- `src/sitemap_verify/schemas/`: bundled XSD files used by the validator
- `tests/`: automated tests
- `docs/feat/`: feature planning notes
- `docs/agent-lessons/`: lessons from past fixed agent mistakes
- `.github/`: GitHub workflows and collaboration templates

## GitHub Collaboration

- Bug reports and feature requests use issue templates under `.github/ISSUE_TEMPLATE/`
- Pull requests follow `.github/pull_request_template.md`
- CI runs lint and tests on pushes and pull requests

## Release Process

- Update `project.version` in `pyproject.toml`
- Commit the release changes to `main`
- Create and push a matching tag such as `v0.1.0`
- GitHub Actions builds the package with `uv`, runs tests, validates distributions, smoke-tests `pip install`, and publishes via PyPI Trusted Publishing

Example:

```bash
git tag v0.1.0
git push origin v0.1.0
```

## License

This project is licensed under the MIT License. See `LICENSE` for details.
