Metadata-Version: 2.4
Name: trawcsy
Version: 0.3.0
Summary: Tiny, zero-dependency crawler — fetch, parse, crawl, store, GUI. Works with existing apps.
Author: trawcsy
License-Expression: MIT
Keywords: crawler,scraper,html,cli,dataset
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.9
Description-Content-Type: text/markdown

# trawcsy

**Tiny, zero-dependency crawler.** Fetch, parse, crawl, sitemap, store, GUI. All stdlib.

```bash
pip install trawcsy

trawcsy https://example.com              # shorthand — JSONL to stdout
trawcsy page https://x.com -o data       # to file
trawcsy page https://x.com -f text       # plain text
trawcsy page https://x.com -f json       # JSON array
trawcsy crawl https://site.com --depth 2 # recursive (concurrent)
trawcsy crawl @urls.txt                  # URLs from file
trawcsy sitemap https://site.com/xml     # from sitemap
trawcsy gui                              # graphical interface
trawcsy cat < data.jsonl                 # read + print text
```

## Library

```python
from trawcsy import (
    fetch, parse, parse_html,           # single page
    save, load, dumps, loads,           # JSONL
    dumps_json_array, save_json_array,  # JSON array
    crawl, crawl_urls,                  # crawling
    parse_sitemap,                      # sitemaps
)

# single page
page = parse("https://example.com")
print(page.title, len(page.text))

# recursive crawl (concurrent, configurable workers)
pages = crawl("https://docs.python.org/3/", depth=2, max_pages=10, workers=8)

# fetch URL list concurrently
pages = crawl_urls(["https://a.com", "https://b.com"], workers=5)

# from sitemap
urls = parse_sitemap("https://site.com/sitemap.xml")
for url in urls[:10]:
    save(parse(url), path="crawl.jsonl")

# JSON array output
save_json_array(pages, path="output.json")

# pipe-friendly
save(page)                                  # → stdout JSONL
for p in load("data.jsonl"):
    print(p.text[:200])
```

### Page fields

| Field | Type | Content |
|-------|------|---------|
| `.url` | `str` | Source URL |
| `.title` | `str` | `<title>` text |
| `.text` | `str` | Visible page text (block-separated by newlines) |
| `.links` | `list[dict]` | `{"href": str, "text": str}` |
| `.tables` | `list[dict]` | `{"caption": str, "headers": list, "rows": list[list]}` |
| `.lists` | `list[dict]` | `{"tag": "ul"|"ol", "items": list[str]}` |
| `.code` | `list[dict]` | `{"lang": str, "body": str}` |
| `.meta` | `dict` | Meta name/OG property → content |

## CLI

```
trawcsy https://example.com              # shorthand JSONL to stdout
trawcsy page https://x.com -o data       # to file
trawcsy page https://x.com -f text       # plain text
trawcsy page https://x.com -f json       # JSON array
trawcsy crawl https://site.com --depth 2 --workers 10
trawcsy crawl @urls.txt                  # read URLs from file
trawcsy sitemap https://site.com/xml
trawcsy gui                              # tkinter GUI
trawcsy cat < data.jsonl                 # read + print
trawcsy completion bash|zsh              # shell completion
trawcsy --version                        # → trawcsy 0.3.0

Options: --rate SEC, --timeout SEC, --workers N
```

## Why trawcsy

- **Zero dependencies** — stdlib only (urllib, html.parser, json, xml, tkinter, concurrent)
- **~17KB wheel** — installs in <1 second
- **Concurrent** — thread pool crawl speeds up multi-page fetches
- **80 tests** — CLI, parser edge cases, concurrent crawl, URL normalization, JSON array, sitemap
- **Modular** — import only what you need
- **Composable** — stdin/stdout JSONL/JSON, works with any pipeline
- **GUI included** — `trawcsy gui` for interactive browsing

## License

MIT
