Metadata-Version: 2.4
Name: inklog
Version: 0.1.2
Summary: Analyst-friendly scrapers with TUI and uvx support
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: aiofiles>=24.1.0
Requires-Dist: httpx>=0.28.1
Requires-Dist: rich>=14.3.2
Requires-Dist: textual>=7.5.0
Requires-Dist: typer>=0.21.1
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
Requires-Dist: pytest>=8.3.4; extra == 'dev'
Requires-Dist: ruff>=0.8.4; extra == 'dev'
Description-Content-Type: text/markdown

# Inklog

Inklog is a collection of uvx-friendly scrapers for analysts. It provides a CLI for automation and a TUI (work in progress) for browsing available scrapers. The current focus is downloading files (PDF/EPUB). Webpage-to-markdown scraping is planned but not implemented yet.

All dates in this repository use the ISO8601 format: `YYYY-MM-DD`.

## Quick start

### Run a scraper (CLI)

```bash
uvx inklog run malaysia_parliament 2025-11-01 2025-11-30
```

### List scrapers

```bash
uvx inklog list
```

### Show example URLs

```bash
uvx inklog info malaysia_parliament
```

### Run the TUI

```bash
uvx inklog
```

Run a single document type:

```bash
uvx inklog run malaysia_parliament:lower_house_hansard 2025-11-01 2025-11-30
```

## TUI

The TUI is a work in progress. It shows the available scrapers, their descriptions, and lets you run a scraper with date ranges and options.
Each Malaysia Parliament document type appears as its own row for quick runs.

Key bindings:
- `Enter` or `Run` button: run the selected scraper
- Example URLs can be opened from the details pane
- `Q`: quit

## Malaysia Parliament scraper

This scraper downloads PDFs from the Malaysia Parliament site for the following document types:

- Dewan Rakyat - Jawapan Lisan (Oral Answers)
- Dewan Rakyat - Jawapan Bukan Lisan (Non-Oral Answers)
- Dewan Rakyat - Penyata Rasmi (Official Report)
- Dewan Negara - Jawapan Bertulis (Written Answers)

### Filename rules

Some sources change their filename patterns over the years. Inklog models this using date-bound filename rules, so each document type can define a list of templates tied to start/end dates. The Malaysia Parliament scraper ships with the current pattern and a rules table that can be extended when historical patterns are known.

## CLI options

`inklog run` supports overriding boolean options with `--set`:

```bash
uvx inklog run malaysia_parliament 2025-11-01 2025-11-30 \
  --set jawapan_lisan_rakyat=true \
  --set jawapan_bukan_lisan_rakyat=true \
  --set penyata_rasmi_rakyat=true \
  --set jawapan_bertulis_negara=false
```

## Adding a new scraper

1) Copy `src/inklog/scrapers/template.py` and rename it.
2) Update `ScraperMeta` and implement `run()`.
3) Ensure the module exports `SCRAPER`.

Scrapers are auto-discovered from the `inklog.scrapers` package.

## Development

```bash
uv sync --extra dev
uv run ruff check --fix .
uv run ruff format .
uv run pytest
```

## Roadmap

- Add markdown scraping mode (TBD).
- Expand filename rules for historical Malaysia Parliament patterns.
