Metadata-Version: 2.4
Name: openstax-scraper
Version: 0.1.3
Summary: Scrape OpenStax math textbooks into AI-ready JSON (Markdown + LaTeX), with optional problem/solution pairs.
Author-email: Yoftahe Milkessa <yoftahemilkessa@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/yoftahe/openstax-scraper
Project-URL: Repository, https://github.com/yoftahe/openstax-scraper
Keywords: openstax,scraper,textbook,rag,llm,markdown,mathml,latex
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Education
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31
Requires-Dist: lxml>=5.0
Requires-Dist: markdownify>=0.12
Requires-Dist: jsonschema>=4.0
Requires-Dist: langdetect>=1.0.9
Requires-Dist: yake>=0.4.8
Requires-Dist: platformdirs>=4.0
Requires-Dist: anthropic>=0.40
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# openstax-scraper

Turn [OpenStax](https://openstax.org) textbooks into **AI-ready JSON**: one clean
Markdown document per content page — LaTeX math preserved, practice problems
inline, with rich metadata and quality signals — plus an optional mode that
harvests **problem ↔ solution pairs**. The output is newline-delimited JSON
(JSONL), ready to chunk and embed for retrieval-augmented generation (RAG),
fine-tuning datasets, search indexes, or analysis.

The package installs a single command-line tool, `scrape_openstax`, and is also
usable as a library (`import openstax_scraper`).

## Highlights

- **MathML → LaTeX** for the full element set OpenStax math books use, wrapped as
  `$…$` / `$$…$$`.
- **HTML → Markdown** that preserves LaTeX through Markdown escaping, references
  images by absolute URL (never downloads them), and normalizes whitespace.
- **Polite, cached fetching** — retry/backoff (429/5xx, `Retry-After`), per-host
  delay + jitter, a descriptive User-Agent, `robots.txt` enforcement, and a
  mandatory on-disk cache with a refresh interval (TTL).
- **Idempotent output** — upsert-by-`id` with atomic rewrite, so re-running on an
  unchanged book is a no-op and a changed page updates its line in place.
- **Schema-validated** — every record can be checked against bundled JSON Schemas
  before it is written (`--validate`).
- **Per-page error isolation** — one bad page never aborts a whole book.

## Installation

Requires **Python 3.10+**.

```bash
pip install openstax-scraper
```

Or from a clone, in editable mode with the dev tools (pytest, ruff, build):

```bash
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
```

### Dependencies

All runtime dependencies install automatically with the package:

| Package | Why it's needed |
|---|---|
| [`requests`](https://pypi.org/project/requests/) | HTTP fetching |
| [`lxml`](https://pypi.org/project/lxml/) | HTML/MathML parsing |
| [`markdownify`](https://pypi.org/project/markdownify/) | HTML → Markdown conversion |
| [`jsonschema`](https://pypi.org/project/jsonschema/) | `--validate` against the output schemas |
| [`langdetect`](https://pypi.org/project/langdetect/) | `language` enrichment signal |
| [`yake`](https://pypi.org/project/yake/) | offline `keywords` extraction (no LLM, no network) |
| [`platformdirs`](https://pypi.org/project/platformdirs/) | resolves the per-user cache directory |
| [`anthropic`](https://pypi.org/project/anthropic/) | `--get-qa` solutions-manual discovery (an Anthropic API key is required only for that mode) |

## Usage

There are two things the tool does:

1. **Scrape a whole textbook** into a single JSONL file with enriched metadata.
2. **Extract question & answer pairs** from a textbook into a separate JSONL file.

Any in-book page URL works as the seed — it is normalized to the book's preface,
from which the full table of contents is discovered.

### Scrape a textbook into JSONL

Output is `<output-dir>/<book>/page_contents.jsonl` (plus a `manifest.json`).

```bash
# Whole book
scrape_openstax \
  --book-url=https://openstax.org/books/calculus-volume-1/pages/preface \
  --output-dir ./out --validate

# Selected chapters only (comma-separated, no spaces)
scrape_openstax \
  --book-url=https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
  --chapters=1,2,3 --output-dir ./out --validate
```

Re-running is **idempotent**: an unchanged book is a no-op, a changed page
updates its line in place. Add `--dry-run` to crawl and validate without writing.

### Extract question-and-answer pairs

`--get-qa` is a separate mode that pairs each problem with its worked solution
into `<output-dir>/<book>/questions_and_answers.jsonl`. It covers the whole book
(ignores `--chapters`) and **needs an Anthropic API key** to discover the
solutions-manual pages.

The key is resolved as `--anthropic-api-key=<key>` if given, otherwise the
`ANTHROPIC_API_KEY` environment variable (read from a local `.env` automatically,
or exported). If neither is set, the run exits with a fatal config error.

```bash
scrape_openstax \
  --book-url=https://openstax.org/books/chemistry-2e/pages/preface \
  --get-qa --output-dir ./out --validate
```

A book with no solutions manual produces no file and exits cleanly.

### Parse a single saved page (offline)

```bash
scrape_openstax \
  --from-file page.html \
  --url https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
  --output-dir ./out
```

### Use as a library

```python
from openstax_scraper.adapters.openstax import OpenStaxAdapter

adapter = OpenStaxAdapter()
page = adapter.parse_page(url, html)   # -> a PageRecord
```

## Command-line arguments

| Flag | Meaning |
|---|---|
| `--book-url=<URL>` | Any in-book page URL (`/books/<book>/pages/...`); **crawls that book** (normalized to its `/pages/preface`, whose TOC is discovered). Mutually exclusive with `--from-file`; one of the two is required. |
| `--from-file=<path>` | Parse a single local HTML file instead of crawling (offline). |
| `--url=<URL>` | The canonical URL to record when using `--from-file` (so `id`/`source` are right even though the bytes came from disk). |
| `--output-dir=<dir>` | Where to write output. Default `./out`, or `$OPENSTAX_OUTPUT_DIR` if set. Files: `<dir>/<book>/page_contents.jsonl` + `manifest.json` (or `questions_and_answers.jsonl` under `--get-qa`). |
| `--chapters=<csv>` | Restrict the crawl to one or more chapters, comma-separated with no spaces (e.g. `11` or `1,2,3`). Each keeps pages whose slug begins `<n>-` (e.g. `11-1`, `11-2`). Omit to crawl the whole book. |
| `--get-qa` | Collect problem/solution **pairs** into `questions_and_answers.jsonl` instead of crawling pages. Covers the whole book (ignores `--chapters`). Needs an Anthropic API key. A book with no solutions manual yields an empty result. |
| `--anthropic-api-key=<key>` | Anthropic API key for `--get-qa`. **Takes precedence** over the `ANTHROPIC_API_KEY` environment variable (and `.env`); if omitted, that variable is the fallback. If neither is set, `--get-qa` exits with a fatal error. Ignored outside `--get-qa`. |
| `--delay=<sec>` | Politeness delay between requests to the same host (default `1.0`, plus jitter). |
| `--refresh-interval=<sec>` | How long a cached page stays fresh before it's re-fetched. Default `432000` (5 days). Caching is always on; see [Cache location](#cache-location). |
| `--no-robots` | Skip `robots.txt` consultation (default: obey it). |
| `--keywords=<mode>` | `heuristic` (default, offline `yake`) or `none`. |
| `--include-types=<csv>` | Keep only these `content_type`s (e.g. `textbook_section,chapter_intro`). Default: all. |
| `--validate` | Check every record against the bundled JSON Schemas **before** writing, and fail loudly if anything is off-contract. Recommended. |
| `--dry-run` | Crawl + parse + validate, but write nothing. Great for CI. |
| `--user-agent=<str>` | Override the polite identifying User-Agent sent on fetches. |
| `--log-level=<lvl>` | Logging verbosity (`DEBUG`/`INFO`/`WARNING`/…). Default `INFO`. |
| `-h`, `--help` | Print usage and exit. |

## Output schema

Newline-delimited JSON, one file per book. The authoritative contract is the
bundled JSON Schemas in
[`src/openstax_scraper/schemas/`](src/openstax_scraper/schemas/).

**`page_contents.jsonl`** — one object per content page:

| field | meaning |
|---|---|
| `id` | `sha1(url)` — stable primary key (idempotency) |
| `url`, `title` | page URL and section title |
| `body_text` | cleaned **Markdown** with `$…$` / `$$…$$` LaTeX — the full page, **practice problems included inline** |
| `source` | `{site, book, book_title, chapter, section, page_slug}` |
| `content_type` | `textbook_section` \| `chapter_intro` \| `chapter_summary` \| `glossary` \| `reference` |
| `char_count`, `word_count`, `math_density`, `n_images`, `image_urls` | structural quality signals |
| `language`, `reading_time_min`, `keywords` | enrichment signals |
| `content_hash`, `fetched_at`, `scraper_version` | provenance / change-detection |

> **Why are problems kept inline rather than split into their own records?**
> OpenStax problems have no dependable structure — groups, sub-problems,
> irregular numbering — so a reliable split would require an LLM. Instead they
> are treated as ordinary page content: converted to Markdown and left inline in
> `body_text`, in reading order.

**`questions_and_answers.jsonl`** (from `--get-qa`) — one object per pair:

| field | meaning |
|---|---|
| `id` | `sha1(question_url, fragment)` — stable primary key |
| `question`, `answer` | the problem and its worked solution, both **Markdown + LaTeX** |
| `source` | `{site, book, chapter, section, page_slug}` of the question |
| `question_url`, `answer_url` | where the problem is stated / where the solution lives |
| `question_fragment`, `label` | the problem element's id; the solution's displayed number |
| `content_hash`, `fetched_at`, `scraper_version` | provenance / change-detection |

## How it works

### Parsing: HTML + MathML → Markdown

`OpenStaxAdapter.parse_page(url, html)` turns a page into a `PageRecord`:

- **MathML → LaTeX** (`mathml.py`) for the element set OpenStax math books use
  (`mi mn mo mrow msup msub msubsup mfrac msqrt mroot mtable …`).
- **HTML → Markdown** (`htmlmd.py`) that preserves LaTeX through Markdown escaping
  (sentinel substitution) and references images by absolute URL.
- **Page classification** + routing (section / intro / summary / glossary /
  reference / skip). The entire content body — worked examples, in-text notes,
  *and* practice problems — is kept inline in one Markdown document.

### Crawling and fetching

- **`fetcher.py`** — site-agnostic HTTP with retry/backoff, a polite per-host
  delay + jitter, `robots.txt` enforcement, and a mandatory on-disk cache:
  entries are stored as `<url-hash>-<epoch>.html` and re-used until older than
  `--refresh-interval`, then re-fetched. Every fetch returns a `FetchResult`;
  errors are captured, never raised.
- **`crawler.py`** — generic over the `SiteAdapter`: discovers the full book TOC
  from the seed page, builds an ordered, deduplicated frontier (optionally
  narrowed to `--chapters`), then fetches/classifies/parses/enriches each page
  with content-hash dedup and per-page error isolation.

### Enrichment and idempotent output

- **`enrich.py`** fills quality signals generically: `language` (`langdetect`),
  `reading_time_min` (`word_count / 200`), and offline `keywords` (`yake`).
- **`writers.py`** writes idempotent JSONL (upsert-by-`id`, atomic rewrite) plus
  a per-book `manifest.json` of run metadata and counts.

### How Q&A pairing works

Pairing problems with solutions on OpenStax is otherwise hopeless to hardcode:
solutions manuals appear under inconsistent names and positions, often cover only
*some* problems, and number them out of order. The trick is that **every
solution's number is a back-link to the problem it solves** — an
`<a class="os-number" … data-page-slug="…" data-page-fragment="…">`. So `--get-qa`:

1. **Discovers the solutions-manual pages with a cheap LLM** — the model reads the
   TOC and returns the solution-page slugs (hallucinated slugs are dropped — only
   real TOC leaves survive). The prompt is bundled at
   [`prompts/discover_solutions.md`](src/openstax_scraper/prompts/discover_solutions.md).
2. **Fetches the whole book**, then on each solutions page finds every `os-number`
   back-link, takes its parent as the answer, and follows the link to the problem
   element on its page. Both halves are converted to Markdown with the same
   MathML→LaTeX pass as page bodies.

## Operational reference

### Cache location

The on-disk page cache is a private speed/politeness optimization, not output, so
it lives in one constant place and is reused across runs regardless of where you
write JSONL. The directory is resolved as:

1. **`$OPENSTAX_CACHE_DIR`** if set — an explicit override.
2. otherwise the **per-user cache dir** for this app (via `platformdirs`):
   - Linux: `~/.cache/openstax-scraper` (honors `$XDG_CACHE_HOME`)
   - macOS: `~/Library/Caches/openstax-scraper`
   - Windows: `%LOCALAPPDATA%\openstax-scraper\Cache`

The cache is best-effort: an unwritable path just disables it rather than failing
the run.

### Exit codes

The CLI follows the rule **"exit non-zero only on fatal config errors; per-page
failures never fail the run."**

| Code | Meaning |
|---|---|
| `0` | Ran to completion. Individual pages that 404, time out, or fail to parse are isolated, counted in the summary (`failed=…`), and do not change the exit code. |
| `2` | A **fatal** config/environment error stopped the run before useful output: `--from-file` path missing, the seed page couldn't be fetched (empty frontier), a record came out off-contract under `--validate`, or `--get-qa` had no API key. These print a one-line error instead of a traceback and write nothing. `argparse` also exits `2` on bad/missing flags. |

## Development

```bash
pip install -e ".[dev]"
pytest          # fully offline, against committed fixtures in tests/fixtures/
ruff check .
```

### Project layout

```
src/openstax_scraper/
  mathml.py            # MathML → LaTeX
  htmlmd.py            # HTML → Markdown (math-aware)
  models.py            # PageRecord, QuestionAndAnswer data classes
  config.py            # runtime configuration (delay, get_qa, cache dir, …)
  enrich.py            # generic quality signals (language, reading time, keywords)
  fetcher.py           # site-agnostic HTTP: retry, throttle, cache, robots
  crawler.py           # orchestrator: TOC frontier → fetch/parse/enrich, dedup
  qa.py                # --get-qa orchestrator: pair problems with solutions
  llm.py               # tiny Anthropic wrapper (used by --get-qa)
  prompts.py           # locate/load bundled prompt templates
  writers.py           # idempotent upsert JSONL + manifest
  cli.py               # scrape_openstax entry point
  adapters/
    base.py            # SiteAdapter protocol + PageClass
    openstax.py        # all OpenStax-specific knowledge
  schemas/             # bundled JSON Schemas (output contract)
  prompts/             # bundled LLM prompt templates
scripts/               # diagnostic probes (live-site, dev-only)
tests/                 # offline tests + committed HTML fixtures
```

The **adapter boundary** keeps OpenStax specifics out of the generic pipeline:
supporting a new site means adding one `SiteAdapter`, not touching the crawler.


## License

[MIT](LICENSE) © Yoftahe Milkessa
