Metadata-Version: 2.4
Name: aozora_py
Version: 0.4.1
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Markup
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Summary: Aozora Bunko notation parser — Python bindings (PyO3).
Keywords: aozora,parser,japanese,ruby
Author: Yasunobu
License: MIT OR Apache-2.0
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/P4suta/aozora
Project-URL: Issues, https://github.com/P4suta/aozora/issues
Project-URL: Repository, https://github.com/P4suta/aozora

# aozora

<p align="center">
  <a href="https://github.com/P4suta/aozora/actions/workflows/ci.yml"><img alt="ci" src="https://github.com/P4suta/aozora/actions/workflows/ci.yml/badge.svg"></a>
  <a href="https://github.com/P4suta/aozora/actions/workflows/docs.yml"><img alt="docs deploy" src="https://github.com/P4suta/aozora/actions/workflows/docs.yml/badge.svg"></a>
  <a href="https://github.com/P4suta/aozora/releases/latest"><img alt="latest release" src="https://img.shields.io/github/v/release/P4suta/aozora?display_name=tag&sort=semver"></a>
  <a href="./LICENSE-APACHE"><img alt="license" src="https://img.shields.io/badge/license-Apache--2.0%20OR%20MIT-blue"></a>
  <a href="./rust-toolchain.toml"><img alt="msrv" src="https://img.shields.io/badge/rust-1.95-orange"></a>
</p>

<p align="center">
  🎮 <a href="https://p4suta.github.io/aozora/playground/"><strong>Playground</strong></a>
  · 📚 <a href="https://p4suta.github.io/aozora/"><strong>Handbook (mdbook)</strong></a>
  · 📖 <a href="https://p4suta.github.io/aozora/api/aozora/"><strong>API reference (rustdoc)</strong></a>
  · 📦 <a href="https://github.com/P4suta/aozora/releases"><strong>Releases &amp; binaries</strong></a>
  · 🇯🇵 <a href="./README.ja.md"><strong>日本語</strong></a>
</p>

Pure-functional Rust parser for **青空文庫記法** (Aozora Bunko notation):
ruby (`｜青梅《おうめ》`), bouten (`［＃「X」に傍点］`), 縦中横, 外字
references (`※［＃…、第3水準1-85-54］`), kunten / kaeriten,
indent / align-end containers (`［＃ここから2字下げ］… ［＃ここで字下げ終わり］`),
and page / section breaks.

The parser is **CommonMark-free, Markdown-free** — this repository deals
only with the 青空文庫 notation itself. The renderer emits semantic HTML5;
the lexer reports structured diagnostics; the AST is a borrowed-arena
tree that can be walked in O(n) without copying source bytes.

## Installation

### Pre-built CLI

Pre-built `aozora` CLI binaries for **Linux x86_64**, **macOS arm64**,
and **Windows x86_64** are attached to every GitHub Release —
[the releases page](https://github.com/P4suta/aozora/releases) carries
`aozora-vX.Y.Z-<target>.{tar.gz,zip}` archives with `SHA256SUMS`.

### Build from source

```sh
cargo install --git https://github.com/P4suta/aozora --locked aozora-cli
```

(builds the latest `main`; pin to a release tag for reproducible builds —
see [the install chapter](https://p4suta.github.io/aozora/getting-started/install.html)
for the tag-pinned form.)

### As a Rust library

The `Cargo.toml` snippet (with the current release tag) lives in the
[install chapter](https://p4suta.github.io/aozora/getting-started/install.html#as-a-rust-library) —
keeping it in one place avoids version-pin drift across multiple READMEs.
crates.io publication tracks the 1.0 API freeze.

For WASM / C ABI / Python bindings see the
[Bindings chapters](https://p4suta.github.io/aozora/bindings/rust.html) of
the handbook.

## Quickstart

```rust
use aozora::Document;

let source = "｜青梅《おうめ》".to_owned();
let doc = Document::new(source);
let tree = doc.parse();

let html: String = tree.to_html();
let canonical: String = tree.serialize();
let diagnostics = tree.diagnostics();

assert_eq!(canonical, "｜青梅《おうめ》");
```

`Document` owns a [`bumpalo`](https://docs.rs/bumpalo) arena; `tree`
borrows from it for the lifetime of the `Document`. Dropping the
`Document` releases every node in a single `Bump::reset` step.

## CLI

```sh
aozora check FILE.txt           # lex + report diagnostics
aozora fmt --check FILE.txt     # round-trip parse ∘ serialize check
aozora render FILE.txt          # render to HTML on stdout
aozora check -E sjis FILE.txt   # Shift_JIS source from Aozora Bunko
```

All subcommands accept `-` (or no path argument) to read from stdin.
See the [CLI reference chapter](https://p4suta.github.io/aozora/ref/cli.html)
for the full subcommand reference.

## Crate layout

aozora is a 21-crate workspace.
[`crates/aozora`](./crates/aozora) is the public facade — library
consumers usually import only this one.

| Crate | Purpose |
|---|---|
| [`crates/aozora`](./crates/aozora) | Top-level facade. `Document::parse() → AozoraTree<'_>`, structured `Diagnostic`s, `SLUGS` catalogue, `canonicalise_slug`. The single front door. |
| [`crates/aozora-spec`](./crates/aozora-spec) | Single source of truth for shared types: `Span`, `TriggerKind`, `PairKind`, `Diagnostic`, PUA sentinel codepoints, `SLUGS` dispatch table. No internal dependency. |
| [`crates/aozora-syntax`](./crates/aozora-syntax) | AST types (`AozoraNode` borrowed-arena variants, `ContainerKind`, `BoutenKind`, `Indent`). |
| [`crates/aozora-encoding`](./crates/aozora-encoding) | Shift_JIS decoding + 外字 lookup (compile-time PHF, JIS X 0213 + UCS resolution). |
| [`crates/aozora-scan`](./crates/aozora-scan) | SIMD-friendly multi-pattern scanner backends (Teddy / structural-bitmap / Hoehrmann DFA / naive fallback). |
| [`crates/aozora-veb`](./crates/aozora-veb) | Eytzinger-layout sorted-set lookup (cache-friendly binary search). |
| [`crates/aozora-pipeline`](./crates/aozora-pipeline) | 4-phase lexer (sanitize → events → pair → classify) plus the `lex_into_arena` orchestrator — pure `fn(&str, &Arena) -> BorrowedLexOutput<'_>`. |
| [`crates/aozora-render`](./crates/aozora-render) | HTML and serialise renderers — `html::render_to_string`, `serialize::serialize`. |
| [`crates/aozora-cst`](./crates/aozora-cst) | rowan-backed lossless concrete syntax tree. Editor/formatter surface. |
| [`crates/aozora-query`](./crates/aozora-query) | Tree-sitter-style pattern DSL (`SyntaxKind` + capture) for queries over the CST. |
| [`crates/aozora-pandoc`](./crates/aozora-pandoc) | Pandoc AST projection (`AozoraTree` → `pandoc_ast::Pandoc`); unlocks 50+ output formats via Pandoc writers. |
| [`crates/aozora-cli`](./crates/aozora-cli) | `aozora` binary: `check` / `fmt` / `schema` / `kinds` / `explain` / `pandoc`. |
| [`crates/aozora-wasm`](./crates/aozora-wasm) | `wasm32-unknown-unknown` target for `wasm-pack build --target web`. |
| [`crates/aozora-ffi`](./crates/aozora-ffi) | C ABI driver (opaque handle, JSON-encoded structured data). |
| [`crates/aozora-py`](./crates/aozora-py) | PyO3 bindings, distributed via `maturin`. |
| [`crates/aozora-bench`](./crates/aozora-bench) | Criterion + corpus-driven probes (PGO profile source). |
| [`crates/aozora-conformance`](./crates/aozora-conformance) | WPT-style conformance fixture runner (golden HTML / serialize / diagnostics / wire across 23 fixtures). |
| [`crates/aozora-corpus`](./crates/aozora-corpus) | Corpus source abstraction for sweep tests (dev-only, set `AOZORA_CORPUS_ROOT`). |
| [`crates/aozora-proptest`](./crates/aozora-proptest) | Shared proptest strategies (`aozora_fragment` / `pathological_aozora` / `unicode_adversarial` and friends; dev-only). |
| [`crates/aozora-trace`](./crates/aozora-trace) | DWARF symbolicator for samply traces. |
| [`crates/aozora-xtask`](./crates/aozora-xtask) | Repo automation (samply wrapper, trace analysis, corpus pack/unpack, schema dumps). |

See the [Architecture chapter](https://p4suta.github.io/aozora/arch/pipeline.html)
of the handbook for the layered design, the borrowed-arena AST, the
SIMD scanner backends, and the dependency graph between these
crates.

## Development

Everything runs inside Docker — the host toolchain is never invoked.
Bring up the dev image once, then drive every operation through `just`:

```sh
just                # list targets
just build          # cargo build --workspace --all-targets
just test           # cargo nextest run --workspace
just prop           # property-based sweep (128 cases per block)
just lint           # fmt + clippy pedantic+nursery + typos + strict-code
just deny           # cargo-deny licenses + advisories + bans
just coverage       # cargo llvm-cov branch coverage
just ci             # full CI replica
just book-build     # render the mdbook handbook
just book-serve     # live-preview the handbook at localhost:3000
```

Use `just run` to invoke the CLI inside the container:

```sh
just run check FILE.txt
just run render -E sjis FILE.txt > out.html
```

See [`CONTRIBUTING.md`](./CONTRIBUTING.md) for the contribution flow,
testing strategy, and lint policy.

## Documentation

- 📚 [**Handbook**](https://p4suta.github.io/aozora/) — the mdbook
  site: notation reference, architecture (borrowed-arena AST,
  SIMD scanner backends, encoding), bindings (Rust / WASM / C ABI /
  Python), performance (samply / bench / corpus sweep), CLI / API /
  env reference, and the contributor guide.
- 📖 [**API reference (rustdoc)**](https://p4suta.github.io/aozora/api/aozora/)
  — auto-deployed alongside the handbook.
- [`CONTRIBUTING.md`](./CONTRIBUTING.md) — dev setup, TDD flow,
  PR rules.
- [`SECURITY.md`](./SECURITY.md) — vulnerability disclosure.
- [`CHANGELOG.md`](./CHANGELOG.md) — release history.

## Related projects

| Repo | What it is |
|---|---|
| [`P4suta/afm`](https://github.com/P4suta/afm) | CommonMark + GFM + 青空文庫記法 integrated Markdown dialect, built on top of this parser. |
| [`P4suta/aozora-tools`](https://github.com/P4suta/aozora-tools) | Authoring tools: formatter, LSP server, tree-sitter grammar, VS Code extension. |

## License

Dual-licensed under [Apache-2.0](./LICENSE-APACHE) OR [MIT](./LICENSE-MIT)
at your option, matching Rust community convention. See
[`NOTICE`](./NOTICE) for third-party attribution (Aozora Bunko spec
snapshots and public-domain sample works used in tests).

