Metadata-Version: 2.3
Name: batchalign
Version: 0.9.0a3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Requires-Dist: typer >=0.12
Requires-Dist: click >=8
Requires-Dist: tqdm >=4
Requires-Dist: numpy >=1.24
Requires-Dist: polars >=0.20
Requires-Dist: soundfile >=0.12
Requires-Dist: rich >=13
Requires-Dist: pydantic >=2
Requires-Dist: pycountry
Requires-Dist: transformers >=4.57, <5 ; extra == 'whisper'
Requires-Dist: torch >=2.0 ; extra == 'whisper'
Requires-Dist: torchaudio >=2.0 ; extra == 'whisper'
Requires-Dist: nltk >=3.8 ; extra == 'whisper'
Requires-Dist: openai-whisper ; extra == 'whisper'
Requires-Dist: pycountry ; extra == 'whisper'
Requires-Dist: num2words ; extra == 'whisper'
Requires-Dist: stanza >=1.12 ; extra == 'stanza'
Requires-Dist: transformers >=4.57, <5 ; extra == 'stanza'
Requires-Dist: pyannote-audio >=3.0 ; extra == 'pyannote'
Requires-Dist: onnxruntime >=1.16 ; extra == 'pyannote'
Requires-Dist: rev-ai ; extra == 'revai'
Requires-Dist: pycantonese >=4.2, <5 ; extra == 'cantonese'
Requires-Dist: tencentcloud-sdk-python-asr ; extra == 'cantonese'
Requires-Dist: tencentcloud-sdk-python-tmt ; extra == 'cantonese'
Requires-Dist: aliyun-python-sdk-core >=2.13 ; extra == 'cantonese'
Requires-Dist: aliyun-python-sdk-alimt ; extra == 'cantonese'
Requires-Dist: funasr ==1.3.1 ; extra == 'cantonese'
Requires-Dist: opencc ; extra == 'cantonese'
Requires-Dist: cos-python-sdk-v5 ; extra == 'cantonese'
Requires-Dist: googletrans ; extra == 'translate'
Requires-Dist: qwen-asr ; extra == 'qwen3'
Requires-Dist: transformers >=4.57, <5 ; extra == 'qwen3'
Requires-Dist: torch >=2.0 ; extra == 'qwen3'
Requires-Dist: transformers >=4.57, <5 ; extra == 'nllb'
Requires-Dist: torch >=2.0 ; extra == 'nllb'
Requires-Dist: sentencepiece ; extra == 'nllb'
Requires-Dist: fastapi >=0.110 ; extra == 'api'
Requires-Dist: uvicorn[standard] >=0.27 ; extra == 'api'
Requires-Dist: sse-starlette >=2.0 ; extra == 'api'
Requires-Dist: python-multipart >=0.0.9 ; extra == 'api'
Requires-Dist: transformers >=4.57, <5 ; extra == 'all'
Requires-Dist: torch >=2.0 ; extra == 'all'
Requires-Dist: torchaudio >=2.0 ; extra == 'all'
Requires-Dist: nltk >=3.8 ; extra == 'all'
Requires-Dist: openai-whisper ; extra == 'all'
Requires-Dist: pycountry ; extra == 'all'
Requires-Dist: num2words ; extra == 'all'
Requires-Dist: stanza >=1.12 ; extra == 'all'
Requires-Dist: pyannote-audio >=3.0 ; extra == 'all'
Requires-Dist: onnxruntime >=1.16 ; extra == 'all'
Requires-Dist: rev-ai ; extra == 'all'
Requires-Dist: pycantonese >=4.2, <5 ; extra == 'all'
Requires-Dist: tencentcloud-sdk-python-asr ; extra == 'all'
Requires-Dist: tencentcloud-sdk-python-tmt ; extra == 'all'
Requires-Dist: aliyun-python-sdk-core >=2.13 ; extra == 'all'
Requires-Dist: aliyun-python-sdk-alimt ; extra == 'all'
Requires-Dist: funasr ==1.3.1 ; extra == 'all'
Requires-Dist: opencc ; extra == 'all'
Requires-Dist: cos-python-sdk-v5 ; extra == 'all'
Requires-Dist: googletrans ; extra == 'all'
Requires-Dist: qwen-asr ; extra == 'all'
Requires-Dist: sentencepiece ; extra == 'all'
Requires-Dist: fastapi >=0.110 ; extra == 'all'
Requires-Dist: uvicorn[standard] >=0.27 ; extra == 'all'
Requires-Dist: sse-starlette >=2.0 ; extra == 'all'
Requires-Dist: python-multipart >=0.0.9 ; extra == 'all'
Requires-Dist: maturin ==1.7.4 ; extra == 'dev'
Requires-Dist: pytest >=7 ; extra == 'dev'
Requires-Dist: pytest-xdist ; extra == 'dev'
Requires-Dist: mypy ; extra == 'dev'
Requires-Dist: datamodel-code-generator >=0.26 ; extra == 'dev'
Requires-Dist: build ; extra == 'dev'
Requires-Dist: wheel ; extra == 'dev'
Requires-Dist: setuptools ; extra == 'dev'
Requires-Dist: hatchling ; extra == 'dev'
Requires-Dist: poetry-core ; extra == 'dev'
Provides-Extra: whisper
Provides-Extra: stanza
Provides-Extra: pyannote
Provides-Extra: revai
Provides-Extra: cantonese
Provides-Extra: translate
Provides-Extra: qwen3
Provides-Extra: nllb
Provides-Extra: api
Provides-Extra: all
Provides-Extra: dev
Summary: TalkBank CHAT processing pipeline
Keywords: nlp,linguistics,chat,talkbank,transcription,forced-alignment,morphosyntax
Author-email: Brian MacWhinney <macw@cmu.edu>, Houjun Liu <houjun@stanford.edu>, Franklin Chen <franklinchen@franklinchen.com>
License: BSD-3-Clause
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/TalkBank/talkbank-tools
Project-URL: Repository, https://github.com/TalkBank/talkbank-tools
Project-URL: Issues, https://github.com/TalkBank/talkbank-tools/issues

# batchalign

TalkBank CHAT processing pipeline — ASR, forced alignment, morphosyntax
(`%mor` / `%gra`), utterance segmentation, translation, and compare.
The user-facing Python package; the runtime is a PyO3 extension backed
by the Rust crates in `crates/batchalign/`.

User and developer docs live in `book/src/batchalign/` (the mdBook is
the source of truth). See `book/src/batchalign/developer/building.md`
for the canonical build recipe.

## Install

From PyPI (stable wheels):

```bash
pip install batchalign
```

From source: **use the `just` recipes** — they go through Bazel, so
every dep (proto codegen, Rust crates, PyO3 extension, Python wheel
extras) is materialized for you. Don't reach for `maturin develop` or
`uv sync` directly unless you're debugging the build itself.

```bash
just batchalign build              # build every Bazel target for batchalign
just batchalign test               # run every Bazel test target
just batchalign cli --help         # run `batchalign3` via the development bridge
just batchalign pytest             # pytest (with full Bazel dep graph)
just batchalign wheel              # host-platform wheel at python/target/wheels/
just batchalign sidecar            # standalone daemon binary via PyApp
just batchalign lint               # mypy (+ ruff)
just batchalign versions           # source-of-truth version readout
```

`just --list batchalign` shows the full recipe list. The `just`
recipes call into Bazel, so `tools/bazel` (the bundled wrapper) takes
care of:

- regenerating the pydantic-v2 wire types from
  `crates/batchalign/batchalign-core/src/proto/*.rs`
- rebuilding the `batchalign_core` PyO3 cdylib via maturin under the
  hood
- staging the binary into the wheel
- propagating dependency changes to dependent targets

If you genuinely need `bazel` directly (because a recipe you want
isn't wrapped):

```bash
bazel build //...           # everything
bazel test //...            # everything
bazel run //book:html       # static book HTML
bazel run //apps/batchalign/batchalign-gui:openapi   # GUI OpenAPI codegen
```

The base wheel ships only the lightweight runtime. Heavy ML backends
are gated behind extras — install only what you use:

```bash
pip install 'batchalign[whisper]'      # Whisper ASR
pip install 'batchalign[stanza]'       # morphosyntax (%mor / %gra)
pip install 'batchalign[pyannote]'     # speaker diarization
pip install 'batchalign[revai]'        # Rev.AI cloud ASR
pip install 'batchalign[cantonese]'    # Cantonese pipeline (FunASR, Tencent)
pip install 'batchalign[qwen3]'        # Qwen3 ASR + forced aligner
pip install 'batchalign[nllb]'         # NLLB translation
pip install 'batchalign[api]'          # FastAPI daemon (`batchalign3 daemon`)
pip install 'batchalign[all]'          # everything
```

## CLI

```bash
batchalign3 --help

batchalign3 transcribe input_dir -o output_dir --lang eng
batchalign3 align     input_dir -o output_dir --engine wav2vec
batchalign3 morphotag input_dir -o output_dir --language en
batchalign3 utseg     input_dir -o output_dir
batchalign3 translate input_dir -o output_dir --target eng
batchalign3 compare   input_dir gold_dir   -o output_dir

batchalign3 version                    # banner, version, git SHA
batchalign3 cache {path,stats,clear}   # local result cache
batchalign3 daemon                     # FastAPI server (needs [api])
```

When `-o` is omitted, results are written back in place. The CLI accepts
either a single CHAT/media file or a folder (walked recursively).

## Programmatic API

```python
import batchalign as ba

# Build a backend chain.
pipeline = ba.recipes.morphotag(
    stanza_backend=ba.StanzaBackend(lang="en"),
)

# Or compose your own.
asr = ba.WhisperBackend(language=ba.LanguageCode.from_iso("eng"))
utseg = ba.CHATUtteranceBackend(model="talkbank/CHATUtterance-en")
pipeline = ba.recipes.transcribe(asr_backend=asr, utseg_backend=utseg)

# Run.
inputs = [ba.media_from_path("session.wav")]
outcomes = list(pipeline.run(inputs))
for outcome in outcomes:
    outcome.write("session.cha")
```

## Repository layout

This package is one slice of the `talkbank-tools` monorepo:

- `python/batchalign/` — Python package (this file).
- `crates/batchalign/` — Rust crates (`batchalign-core`, `batchalign-engine`).
- `crates/core/` — shared CHAT parser / model / transform.
- `apps/batchalign/batchalign-gui/` — Tauri desktop GUI.
- `book/` — user + developer documentation (mdBook; source of truth).

For repo conventions, build commands, and the BA3 cutover plan, see
`CLAUDE.md` at the repo root and
`book/src/batchalign/developer/landing-status.md`.

