Metadata-Version: 2.4
Name: kaldingram
Version: 0.1.0
Summary: Train and entropy-prune ARPA n-gram language models
Author: Next-gen Kaldi Team
License: Apache-2.0
Project-URL: Homepage, https://github.com/pkufool/kaldingram
Project-URL: Repository, https://github.com/pkufool/kaldingram
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# kaldingram

`kaldingram` provides Python and CLI tools to:

- train Kneser-Ney back-off n-gram language models in ARPA format
- entropy-prune ARPA language models

The implementation is based on Kaldi WSJ scripts and matches SRILM-style behavior.

## Install

```bash
pip install kaldingram
```

## CLI Usage

### Train an n-gram LM

```bash
kaldingram train --ngram-order 4 --text corpus.txt --lm 4gram.arpa
```

Or stream text from stdin and write ARPA to stdout:

```bash
cat corpus.txt | kaldingram train --ngram-order 3 > 3gram.arpa
```

### Prune an n-gram LM

```bash
kaldingram prune --threshold 1e-8 --lm 4gram.arpa --write-lm 4gram_pruned.arpa
```

## Development

Build package locally:

```bash
python -m pip install --upgrade build
python -m build
```

