Metadata-Version: 2.4
Name: backchannel-classifier
Version: 0.4.0
Summary: backchannel classifier - detect backchannels vs real responses in thai and japanese asr output
Author-email: "100x.fi" <kiri@100x.fi>
License: MIT
Project-URL: Homepage, https://github.com/100x-fi/backchannel-classifier
Project-URL: Repository, https://github.com/100x-fi/backchannel-classifier
Keywords: thai,japanese,nlp,backchannel,aizuchi,voice,asr,classifier
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: scikit-learn>=1.0
Requires-Dist: numpy>=1.20
Dynamic: license-file

# backchannel classifier

detects backchannel responses vs real user input for voice ai systems. supports **thai** and **japanese** (aizuchi).

## install

```bash
pip install backchannel-classifier
```

## usage

```python
from backchannel_classifier import is_backchannel

# thai (default)
is_backchannel("ครับ")                    # (True, 0.91)
is_backchannel("ไม่ครับ")                 # (False, 0.01)
is_backchannel("ใช่ แต่ว่า")              # (False, 0.01)

# japanese
is_backchannel("はい", lang="ja")         # (True, 0.99)
is_backchannel("そうですね", lang="ja")    # (True, 0.99)
is_backchannel("予約したいです", lang="ja") # (False, 0.0001)

# direct import
from backchannel_classifier.jp import is_backchannel_ja
is_backchannel_ja("なるほど")              # (True, 0.99)
```

returns `(is_backchannel: bool, confidence: float)`.

## why

voice bots using asr → llm → tts pipelines need to distinguish between backchannels (acknowledgment sounds that should be ignored) and real responses that need processing. simple exact matching fails on asr variants and misses edge cases.

## approach

gradient boosting classifier with handcrafted language-specific features. key idea: strip known backchannel components from the text, measure what's left (`remaining_ratio`). if nothing remains, it's a backchannel.

### thai (26 features)

| feature | importance |
|---|---|
| remaining_ratio | 0.9098 |
| has_request | 0.0406 |
| has_negation | 0.0274 |
| particle_ratio | 0.0108 |

- polite particle detection (ครับ/ค่ะ/จ้ะ variants)
- backchannel sound patterns (อืม/อ๋อ/เออ with tone variants)
- question/negation/request/continuation markers
- handles asr misspellings (ค่า→ค่ะ, คับ→ครับ, อื้ม→อืม)

### japanese (27 features)

| feature | importance |
|---|---|
| remaining_ratio | 0.7765 |
| remaining_len | 0.0484 |
| katakana | 0.0347 |
| word_count | 0.0325 |
| kanji_ratio | 0.0206 |

- core aizuchi (はい/ええ/うん/そう)
- agreement, understanding, surprise, filler, reaction markers
- question/continuation/request/negation/verb negative indicators
- handles asr elongation variants (はーーい, えーーー)

## results

### thai
- **99.49% f1** (5-fold cv)
- test suite: **94/94** (100%)

### japanese
- **98.37% f1** (5-fold cv)
- test suite: **119/119** (100%)

## test coverage

### thai (94 cases)

**backchannels (49):** ครับ, ค่ะ, อืม, ใช่, อ๋อ, เหรอ, ฮัลโหล, asr variants...
**real responses (45):** สวัสดีครับ, ไม่ครับ, ราคาเท่าไหร่ครับ, edge cases (ใช่ แต่ว่า, ครับ แล้วก็)...

### japanese (119 cases)

**aizuchi (63):** はい, うん, そうですね, なるほど, へー, まじで, えーと, すごい, 承知しました, compounds...
**real responses (56):** ありがとうございます, いくらですか, 予約したいです, edge cases (はい、質問があります, そうですね、でも...)...

## testing

```bash
python3 -m pytest tests/ -v
```

## files

- `backchannel_classifier/__init__.py` - thai classifier + unified api
- `backchannel_classifier/jp.py` - japanese classifier
- `train.py` - thai training script
- `train_ja.py` - japanese training script
- `tests/test_classifier.py` - thai test suite (94 cases)
- `tests/test_classifier_ja.py` - japanese test suite (119 cases)

## requirements

- python 3.8+
- scikit-learn
- numpy

## memory

~3.7 MB per language model, lazy-loaded. if you only use thai, japanese model is never loaded (zero overhead).
