Metadata-Version: 2.4
Name: cao-official
Version: 0.5.0
Summary: CAO Official — fast, clean emotiCon Analysis and decOding of affective information (Japanese kaomoji affect analysis)
Author: Michal Ptaszynski, Pawel Dybala, Rafal Rzepka, Kenji Araki
Maintainer: Michal Ptaszynski
License-Expression: BSD-3-Clause
Project-URL: Homepage, https://github.com/ptaszynski/cao-official
Project-URL: Repository, https://github.com/ptaszynski/cao-official
Project-URL: Issues, https://github.com/ptaszynski/cao-official/issues
Project-URL: Changelog, https://github.com/ptaszynski/cao-official/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/ptaszynski/cao-official/blob/main/docs/API.md
Keywords: emoticon,kaomoji,emotion,affect,sentiment,nlp,japanese,kinesics,naive-bayes
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Natural Language :: Japanese
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Provides-Extra: fast
Requires-Dist: pyahocorasick>=2.0; extra == "fast"
Provides-Extra: app
Requires-Dist: streamlit>=1.30; extra == "app"
Requires-Dist: altair>=5; extra == "app"
Requires-Dist: pandas>=2; extra == "app"
Requires-Dist: pillow>=9; extra == "app"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: build>=1; extra == "dev"
Requires-Dist: twine>=5; extra == "dev"
Dynamic: license-file

# CAO Official 0.5.0

A fast, clean reimplementation of **CAO** — *emotiCon Analysis and decOding of
affective information* — the system that detects Japanese emoticons (*kaomoji*,
e.g. `(^_^)`) in text and classifies the emotion they express, grounded in
Birdwhistell's **theory of kinesics** (an emoticon is body language split into
semantic *kinemes*: eyes, mouth, decorations).

Based on: Ptaszynski et al., *"CAO: A Fully Automatic Emoticon Analysis System
Based on Theory of Kinesics"*, IEEE Transactions on Affective Computing, 2010.
This is a from-scratch Python rewrite of the original C# system — faster, fixed,
and runnable. See [CHANGELOG.md](CHANGELOG.md) for the lineage,
[docs/API.md](docs/API.md) for the API reference, and
[ANALYSIS.md](ANALYSIS.md) for how it differs from the legacy code and the paper.

> **PyPI name:** the bare `cao` is taken, so the distribution is **`cao-official`**
> and the import package is **`cao_official`** (CLI: `cao-official`).

**New in 0.5:** partial faces (bracketless `^o^`, one-bracket `(^o^`, mouthless
`(^^)` / `(--)`); a probabilistic **Naive-Bayes** scorer that is the new default
(fixes the old `relief` bias, gives calibrated confidence); a statistical
borderline detector; mmap model; batch/async API; and pip-installable packaging.

---

## Citation

If you use CAO, please cite:

> M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka, K. Araki.
> "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics."
> *IEEE Transactions on Affective Computing*, Vol. 1, No. 1, 2010.

```
@article{ptaszynski2010cao,
  title={CAO: A fully automatic emoticon analysis system based on theory of kinesics},
  author={Ptaszynski, Michal and Maciejewski, Jacek and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji},
  journal={IEEE Transactions on Affective Computing},
  volume={1},
  number={1},
  pages={46--59},
  year={2010},
  publisher={IEEE}
}
```
---

## What it does

Given an emoticon or a sentence, CAO runs three procedures:

1. **Detection** — find emoticon spans in free text (face-anchored: a candidate
   must contain a recognized face core; brackets optional). 0.5 also finds
   **partial faces** — bracketless, single-bracket, and **mouthless** (eye–eye)
   — via a gated fallback anchor, and rejects prose/number noise.
2. **Extraction** — segment each emoticon into its seven structural areas
   `[additional][bracket][internal][ FACE ][internal][bracket][additional]` and
   decompose the face into eye / mouth / eye (occurrence-weighted; empty mouth
   allowed).
3. **Affect analysis** — score the parts against ten per-emotion databases and
   decide a single emotion, with a calibrated confidence and a 2-D coordinate.

The ten emotions (Nakamura): anger, dislike, excitement, fear, fondness, joy,
relief, shame, sorrow, surprise — also projected onto Russell's **valence ×
activation** plane.

---

## Install

Requires **Python 3.10+** and **numpy**. `pyahocorasick` is recommended
(C-accelerated matching; the code falls back to a pure-Python automaton — only
~15% slower — without it).

From PyPI:

```bash
pip install cao-official          # core
pip install "cao-official[fast]"  # + pyahocorasick (recommended)
```

The wheel ships a **prebuilt model** (`cao.model` + `cao.model.npy`), so the
first `Cao()` loads in ~5 ms with no database needed — the package is
self-contained.

From source (for development / rebuilding the model from the database):

```bash
git clone https://github.com/ptaszynski/cao-official
cd cao-official
python3 -m venv .venv && ./.venv/bin/pip install -e ".[fast,dev]"
./.venv/bin/python -m cao_official --build   # rebuild model from ../cao_0.2/data
```

---

## Quickstart

### Python API

```python
from cao_official import Cao

cao = Cao()                       # loads the model (builds it once if missing)

r = cao.analyze("(｀Д´)")
print(r.label)                    # 'anger'
print(r.confidence)               # 0.71   (calibrated posterior)
print(r.valence, r.activation)    # -0.51 0.70   -> negative-activated
print(r.areas.as_row())           # ['N/A', '(', 'N/A', '`Д́', 'N/A', ')', 'N/A']
print(r.ranking()[:3])            # [('anger', 0.71), ('excitement', 0.10), ...]
print(r.attribution[0])           # ('face', '`Д́', 'anger', 88.0)  -- why

# detect + analyze every emoticon in a sentence (partial faces included)
for r in cao.analyze_text("今日は嬉しい^o^けど(--)気分"):
    print(r.span, r.emoticon, "->", r.label)

# stream over a big corpus (optionally across processes)
for i, results in cao.analyze_batch(open("corpus.txt"), workers=4):
    ...

# pick a different scoring method (the five paper methods remain available)
cao.analyze("(^o^)", method="frequency")
```

### Web app

An interactive **Streamlit** demo (`app.py`), bilingual (English / 日本語), with
single-emoticon, free-text, and whole-**document (file upload)** modes; it shows
detection, the emotion ranking, the Russell valence×activation plot, the 7-area
breakdown, and the kineme attribution. The header logo recolours to the light/dark
theme automatically.

```bash
pip install "cao-official[app]"   # streamlit + altair + pandas + pillow
streamlit run app.py
```

**Deploy on Streamlit Community Cloud:** point it at this repo and `app.py`; the
included `requirements.txt` installs the dependencies, the package and its
prebuilt model are in the repo, so it runs as-is.

### Command line

```bash
cao-official "(^_^)" "(｀Д´)"                       # after `pip install`
python -m cao_official "(^_^)" "(｀Д´)"             # or via the module
echo "嬉しい(^o^)です" | python -m cao_official --text   # detect in free text
python -m cao_official --json "orz"                  # JSON output
python -m cao_official --explain "(｀Д´)"             # show kineme attribution
python -m cao_official --method frequency "(^o^)"    # pick a scoring method
python -m cao_official --build                       # rebuild the model
```

Example:

```
emoticon: (｀Д´)
  split : N/A | ( | N/A | `Д́ | N/A | ) | N/A
  label : anger  (confidence 0.71, via bayes)
  2-D   : valence=-0.51 activation=+0.70 [negative-activated]
  top-3 : anger=0.7058, excitement=0.1009, dislike=0.07281
  why   : face='`Д́'->anger(88), left_eye+right_eye='`\t́'->anger(65), mouth='Д'->anger(42)
```

---

## How it works

```
text ──► normalize (NFKC) ──► detect (Aho-Corasick, face-anchored + mouthless fallback)
     ──► extract (position-aware 7 areas + eye/mouth) ──► score (5 methods + Naive-Bayes)
     ──► decide (label + calibrated confidence + Russell 2-D)
```

- **One model, built once.** The released per-emotion databases are parsed into
  contiguous `float32` stat tensors (one `(N, 10, 5)` buffer + a `dict[str → row]`
  per table) and a face-core automaton, serialized to `cao_official/cao.model`
  (+ a sidecar `cao.model.npy` tensor buffer, mmappable). Runtime never re-parses
  text files; loading is ~5 ms.
- **One O(text) pass, shared.** A single Aho-Corasick automaton over the
  normalized line replaces the legacy giant regex alternation, the per-character
  detection loop, and every linear database scan — and its hits drive *both*
  detection and extraction (no second scan).
- **Normalized matching with offset map.** NFKC unifies full/half-width and
  combining/precomposed forms (`（＾＿＾）` ≡ `(^_^)`); a canonical→original index
  map lets detection work in normalized space yet report exact original spans.

### Scoring methods

Each part is matched exactly against its database; the face follows the cascade
**raw whole-emoticon → triplet core → eyes + mouth**, and the surrounding areas
are added at the paper's `0.25` weight (now separately configurable for internal
vs additional). Six methods are available:

| method | meaning | direction |
|---|---|---|
| `occurrence` | raw count in the emotion DB | higher = stronger |
| `frequency` | occurrence ÷ total occurrences in that DB | higher = stronger |
| `uniqueFrequency` | occurrence ÷ #unique elements in that DB | higher = stronger |
| `position` | rank by occurrence (ties share) | lower = stronger |
| `uniquePosition` | dense rank by occurrence | lower = stronger |
| **`bayes`** *(default)* | smoothed Naive-Bayes posterior over the parts | higher = probability |

**`bayes`** is the 0.5 default. It composes the kineme evidence as a log-product
`log P(e) + Σ wᵢ·log P(partᵢ|e)` with Lidstone smoothing — the canonical
generative model, of which the paper's `frequency` is a single-part special case.
It yields a real posterior probability (so `confidence` is calibrated by
temperature scaling) and removes the `uniqueFrequency` small-DB bias toward
`relief`. It won the cross-validation bake-off (`eval bakeoff`: best macro-F1 0.322 and
best top-3 69.1%; in-context probe top-1 **28.1% vs 22.1%** for the old default,
top-3 57.8%). All methods are selectable via `method=` and exposed in
`result.scores`.

---

## Performance

| operation | time |
|---|---|
| build model | ~0.2 s (once) |
| load model | ~6 ms (eager) |
| model artifact | ~5 MB (0.8 MB pickle + 4.2 MB mmappable `.npy`) |
| full 1000-sentence corpus | ~15 ms (~65k sentences/s end-to-end) |
| pure-Python Aho-Corasick | ~15% slower than native `pyahocorasick` |

Run the harness: `python -m cao_official.eval` (add `bakeoff`, `calibrate`, or
`bench`). Throughput is slightly lower than 0.4 because the default `bayes`
scorer computes a posterior per emoticon — still well over 60k sentences/s.

**Scales with the database, not against it.** Detection, matching and scoring
are all O(text), independent of how many emoticons are in the model: a single
Aho-Corasick pass anchors detection *and* feeds extraction, lookups are
`dict → row` into one contiguous `float32` tensor, and every statistic is
precomputed at build time. Growing the emoticon database 10× or 100× leaves
per-analysis cost unchanged — only the one-off build and the (compact, linear)
artifact grow.

---

## Project layout

```
cao-official_0.5/
  pyproject.toml       pip-installable (dist `cao-official`, CLI `cao-official`)
  cao_official/        the package
    normalize.py       NFKC canonicalization
    model.py           Emotion set, Russell 2-D, EmoResult, methods
    automaton.py       Aho-Corasick (pyahocorasick + pure-python fallback)
    langmodel.py       one-class char-LM detector for borderline spans
    database.py        build/serialize the model (cao.model + cao.model.npy)
    detect.py          face-anchored detection + mouthless fallback
    extract.py         position-aware 7-area segmentation + eye/mouth split
    score.py           5 paper methods + Naive-Bayes, cascade, decision, 2-D
    cao.py             the Cao facade (analyze / analyze_text / analyze_batch)
    cli.py / __main__  command line
    eval.py            throughput + bake-off + calibrate + bench
  reference_port/      faithful 1:1 port of the C# (regression oracle)
  tests/               behaviour tests (python tests/test_cao.py)
  docs/API.md          API reference + theory-of-kinesics primer
  README.md  CHANGELOG.md  ROADMAP.md  ANALYSIS.md  PLAN.md
```

---

## Limitations & honest caveats

- **Scope: kaomoji only.** Unicode emoji and Western emoticons are out of scope
  for now (the design extends to them — see ROADMAP; emoji was deferred this pass
  by decision).
- **Accuracy is data-bound.** Labels reflect the *released* database, which is
  more deduplicated than the paper's full DB; e.g. `^o^` resolves where it is
  densest in this DB. The `joy` class barely surfaces because it overlaps almost
  entirely with `fondness` in this release.
- **Evaluation is in-database.** The bake-off uses the labeled DB itself as gold
  (resubstitution-style, raw stage disabled); the in-context probe only labels 5
  corpus blocks. A truly held-out, hand-labeled gold set is the top ROADMAP item.
- **The `relief` small-DB bias is fixed** by the default `bayes` scorer (relief
  predictions 1018 → 249 on the bake-off; gold 91). `uniqueFrequency` and the
  other paper methods remain available but carry the original bias.
- **Bare 2-char bracketless faces in prose** (e.g. `^^` in `わー^^ね`) are below
  the detection length floor and skipped to avoid prose false-positives;
  bracketed (`(^^)`) and 3-char+ forms are found.

For the faithful, bug-for-bug behaviour of the original system, use
`reference_port/cao_reference.py` (validated against `cao_0.2/eval/1000eval4.txt`).

---

## License

CAO is released under **The BSD 3-Clause License** — the whole system, including
the `cao-official` Python package and the bundled emoticon databases.

> Copyright (c) 2007-2026, Michal Ptaszynski, Pawel Maciejewski, Pawel Dybala, 
> Rafal Rzepka, Kenji Araki.

See [LICENSE](LICENSE) for the full text.
