Metadata-Version: 2.2
Name: nyansasua
Version: 0.2.4
Summary: Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.
Author: Cire contributors
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Text Processing :: Linguistic
Project-URL: Homepage, https://github.com/yourorg/cire
Project-URL: Issues, https://github.com/yourorg/cire/issues
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# Cire / Nyansasua

> *Cire* (Hausa) — **knowledge / wisdom**. *Nyansasua* (Twi) — **learning / wisdom**.
>
> PyPI package: `nyansasua` · C++ library: Cire

A self-contained, fast C++17 library for **multi-language keyword extraction**,
with first-class Python bindings.

* **No external dependencies** for the C++ core (no ICU, no Boost).
* **18 languages** with stopword lists: English, Spanish, French, German,
  Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic,
  Indonesian, Twi/Akan, Ga, Ewe, Hausa, Fante.
* **Tenant-aware stopword overlays** for isolated domain/agent dictionaries.
* **BK-tree fuzzy snapping** for tenant-scoped canonical term correction.
* **4 algorithms** (run any one, or combine via ensemble):
  * **TF-IDF** — single-doc entropy fallback or corpus-driven
  * **YAKE** — statistical (Campos et al., 2020)
  * **TextRank** — graph-based PageRank (Mihalcea & Tarau, 2004)
  * **RAKE** — rapid automatic keyword extraction (Rose et al., 2010)
* **UTF-8 everywhere** — proper Unicode tokenizer with CJK / Cyrillic /
  Arabic / Hangul / Devanagari / Thai / Hiragana / Katakana support.
* **C++17** + clean public API; **pybind11** Python module ships in
  `python/`.

---

## Project layout

```
cire/
├── cpp/                  C++ core
│   ├── include/cire/     Public headers
│   ├── src/              Implementations
│   ├── examples/         Demo program (9 languages)
│   ├── tests/            C++ test suite
│   └── CMakeLists.txt
├── python/               pybind11 Python wrapper
│   ├── bindings.cpp
│   ├── cire/             Python package
│   └── tests/            pytest suite
├── CMakeLists.txt        Top-level build (optional)
├── pyproject.toml        Python packaging metadata
└── README.md
```

---

## C++ quick start

```cpp
#include <cire/extractor.hpp>
#include <cstdio>

int main() {
    std::string text = "Natural language processing (NLP) is a subfield of "
                       "linguistics, computer science, and artificial "
                       "intelligence concerned with the interactions between "
                       "computers and human language. Transformers have "
                       "revolutionized NLP.";

    cire::EnsembleConfig cfg;
    cfg.language = cire::Language::Auto;
    cfg.top_k = 5;

    for (const auto& k : cire::extract_keywords_ensemble(text, cfg)) {
        std::printf("%-20s  score=%.3f\n", k.text.c_str(), k.score);
    }
}
```

### Build (C++ only)

```bash
cd cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./cire_tests
./cire_bench
./cire_demo           # multi-language demo
```

---

## Python quick start

### Install from PyPI

```bash
pip install nyansasua
```

Nyansasua installs as the `cire` module:

```python
import cire

print(cire.__version__)
```

### Install from source

```bash
cd Cire
pip install -e .
```

This uses `scikit-build-core` + `pybind11` to compile the C++ core and produce
a wheel that bundles the compiled extension. Once installed:

```python
import cire

# One-liner
for k in cire.extract_keywords("Hello world", top_k=5):
    print(k.text, k.score)

# Or use the high-level Extractor class
ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)
for k in ext.extract("Machine learning is a branch of AI."):
    print(k.text, k.score)

# Batch processing
results = ext.extract_many([
    "Python is widely used in data science and machine learning.",
    "Climate change is one of the biggest challenges facing humanity.",
])

# Corpus-driven TF-IDF (uses document frequency across many texts)
corpus = ["Python is used in data science.", "Java is used in enterprise.",
          "Python is great for scripting.", "Java runs on the JVM."]
kws = ext.extract_corpus_tfidf(corpus, "Python is popular for ML and AI.",
                               top_k=5)
```

### Tenant dictionaries for domains

Use tenant dictionaries when different November agents need isolated domain
vocabulary in memory at the same time.

```python
import cire

cire.load_tenant_dictionary(
    "education",
    ["mathematics", "english", "fractions", "lesson_note", "B2"],
)
cire.load_tenant_dictionary(
    "banking",
    ["mobile_money", "microloan", "GHS", "susu"],
)

print(cire.snap_term("education", "mathematic")) # mathematics
print(cire.snap_term("education", "fracions")) # fractions
print(cire.snap_term("banking", "micro-loan")) # microloan
print(cire.snap_term("education", "micro-loan")) # micro-loan, no banking leakage
```

### Tenant stopwords

Tenant stopwords are isolated overlays on top of built-in language stopwords.

```python
import cire

cire.load_tenant_stopwords(
    "health",
    cire.Language.English,
    ["please", "show", "patient", "case"],
)

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.tenant_id = "health"
cfg.top_k = 5

for kw in cire.extract_keywords("Please show malaria treatment for this patient case", cfg):
    print(kw.text, kw.score)
```

### Education lesson-note query example

This mirrors a November Education agent that extracts expected filters first,
uses an alias map for semantic aliases, and lets Nyansasua snap remaining
spelling variants with the Education tenant dictionary.

```python
import cire

cire.load_tenant_dictionary(
    "education",
    [
        "B2",
        "english",
        "lesson_note",
        "GES",
        "core_competencies",
        "assessment_task",
    ],
)

aliases = {
    "basic 2": "B2",
    "english language": "english",
}

entities = {
    "grade": "Basic 2",
    "subject": "Englsh",
}

normalized = {}
for field, value in entities.items():
    key = value.lower()
    normalized[field] = aliases.get(key) or cire.snap_term("education", key, 2)

print(normalized)
# {'grade': 'B2', 'subject': 'english'}
```

### Ghanaian language detection

```python
import cire

samples = {
    "ewe": "ame ƒe nu",
    "hausa": "ɗan makaranta yana karatu",
    "ga": "ŋɔɔ kɛ sane",
    "fante": "me dɛ hom nyina",
}

for label, text in samples.items():
    lang = cire.detect_language(text)
    print(label, cire.language_name(lang), cire.language_code(lang))
```

### Build the Python module directly (no scikit-build)

```bash
cd cpp
mkdir build && cd build
cmake .. -DCIRE_BUILD_PYTHON=ON -Dpybind11_DIR=$(python3 -m pybind11 --cmakedir)
cmake --build . -j
# The .so is dropped into python/cire/ by the top-level CMake hook.
```

### Run the Python tests

```bash
cd python
pytest tests/
```

---

## API surface

### C++

| Header                | What it does                                    |
|-----------------------|-------------------------------------------------|
| `cire/types.hpp`      | `Language`, `Token`, `Keyword`, `Sentence`      |
| `cire/tokenizer.hpp`  | UTF-8 tokenizer + sentence splitter             |
| `cire/stopwords.hpp`  | Stopword lists + tenant overlays                |
| `cire/snapper.hpp`    | Tenant fuzzy dictionary snapping                |
| `cire/tfidf.hpp`      | TF-IDF extractor (corpus + single-doc fallback) |
| `cire/yake.hpp`       | YAKE statistical extractor                      |
| `cire/textrank.hpp`   | TextRank PageRank extractor                     |
| `cire/rake.hpp`       | RAKE phrase extractor                           |
| `cire/extractor.hpp`  | Top-level facade + ensemble + language detect   |

### Python (`import cire`)

| Symbol                              | What it does                          |
|-------------------------------------|---------------------------------------|
| `Extractor(language, algorithm, …)` | High-level facade class               |
| `Language`, `Algorithm`             | Enums (with string aliases)           |
| `ExtractConfig`, `EnsembleConfig`   | Per-call configuration                |
| `extract_keywords(text, config)`    | Run one algorithm                     |
| `extract_keywords_ensemble(…)`      | Run all four, merge                   |
| `tokenize`, `split_sentences`       | Low-level token utilities             |
| `is_stopword`, `add_stopword`       | Stopword inspection                   |
| `load_tenant_stopwords`             | Tenant-specific stopword overlays     |
| `load_tenant_dictionary`, `snap_term` | Tenant fuzzy dictionary snapping    |
| `detect_language`                   | Heuristic script detection            |
| `build_corpus_df`                   | Build a DF table for TF-IDF           |

---

## Algorithm selection

| Use case                               | Pick           |
|----------------------------------------|----------------|
| Large corpus, need IDF signal          | `TFIDF`        |
| Single document, no corpus             | `YAKE`         |
| Want graph-based co-occurrence ranking | `TextRank`     |
| Domain phrases (e.g. legal, medical)   | `RAKE`         |
| Best of all worlds                     | `ensemble`     |

---

## Multi-language behavior

- The tokenizer handles mixed-script text (e.g. `"Hello世界world"`)
  correctly.
- For CJK / Korean / Japanese, the stopword list contains function words;
  the tokenizer also splits every CJK char into its own 1-char token, which
  is the standard approach for these scripts.
- The casing salience feature in YAKE is automatically a no-op for caseless
  scripts (CJK, Hangul, Arabic).
- Use `cire.detect_language(text)` if you don't want to specify the language
  up front.
- Ghanaian language detection is heuristic and works best when native Unicode
  characters such as `ƒ`, `ʋ`, `ɗ`, `ɓ`, `ƙ`, `ŋ`, `ɛ`, and `ɔ` are preserved.

---

## Distribution

The Python package is built with `scikit-build-core`; for cross-platform
wheels, configure `cibuildwheel`:

```toml
[tool.cibuildwheel]
build = ["cp38-*", "cp39-*", "cp310-*", "cp311-*", "cp312-*"]
```

Then:

```bash
python -m cibuildwheel --output-dir wheelhouse
twine upload wheelhouse/*
```

---

## License

MIT.
