Metadata-Version: 2.2
Name: nyansasua
Version: 0.2.3
Summary: Fast multi-language keyword extraction with tenant-aware stopwords and fuzzy dictionary snapping.
Author: Cire contributors
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Text Processing :: Linguistic
Project-URL: Homepage, https://github.com/yourorg/cire
Project-URL: Issues, https://github.com/yourorg/cire/issues
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# Nyansasua

Fast multi-language keyword extraction for Python, powered by the C++17 Cire core.

Nyansasua installs as the `cire` Python module and provides TF-IDF, YAKE, TextRank,
RAKE, and ensemble keyword extraction with UTF-8 tokenization, stopword filtering,
tenant-aware configuration, and fuzzy dictionary snapping.

## Features

- **18 language profiles**: English, Spanish, French, German, Italian, Portuguese,
  Dutch, Russian, Chinese, Japanese, Korean, Arabic, Indonesian, Twi/Akan, Ga,
  Ewe, Hausa, and Fante.
- **4 extraction algorithms**: TF-IDF, YAKE, TextRank, RAKE, plus ensemble mode.
- **Tenant-aware stopwords**: isolate domain or agent-specific stopwords such as
  Banking, Health, Legal, and Education.
- **BK-tree fuzzy snapping**: fast tenant-scoped correction to canonical terms like
  `NHIS`, `GHS`, or domain vocabulary.
- **Unicode-native**: handles UTF-8 text, Ghanaian characters, CJK, Cyrillic,
  Arabic, Hangul, Hiragana, Katakana, Thai, and Devanagari scripts.
- **No Python runtime dependencies** after installation.

## Installation

```bash
pip install nyansasua
```

## Quick Start

```python
import cire

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.YAKE
cfg.top_k = 5

for kw in cire.extract_keywords("Machine learning is a branch of AI.", cfg):
    print(kw.text, kw.score)
```

## High-Level Extractor

```python
import cire

ext = cire.Extractor(language="auto", algorithm="ensemble", top_k=10)

keywords = ext.extract(
    "Natural language processing has seen rapid growth in education tools."
)

for kw in keywords:
    print(kw.text, kw.score)
```

## Ghanaian Language Detection

```python
import cire

print(cire.detect_language("ame ƒe nu"))        # Language.Ewe
print(cire.detect_language("ɗan makaranta"))    # Language.Hausa
print(cire.detect_language("ŋɔɔ kɛ sane"))      # Language.Ga
print(cire.detect_language("me dɛ hom nyina"))  # Language.Fante
```

Detection is heuristic. Text with diagnostic Unicode characters such as `ƒ`, `ʋ`,
`ɗ`, `ɓ`, `ƙ`, `ŋ`, `ɛ`, and `ɔ` is much more reliable than plain ASCII text.

## Tenant-Aware Stopwords

Use tenant IDs to keep domain-specific stopwords isolated across agents.

```python
import cire

cire.load_tenant_stopwords(
    "banking",
    cire.Language.English,
    ["can", "get", "account", "fees"],
)

cfg = cire.ExtractConfig()
cfg.language = cire.Language.English
cfg.algorithm = cire.Algorithm.RAKE
cfg.tenant_id = "banking"
cfg.top_k = 5

keywords = cire.extract_keywords(
    "Can I get account fees for a mobile money loan?",
    cfg,
)
```

Tenant stopwords are additive: built-in language stopwords still apply, and each
tenant gets its own isolated overlay.

## Tenant Fuzzy Dictionary Snapping

Nyansasua can keep separate canonical dictionaries in memory for different
tenants or domains.

```python
import cire

cire.load_tenant_dictionary("health", ["NHIS", "GHS", "malaria treatment"])

print(cire.snap_term("health", "nhsi"))  # NHIS
print(cire.snap_term("legal", "nhsi"))   # nhsi, no cross-tenant leakage
```

The snapper uses a BK-tree per tenant, so large dictionaries avoid a full linear
scan for every query.

## Batch Processing And Corpus TF-IDF

```python
import cire

ext = cire.Extractor(language="english", algorithm="ensemble", top_k=5)

batch = ext.extract_many([
    "Python is widely used in data science.",
    "Climate change is a significant global challenge.",
])

corpus = [
    "Python is used in data science.",
    "Java is used in enterprise environments.",
    "Python is popular for AI.",
]

kws = ext.extract_corpus_tfidf(
    texts=corpus,
    target_text="Python is heavily used in AI and ML.",
    top_k=3,
)
```

## Performance Snapshot

Recent C++ benchmark run on the development server:

- Stopword lookups: about **0.16-0.63 microseconds per lookup**.
- YAKE short text extraction: about **16.6 microseconds per extraction**.
- BK-tree fuzzy snapping at 10,000 terms: about **243 microseconds per snap**.
- Concurrent tenant stopword isolation: **0 failures** across 160,000 operations.

Exact timings depend on hardware, compiler, build type, and input shape.

## License

MIT License.
